Research Project

Kiln

Learning professional audio production as a direct sample-stream transformation.

Kiln is a Resonant Labs research project investigating whether a neural network can learn the gap between volunteer stereo livestream production and release-quality production while preserving the original performance.

The initial proving ground is church livestream audio: a real-world environment with repeatable rooms, recurring teams, and a uniquely stable dataset opportunity.

30M Target model parameters at full width
300k+ US churches streaming weekly
300-600 Estimated GPU-hours for full training
01

Church livestreams create a repeatable training environment.

The wedge is not just market access. It is measurement quality. Recurring rooms, volunteer teams, and stable signal chains create a rare real-world setting where transfer can be evaluated across repeated captures instead of isolated demos.

Synthetic pairs. Same-source amateurization gives controlled curriculum data before live capture volume is large.
Live captures. Repeated stereo workflows from the same environment make longitudinal evaluation possible.
Reference supervision. Targets define production lift while keeping preservation constraints explicit.
02

The model has to change production without replacing the performance.

Kiln targets the underexplored middle regime between tiny direct-waveform systems and huge generative music models: a causal transfer stack large enough to reshape production, but constrained enough to stay attached to the original stereo performance.

Latent audio prior. Pretrained stereo latents provide production-scale music context.
Hierarchical sequence model. Long-context modeling handles reverb, balance, dynamics, and sectional movement.
Preservation losses. Timing, identity, and phase coherence keep the output anchored to the source.
03

The output matters only if the same band still sounds like itself.

The goal is not a cleaner imitation of professional music in the abstract. The goal is evidence that production can be lifted while the singer, timing, phrasing, and musical identity of the original capture remain intact.

Preservation. Same singer, same timing, same feel.
Production lift. Better translation in space, balance, and dynamics.
Deployment path. Offline post-production first, reduced-width real-time inference later.
Problem

Church livestreams expose a large, stable production gap.

Every week, churches stream music mixed in real time by volunteers on mid-tier digital consoles with limited time for iteration. The resulting audio is often technically competent as performance capture but consistently amateur as production: buried vocals, uncontrolled dynamics, diffuse reverberation, narrow stereo imaging, and none of the section-aware decisions that distinguish polished releases from live stereo bus output.

The gap is production, not performance.

The musicians usually are not the limiting factor. The gap between a competent church livestream and a commercial worship release is driven by engineering decisions, spatial treatment, dynamics handling, and mix architecture rather than by songwriting or musicianship.

Churches are the proving ground, not the ceiling.

Churches matter because the need is real, but also because the environment is unusually stable. Recurring rooms, consistent signal chains, repeat volunteer teams, and weekly services create a rare real-world setting where direct-stream production transfer can be evaluated on repeated captures rather than isolated examples.

300k+ US churches livestreaming weekly
$500-$2k Typical cost of one human mix session
Repeatable rooms Stable conditions for longitudinal evaluation
Broader upside If it works here, the market extends well beyond churches
Research Question

Can professional production be learned as a direct sample-stream transformation?

Direct sample-stream audio transformation is well-established for small systems such as guitar amplifier modeling, while large generative music systems operate in token or latent spaces at far larger scales. Kiln investigates the unclaimed middle regime: direct raw-audio or latent-audio transformation at tens of millions of parameters for full-mix production transfer.

01

Hierarchical selective state-space models

SaShiMi-style and Mamba-class architectures make long-context causal audio processing tractable at linear time, addressing the receptive-field bottlenecks that limited older direct-waveform models.

02

Pretrained stereo audio latents

Stereo 44.1 kHz latent audio autoencoders provide strong music priors at production quality, letting a task-specific transformation model operate with less data than a full generative system would require.

03

Slimmable training for deployment

A single trained model can potentially serve both offline high-quality inference and reduced-width real-time deployment, which matters if live stereo-bus use becomes practical later.

04

Performance-preserving transfer

The target is not generation of a new song. The target is a transformation that preserves singer, drummer, timing, phrasing, and imperfections while replacing amateur production with studio-quality production choices.

Prior Experiments

Six failed paths informed the current design.

The current approach is the result of multiple documented failures. Each one narrowed the real problem: this task needs enough expressive power to reshape reverb, restore transients, and alter production structure without introducing artifacts or performer drift.

01

CycleGAN spectrogram transfer

Unpaired mel-spectrogram transfer produced adversarial artifacts and weak production change, suggesting that this domain gap is too large for naive unpaired translation.

02

Synthetic paired inversion

Paired learning worked in principle, but unconstrained waveform models exploited spectral losses by generating spectrally matched noise rather than perceptually better audio.

03

Differentiable DSP parameter prediction

Interpretable DSP prediction converged cleanly, but it could not create enough depth, transient snap, or spatial dimension to close the amateur-to-pro gap.

04

Cross-quality JEPA plus DSP decoder

Representation learning improved the optimization path, but expressive limits in the decoder still bottlenecked audible production change.

05

Tropical dynamics network

An interpretable compressor-style network revealed that the target gap is not predominantly about dynamics. Training collapsed toward per-band makeup gain, which is effectively EQ.

06

Band-split waveform residual

A 5.3M-parameter residual model produced whirring and grinding artifacts, pointing to cross-band phase drift and exploitable loss pathologies when phase supervision is weak.

Current Approach

A 30M-parameter causal transfer model in pretrained stereo latent space.

The current experiment combines hierarchical Mamba-style sequence modeling, pretrained stereo audio latents, synthetic same-source curriculum learning, same-performance captures, and phase-aware losses into a single direct-stream transformation system.

01

Architecture

SaShiMi-style three-tier hierarchical Mamba-2, causal throughout, operating in pretrained stereo latent space at full width around 30M parameters and slimmable toward smaller real-time variants.

02

Training data

Curriculum learning blends synthetic same-source pairs, same-performance captures from recurring real-world workflows, cross-performance style supervision, and unpaired professional realism priors.

03

Loss composition

Latent reconstruction, multi-scale spectral losses, complex phase-aware STFT losses, feature matching, adversarial realism, and performer-preservation embeddings work together to stabilize transfer quality.

04

Deployment path

The full-width model targets offline post-production first. Slimmable training preserves a path toward lower-width real-time inference if live stereo-bus deployment becomes viable.

1. Pair and align

Pair repeatable livestream captures with curated production references and align them with chroma and lyric cues.

2. Learn reference production

Use pretrained music representations to separate production quality targets from performance identity.

3. Constrain the transfer

Apply differentiable DSP and phase-aware objectives so the output remains grounded in the original signal.

4. Residual neural lift

Use residual neural transformation to capture production changes that fixed DSP alone cannot express.

5. Evaluate and slim

Measure transfer quality, artifact behavior, and real-time viability across offline and reduced-width variants.

Why GPUs matter

Long-form paired training, frozen perceptual encoders, complex multi-resolution losses, adversarial fine-tuning, and minute-scale stereo inference are all GPU-intensive. Kiln training targets NVIDIA A100-class hardware and is designed with a future real-time inference path in mind.

Expected Contribution

Useful results even if the final audio never clears a commercial-quality bar.

Success would be interesting. Failure would still be publishable. Either way, Kiln should produce evidence about a region of direct-stream audio transformation that has barely been explored in public research.

30M regime

Empirical characterization of direct-stream audio transformation at a scale where little or no prior published work exists.

Hierarchical Mamba evidence

Architectural evidence about causal hierarchical state-space models applied to stereo transformation instead of pure generation.

Synthetic amateurization recipe

A documented method for generating realistic same-source amateur-to-pro training pairs via inverse-mastering style degradations.

Identifiability analysis

Clearer evidence about what loss structure is needed to prevent performer identity drift when supervision crosses performances.

Compute Requirements

Enough for a real research run, not just architecture triage.

Early self-funded experimentation is enough to identify bad architectures. It is not enough to run the full curriculum required to train and evaluate the current design at publication-quality depth.

Stage A

Synthetic pretraining

Approx. 1-3 days on 4xA100 80GB or 2xH100 for latent reconstruction and curriculum warm-up on synthetic same-source pairs.

Stages B-D

Supervised and adversarial fine-tuning

Approx. 3-7 days on the same cluster for same-performance supervision, cross-performance style transfer, and realism-oriented fine-tuning.

Total

300-600 GPU-hours

That estimate covers the full experimental protocol, with additional time if reduced-width teacher distillation is explored for real-time deployment evaluation.

About The Work

Audio research grounded in shipping software and real production practice.

Kiln sits at the intersection of music production, systems engineering, and applied machine learning. The project is being developed under Resonant Labs as research work, not as a public product launch page.

Research Context

Why this direction is credible

The work combines formal music training, audio-production experience, and long-running software engineering practice. It also builds on prior shipped work in audio measurement and applied ML through RoomScore.

Resonant Labs LLC is the vehicle for that work: applied audio research, experimental systems, and productized software where the technical problem is real enough to measure.

Why churches first

  • Stable rooms and repeat workflows create a better proving ground than most live-music environments.
  • The dataset opportunity is unusually strong because capture conditions recur week after week.
  • If the transfer quality proves out there, the adjacent market is far larger than the initial wedge.
Contact

Research correspondence.

For research correspondence, collaboration inquiries, or compute-allocation discussions: hello@resonantlabs.tech.

Detailed experimental protocol notes, provenance documentation, training-code excerpts, and publication plans are available on request.