Training a large AI model is expensive, not just in dollars, but in time, energy, and computational resources. That's been the uncomfortable truth for anyone who has watched GPU bills climb alongside model sizes. Researchers keep pushing model capacity higher, but the cost of getting there hasn't budged much.

Traditionally, obtaining a smaller, faster model either requires training a massive one first and then trimming it down, or training a small one from scratch and accepting weaker performance. Both paths have real costs. The first wastes compute on capacity you'll eventually throw away. The second leaves performance on the table.

A new technique from MIT and collaborators flips that logic entirely. Instead of compressing after the fact, it compresses during training, and the numbers coming out of the research are hard to ignore.


What Is CompreSSM?

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), Max Planck Institute for Intelligent Systems, European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now developed a new method that sidesteps this trade-off entirely, compressing models during training, rather than after.

The technique, called CompreSSM, targets a family of AI architectures known as state-space models, which power applications ranging from language processing to audio generation and robotics. State-space models (SSMs) have gained serious traction as an alternative to transformers, particularly for tasks requiring long-context understanding with lower memory overhead.

"It's essentially a technique to make models grow smaller and faster as they are training," says Makram Chahine, a PhD student in electrical engineering and computer science, CSAIL affiliate, and lead author of the paper. "During learning, they're also getting rid of parts that are not useful to their development."


How CompreSSM Works

The core insight here comes from control theory, a branch of engineering that deals with how systems evolve and stabilize over time.

By borrowing mathematical tools from control theory, the researchers can identify which parts of a model are pulling their weight and which are dead weight, before surgically removing the unnecessary components early in the training process.

The key insight is that the relative importance of different components within these models stabilizes surprisingly early during training. Using a mathematical quantity called Hankel singular values, which measure how much each internal state contributes to the model's overall behavior, the team showed they can reliably rank which dimensions matter and which don't after only about 10 percent of the training process.

That 10% threshold is the critical window. Once those rankings are established, the less-important components can be safely discarded, and the remaining 90 percent of training proceeds at the speed of a much smaller model. The model runs big when it needs to learn, then sheds what it doesn't need and finishes the job lean.


Key Technical Highlights

  • Target architecture: State-space models (SSMs), including Mamba, one of the most widely adopted SSM architectures
  • Compression trigger: Hankel singular values computed at ~10% training completion
  • Mechanism: Structured pruning of low-importance state dimensions during training
  • Advantage over alternatives: No expensive eigenvalue computations at every gradient step

Compared to Hankel nuclear norm regularization, a recently proposed spectral technique for encouraging compact state-space models, CompreSSM was more than 40 times faster, while also achieving higher accuracy. The regularization approach slowed training by roughly 16 times because it required expensive eigenvalue computations at every single gradient step, and even then, the resulting models underperformed.

Knowledge distillation, another common compression strategy, also falls short by comparison. Against knowledge distillation on CIFAR-10, CompreSSM held a clear advantage for heavily compressed models: at smaller state dimensions, distilled models saw significant accuracy drops, while CompreSSM-compressed models maintained near-full performance. And because distillation requires a forward pass through both the teacher and student at every training step, even its smaller student models trained slower than the full-sized baseline.


Performance and Benchmarks

The results from the paper are specific and consistent across multiple test conditions.

On image classification benchmarks, compressed models maintained nearly the same accuracy as their full-sized counterparts while training up to 1.5 times faster. A compressed model reduced to roughly a quarter of its original state dimension achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to just 81.8 percent for a model trained at that smaller size from scratch.

On Mamba, one of the most widely used state-space architectures, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.

"You get the performance of the larger model, because you capture most of the complex dynamics during the warm-up phase, then only keep the most-useful states," Chahine says. "The model is still able to perform at a higher level than training a small model from the start."

A 4x speedup on Mamba is not a marginal gain. For teams training SSMs at scale, that's a meaningful reduction in both time and infrastructure cost.


What This Means for the AI Industry

The broader implication here is philosophical as much as practical. Most of the AI field has treated compression as a post-training concern. You train big, you deploy small. CompreSSM challenges that default.

"What's exciting about this work is that it turns compression from an afterthought into part of the learning process itself," says senior author Daniela Rus, MIT professor and director of CSAIL. "Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns. That's a fundamentally different way to think about building AI systems."

The timing also matters. In 2025, as demand for AI at the edge grows in sectors like automotive, healthcare, and IoT, model compression defines whether an embedded solution can scale or fail. SSMs are already being explored for real-time applications in audio, robotics, and embedded systems. A technique that produces compact SSMs without the usual accuracy penalty could accelerate deployment in exactly those spaces.

The field has also been watching the broader efficiency push closely. Research on compression techniques for large language models has expanded significantly to address the growing demands for efficient deployment on various hardware platforms. These techniques aim to reduce the computational cost and memory footprint while retaining their performance. CompreSSM adds a new and distinct approach to that toolkit, one that doesn't require a trained model to exist before compression begins.


Final Thoughts

What stands out to me about CompreSSM isn't just the speedup numbers. It's the decision to treat importance ranking as a property of the training process itself, not something you measure after the fact. Using Hankel singular values to identify which state dimensions stabilize early is a genuinely clever application of control theory to a machine learning problem, and it's the kind of cross-disciplinary move that tends to age well.

The current limitation worth watching is scope: CompreSSM targets state-space models specifically. Transformers, which still dominate most production deployments, aren't in scope yet. Whether the same Hankel-based approach generalizes to attention-based architectures is an open question, and probably the most interesting one to follow from here.

If you're working with Mamba or other SSM-based architectures, this research is worth reading in full. What do you think? Drop your thoughts in the comments.


FAQ

What is CompreSSM?

CompreSSM is a technique developed by MIT CSAIL and collaborators that compresses AI state-space models during training rather than after, using mathematical tools from control theory to remove low-importance components early in the process.

What are state-space models?

State-space models (SSMs) are a family of AI architectures used for tasks like language processing, audio generation, and robotics. Mamba is one of the most widely known examples.

How much faster does CompreSSM make training?

On the Mamba architecture, CompreSSM achieved approximately 4x training speedups. On image classification benchmarks, compressed models trained up to 1.5 times faster while maintaining near-identical accuracy.

How does CompreSSM compare to knowledge distillation?

CompreSSM outperforms knowledge distillation at heavy compression levels. Distilled models showed significant accuracy drops at smaller state dimensions, while CompreSSM-compressed models held near-full performance, and distillation's dual forward-pass requirement made it slower overall.

Does CompreSSM work with transformer models?

Currently, CompreSSM targets state-space models specifically. Whether the approach extends to transformer architectures remains an open research question.