arXiv cs.LG
5/11/2026

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Short summary
Toeplitz MLP Mixers (TMM) replace quadratic attention with masked Toeplitz matrix operations, reducing transformer complexity to O(dn log n) training and O(dn) inference. Despite simpler architecture than competing sub-quadratic models, TMMs achieve better training efficiency, information retention, and in-context learning performance. Operator index theory analysis reveals trained Toeplitz layers become nearly invertible, explaining the model's superior ability to preserve input information.
- •Replaces attention with Toeplitz matrices, achieving O(dn log n) training complexity vs. quadratic scaling
- •Outperforms other sub-quadratic architectures on training efficiency, information retrieval, and in-context learning
- •Mathematical analysis shows trained layers become invertible, explaining improved information retention
Generated with AI, which can make mistakes.
Is this a good recommendation for you?