Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

Short summary

Toeplitz MLP Mixers (TMM) replace quadratic attention with masked Toeplitz matrix operations, reducing transformer complexity to O(dn log n) training and O(dn) inference. Despite simpler architecture than competing sub-quadratic models, TMMs achieve better training efficiency, information retention, and in-context learning performance. Operator index theory analysis reveals trained Toeplitz layers become nearly invertible, explaining the model's superior ability to preserve input information.

•Replaces attention with Toeplitz matrices, achieving O(dn log n) training complexity vs. quadratic scaling
•Outperforms other sub-quadratic architectures on training efficiency, information retrieval, and in-context learning
•Mathematical analysis shows trained layers become invertible, explaining improved information retention

Generated with AI, which can make mistakes.

#research-breakthrough

Read full article at arXiv cs.LG

Is this a good recommendation for you?

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

Short summary

Explore more