How is this different from Performer / linear attention?
Performer and related methods approximate the softmax kernel with random features or low-rank projections. Summation is not an approximation — it eliminates similarity altogether. Tokens are modulated by positional encodings, projected with nonlinearities, and aggregated by direct addition.
Does pure summation replace attention?
In classification and multimodal regression, yes — summation alone is competitive and often faster. In autoregressive language modeling, pure summation underperforms. But a hybrid design (summation in most layers + a single final attention layer) matches or slightly beats full attention while keeping most of the network near-linear.
What scale are the experiments?
Small-to-moderate scale (document classification, WikiText-2, AG News, etc.). Scaling laws remain an open question — collaboration on larger-scale validation is very welcome.
Why might this work?
Summation acts as a bottleneck: only task-relevant features survive aggregation, which seems to restructure embeddings before the final attention layer stabilizes them. PCA and dimensionality analyses show distinctive representation dynamics compared to attention.
Summation-based aggregation replaces pairwise similarity with position-modulated projections and direct summation, reducing per-layer cost from quadratic to near-linear.
On its own, summation is competitive for classification and multimodal tasks. In language modeling, a hybrid design — summation in most layers with a single final attention layer — matches or slightly outperforms full attention while staying nearly linear in cost.
Author here — a few clarifications up front:
How is this different from Performer / linear attention? Performer and related methods approximate the softmax kernel with random features or low-rank projections. Summation is not an approximation — it eliminates similarity altogether. Tokens are modulated by positional encodings, projected with nonlinearities, and aggregated by direct addition.
Does pure summation replace attention? In classification and multimodal regression, yes — summation alone is competitive and often faster. In autoregressive language modeling, pure summation underperforms. But a hybrid design (summation in most layers + a single final attention layer) matches or slightly beats full attention while keeping most of the network near-linear.
What scale are the experiments? Small-to-moderate scale (document classification, WikiText-2, AG News, etc.). Scaling laws remain an open question — collaboration on larger-scale validation is very welcome.
Why might this work? Summation acts as a bottleneck: only task-relevant features survive aggregation, which seems to restructure embeddings before the final attention layer stabilizes them. PCA and dimensionality analyses show distinctive representation dynamics compared to attention.
Summation-based aggregation replaces pairwise similarity with position-modulated projections and direct summation, reducing per-layer cost from quadratic to near-linear.
On its own, summation is competitive for classification and multimodal tasks. In language modeling, a hybrid design — summation in most layers with a single final attention layer — matches or slightly outperforms full attention while staying nearly linear in cost.
GitHub: https://github.com/pfekin/summation-based-transformers