Contents
In mid-March 2026, Kimi (Moonshot AI) released a paper titled Attention Residuals. It is a deceptively simple idea: instead of blindly adding all previous layer outputs together in a residual stream, let each layer attend over the history of representations and decide what is actually worth carrying forward. The paper validates this at scale — including a 48B-parameter model trained on 1.4T tokens — and the results are compelling.
This post does two things. First, it is a thorough tutorial on the Attention Residuals paper — the math, the intuition, and why it matters. Second, I propose a novel extension: AttnRes-LoRA, which applies the same selective-attention idea to LoRA-based fine-tuning, solving a documented problem in ResLoRA and potentially enabling more expressive parameter-efficient fine-tuning at lower cost.
1. The Problem with Standard Residual Connections
Modern transformers use Pre-Layer Normalization (PreNorm) residual blocks. For a model with $L$ layers, the hidden state at layer $l$ is updated as:
where $F_l$ is the transformer block (self-attention + feed-forward), and $\mathrm{LN}$ is layer normalization. After $L$ layers, the final hidden state is:
Every layer contributes exactly once, with a weight of 1. This has two connected problems:
- Uncontrolled growth: The magnitude of $\mathbf{h}_L$ grows proportionally to $L$. As depth increases, later layers must work against an increasingly noisy, large-magnitude state.
- Dilution: Because everything is added uniformly, early layer outputs — which encode syntactic, low-level features — are averaged in with late-layer semantic representations. Each individual layer's contribution is diluted by $1/L$. This is especially damaging in very deep models.
In a 96-layer model, any single layer's output contributes — on average — just 1% of the final representation. Even if a specific layer learns something critical, it is drowned out by the noise from the other 95 layers.
2. The Attention Residuals Paper
The Attention Residuals paper replaces fixed uniform summation with a learned, input-dependent weighting. At layer $l$, instead of adding $\mathbf{h}_{l-1}$ with weight 1, the model attends over all previous layer outputs $\{\mathbf{h}_0, \mathbf{h}_1, \ldots, \mathbf{h}_{l-1}\}$ and computes a soft weighted sum.
The update rule becomes:
where the attention weights $\alpha_{i \to l}$ are computed as:
Here, $\mathbf{w}_l$ is a small learnable pseudo-query vector — one per layer, with $d$ parameters. The softmax is applied across all $l$ previous layer indices, so $\sum_{i=0}^{l-1} \alpha_{i \to l} = 1$. This means the residual stream magnitude is automatically controlled: the contribution to $\mathbf{h}_l$ from the history is always a convex combination, bounded in norm.
Why this helps
- Dynamic weighting: For different tokens and contexts, the learned $\alpha$ scores shift. A syntactically complex token might rely heavily on early layers; a semantic reasoning step might weight later layers more.
- Gradient flow: Softmax attention creates short direct paths between any layer $i$ and any future layer $l$, similar to how attention heads connect distant tokens. This dramatically improves gradient flow during training.
- Bounded growth: Since attention weights sum to 1, the residual contribution is always a unit-norm-bounded mixture. The uncontrolled magnitude explosion of standard residuals is eliminated.
The paper reports consistent improvements across model sizes and demonstrates more uniform output magnitudes and gradient distribution across depth when integrated into a 48B-parameter model trained on 1.4T tokens.
3. Block AttnRes: Scaling It Up
A naive implementation of Attention Residuals requires attending over all $l$ previous layer outputs at every layer. For a 96-layer model, layer 95 would compute attention over 95 vectors in $\mathbb{R}^d$ — cheap in absolute terms, but the stored key-value cache scales quadratically with both depth and batch size.
To manage this, the paper introduces Block AttnRes: partition the $L$ layers into $B$ blocks of size $k = L/B$. Each layer only attends over the $B$ block-level representations (typically the last hidden state of each block), not all previous individual layer outputs.
This reduces the attention cost from $O(L^2 d)$ to $O(B^2 d)$ per forward pass, where $B \ll L$. The paper shows this preserves most of the gains of full AttnRes while remaining practical at large scale.
4. LoRA Recap
Before extending the idea, let's ground ourselves in LoRA. Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique where instead of updating a full weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times d}$, we learn a low-rank decomposition:
where $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times d}$, and $r \ll d$ is the rank. The forward pass through a LoRA-adapted layer is:
$\mathbf{W}_0$ is frozen. Only $\mathbf{A}$ and $\mathbf{B}$ are trained. With $r = 4$ and $d = 4096$, LoRA requires just $2 \times 4096 \times 4 = 32{,}768$ trainable parameters per layer — compared to $4096^2 \approx 16.7M$ for full fine-tuning. That is a 512× reduction.
Because $\mathbf{W}_0$ is frozen, the model mathematically cannot undergo catastrophic forgetting — its pre-trained knowledge is fully preserved. LoRA only adds a correction signal on top.
5. ResLoRA and Its Dilution Problem
ResLoRA (Microsoft Research, 2024) recognized that LoRA adapters in different layers are isolated from each other — a late adapter cannot benefit from what an early adapter learned. Their fix: add residual connections between LoRA blocks.
Formally, at LoRA layer $l$:
This speeds up training convergence significantly, because deep adapters receive gradient signals through shorter paths. However, ResLoRA inherits exactly the same flaw the Attention Residuals paper identified in standard residual connections: all previous adapter outputs are summed with equal weight.
As you stack more LoRA blocks — which you naturally do when adapting deeper models or adding multiple task-specific adapters — the sum in ResLoRA grows unboundedly. Early adapter signals from shallow layers are buried under the accumulated noise of every later adapter. The effective signal-to-noise ratio of any individual adapter's contribution drops as $1/(l-1)$.
The dilution problem: In a model with 32 LoRA blocks, the adapter at block 1 contributes just 3% of the final residual sum — even if it learned something critically relevant to the current token.
6. AttnRes-LoRA: The Proposed Extension
The fix is direct: apply the Attention Residuals idea to the LoRA residual stream. Instead of blindly summing all previous adapter outputs, each LoRA block uses a lightweight query vector to attend over the history and selectively aggregate.
Let $\mathrm{LoRA}_i(\mathbf{x}) \in \mathbb{R}^d$ denote the output of the $i$-th adapter. The AttnRes-LoRA update at block $l$ is:
The attention weights are computed exactly as in the Attention Residuals paper, but applied to adapter outputs rather than full hidden states:
Here $\mathbf{w}_l$ is a small learnable query vector specific to block $l$. It is the only additional parameter introduced beyond the standard LoRA matrices — $d$ scalars per block, which is negligible (4,096 floats at rank 4 in a typical LLM layer).
What gets trained vs. what stays frozen
- Frozen: $\mathbf{W}_0$ (all base model weights) — untouched, as in standard LoRA.
- Trained: $\mathbf{A}_l, \mathbf{B}_l$ (LoRA matrices) — same as standard LoRA.
- Trained (new): $\mathbf{w}_l$ (query vector per block) — $d$ parameters per block, trivially small.
Why the softmax is essential
The softmax over $i$ enforces that $\sum_{i=1}^{l-1} \alpha_{i \to l} = 1$. This has two consequences:
- The residual contribution is always a convex combination — bounded in norm regardless of depth.
- The weights are sparse in practice: softmax with sharp attention naturally assigns most of the weight to one or two adapters, while the rest get near-zero. This is the foundation for dynamic adapter skipping.
7. Dynamic Adapter Skipping Explained
One of the most compelling properties of AttnRes-LoRA is that it enables inference-time compute savings through dynamic adapter skipping. Let me be precise about how this works, because the causality is subtle.
What we are NOT skipping
When the token passes through Layer 12, the LoRA adapter at Layer 12 computes its output. This happens unconditionally — we cannot skip it because Layer 12 must produce a hidden state for Layer 13 to consume. AttnRes-LoRA does not skip the forward computation of adapters.
What we ARE skipping
After the token has already passed through Layers 1–49, we are now at Layer 50 computing the AttnRes aggregation. Layer 50 needs to compute:
In standard ResLoRA, the GPU must fetch all 49 stored adapter outputs from high-bandwidth memory (HBM) and perform 49 tensor additions. This is expensive — not because of computation, but because of memory bandwidth (the "memory wall").
In AttnRes-LoRA, we first compute the attention scores using the cheap dot products $\mathbf{w}_{50}^\top \cdot \mathrm{RMSNorm}(\mathrm{LoRA}_i(\mathbf{x}))$. These dot products use only the small query $\mathbf{w}_{50}$ and the low-dimensional adapter outputs — very cheap. The result might look like:
With a threshold $\tau = 0.05$: only adapters 48 and 49 meet the threshold. We skip fetching the other 47 adapter outputs from HBM entirely. We do not perform their tensor additions. We save ~96% of the memory bandwidth cost for this aggregation step.
Token-dependent routing: For the token "quantum," adapters 48 and 49 dominate. For the next token "entanglement," the scores shift — perhaps adapter 12 (which learned scientific vocabulary during fine-tuning) suddenly gets $\alpha_{12\to50} = 0.72$. The skipping pattern changes token-by-token, dynamically.
8. PyTorch Implementation
Below is a clean implementation of a single AttnRes-LoRA layer. It wraps any linear layer with LoRA matrices and the depth-wise attention query.
import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, List, Tuple
class AttnResLoRALayer(nn.Module):
"""
A single LoRA-adapted linear layer with Attention Residuals support.
Forward pass accepts a list of previous adapter outputs and computes
a softmax-weighted aggregation (AttnRes) before adding to the current
adapter output.
Args:
base_layer: The frozen linear layer to adapt.
rank: LoRA rank r (default 4).
alpha: LoRA scaling factor (default 1.0).
skip_threshold: Adapters with alpha < skip_threshold are skipped
during inference aggregation (default 0.05).
"""
def __init__(
self,
base_layer: nn.Linear,
rank: int = 4,
alpha: float = 1.0,
skip_threshold: float = 0.05,
):
super().__init__()
self.base_layer = base_layer
self.rank = rank
self.scaling = alpha / rank
self.skip_threshold = skip_threshold
d_out, d_in = base_layer.weight.shape
# LoRA matrices — same as standard LoRA
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
# Depth-wise query for attention over previous adapter outputs
self.w_query = nn.Parameter(torch.randn(d_out) * 0.01)
self.rms_norm = nn.RMSNorm(d_out)
# Freeze base layer
for p in self.base_layer.parameters():
p.requires_grad = False
def lora_output(self, x: torch.Tensor) -> torch.Tensor:
"""Compute only the LoRA delta (no base layer)."""
return (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
def forward(
self,
x: torch.Tensor,
prev_lora_outputs: Optional[List[torch.Tensor]] = None,
training: bool = True,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Args:
x: Input tensor [batch, seq_len, d_in]
prev_lora_outputs: List of previous adapter outputs,
each [batch, seq_len, d_out]
training: If False, apply skip_threshold for efficiency.
Returns:
(output, current_lora_out) — output is the full layer result;
current_lora_out should be appended to prev_lora_outputs
for the next layer.
"""
base_out = self.base_layer(x) # [B, T, d_out]
lora_out = self.lora_output(x) # [B, T, d_out]
attn_residual = torch.zeros_like(lora_out)
if prev_lora_outputs:
# Stack previous outputs: [L, B, T, d_out]
stacked = torch.stack(prev_lora_outputs, dim=0)
L = stacked.shape[0]
# Normalize each previous output: [L, B, T, d_out]
normed = self.rms_norm(stacked)
# Compute attention scores via dot product with w_query
# w_query: [d_out] → scores: [L, B, T]
scores = torch.einsum('d,lbtd->lbt', self.w_query, normed)
# Softmax over the L dimension
alpha = F.softmax(scores, dim=0) # [L, B, T]
if not training and self.skip_threshold > 0:
# Zero out adapters below threshold (dynamic skipping)
# Average over batch/seq for the threshold decision
alpha_mean = alpha.mean(dim=(1, 2), keepdim=True) # [L,1,1]
mask = (alpha_mean >= self.skip_threshold).float()
alpha = alpha * mask
# Re-normalize after masking
alpha_sum = alpha.sum(dim=0, keepdim=True).clamp(min=1e-8)
alpha = alpha / alpha_sum
# Weighted sum of previous adapter outputs
# [L,B,T] × [L,B,T,d] → [B,T,d]
attn_residual = torch.einsum('lbt,lbtd->btd', alpha, stacked)
output = base_out + lora_out + attn_residual
return output, lora_out.detach()
AttnResLoRALayer — single layer implementation with dynamic skipping support.
Here is how you wire multiple layers together in a simple model:
class AttnResLoRAModel(nn.Module):
"""
Wrapper that applies AttnRes-LoRA across multiple adapted layers.
Each layer receives the cached outputs of all previous LoRA adapters.
"""
def __init__(self, frozen_layers: List[nn.Linear], rank: int = 4):
super().__init__()
self.lora_layers = nn.ModuleList([
AttnResLoRALayer(layer, rank=rank) for layer in frozen_layers
])
def forward(self, x: torch.Tensor) -> torch.Tensor:
prev_outputs: List[torch.Tensor] = []
h = x
for layer in self.lora_layers:
h, lora_out = layer(h, prev_lora_outputs=prev_outputs,
training=self.training)
prev_outputs.append(lora_out)
return h
# Quick usage example
if __name__ == "__main__":
# Simulate 8 frozen linear layers (e.g., projection layers in a transformer)
frozen_linears = [nn.Linear(512, 512) for _ in range(8)]
model = AttnResLoRAModel(frozen_linears, rank=4)
x = torch.randn(2, 128, 512) # [batch=2, seq_len=128, d=512]
out = model(x)
print(f"Output shape: {out.shape}") # [2, 128, 512]
# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable params: {trainable:,}")
# Approx: 8 layers × (rank×d×2 + d) = 8 × (4×512×2 + 512) = 36,864
Composing multiple AttnRes-LoRA layers with the shared adapter output cache.
9. What Is Worth Exploring
The ideas above raise several concrete questions that seem worth running experiments on. I'll try to be precise about what each experiment would actually test and what a positive or negative result would mean.
Does the gradient flow actually improve?
In ResLoRA, the gradient of the loss $\mathcal{L}$ with respect to an early adapter $\mathrm{LoRA}_i$ that feeds into layer $l$ is:
In AttnRes-LoRA, the same gradient is scaled by the learned attention weight:
This is a self-reinforcing signal: an adapter that is useful for a task gets attended to more, receives stronger gradients, and specializes further. An adapter that is not useful gets near-zero gradient from the residual path and converges quickly on its local input only. Whether this produces faster or more stable convergence than uniform weighting is an empirical question — but the mechanism is theoretically sound and worth measuring with training loss curves against LoRA and ResLoRA baselines.
Does $\alpha$ become sparse naturally, and does it mean anything?
Softmax over $L-1$ values does not guarantee sparsity — for small $L$ or low-temperature attention, the weights can remain close to uniform. But as $L$ grows and the model trains, we would expect the weights to concentrate. The interesting question is not just whether sparsity emerges, but what structure it has.
One testable hypothesis: for tokens of the same syntactic type (e.g., verbs), the $\alpha$ distribution at a given layer is more consistent across examples than for tokens of different types. If true, that would suggest the attention mechanism is discovering something semantically meaningful about which adapter layers capture which type of information — which would be a compelling interpretability result on top of the efficiency story.
A simpler sanity check: visualize $\alpha_{i \to l}$ as a heatmap (adapter index $i$ on x-axis, block index $l$ on y-axis) and check whether it resembles a diagonal (each block attends mostly to its immediate predecessor, similar to standard residuals) or a more structured pattern.
Can AttnRes-LoRA with low rank match high-rank LoRA?
Standard LoRA's expressivity is bounded by its rank $r$. The effective rank of the learned update $\Delta\mathbf{W} = \mathbf{BA}$ is at most $r$. In AttnRes-LoRA, the aggregated residual is a weighted sum of $L-1$ rank-$r$ matrices:
Each term has rank $\leq r$, but their weighted sum can have rank up to $\min((l-1) \cdot r,\, d)$. With $l = 10$ and $r = 4$, the effective update can have rank up to 36. This is the mathematical basis for the claim that many low-rank adapters, selectively combined, can match the expressivity of a single high-rank adapter — without the memory cost of storing a large matrix.
The clean experiment: fine-tune a fixed LLM on a reasoning benchmark (e.g., GSM8K or MATH) with (a) LoRA $r=4$, (b) LoRA $r=32$, (c) AttnRes-LoRA $r=4$ across 8 blocks. If (c) matches (b) while using 4× fewer parameters per adapter, that is a meaningful result.
Multi-task routing without task labels
A separate and interesting direction: fine-tune a single model on two distinct tasks simultaneously (e.g., code generation and mathematical reasoning) using AttnRes-LoRA. The hypothesis is that the $\alpha$ weights will naturally separate — code tokens will route through a different subset of adapters than math tokens — without any explicit task identifier or routing supervision.
If this holds, it is a meaningful result for continual learning: you could add a new task by appending new LoRA blocks, and the attention mechanism would learn to route the new task's inputs through the new adapters while preserving the old pathways. No replay buffer, no task boundaries, no explicit masks.
This is the most speculative direction. The $\alpha$ weights are conditioned on the hidden state $\mathbf{x}$, which by the time it reaches a deep block already contains mixed task information. Whether the attention mechanism can cleanly separate tasks from a mixed signal is genuinely unclear without experiments.
The Initialization Trap: Why $\mathbf{w}_l$ Needs a New Recipe
In standard LoRA, initialization is a solved problem: matrix $\mathbf{A}$ is initialized with random Gaussian noise to break symmetry, and matrix $\mathbf{B}$ is initialized to zero so that the adapter starts by doing nothing ($\Delta\mathbf{W} = 0$). For AttnRes-LoRA, we absolutely keep this standard recipe for the adapter matrices themselves.
However, the new depth-wise query vector $\mathbf{w}_l$ introduces a trap. It is tempting to initialize $\mathbf{w}_l$ to all zeros as well. But if $\mathbf{w}_l = \mathbf{0}$, the dot product against all previous adapters is zero. When passed through the Softmax function, these zeros turn into a perfectly uniform distribution (e.g., $[0.25, 0.25, 0.25, 0.25]$).
Ironically, applying the "safe" zero-initialization means our network starts step zero acting exactly like ResLoRA — blindly averaging all previous layers — which is the exact dilution problem we are trying to fix!
This leaves us with an open research question:
- Do we initialize $\mathbf{w}_l$ with zeros, accept the uniform ResLoRA-like start, and trust the optimizer to learn sparsity quickly?
- Or do we initialize $\mathbf{w}_l$ with a hard-coded bias that heavily favors the immediate preceding block ($\alpha_{l-1} \approx 1.0$), forcing the network to initially mimic a standard sequential flow before it learns to route dynamically?
Conclusion
The Attention Residuals paper by the Kimi team challenges one of the longest-standing defaults in deep learning architecture. By proving that depth-wise attention is not only feasible but mathematically superior to uniform addition, they have opened the door to much more intelligent and communicative network designs.
AttnRes-LoRA is a natural and necessary extension of this idea. As we push the limits of Parameter-Efficient Fine-Tuning (PEFT) on complex reasoning and high-fidelity visual tasks, we cannot afford to let early adapter signals drown in a noisy, unweighted residual stream.
By allowing adapters to dynamically route features and selectively remember past states, we can solve the dilution problem of ResLoRA, enable dynamic compute-saving skips during inference, and drastically increase the expressivity of low-rank fine-tuning without expanding our parameter budget.
The theoretical foundation is there, and the PyTorch implementation is lightweight. The next step is empirical validation. If you are exploring parameter-constrained fine-tuning or want to collaborate on benchmarking these routing dynamics, feel free to reach out!
Umar Khalid is an AI Research Scientist II at Axon Enterprises, working on computer vision and multimodal AI. Previously at Meta, Samsung Research, and Microsoft. PhD from the University of Central Florida.
Have thoughts on AttnRes-LoRA? Reach out at umar@umarkhalid.com or connect on LinkedIn.