All Posts

Attention Residuals & AttnRes-LoRA: Teaching Networks to Selectively Remember

Contents

  1. The Problem with Standard Residual Connections
  2. The Attention Residuals Paper
  3. Block AttnRes: Scaling It Up
  4. LoRA Recap
  5. ResLoRA and Its Dilution Problem
  6. AttnRes-LoRA: The Proposed Extension
  7. Dynamic Adapter Skipping Explained
  8. PyTorch Implementation
  9. What Is Worth Exploring

In mid-March 2026, Kimi (Moonshot AI) released a paper titled Attention Residuals. It is a deceptively simple idea: instead of blindly adding all previous layer outputs together in a residual stream, let each layer attend over the history of representations and decide what is actually worth carrying forward. The paper validates this at scale — including a 48B-parameter model trained on 1.4T tokens — and the results are compelling.

This post does two things. First, it is a thorough tutorial on the Attention Residuals paper — the math, the intuition, and why it matters. Second, I propose a novel extension: AttnRes-LoRA, which applies the same selective-attention idea to LoRA-based fine-tuning, solving a documented problem in ResLoRA and potentially enabling more expressive parameter-efficient fine-tuning at lower cost.

1. The Problem with Standard Residual Connections

Modern transformers use Pre-Layer Normalization (PreNorm) residual blocks. For a model with $L$ layers, the hidden state at layer $l$ is updated as:

$$\mathbf{h}_l = \mathbf{h}_{l-1} + F_l\!\left(\mathrm{LN}(\mathbf{h}_{l-1})\right)$$
Standard PreNorm residual update

where $F_l$ is the transformer block (self-attention + feed-forward), and $\mathrm{LN}$ is layer normalization. After $L$ layers, the final hidden state is:

$$\mathbf{h}_L = \mathbf{h}_0 + \sum_{l=1}^{L} F_l\!\left(\mathrm{LN}(\mathbf{h}_{l-1})\right)$$
Unrolled residual stream after L layers

Every layer contributes exactly once, with a weight of 1. This has two connected problems:

In a 96-layer model, any single layer's output contributes — on average — just 1% of the final representation. Even if a specific layer learns something critical, it is drowned out by the noise from the other 95 layers.

2. The Attention Residuals Paper

The Attention Residuals paper replaces fixed uniform summation with a learned, input-dependent weighting. At layer $l$, instead of adding $\mathbf{h}_{l-1}$ with weight 1, the model attends over all previous layer outputs $\{\mathbf{h}_0, \mathbf{h}_1, \ldots, \mathbf{h}_{l-1}\}$ and computes a soft weighted sum.

The update rule becomes:

$$\mathbf{h}_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot \mathbf{h}_i \;+\; F_l\!\left(\mathrm{LN}(\mathbf{h}_{l-1})\right)$$
Attention Residuals update

where the attention weights $\alpha_{i \to l}$ are computed as:

$$\alpha_{i \to l} = \mathrm{Softmax}_i\!\left(\mathbf{w}_l^\top \cdot \mathrm{RMSNorm}(\mathbf{h}_i)\right), \quad \mathbf{w}_l \in \mathbb{R}^d$$
Depth-wise attention weight computation

Here, $\mathbf{w}_l$ is a small learnable pseudo-query vector — one per layer, with $d$ parameters. The softmax is applied across all $l$ previous layer indices, so $\sum_{i=0}^{l-1} \alpha_{i \to l} = 1$. This means the residual stream magnitude is automatically controlled: the contribution to $\mathbf{h}_l$ from the history is always a convex combination, bounded in norm.

Why this helps

The paper reports consistent improvements across model sizes and demonstrates more uniform output magnitudes and gradient distribution across depth when integrated into a 48B-parameter model trained on 1.4T tokens.

3. Block AttnRes: Scaling It Up

A naive implementation of Attention Residuals requires attending over all $l$ previous layer outputs at every layer. For a 96-layer model, layer 95 would compute attention over 95 vectors in $\mathbb{R}^d$ — cheap in absolute terms, but the stored key-value cache scales quadratically with both depth and batch size.

To manage this, the paper introduces Block AttnRes: partition the $L$ layers into $B$ blocks of size $k = L/B$. Each layer only attends over the $B$ block-level representations (typically the last hidden state of each block), not all previous individual layer outputs.

$$\mathbf{h}_l = \sum_{b=1}^{\lfloor l/k \rfloor} \alpha_{b \to l} \cdot \mathbf{h}^{(b)}_{\text{end}} \;+\; F_l\!\left(\mathrm{LN}(\mathbf{h}_{l-1})\right)$$
Block AttnRes: attend over block-level representations

This reduces the attention cost from $O(L^2 d)$ to $O(B^2 d)$ per forward pass, where $B \ll L$. The paper shows this preserves most of the gains of full AttnRes while remaining practical at large scale.

4. LoRA Recap

Before extending the idea, let's ground ourselves in LoRA. Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) technique where instead of updating a full weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times d}$, we learn a low-rank decomposition:

$$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$
LoRA weight decomposition

where $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times d}$, and $r \ll d$ is the rank. The forward pass through a LoRA-adapted layer is:

$$\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r}\,\mathbf{B}\mathbf{A}\mathbf{x}$$
LoRA forward pass (α is a scaling hyperparameter)

$\mathbf{W}_0$ is frozen. Only $\mathbf{A}$ and $\mathbf{B}$ are trained. With $r = 4$ and $d = 4096$, LoRA requires just $2 \times 4096 \times 4 = 32{,}768$ trainable parameters per layer — compared to $4096^2 \approx 16.7M$ for full fine-tuning. That is a 512× reduction.

Because $\mathbf{W}_0$ is frozen, the model mathematically cannot undergo catastrophic forgetting — its pre-trained knowledge is fully preserved. LoRA only adds a correction signal on top.

5. ResLoRA and Its Dilution Problem

ResLoRA (Microsoft Research, 2024) recognized that LoRA adapters in different layers are isolated from each other — a late adapter cannot benefit from what an early adapter learned. Their fix: add residual connections between LoRA blocks.

Formally, at LoRA layer $l$:

$$\mathbf{h}_l = \mathrm{Base}_l(\mathbf{x}) + \mathrm{LoRA}_l(\mathbf{x}) + \sum_{i=1}^{l-1} \mathrm{LoRA}_i(\mathbf{x})$$
ResLoRA: fixed uniform sum of all previous adapter outputs

This speeds up training convergence significantly, because deep adapters receive gradient signals through shorter paths. However, ResLoRA inherits exactly the same flaw the Attention Residuals paper identified in standard residual connections: all previous adapter outputs are summed with equal weight.

As you stack more LoRA blocks — which you naturally do when adapting deeper models or adding multiple task-specific adapters — the sum in ResLoRA grows unboundedly. Early adapter signals from shallow layers are buried under the accumulated noise of every later adapter. The effective signal-to-noise ratio of any individual adapter's contribution drops as $1/(l-1)$.

The dilution problem: In a model with 32 LoRA blocks, the adapter at block 1 contributes just 3% of the final residual sum — even if it learned something critically relevant to the current token.

6. AttnRes-LoRA: The Proposed Extension

The fix is direct: apply the Attention Residuals idea to the LoRA residual stream. Instead of blindly summing all previous adapter outputs, each LoRA block uses a lightweight query vector to attend over the history and selectively aggregate.

Let $\mathrm{LoRA}_i(\mathbf{x}) \in \mathbb{R}^d$ denote the output of the $i$-th adapter. The AttnRes-LoRA update at block $l$ is:

$$\mathbf{h}_l = \mathrm{Base}_l(\mathbf{x}) + \mathrm{LoRA}_l(\mathbf{x}) + \sum_{i=1}^{l-1} \alpha_{i \to l} \cdot \mathrm{LoRA}_i(\mathbf{x})$$
AttnRes-LoRA forward pass

The attention weights are computed exactly as in the Attention Residuals paper, but applied to adapter outputs rather than full hidden states:

$$\alpha_{i \to l} = \mathrm{Softmax}_i\!\left(\mathbf{w}_l^\top \cdot \mathrm{RMSNorm}\!\left(\mathrm{LoRA}_i(\mathbf{x})\right)\right), \quad \mathbf{w}_l \in \mathbb{R}^d$$
Depth-wise attention over adapter outputs

Here $\mathbf{w}_l$ is a small learnable query vector specific to block $l$. It is the only additional parameter introduced beyond the standard LoRA matrices — $d$ scalars per block, which is negligible (4,096 floats at rank 4 in a typical LLM layer).

What gets trained vs. what stays frozen

Why the softmax is essential

The softmax over $i$ enforces that $\sum_{i=1}^{l-1} \alpha_{i \to l} = 1$. This has two consequences:

  1. The residual contribution is always a convex combination — bounded in norm regardless of depth.
  2. The weights are sparse in practice: softmax with sharp attention naturally assigns most of the weight to one or two adapters, while the rest get near-zero. This is the foundation for dynamic adapter skipping.

7. Dynamic Adapter Skipping Explained

One of the most compelling properties of AttnRes-LoRA is that it enables inference-time compute savings through dynamic adapter skipping. Let me be precise about how this works, because the causality is subtle.

What we are NOT skipping

When the token passes through Layer 12, the LoRA adapter at Layer 12 computes its output. This happens unconditionally — we cannot skip it because Layer 12 must produce a hidden state for Layer 13 to consume. AttnRes-LoRA does not skip the forward computation of adapters.

What we ARE skipping

After the token has already passed through Layers 1–49, we are now at Layer 50 computing the AttnRes aggregation. Layer 50 needs to compute:

$$\sum_{i=1}^{49} \alpha_{i \to 50} \cdot \mathrm{LoRA}_i(\mathbf{x})$$

In standard ResLoRA, the GPU must fetch all 49 stored adapter outputs from high-bandwidth memory (HBM) and perform 49 tensor additions. This is expensive — not because of computation, but because of memory bandwidth (the "memory wall").

In AttnRes-LoRA, we first compute the attention scores using the cheap dot products $\mathbf{w}_{50}^\top \cdot \mathrm{RMSNorm}(\mathrm{LoRA}_i(\mathbf{x}))$. These dot products use only the small query $\mathbf{w}_{50}$ and the low-dimensional adapter outputs — very cheap. The result might look like:

$$[\alpha_{1\to50}, \ldots, \alpha_{49\to50}] = [0.001, 0.002, \ldots, 0.002, 0.85, 0.14, 0.001, \ldots]$$

With a threshold $\tau = 0.05$: only adapters 48 and 49 meet the threshold. We skip fetching the other 47 adapter outputs from HBM entirely. We do not perform their tensor additions. We save ~96% of the memory bandwidth cost for this aggregation step.

Token-dependent routing: For the token "quantum," adapters 48 and 49 dominate. For the next token "entanglement," the scores shift — perhaps adapter 12 (which learned scientific vocabulary during fine-tuning) suddenly gets $\alpha_{12\to50} = 0.72$. The skipping pattern changes token-by-token, dynamically.

8. PyTorch Implementation

Below is a clean implementation of a single AttnRes-LoRA layer. It wraps any linear layer with LoRA matrices and the depth-wise attention query.

import torch
import torch.nn as nn
import torch.nn.functional as F
from typing import Optional, List, Tuple


class AttnResLoRALayer(nn.Module):
    """
    A single LoRA-adapted linear layer with Attention Residuals support.

    Forward pass accepts a list of previous adapter outputs and computes
    a softmax-weighted aggregation (AttnRes) before adding to the current
    adapter output.

    Args:
        base_layer: The frozen linear layer to adapt.
        rank: LoRA rank r (default 4).
        alpha: LoRA scaling factor (default 1.0).
        skip_threshold: Adapters with alpha < skip_threshold are skipped
                        during inference aggregation (default 0.05).
    """
    def __init__(
        self,
        base_layer: nn.Linear,
        rank: int = 4,
        alpha: float = 1.0,
        skip_threshold: float = 0.05,
    ):
        super().__init__()
        self.base_layer = base_layer
        self.rank = rank
        self.scaling = alpha / rank
        self.skip_threshold = skip_threshold

        d_out, d_in = base_layer.weight.shape

        # LoRA matrices — same as standard LoRA
        self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(d_out, rank))

        # Depth-wise query for attention over previous adapter outputs
        self.w_query = nn.Parameter(torch.randn(d_out) * 0.01)
        self.rms_norm = nn.RMSNorm(d_out)

        # Freeze base layer
        for p in self.base_layer.parameters():
            p.requires_grad = False

    def lora_output(self, x: torch.Tensor) -> torch.Tensor:
        """Compute only the LoRA delta (no base layer)."""
        return (x @ self.lora_A.T @ self.lora_B.T) * self.scaling

    def forward(
        self,
        x: torch.Tensor,
        prev_lora_outputs: Optional[List[torch.Tensor]] = None,
        training: bool = True,
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Args:
            x: Input tensor [batch, seq_len, d_in]
            prev_lora_outputs: List of previous adapter outputs,
                               each [batch, seq_len, d_out]
            training: If False, apply skip_threshold for efficiency.

        Returns:
            (output, current_lora_out) — output is the full layer result;
            current_lora_out should be appended to prev_lora_outputs
            for the next layer.
        """
        base_out = self.base_layer(x)        # [B, T, d_out]
        lora_out = self.lora_output(x)       # [B, T, d_out]

        attn_residual = torch.zeros_like(lora_out)

        if prev_lora_outputs:
            # Stack previous outputs: [L, B, T, d_out]
            stacked = torch.stack(prev_lora_outputs, dim=0)
            L = stacked.shape[0]

            # Normalize each previous output: [L, B, T, d_out]
            normed = self.rms_norm(stacked)

            # Compute attention scores via dot product with w_query
            # w_query: [d_out] → scores: [L, B, T]
            scores = torch.einsum('d,lbtd->lbt', self.w_query, normed)

            # Softmax over the L dimension
            alpha = F.softmax(scores, dim=0)  # [L, B, T]

            if not training and self.skip_threshold > 0:
                # Zero out adapters below threshold (dynamic skipping)
                # Average over batch/seq for the threshold decision
                alpha_mean = alpha.mean(dim=(1, 2), keepdim=True)  # [L,1,1]
                mask = (alpha_mean >= self.skip_threshold).float()
                alpha = alpha * mask
                # Re-normalize after masking
                alpha_sum = alpha.sum(dim=0, keepdim=True).clamp(min=1e-8)
                alpha = alpha / alpha_sum

            # Weighted sum of previous adapter outputs
            # [L,B,T] × [L,B,T,d] → [B,T,d]
            attn_residual = torch.einsum('lbt,lbtd->btd', alpha, stacked)

        output = base_out + lora_out + attn_residual
        return output, lora_out.detach()

AttnResLoRALayer — single layer implementation with dynamic skipping support.

Here is how you wire multiple layers together in a simple model:

class AttnResLoRAModel(nn.Module):
    """
    Wrapper that applies AttnRes-LoRA across multiple adapted layers.
    Each layer receives the cached outputs of all previous LoRA adapters.
    """
    def __init__(self, frozen_layers: List[nn.Linear], rank: int = 4):
        super().__init__()
        self.lora_layers = nn.ModuleList([
            AttnResLoRALayer(layer, rank=rank) for layer in frozen_layers
        ])

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        prev_outputs: List[torch.Tensor] = []
        h = x
        for layer in self.lora_layers:
            h, lora_out = layer(h, prev_lora_outputs=prev_outputs,
                                training=self.training)
            prev_outputs.append(lora_out)
        return h


# Quick usage example
if __name__ == "__main__":
    # Simulate 8 frozen linear layers (e.g., projection layers in a transformer)
    frozen_linears = [nn.Linear(512, 512) for _ in range(8)]
    model = AttnResLoRAModel(frozen_linears, rank=4)

    x = torch.randn(2, 128, 512)  # [batch=2, seq_len=128, d=512]
    out = model(x)
    print(f"Output shape: {out.shape}")  # [2, 128, 512]

    # Count trainable parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable params: {trainable:,}")
    # Approx: 8 layers × (rank×d×2 + d) = 8 × (4×512×2 + 512) = 36,864

Composing multiple AttnRes-LoRA layers with the shared adapter output cache.

9. What Is Worth Exploring

The ideas above raise several concrete questions that seem worth running experiments on. I'll try to be precise about what each experiment would actually test and what a positive or negative result would mean.

Does the gradient flow actually improve?

In ResLoRA, the gradient of the loss $\mathcal{L}$ with respect to an early adapter $\mathrm{LoRA}_i$ that feeds into layer $l$ is:

$$\frac{\partial \mathcal{L}}{\partial \mathrm{LoRA}_i} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_l} \cdot 1$$
ResLoRA: uniform gradient weight — every adapter receives equal signal regardless of relevance

In AttnRes-LoRA, the same gradient is scaled by the learned attention weight:

$$\frac{\partial \mathcal{L}}{\partial \mathrm{LoRA}_i} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_l} \cdot \alpha_{i \to l}$$
AttnRes-LoRA: gradient is weighted — adapters that are attended to more receive stronger training signal

This is a self-reinforcing signal: an adapter that is useful for a task gets attended to more, receives stronger gradients, and specializes further. An adapter that is not useful gets near-zero gradient from the residual path and converges quickly on its local input only. Whether this produces faster or more stable convergence than uniform weighting is an empirical question — but the mechanism is theoretically sound and worth measuring with training loss curves against LoRA and ResLoRA baselines.

Does $\alpha$ become sparse naturally, and does it mean anything?

Softmax over $L-1$ values does not guarantee sparsity — for small $L$ or low-temperature attention, the weights can remain close to uniform. But as $L$ grows and the model trains, we would expect the weights to concentrate. The interesting question is not just whether sparsity emerges, but what structure it has.

One testable hypothesis: for tokens of the same syntactic type (e.g., verbs), the $\alpha$ distribution at a given layer is more consistent across examples than for tokens of different types. If true, that would suggest the attention mechanism is discovering something semantically meaningful about which adapter layers capture which type of information — which would be a compelling interpretability result on top of the efficiency story.

A simpler sanity check: visualize $\alpha_{i \to l}$ as a heatmap (adapter index $i$ on x-axis, block index $l$ on y-axis) and check whether it resembles a diagonal (each block attends mostly to its immediate predecessor, similar to standard residuals) or a more structured pattern.

Can AttnRes-LoRA with low rank match high-rank LoRA?

Standard LoRA's expressivity is bounded by its rank $r$. The effective rank of the learned update $\Delta\mathbf{W} = \mathbf{BA}$ is at most $r$. In AttnRes-LoRA, the aggregated residual is a weighted sum of $L-1$ rank-$r$ matrices:

$$\Delta\mathbf{W}_{\text{eff}} = \mathbf{B}_l\mathbf{A}_l + \sum_{i=1}^{l-1} \alpha_{i \to l}\,\mathbf{B}_i\mathbf{A}_i$$
Effective weight update seen by the base model at block l

Each term has rank $\leq r$, but their weighted sum can have rank up to $\min((l-1) \cdot r,\, d)$. With $l = 10$ and $r = 4$, the effective update can have rank up to 36. This is the mathematical basis for the claim that many low-rank adapters, selectively combined, can match the expressivity of a single high-rank adapter — without the memory cost of storing a large matrix.

The clean experiment: fine-tune a fixed LLM on a reasoning benchmark (e.g., GSM8K or MATH) with (a) LoRA $r=4$, (b) LoRA $r=32$, (c) AttnRes-LoRA $r=4$ across 8 blocks. If (c) matches (b) while using 4× fewer parameters per adapter, that is a meaningful result.

Multi-task routing without task labels

A separate and interesting direction: fine-tune a single model on two distinct tasks simultaneously (e.g., code generation and mathematical reasoning) using AttnRes-LoRA. The hypothesis is that the $\alpha$ weights will naturally separate — code tokens will route through a different subset of adapters than math tokens — without any explicit task identifier or routing supervision.

If this holds, it is a meaningful result for continual learning: you could add a new task by appending new LoRA blocks, and the attention mechanism would learn to route the new task's inputs through the new adapters while preserving the old pathways. No replay buffer, no task boundaries, no explicit masks.

This is the most speculative direction. The $\alpha$ weights are conditioned on the hidden state $\mathbf{x}$, which by the time it reaches a deep block already contains mixed task information. Whether the attention mechanism can cleanly separate tasks from a mixed signal is genuinely unclear without experiments.

The Initialization Trap: Why $\mathbf{w}_l$ Needs a New Recipe

In standard LoRA, initialization is a solved problem: matrix $\mathbf{A}$ is initialized with random Gaussian noise to break symmetry, and matrix $\mathbf{B}$ is initialized to zero so that the adapter starts by doing nothing ($\Delta\mathbf{W} = 0$). For AttnRes-LoRA, we absolutely keep this standard recipe for the adapter matrices themselves.

However, the new depth-wise query vector $\mathbf{w}_l$ introduces a trap. It is tempting to initialize $\mathbf{w}_l$ to all zeros as well. But if $\mathbf{w}_l = \mathbf{0}$, the dot product against all previous adapters is zero. When passed through the Softmax function, these zeros turn into a perfectly uniform distribution (e.g., $[0.25, 0.25, 0.25, 0.25]$).

Ironically, applying the "safe" zero-initialization means our network starts step zero acting exactly like ResLoRA — blindly averaging all previous layers — which is the exact dilution problem we are trying to fix!

This leaves us with an open research question:

Conclusion

The Attention Residuals paper by the Kimi team challenges one of the longest-standing defaults in deep learning architecture. By proving that depth-wise attention is not only feasible but mathematically superior to uniform addition, they have opened the door to much more intelligent and communicative network designs.

AttnRes-LoRA is a natural and necessary extension of this idea. As we push the limits of Parameter-Efficient Fine-Tuning (PEFT) on complex reasoning and high-fidelity visual tasks, we cannot afford to let early adapter signals drown in a noisy, unweighted residual stream.

By allowing adapters to dynamically route features and selectively remember past states, we can solve the dilution problem of ResLoRA, enable dynamic compute-saving skips during inference, and drastically increase the expressivity of low-rank fine-tuning without expanding our parameter budget.

The theoretical foundation is there, and the PyTorch implementation is lightweight. The next step is empirical validation. If you are exploring parameter-constrained fine-tuning or want to collaborate on benchmarking these routing dynamics, feel free to reach out!


Umar Khalid is an AI Research Scientist II at Axon Enterprises, working on computer vision and multimodal AI. Previously at Meta, Samsung Research, and Microsoft. PhD from the University of Central Florida.

Have thoughts on AttnRes-LoRA? Reach out at umar@umarkhalid.com or connect on LinkedIn.