“Attention Over the Air”: Low-Latency Transformers in RAN

#AI-RAN #AI-for-RAN #Transformer

Topics

*Portions of this article are derived from a research manuscript currently under review at IEEE. A preprint is available here.

1. Why We Need a New AI for RAN Architecture

Modern 5G/6G radios must operate under rapidly changing channel and interference conditions while supporting flexible numerologies, MU-MIMO, and tight real-time deadlines. Hand-crafted PHY pipelines—synchronization → channel estimation → equalization → demapping → decoding—perform well only near the assumptions they were designed for; when pilot patterns, bands, or mobility profiles shift, they require re-tuning. A data-driven, spectrum-aware approach that learns correlations directly from I/Q and adapts without re-engineering is therefore desirable.
To gauge the upside, we first replaced individual blocks with learned models (see Section 2). Two practical issues remained:


  • Limited context. Fixed receptive fields make CNNs miss long-range correlations across subcarriers, symbols, and antennas, which are crucial under fast fading, Doppler, and frequency-selective interference.

  • Engineering overhead. Without a common design, each PHY task required its own architecture and bespoke pre/post-processing, increasing design, tuning, and maintenance effort.

2. Starting Point: Previous Achievements

Before developing the Transformer blueprint, we delivered single-task AI baselines at MWC 2025, which serve as concrete references for this work (see Fig. 1):


  • Channel Frequency Interpolation × CNN: ≈ 20% uplink throughput gain vs a non-AI estimator.

  • SRS Prediction × MLP: ≈ 13% downlink throughput gain at 80 km/h (smaller at lower speeds).

Fig. 1 — Baselines presented at MWC 2025: CNN-based channel interpolation (uplink) and MLP-based SRS prediction (downlink).

3. From Local to Global: Convolution vs Self-Attention

Convolutions look locally: a fixed-size kernel aggregates only a small neighborhood on the time–frequency grid. Wireless channels, however, often exhibit dependencies spanning tens to hundreds of REs across time and frequency (multipath clusters, Doppler, sparse pilots). As a result, CNNs either under-fit long-range structure or grow deep and slow to compensate.

Self-attention is different: each token can attend to any other token, letting the model connect distant but relevant positions on the grid. This global view is well-suited to interpolation and prediction, where far-apart pilots or historical REs can be informative for the current estimate.

Fig. 2 — Convolution (local receptive fields) vs self-attention (global dependencies) on the time–frequency grid.

4. Unified Transformer — Design Principles

We standardize on a common blueprint (architectural template; weights trained per task) built around three principles:

1. Resource-Element (RE) tokenization

Flatten the 2D grid into a sequence of tokens, one per RE; optional fields (antenna/UE indices, pilot mask) can be appended. Positional embeddings preserve grid coordinates.
Why it fits: keeps per-RE fidelity while exposing global context without hand-crafted neighborhoods.

2. Shallow, amplitude-preserving encoder

A compact stack of multi-head self-attention layers (e.g., four layers, four heads) without early normalization, so magnitude cues are retained.
Why it fits: delivers long-range dependencies without deep stacks, containing latency/memory growth and avoiding amplitude wash-out that hurts channel reconstruction.

3. Task-specific heads

A lightweight head (LN → small FFN → projection) maps the shared representation format to task outputs: complex channels (interpolation/estimation), per-bit LLRs (demapping), or scalar scores (prediction/scheduling).
Why it fits: one blueprint across tasks reduces engineering overhead (I/O, embedding, implementation), while each task is trained and deployed as its own model instance.

Fig. 3 — Architecture Core 3: common blueprint (encoder + task-specific heads).

5. How It Works


  • Inputs: One or more resource grids (e.g., uplink PUSCH with data/pilots, or an SRS grid). Each RE becomes a token by concatenating its real/imaginary parts; optional side fields (antenna index, UE ID, pilot mask, target horizon) can be appended.

  • Positional encoding: Embeddings encode (subcarrier, symbol, antenna) so the sequence retains the 2D grid’s structure.

  • Encoding: The shallow self-attention encoder processes the sequence; any RE can attend to any other, capturing time-, frequency-, and antenna-wise dependencies without stacking many convolutional layers.

  • Heads & outputs: A task head reshapes back to grid geometry and emits the desired quantity—e.g., a complex channel grid (interpolation/estimation), per-bit LLRs (demapping), or a scalar forecast (SRS prediction). Minimal post-processing (e.g., hard decisions) is applied where needed.

Implementation note: Latency numbers and hardware choices are discussed in Section 7 (Latency & Implementation).

6. Featured Use Cases

6.1 Channel Frequency Interpolation

Problem

In 5G, channels are directly observed only at pilot REs; values at data REs must be inferred. Classical interpolators (e.g., LMMSE) or 2D CNNs rely on local neighborhoods and may underuse far-apart pilot information.

Fig. 4 — Channel-interpolation sketch: pilots (green) and predicted values at data REs (orange).

Approach

Instantiate the blueprint for interpolation: feed pilot estimates and a pilot mask; output a complex channel grid across the full band. Self-attention exploits long-range correlations across subcarriers and symbols that local kernels miss.

Results

In 100 MHz n78 OTA evaluations, the Transformer interpolator achieved ≈ 30% uplink throughput gain vs LMMSE and ≈ 8% vs a CNN interpolator, with higher medians.

Fig. 5 — Uplink throughput (OTA): Transformer vs CNN vs LMMSE (box plots).

6.2 Sounding Reference Signal (SRS) Prediction

Problem

When uplink SRS reports are missing (UL gaps, mobility), the gNB must forecast near-future downlink beam quality to maintain alignment.

Fig. 6 — SRS prediction problem: missing SRS intervals on the UL timeline and the gNB’s forecast of the near-future channel.

Approach

Instantiate the blueprint for SRS prediction: ingest recent UL pilots/SRS and predict a future beam/quality for the target horizon. The global receptive field helps capture longer temporal dependencies.

Results

New simulations show ≈ 29% downlink gain at 80 km/h and ≈ 31% at 40 km/h, roughly 2.2–3.1× the gains of a prior MLP baseline.

Fig. 7 — SRS prediction (simulation): Transformer vs MLP gains at 40/80 km/h.

7. Latency & Implementation

7.1 Meeting the 500 µs Deadline

Constraint

Uplink reception requires PHY processing in ≈ 500 µs.

Result

For 100 MHz PUSCH (OTA), the end-to-end receiver completes in ≈ 337 µs on NVIDIA GH200—≈ 1.36× faster than a CNN-based receiver—comfortably within budget.

Fig. 8 — Mean end-to-end PUSCH latency (OTA): Transformer vs CNN.

7.2 The GPU Imperative

Observation

Even lightweight models exceed the budget on CPUs alone (modern Xeon > 500µs).

Result

On NVIDIA H100, channel-interpolation inference runs in ≈ 0.205 ms, meeting the real-time target. GPUs (or dedicated AI accelerators) are therefore essential for practical deployment.

Fig. 9 — Inference time for channel interpolation: GPU (H100) vs CPU (Xeon).

8. Other Applications


  • End-to-end reception: Output LLRs directly to unify estimation → equalization → demapping, reducing BLER while respecting latency budgets.

  • Full-band channel estimation: Reconstruct the entire band from sparse pilots; observed to outperform CNN estimators.

  • Cross-layer hooks: Score user pairs for MU-pairing and other RRM tasks.

9. Challenges and Future Directions


  • Broader task coverage: Extend beyond interpolation/SRS to cross-layer functions (e.g., scheduling, resource allocation) so the blueprint contributes to end-to-end optimization.

  • Efficiency & scale: Refine positional encodings; apply compression and early-exit strategies to push the accuracy–latency–power frontier on baseband/edge hardware.

  • Robustness & evaluation: Improve behavior under high mobility/fast fading and advance 3GPP-compliant system-level evaluations.

  • Interpretability & standards: Use attention insights to inform pilot design and debugging while engaging with standards on the path from research to deployment.

10. Toward the Realization of AI-Native RAN

In this column we showed that attention-based signal processing can run over the air within slot-time constraints and that a reusable Transformer blueprint—RE tokenization, a shallow attention encoder, and task-specific (non-shared) heads—maps cleanly onto real RAN tasks. We illustrated the approach on channel-frequency interpolation (OTA) and SRS prediction (simulation) and discussed the practical pieces that matter in deployment: I/O conventions, runtime envelopes, and the need for accelerator-class inference.

From here, our focus is on broadening 3GPP-aligned evaluations, pushing efficiency (compression, early-exit, numerology-robust embeddings), and extending integrations toward scheduling and RRM. We’ll keep sharing what works—and what doesn’t—through papers, talks, and collaborations with partners who want to help build AI-native RAN.

Research Areas