How do the ML models used in wireless sensing (CNNs in 802.11bf) compare to transformer-based architectures like those i

Query

How do the ML models used in wireless sensing (CNNs in 802.11bf) compare to transformer-based architectures like those in Attention Is All You Need, and what drives the architectural choice?

Answer

CNN (MILAGRO) vs. Transformer: Architectural Comparison for Wireless Sensing

Structural Differences

Dimension	MILAGRO 1D CNN	Transformer
Core operation	Conv1D cross-correlation over local kernel (kernel=2)	Scaled dot-product attention: $softmax (Q K^{T} / d_{k}) V$
Dependency range	Local — kernel captures consecutive time steps only	Global — every position attends to every other position simultaneously
Complexity	O(n) per layer	O(n²) per layer (attention matrix)
Positional awareness	Implicit via convolution order	Requires explicit positional encoding (sinusoidal)
Architecture depth	Shallow cascade: Conv1D → MaxPool1D → Dense(128) → Dense(labels)	Deep stacked encoder/decoder layers with residual connections and layer normalization
Multi-stream fusion	Hierarchical two-block cascade (Block 1 mmWave → Block 2 5 GHz + Block 1 output)	Multi-head attention attends across subspaces simultaneously

Sources: 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing, NIPS-2017-attention-is-all-you-need-Paper

What Drives the Architectural Choice

The wiki explicitly addresses this tradeoff:

1. Nature of temporal dependencies in the data

CSI beacon data and BT sweep data exhibit short-range temporal correlations. A Conv1D kernel of size 2 is sufficient to capture dependencies between consecutive time steps (beacons at ~100 ms intervals, consecutive AWV sweeps). The Transformer’s self-attention mechanism is designed for long-range dependencies — its key motivation in NLP is capturing relationships between distant tokens regardless of sequence length. This long-range capability is unnecessary and costly for the CSI/PDP input structure used in MILAGRO 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing.

2. Computational complexity

The O(n²) complexity of self-attention (where n is sequence length) is called out explicitly as a distinguishing factor versus CNN’s O(n) complexity 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing. For fixed-size RF input matrices (52×100 for 5 GHz CSI, 64×3000 for mmWave PDP), the quadratic scaling of attention would add overhead without benefit, since the task does not require reasoning over arbitrary-length sequences.

3. Task nature: classification vs. sequence transduction

The Transformer was designed for sequence-to-sequence transduction (e.g., machine translation), where the decoder must generate output tokens conditioned on a full encoded input NIPS-2017-attention-is-all-you-need-Paper. MILAGRO’s task is multi-class classification over fixed-dimensional RF tensors — the output is a probability vector over labels, not a generated sequence. This mismatch in task structure further reduces the motivation to use attention.

4. Data volume and sample efficiency

MILAGRO saturates accuracy at approximately 60 training samples and 120 epochs 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing. Transformers are known to require large datasets to train effectively (the base Transformer in the original paper used millions of sentence pairs NIPS-2017-attention-is-all-you-need-Paper). A shallow CNN is far more appropriate when labeled training data is scarce — a practical constraint in RF sensing deployments requiring environment-specific data collection.

5. Fusion architecture requirements

MILAGRO’s design challenge is heterogeneous multiband fusion: two physically distinct signal modalities (60 GHz mmWave PDP, 5 GHz beacon CSI) with different resolutions and failure modes (mmWave is LOS-only, 5 GHz penetrates walls). The cascade architecture (Block 1 pre-classifies to constrain Block 2’s inference space) is a deliberate engineering choice to exploit complementary modality strengths. The Transformer’s multi-head attention could in principle attend across modalities, but the wiki does not suggest this was evaluated, and the cascade outperforms naive late-fusion of single-band models 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing.

Summary Judgment

The architectural choice of 1D CNN over Transformer is driven by four aligned factors: local (not long-range) temporal structure in CSI/PDP data, O(n) vs O(n²) complexity, classification (not transduction) task type, and low training-data availability. The Transformer’s parallelization advantage over RNNs NIPS-2017-attention-is-all-you-need-Paper is irrelevant here since CNNs are already fully parallelizable. The Transformer’s global attention is powerful precisely where MILAGRO does not need it.

Personal Wiki

Explorer