Query
How do the ML models used in wireless sensing (CNNs in 802.11bf) compare to transformer-based architectures like those in Attention Is All You Need, and what drives the architectural choice?
Answer
CNN (MILAGRO) vs. Transformer: Architectural Comparison for Wireless Sensing
Structural Differences
| Dimension | MILAGRO 1D CNN | Transformer |
|---|---|---|
| Core operation | Conv1D cross-correlation over local kernel (kernel=2) | Scaled dot-product attention: |
| Dependency range | Local — kernel captures consecutive time steps only | Global — every position attends to every other position simultaneously |
| Complexity | O(n) per layer | O(n²) per layer (attention matrix) |
| Positional awareness | Implicit via convolution order | Requires explicit positional encoding (sinusoidal) |
| Architecture depth | Shallow cascade: Conv1D → MaxPool1D → Dense(128) → Dense(labels) | Deep stacked encoder/decoder layers with residual connections and layer normalization |
| Multi-stream fusion | Hierarchical two-block cascade (Block 1 mmWave → Block 2 5 GHz + Block 1 output) | Multi-head attention attends across subspaces simultaneously |
Sources: 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing, NIPS-2017-attention-is-all-you-need-Paper
What Drives the Architectural Choice
The wiki explicitly addresses this tradeoff:
1. Nature of temporal dependencies in the data
CSI beacon data and BT sweep data exhibit short-range temporal correlations. A Conv1D kernel of size 2 is sufficient to capture dependencies between consecutive time steps (beacons at ~100 ms intervals, consecutive AWV sweeps). The Transformer’s self-attention mechanism is designed for long-range dependencies — its key motivation in NLP is capturing relationships between distant tokens regardless of sequence length. This long-range capability is unnecessary and costly for the CSI/PDP input structure used in MILAGRO 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing.
2. Computational complexity
The O(n²) complexity of self-attention (where n is sequence length) is called out explicitly as a distinguishing factor versus CNN’s O(n) complexity 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing. For fixed-size RF input matrices (52×100 for 5 GHz CSI, 64×3000 for mmWave PDP), the quadratic scaling of attention would add overhead without benefit, since the task does not require reasoning over arbitrary-length sequences.
3. Task nature: classification vs. sequence transduction
The Transformer was designed for sequence-to-sequence transduction (e.g., machine translation), where the decoder must generate output tokens conditioned on a full encoded input NIPS-2017-attention-is-all-you-need-Paper. MILAGRO’s task is multi-class classification over fixed-dimensional RF tensors — the output is a probability vector over labels, not a generated sequence. This mismatch in task structure further reduces the motivation to use attention.
4. Data volume and sample efficiency
MILAGRO saturates accuracy at approximately 60 training samples and 120 epochs 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing. Transformers are known to require large datasets to train effectively (the base Transformer in the original paper used millions of sentence pairs NIPS-2017-attention-is-all-you-need-Paper). A shallow CNN is far more appropriate when labeled training data is scarce — a practical constraint in RF sensing deployments requiring environment-specific data collection.
5. Fusion architecture requirements
MILAGRO’s design challenge is heterogeneous multiband fusion: two physically distinct signal modalities (60 GHz mmWave PDP, 5 GHz beacon CSI) with different resolutions and failure modes (mmWave is LOS-only, 5 GHz penetrates walls). The cascade architecture (Block 1 pre-classifies to constrain Block 2’s inference space) is a deliberate engineering choice to exploit complementary modality strengths. The Transformer’s multi-head attention could in principle attend across modalities, but the wiki does not suggest this was evaluated, and the cascade outperforms naive late-fusion of single-band models 802.11bf-multiband-passive-sening-reusing-wifi-singling-for-sensing.
Summary Judgment
The architectural choice of 1D CNN over Transformer is driven by four aligned factors: local (not long-range) temporal structure in CSI/PDP data, O(n) vs O(n²) complexity, classification (not transduction) task type, and low training-data availability. The Transformer’s parallelization advantage over RNNs NIPS-2017-attention-is-all-you-need-Paper is irrelevant here since CNNs are already fully parallelizable. The Transformer’s global attention is powerful precisely where MILAGRO does not need it.