Multi-Head Attention: Full Input Projection Not Slicing
A critical clarification about how multi-head attention works in Transformers: Each attention head receives the entire input embedding, not a slice of it.
The Common Misconception
Many explanations suggest that with 768 dimensions and 12 heads, each head gets a different 64-dimensional "slice" (dims 1-64, 65-128, etc.). This is incorrect.
What Actually Happens
- Full input to each head: Every attention head receives the complete 768-dimensional embedding
- Unique projections: Each head has its own learned Q, K, V weight matrices that project the full input down to 64 dimensions
- Parallel processing: All heads compute attention simultaneously in their own 64-dim spaces
- Concatenation: The 12 outputs (12 × 64 = 768) are concatenated
- Final mixing: A linear transformation combines information across all heads
The Key Insight
Think of it like 12 different experts examining the same patient - they all see the complete picture but each learns to focus on different diagnostic patterns through their unique projection matrices. The "different parts of the feature space" refers to different learned representations, not different input slices.
This design achieves both computational efficiency (parallel 64-dim operations) and representational richness (12 different learned perspectives on the same data).