W_k
and W_v
are so amazingly low dimensional spaces if v-stacked and PCA fit together. With just 30% of the variance explained using principal components, the LLM still performs reasonably well.
No wonder the DeepSeek team added a latent matrix for W_kv
to generate W_k
and W_v
. K
and V
is a nothing burger.
Also this PCA trick is a way to quickly find architectural improvements to LLM without using fine tuning and blind explorations.
This project explores the effects of dimensionality reduction on TinyLlama by replacing the Wk, Wq, and Wv attention matrices with reconstructed versions from their PCA components.
In the pca
branch of this repository, the Wk, Wq, and Wv matrices of TinyLlama are replaced with matrices reconstructed from their PCA components that explain different amounts of the original variance:
As the variance threshold decreases, the model's performance progressively degrades, demonstrating the trade-offs between model compression and capabilities.
In the branch shared_kv_pca
of this repository For each layer:
- The code now groups parameters by layer name (e.g.,
model.layers.0.self_attn
) - For each layer:
q_proj
is processed separately with its own PCAk_proj
andv_proj
are processed together with a shared PCA:- The matrices are stacked vertically using
np.vstack
- A single PCA is fit on the combined matrix
- The same PCA components are then used to transform and reconstruct each matrix separately
- The matrices are stacked vertically using
This approach ensures that k_proj
and v_proj
matrices use the same principal components, potentially capturing related patterns across both matrices while still maintaining their individual characteristics.
- 99% of original variance
- 90% of original variance
- 80% of original variance
- 70% of original variance
- 60% of original variance
- 50% of original variance
- 40% of original variance
- 30% of original variance
- 20% of original variance
- 10% of original variance
Notice how when we np.vstack
, the k_proj
and v_proj
matrices and calculate a shared PCA used to reconstruct k_proj
and v_proj
the LLM is much more resilient to PCA variance threshold, than the Trivial Result of using separate PCA for k_proj
and v_proj
.
Notice how this method requires no training, fine-tuning or backpropagation. It runs very fast on CPU. Perhaps this is how teams competing to make foundation models quickly derive their insight that latent matrices which help create W_k
and W_v
share structure. Also it seems like KV has a lot more redundancy than I thought. Perhaps the next step is to reconstruct from PCA components shared between layers. Requesting for comments.
Make sure you have Python and uv installed.
make run
This runs the script with the default variance threshold (90%).
make run VARIANCE_THRESHOLD=0.95
You can specify any variance threshold between 0 and 1.
To automatically rerun the script when Python files change:
make watch
You can also specify a custom variance threshold in watch mode:
make watch VARIANCE_THRESHOLD=0.99
To just create the virtual environment and install dependencies:
make install