TL;DR:

W_k and W_v are so amazingly low dimensional spaces if v-stacked and PCA fit together. With just 30% of the variance explained using principal components, the LLM still performs reasonably well.

No wonder the DeepSeek team added a latent matrix for W_kv to generate W_k and W_v. K and V is a nothing burger.

Also this PCA trick is a way to quickly find architectural improvements to LLM without using fine tuning and blind explorations.

Project Surgery

This project explores the effects of dimensionality reduction on TinyLlama by replacing the Wk, Wq, and Wv attention matrices with reconstructed versions from their PCA components.

Trivial Result

In the pca branch of this repository, the Wk, Wq, and Wv matrices of TinyLlama are replaced with matrices reconstructed from their PCA components that explain different amounts of the original variance:

99% of original variance
97.5% of original variance
95% of original variance

As the variance threshold decreases, the model's performance progressively degrades, demonstrating the trade-offs between model compression and capabilities.

Interesting Result

In the branch shared_kv_pca of this repository For each layer:

The code now groups parameters by layer name (e.g., model.layers.0.self_attn)
For each layer:
- q_proj is processed separately with its own PCA
- k_proj and v_proj are processed together with a shared PCA:
  - The matrices are stacked vertically using np.vstack
  - A single PCA is fit on the combined matrix
  - The same PCA components are then used to transform and reconstruct each matrix separately

This approach ensures that k_proj and v_proj matrices use the same principal components, potentially capturing related patterns across both matrices while still maintaining their individual characteristics.

99% of original variance
90% of original variance
80% of original variance
70% of original variance
60% of original variance
50% of original variance
40% of original variance
30% of original variance
20% of original variance
10% of original variance

Notice how when we np.vstack, the k_proj and v_proj matrices and calculate a shared PCA used to reconstruct k_proj and v_proj the LLM is much more resilient to PCA variance threshold, than the Trivial Result of using separate PCA for k_proj and v_proj.

Conclusion & Further Direction

Notice how this method requires no training, fine-tuning or backpropagation. It runs very fast on CPU. Perhaps this is how teams competing to make foundation models quickly derive their insight that latent matrices which help create W_k and W_v share structure. Also it seems like KV has a lot more redundancy than I thought. Perhaps the next step is to reconstruct from PCA components shared between layers. Requesting for comments.

Setup

Make sure you have Python and uv installed.

Usage

Run with default settings

make run

This runs the script with the default variance threshold (90%).

Run with custom variance threshold

make run VARIANCE_THRESHOLD=0.95

You can specify any variance threshold between 0 and 1.

Development mode

To automatically rerun the script when Python files change:

make watch

You can also specify a custom variance threshold in watch mode:

make watch VARIANCE_THRESHOLD=0.99

Installation only

To just create the virtual environment and install dependencies:

make install

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TL;DR:

Project Surgery

Trivial Result

Interesting Result

Conclusion & Further Direction

Setup

Usage

Run with default settings

Run with custom variance threshold

Development mode

Installation only

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

edwinhere/surgery

Folders and files

Latest commit

History

Repository files navigation

TL;DR:

Project Surgery

Trivial Result

Interesting Result

Conclusion & Further Direction

Setup

Usage

Run with default settings

Run with custom variance threshold

Development mode

Installation only

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages