Skip to content

edwinhere/surgery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TL;DR:

W_k and W_v are so amazingly low dimensional spaces if v-stacked and PCA fit together. With just 30% of the variance explained using principal components, the LLM still performs reasonably well.

No wonder the DeepSeek team added a latent matrix for W_kv to generate W_k and W_v. K and V is a nothing burger.

Also this PCA trick is a way to quickly find architectural improvements to LLM without using fine tuning and blind explorations.

Project Surgery

This project explores the effects of dimensionality reduction on TinyLlama by replacing the Wk, Wq, and Wv attention matrices with reconstructed versions from their PCA components.

Trivial Result

In the pca branch of this repository, the Wk, Wq, and Wv matrices of TinyLlama are replaced with matrices reconstructed from their PCA components that explain different amounts of the original variance:

  • 99% of original variance image
  • 97.5% of original variance image
  • 95% of original variance image

As the variance threshold decreases, the model's performance progressively degrades, demonstrating the trade-offs between model compression and capabilities.

Interesting Result

In the branch shared_kv_pca of this repository For each layer:

  1. The code now groups parameters by layer name (e.g., model.layers.0.self_attn)
  2. For each layer:
    • q_proj is processed separately with its own PCA
    • k_proj and v_proj are processed together with a shared PCA:
      • The matrices are stacked vertically using np.vstack
      • A single PCA is fit on the combined matrix
      • The same PCA components are then used to transform and reconstruct each matrix separately

This approach ensures that k_proj and v_proj matrices use the same principal components, potentially capturing related patterns across both matrices while still maintaining their individual characteristics.

  • 99% of original variance image
  • 90% of original variance image
  • 80% of original variance image
  • 70% of original variance image
  • 60% of original variance image
  • 50% of original variance image
  • 40% of original variance image
  • 30% of original variance image
  • 20% of original variance image
  • 10% of original variance image

Notice how when we np.vstack, the k_proj and v_proj matrices and calculate a shared PCA used to reconstruct k_proj and v_proj the LLM is much more resilient to PCA variance threshold, than the Trivial Result of using separate PCA for k_proj and v_proj.

Conclusion & Further Direction

Notice how this method requires no training, fine-tuning or backpropagation. It runs very fast on CPU. Perhaps this is how teams competing to make foundation models quickly derive their insight that latent matrices which help create W_k and W_v share structure. Also it seems like KV has a lot more redundancy than I thought. Perhaps the next step is to reconstruct from PCA components shared between layers. Requesting for comments.

Setup

Make sure you have Python and uv installed.

Usage

Run with default settings

make run

This runs the script with the default variance threshold (90%).

Run with custom variance threshold

make run VARIANCE_THRESHOLD=0.95

You can specify any variance threshold between 0 and 1.

Development mode

To automatically rerun the script when Python files change:

make watch

You can also specify a custom variance threshold in watch mode:

make watch VARIANCE_THRESHOLD=0.99

Installation only

To just create the virtual environment and install dependencies:

make install

About

My LLM surgeries. Ablation studies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published