Skip to content

Conversation

huing4257
Copy link

Add gemm kernel for int2 weight. Also fix scaling problems in previous bitlinear kernel.

@huing4257
Copy link
Author

@microsoft-github-policy-service agree

@huing4257 huing4257 changed the title Add support for int2 prefill Add support for GPU int2 prefill Jul 30, 2025
@huing4257
Copy link
Author

fix #284 :

Could you help me explain Python?

Of course! Python is a high-level, interpreted programming language known for its simplicity and readability. Here are some key points to help you understand Python:
...

// Extract and decompress the int2 values
int32_t compressed = B_compressed[compressed_block_idx * 32 + tile_idx];
int8_t decompressed[16];
decode_i2s_to_i8s(&compressed, decompressed);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many threads will dequant i2s with same weight, could we create a pre-process to cache the dequant result to share memory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, blocks with the same blockN but different blockM will dequant the same weight. However, shared memory is only accessible to threads within the same thread block. If we want to cache the dequant result, I think we either dequant all weights in global memory, or we have to loop on M in every block to reuse the weight, which may lead to splitK to maximize parallel. How do you think we can implement this to optimize?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants