Skip to content

Conversation

@mahdizaferanchi
Copy link
Contributor

Used CUDA block-level shared memory to optimize training kernels.

Find tests and results here.

@Felix-Petersen
Copy link
Owner

Thanks for the additions! These optimizations restructure the code quite a bit and require a constant which could depend on the GPU model. I saw you showed improvements for a P100, but before merging it, I want to test it myself on some other more recent and diverse GPU models.

That being said, I very much appreciate the improvements, and I intend to do these tests and integrate the block-level shared memory for the next version release (probably later this year).

@mahdizaferanchi
Copy link
Contributor Author

That sounds great! Yes, I understand the decision to merge these changes isn't trivial. I'm glad my work could be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants