-
Notifications
You must be signed in to change notification settings - Fork 93
GPU optimized symmetry operations #1097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, some thoughts.
Changing the order of the loops reduces the number of kernel calls and large copies. As a result, performance is greatly improved:
Note that this introduces some level of clunkiness. For example, all data accessed within the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Your current implementation, spiked another idea: Fuse both functions. If it's too complicated don't bother, but I think it should work.
acc = ρ_i | ||
for S in symm_S | ||
idx = index_G_vectors(fft_size, S * G) | ||
acc *= isnothing(idx) ? zero(complex(T)) : one(complex(T)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting off.
function accumulate_over_symmetries!(ρaccu::AbstractArray, ρin::AbstractArray, | ||
basis::PlaneWaveBasis{T}, symmetries) where {T} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how much worse is this implementation compared to the CPU version ? Can we not just make this the CPU version, too ? It looks like it should not be very much worse than what we have.
ρaccu | ||
end | ||
|
||
function lowpass_for_symmetry!(ρ::AbstractGPUArray, basis::PlaneWaveBasis{T}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place where we need this is in symmetrize_ρ
, where it comes right after accumulate_over_symmetries!
. I think for the GPU version it would make a lot of sense to fuse these two functions into one and thus have a single map!
going over all Gs
. This you should be able to do with a boolean flag do_lowpass
, which hopefully Julia is smart enough to constant-prop into the GPU kernel and fully compile away if set to false
.
Again feel free to make this fused function also the CPU function if this does not hurt performance too much.
When simulating large systems with symmetries, the symmetrization of the density is extremely slow. This is because the array is first transferred to the CPU, before a double loop over symmetries and G vectors takes place. By comparison to loops running on the GPU, this is slow, and the operation becomes a major bottleneck.
This PR introduces GPU specific implementations for
accumulate_over_symmetries!
andlowpass_for_symemtry!
, using themap!
construct to run on the device, and thus saving data transfers. The loop itself is also greatly accelerated. Note that these new implementations do run on the CPU, but slower than the current solution.For illustration, using the input below, I observe these timings (seconds):