-
Notifications
You must be signed in to change notification settings - Fork 94
KernelAbstractions for FFTs #1099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Nice, but for large scale computations we might be better off just doing non variational computations on a cubic grid, which avoids this computation entirely. |
I think there is a sweet spot in between where you need variational data. Especially in data generation for ML consistency between energy and forces within a structure and across a dataset is quits important for example. But we should really test "how bad" such inconsistencies would be for nonvariational. |
I've learnt to be wary of non variational calculations in my (more scientific) past, especially if gradients are involved. Generally, I think that variational calculations should always be an option, and probably the default one. Then, we might as well do it efficiently. |
Careful about the word variational here. Non variational in the dftk context just means that the discretization is non variational, ie the energy is not necessarily above the true one. Also note that even the "variational" energy is not strictly variational, because the xc energy is approximated on a grid. The fact that the discretization is not variational doesn't mean that the discrete orbitals are not: they still minimize the discretized energy, and the hellmann feynman theorem is still valid. So I wouldn't expect problems like that. |
I'd need a more careful review, but I think given the performance improvement that's a reasonable amount of code to add, don't you agree @antoine-levitt ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like the kernel overload should just be in GPUArrays, no? (ie x[I]=y should do what you implemented) Can we fix it there?
|
||
# Pad the input data | ||
fill!(f_real, 0) | ||
f_real[Gvec_mapping] = f_fourier | ||
ifft_remap!(f_real, Gvec_mapping, f_fourier) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More meaningful names would be good. Something like fill_grid_with_kpt_data!()? Maybe also comment that this does f_real[Gvec_mapping]=f_fourier
?
I think kernelabstractions also has a cpu backend, so perhaps we can actually avoid special-casing the gpu code. I also have a few issues with the proposed code structure ... will review more later. |
KernelAbstractions does have a CPU backend, and it uses Julia threads for parallelization (can also use OMP). I tested it, and overall performance drops when using multiple threads (compared to master). My interpretation is that since these FFTs are mostly (only?) called within |
Calling it on just one thread is not easily feasible ? If not then let's not bother and let's go with your proposed structure. |
I looked into it, and it's not obvious. The only solution I found also underperforms wrt to master when running with threads.
I'll look into it, this could be a nice feature with potential impact elsewhere in the code. This would also mean overloading |
Careful GPU profiling shows that a lot of time is wasted reshuffling arrays during K-point specific FFT calls. Since FFTs are a large part of
DftHamiltonian multiplication
andcompute_density
, accelerating them can have a large impact on performance, especially for systems with dense k-point meshes.For illustration, a
ifft!
call currently looks like this in the profiler:The 3 blue squares on the right are the actual FFT, and the blue square on the left is the kernel filling the array with zeros. All the rest, i.e. the majority of the runtime, is spend in this operation:
f_real[Gvec_mapping] = f_fourier
. Something similar happens forfft!
.This PR introduces explicit GPU kernels for the reshuffling of data during FFTs. The
KernelAbstractions.jl
package is used in order to remain vendor agnostic (already an indirect dependency of DFTK through GPU packages). The kernel themselves are very simple, but their performance is much better. The sameifft!
call now looks like the following in the profiler:For this to be possible, some light code uglification is necessary. The
Gvec_mapping
field of a given K-point must be available on the GPU at all time, and transferring it to the device at each call would defeat the purpose of this PR. As a solution for this, I added aGvec_mapping_gpu
field to the K-point struct (could also be calledGvec_mapping_fft
). I decided to duplicate the data rather than moving the originalGvec_mapping
field to the device, because the latter is assumed to be on the CPU in most places in the code, and it's inversemapping_inv
is not in a format suitable for the GPU (and it makes sense that back and forth ops are stored in the same manner).Finally, here are some timings (s) for the input show below: