KernelAbstractions for FFTs #1099

abussy · 2025-05-19T13:55:21Z

Careful GPU profiling shows that a lot of time is wasted reshuffling arrays during K-point specific FFT calls. Since FFTs are a large part of DftHamiltonian multiplication and compute_density, accelerating them can have a large impact on performance, especially for systems with dense k-point meshes.

For illustration, a ifft! call currently looks like this in the profiler:

The 3 blue squares on the right are the actual FFT, and the blue square on the left is the kernel filling the array with zeros. All the rest, i.e. the majority of the runtime, is spend in this operation: f_real[Gvec_mapping] = f_fourier. Something similar happens for fft!.

This PR introduces explicit GPU kernels for the reshuffling of data during FFTs. The KernelAbstractions.jl package is used in order to remain vendor agnostic (already an indirect dependency of DFTK through GPU packages). The kernel themselves are very simple, but their performance is much better. The same ifft! call now looks like the following in the profiler:

For this to be possible, some light code uglification is necessary. The Gvec_mapping field of a given K-point must be available on the GPU at all time, and transferring it to the device at each call would defeat the purpose of this PR. As a solution for this, I added a Gvec_mapping_gpu field to the K-point struct (could also be called Gvec_mapping_fft). I decided to duplicate the data rather than moving the original Gvec_mapping field to the device, because the latter is assumed to be on the CPU in most places in the code, and it's inverse mapping_inv is not in a format suitable for the GPU (and it makes sense that back and forth ops are stored in the same manner).

Finally, here are some timings (s) for the input show below:

GPU	master	this PR
A100	44	33
H100	51	40
Mi200	71	54

using DFTK
using PseudoPotentialData
using CUDA
setup_threading(;)

arch = DFTK.GPU(CuArray)

Ecut = 72.0
kgrid = [10, 10, 10] 
temp = 0.01
tol = 1.0e-8
maxiter = 10

a = 7.260
lattice = a * [[1. 0. 0.];
               [0. 1. 0.];
               [0. 0. 1.]]

Sr = ElementPsp(:Sr, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
V = ElementPsp(:V, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
O = ElementPsp(:O, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))

atoms     = [Sr, V, O, O, O]
positions = [[0.5, 0.5, 0.5],
             [0.0, 0.0, 0.0],
             [0.5, 0.0, 0.0],
             [0.0, 0.5, 0.0],
             [0.0, 0.0, 0.5]]

model = model_DFT(lattice, atoms, positions; temperature=temp, functionals=PBE(),
                  smearing=DFTK.Smearing.Gaussian())

#warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; tol=tol, maxiter=3);

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; tol=tol, maxiter=maxiter);
@show DFTK.timer

antoine-levitt · 2025-05-19T18:18:29Z

Nice, but for large scale computations we might be better off just doing non variational computations on a cubic grid, which avoids this computation entirely.

mfherbst · 2025-05-20T06:34:23Z

I think there is a sweet spot in between where you need variational data. Especially in data generation for ML consistency between energy and forces within a structure and across a dataset is quits important for example.

But we should really test "how bad" such inconsistencies would be for nonvariational.

abussy · 2025-05-20T07:53:51Z

I've learnt to be wary of non variational calculations in my (more scientific) past, especially if gradients are involved.

Generally, I think that variational calculations should always be an option, and probably the default one. Then, we might as well do it efficiently.

antoine-levitt · 2025-05-20T08:13:58Z

Careful about the word variational here. Non variational in the dftk context just means that the discretization is non variational, ie the energy is not necessarily above the true one. Also note that even the "variational" energy is not strictly variational, because the xc energy is approximated on a grid. The fact that the discretization is not variational doesn't mean that the discrete orbitals are not: they still minimize the discretized energy, and the hellmann feynman theorem is still valid. So I wouldn't expect problems like that.

mfherbst · 2025-05-22T18:17:42Z

I'd need a more careful review, but I think given the performance improvement that's a reasonable amount of code to add, don't you agree @antoine-levitt ?

antoine-levitt

Sounds like the kernel overload should just be in GPUArrays, no? (ie x[I]=y should do what you implemented) Can we fix it there?

antoine-levitt · 2025-05-23T06:57:53Z

src/fft.jl


    # Pad the input data
    fill!(f_real, 0)
-    f_real[Gvec_mapping] = f_fourier
+    ifft_remap!(f_real, Gvec_mapping, f_fourier)


More meaningful names would be good. Something like fill_grid_with_kpt_data!()? Maybe also comment that this does f_real[Gvec_mapping]=f_fourier?

mfherbst · 2025-05-23T08:34:17Z

I think kernelabstractions also has a cpu backend, so perhaps we can actually avoid special-casing the gpu code. I also have a few issues with the proposed code structure ... will review more later.

abussy · 2025-05-26T08:13:22Z

KernelAbstractions does have a CPU backend, and it uses Julia threads for parallelization (can also use OMP). I tested it, and overall performance drops when using multiple threads (compared to master). My interpretation is that since these FFTs are mostly (only?) called within parallel_loop_over_range, there might be some thread overload.

mfherbst · 2025-05-27T07:49:37Z

there might be some thread overload.

Calling it on just one thread is not easily feasible ? If not then let's not bother and let's go with your proposed structure.

abussy · 2025-05-27T12:08:27Z

Calling it on just one thread is not easily feasible ? If not then let's not bother and let's go with your proposed structure.

I looked into it, and it's not obvious. The only solution I found also underperforms wrt to master when running with threads.

Sounds like the kernel overload should just be in GPUArrays, no? (ie x[I]=y should do what you implemented) Can we fix it there?

I'll look into it, this could be a nice feature with potential impact elsewhere in the code. This would also mean overloading x = y[I] and using f_fourier = f_real[Gvec_mapping] in the fft calls rather than the current f_fourier .= view(f_real, Gvec_mapping) (I'll need to make sure there is no performance impact).

KernelAbstractions for FFTs

2571dbf

antoine-levitt reviewed May 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KernelAbstractions for FFTs #1099

KernelAbstractions for FFTs #1099

Uh oh!

abussy commented May 19, 2025

Uh oh!

antoine-levitt commented May 19, 2025

Uh oh!

mfherbst commented May 20, 2025

Uh oh!

abussy commented May 20, 2025

Uh oh!

antoine-levitt commented May 20, 2025

Uh oh!

mfherbst commented May 22, 2025

Uh oh!

antoine-levitt left a comment

Uh oh!

antoine-levitt May 23, 2025

Uh oh!

mfherbst commented May 23, 2025

Uh oh!

abussy commented May 26, 2025

Uh oh!

mfherbst commented May 27, 2025

Uh oh!

abussy commented May 27, 2025

Uh oh!

Uh oh!

KernelAbstractions for FFTs #1099

Are you sure you want to change the base?

KernelAbstractions for FFTs #1099

Uh oh!

Conversation

abussy commented May 19, 2025

Uh oh!

antoine-levitt commented May 19, 2025

Uh oh!

mfherbst commented May 20, 2025

Uh oh!

abussy commented May 20, 2025

Uh oh!

antoine-levitt commented May 20, 2025

Uh oh!

mfherbst commented May 22, 2025

Uh oh!

antoine-levitt left a comment

Choose a reason for hiding this comment

Uh oh!

antoine-levitt May 23, 2025

Choose a reason for hiding this comment

Uh oh!

mfherbst commented May 23, 2025

Uh oh!

abussy commented May 26, 2025

Uh oh!

mfherbst commented May 27, 2025

Uh oh!

abussy commented May 27, 2025

Uh oh!

Uh oh!