GPU optimized symmetry operations #1097

abussy · 2025-05-16T09:50:23Z

When simulating large systems with symmetries, the symmetrization of the density is extremely slow. This is because the array is first transferred to the CPU, before a double loop over symmetries and G vectors takes place. By comparison to loops running on the GPU, this is slow, and the operation becomes a major bottleneck.

This PR introduces GPU specific implementations for accumulate_over_symmetries! and lowpass_for_symemtry!, using the map! construct to run on the device, and thus saving data transfers. The loop itself is also greatly accelerated. Note that these new implementations do run on the CPU, but slower than the current solution.

For illustration, using the input below, I observe these timings (seconds):

GPU	master	this PR
A100	195	14
H100	135	9
Mi200	212	26

using DFTK
using PseudoPotentialData
using CUDA
# using AMDGPU
setup_threading(;)

arch = DFTK.GPU(CuArray)
# arch = DFTK.CPU(ROCArray)

Ecut = 32.0
kgrid = [2, 1, 1]
maxiter = 5
tol = 1.0e-8
temperature = 1e-4

factor = 4
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())

# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);
@show DFTK.timer

mfherbst

Thanks, some thoughts.

src/gpu/symmetry.jl

abussy · 2025-05-16T15:38:25Z

Changing the order of the loops reduces the number of kernel calls and large copies. As a result, performance is greatly improved:

GPU	master	original PR	new loop order
A100	195	14	8.5
H100	135	9	6.5
Mi200	212	26	24

Note that this introduces some level of clunkiness. For example, all data accessed within the map! mus be isbit. Therefore, various fields of the symmetry vector must first be copied in their own arrays and transferred to the GPU. The cost of these actual copies is negligible due to the small size of the data and the low frequency of the transfer (I compared timings to a case were I did these transfer once and for all, for peace of mind).

mfherbst

Nice. Your current implementation, spiked another idea: Fuse both functions. If it's too complicated don't bother, but I think it should work.

mfherbst · 2025-05-22T18:02:00Z

src/gpu/symmetry.jl

+        acc = ρ_i
+	    for S in symm_S
+            idx = index_G_vectors(fft_size, S * G)
+            acc *= isnothing(idx) ? zero(complex(T)) : one(complex(T))
+        end


Formatting off.

mfherbst · 2025-05-22T18:06:50Z

src/gpu/symmetry.jl

+function accumulate_over_symmetries!(ρaccu::AbstractArray, ρin::AbstractArray,
+    basis::PlaneWaveBasis{T}, symmetries) where {T}


Also, how much worse is this implementation compared to the CPU version ? Can we not just make this the CPU version, too ? It looks like it should not be very much worse than what we have.

mfherbst · 2025-05-22T18:13:08Z

src/gpu/symmetry.jl

+    ρaccu
+end
+
+function lowpass_for_symmetry!(ρ::AbstractGPUArray, basis::PlaneWaveBasis{T};


The only place where we need this is in symmetrize_ρ, where it comes right after accumulate_over_symmetries!. I think for the GPU version it would make a lot of sense to fuse these two functions into one and thus have a single map! going over all Gs. This you should be able to do with a boolean flag do_lowpass, which hopefully Julia is smart enough to constant-prop into the GPU kernel and fully compile away if set to false.

Again feel free to make this fused function also the CPU function if this does not hurt performance too much.

GPU optimized symmetry operations

7484966

mfherbst reviewed May 16, 2025

View reviewed changes

src/gpu/symmetry.jl Outdated Show resolved Hide resolved

src/gpu/symmetry.jl Outdated Show resolved Hide resolved

Changed loop ordering

0649c55

mfherbst reviewed May 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU optimized symmetry operations #1097

GPU optimized symmetry operations #1097

Uh oh!

abussy commented May 16, 2025

Uh oh!

mfherbst left a comment

Uh oh!

Uh oh!

Uh oh!

abussy commented May 16, 2025

Uh oh!

mfherbst left a comment

Uh oh!

mfherbst May 22, 2025

Uh oh!

mfherbst May 22, 2025

Uh oh!

mfherbst May 22, 2025

Uh oh!

mfherbst May 22, 2025

Uh oh!

Uh oh!

		function accumulate_over_symmetries!(ρaccu::AbstractArray, ρin::AbstractArray,
		basis::PlaneWaveBasis{T}, symmetries) where {T}

GPU optimized symmetry operations #1097

Are you sure you want to change the base?

GPU optimized symmetry operations #1097

Uh oh!

Conversation

abussy commented May 16, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

abussy commented May 16, 2025

Uh oh!

mfherbst left a comment

Choose a reason for hiding this comment

Uh oh!

mfherbst May 22, 2025

Choose a reason for hiding this comment

Uh oh!

mfherbst May 22, 2025

Choose a reason for hiding this comment

Uh oh!

mfherbst May 22, 2025

Choose a reason for hiding this comment

Uh oh!

mfherbst May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!