Skip to content

GPU optimized symmetry operations #1097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

abussy
Copy link
Collaborator

@abussy abussy commented May 16, 2025

When simulating large systems with symmetries, the symmetrization of the density is extremely slow. This is because the array is first transferred to the CPU, before a double loop over symmetries and G vectors takes place. By comparison to loops running on the GPU, this is slow, and the operation becomes a major bottleneck.

This PR introduces GPU specific implementations for accumulate_over_symmetries! and lowpass_for_symemtry!, using the map! construct to run on the device, and thus saving data transfers. The loop itself is also greatly accelerated. Note that these new implementations do run on the CPU, but slower than the current solution.

For illustration, using the input below, I observe these timings (seconds):

GPU master this PR
A100 195 14
H100 135 9
Mi200 212 26
using DFTK
using PseudoPotentialData
using CUDA
# using AMDGPU
setup_threading(;)

arch = DFTK.GPU(CuArray)
# arch = DFTK.CPU(ROCArray)

Ecut = 32.0
kgrid = [2, 1, 1]
maxiter = 5
tol = 1.0e-8
temperature = 1e-4

factor = 4
a = 3.8267
lattice = factor*a * [[0.0 1.0 1.0];
                      [1.0 0.0 1.0];
                      [1.0 1.0 0.0]]
Al = ElementPsp(:Al, PseudoFamily("dojo.nc.sr.pbe.v0_4_1.stringent.upf"))
atoms = [Al for i in 1:factor^3]
positions = Vector{Vector{Float64}}([])
for i = 1:factor, j = 1:factor, k=1:factor
   push!(positions, [i/factor, j/factor, k/factor])
end

model = model_DFT(lattice, atoms, positions; temperature=temperature,
                  functionals=PBE(), smearing=DFTK.Smearing.Gaussian())

# warmup
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=3, tol=tol);

#actual calculations
DFTK.reset_timer!(DFTK.timer)
basis  = PlaneWaveBasis(model; Ecut, kgrid, architecture=arch)
scfres = self_consistent_field(basis; maxiter=maxiter, tol=tol);
@show DFTK.timer

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, some thoughts.

@abussy
Copy link
Collaborator Author

abussy commented May 16, 2025

Changing the order of the loops reduces the number of kernel calls and large copies. As a result, performance is greatly improved:

GPU master original PR new loop order
A100 195 14 8.5
H100 135 9 6.5
Mi200 212 26 24

Note that this introduces some level of clunkiness. For example, all data accessed within the map! mus be isbit. Therefore, various fields of the symmetry vector must first be copied in their own arrays and transferred to the GPU. The cost of these actual copies is negligible due to the small size of the data and the low frequency of the transfer (I compared timings to a case were I did these transfer once and for all, for peace of mind).

Copy link
Member

@mfherbst mfherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Your current implementation, spiked another idea: Fuse both functions. If it's too complicated don't bother, but I think it should work.

Comment on lines +37 to +41
acc = ρ_i
for S in symm_S
idx = index_G_vectors(fft_size, S * G)
acc *= isnothing(idx) ? zero(complex(T)) : one(complex(T))
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting off.

Comment on lines +3 to +4
function accumulate_over_symmetries!(ρaccu::AbstractArray, ρin::AbstractArray,
basis::PlaneWaveBasis{T}, symmetries) where {T}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how much worse is this implementation compared to the CPU version ? Can we not just make this the CPU version, too ? It looks like it should not be very much worse than what we have.

ρaccu
end

function lowpass_for_symmetry!(ρ::AbstractGPUArray, basis::PlaneWaveBasis{T};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place where we need this is in symmetrize_ρ, where it comes right after accumulate_over_symmetries!. I think for the GPU version it would make a lot of sense to fuse these two functions into one and thus have a single map! going over all Gs. This you should be able to do with a boolean flag do_lowpass, which hopefully Julia is smart enough to constant-prop into the GPU kernel and fully compile away if set to false.

Again feel free to make this fused function also the CPU function if this does not hurt performance too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants