Skip to content

memory usage when reading (really large) files #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lmiq opened this issue Nov 7, 2024 · 6 comments
Open

memory usage when reading (really large) files #59

lmiq opened this issue Nov 7, 2024 · 6 comments

Comments

@lmiq
Copy link
Contributor

lmiq commented Nov 7, 2024

I have a CIF file with 65 million atoms. It occupies 5.7GB in the hard disk. My computer has 32GB of memory, so I would expect that the structure could be read in this machine. Nevertheless, Julia gets killed because of memory overflow.

VMD, on the other side, reads it fine.

I might take a look into that to see if there's something I can contribute, but anyway I'm reporting for us to know of the existence of the problem.

@jgreener64
Copy link
Member

Thanks for reporting, I haven't tried very large structure files.

@timholy
Copy link
Collaborator

timholy commented May 8, 2025

You might want to test #70 and see if it helps a bit.

@lmiq
Copy link
Contributor Author

lmiq commented May 8, 2025

Thanks Tim. That will certainly help, but it still falls short for the type of structure I have to handle, which might have 70M atoms, for example. I have, in parallel, implemented a framework that fits my needs in PDBTools.jl. I know duplicating work is not ideal, but I'm not sure if we can merge the requirements of the two package, really. This is what I have now, for a 230K atom file:

with PDBTools.jl:

julia> @time ats = read_mmcif("./4v6x.cif");
  0.424724 seconds (702.72 k allocations: 68.556 MiB, 5.81% gc time)

with BioStructures.jl (stable)

julia> @time ats = read("./4v6x.cif", MMCIFFormat);
  1.065855 seconds (7.83 M allocations: 592.486 MiB, 42.56% gc time)

with BioStructures.jl, with #70

julia> @time ats = read("./4v6x.cif", MMCIFFormat);
  0.977309 seconds (6.43 M allocations: 448.288 MiB, 45.73% gc time)

As you can see, in PDBTools I'm using still about an order of magnitude less memory. This was possible only by tuning every detail of the structure that hods the atom information: 32bit representations when possible, no internal references, InlineStrings for the string fields. And the structure data is totally flat, just a vector of atoms. Probably it is too much to ask for a general package like BioStructures.jl to support these huge structures while keeping all the features of the data representation it carries. Still, I think that the issue should not be considered closed, because these enormous structures will probably become more common in the future, and some way of handling them will be necessary.

Edit: On the other side, the resulting data structures are just about 2x greater in BioStructures.jl than in PDBTools.jl:

PDBTools.jl

julia> Base.summarysize(ats) / 1024^2
22.110557556152344

BioStructures.jl with #70

julia> Base.summarysize(ats) / 1024^2
43.35328006744385

That means that the memory used for reading can, in principle, be alleviated, and the resulting data structure is not that bad.

@timholy
Copy link
Collaborator

timholy commented May 8, 2025

There's also the issue that when parsing an mmCIF we first extract all tokens we intend to keep, and then we parse them. On a truly big file you might run out of memory just storing all the tokens.

Thanks for supplying details that illustrate the remaining gap! I agree that generality probably costs us something but is also worth having. Nevertheless, it's incentive to see whether more can be squeezed out. It looks like we're using about 200B per Atom:

julia> c
Chain A with 260 residues, 0 other molecules, 1949 atoms

julia> Base.summarysize(c; exclude=Model) / 1949
212.0831195484864

and that does seem like a fair bit. Switching Float64 to Float32 drops that to about 188. Atom itself is "just" 88B (with Float64), though, so the higher-order structure is costing us. Reducing further might take a significant restructuring: a Dict is 64 bytes.

@lmiq
Copy link
Contributor Author

lmiq commented May 8, 2025

Yes, one the things I've done in PDBTools.jl is to remove a Dict from the Atom structure, leaving an optional parametric field for storing custom values, but with type Nothing as default. The resulting Atom structure is 88 bytes now.

Using InlineStrings for the string values was necessary not only to save space, but to drop allocations from the reading procedure.

@jgreener64
Copy link
Member

Switching to Float32 is probably a good idea, but would be a breaking change so would have to be worth it.

I recall that parsing and storing strings was a significant amount of the parse time. At the time I thought I had reached the limit and stopped optimising, but looking back there is room for improvement there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants