-
Notifications
You must be signed in to change notification settings - Fork 24
memory usage when reading (really large) files #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting, I haven't tried very large structure files. |
You might want to test #70 and see if it helps a bit. |
Thanks Tim. That will certainly help, but it still falls short for the type of structure I have to handle, which might have 70M atoms, for example. I have, in parallel, implemented a framework that fits my needs in PDBTools.jl. I know duplicating work is not ideal, but I'm not sure if we can merge the requirements of the two package, really. This is what I have now, for a 230K atom file: with PDBTools.jl: julia> @time ats = read_mmcif("./4v6x.cif");
0.424724 seconds (702.72 k allocations: 68.556 MiB, 5.81% gc time)
with BioStructures.jl (stable) julia> @time ats = read("./4v6x.cif", MMCIFFormat);
1.065855 seconds (7.83 M allocations: 592.486 MiB, 42.56% gc time)
with BioStructures.jl, with #70 julia> @time ats = read("./4v6x.cif", MMCIFFormat);
0.977309 seconds (6.43 M allocations: 448.288 MiB, 45.73% gc time)
As you can see, in PDBTools I'm using still about an order of magnitude less memory. This was possible only by tuning every detail of the structure that hods the atom information: 32bit representations when possible, no internal references, Edit: On the other side, the resulting data structures are just about 2x greater in BioStructures.jl than in PDBTools.jl: PDBTools.jl julia> Base.summarysize(ats) / 1024^2
22.110557556152344
BioStructures.jl with #70 julia> Base.summarysize(ats) / 1024^2
43.35328006744385
That means that the memory used for reading can, in principle, be alleviated, and the resulting data structure is not that bad. |
There's also the issue that when parsing an mmCIF we first extract all tokens we intend to keep, and then we parse them. On a truly big file you might run out of memory just storing all the tokens. Thanks for supplying details that illustrate the remaining gap! I agree that generality probably costs us something but is also worth having. Nevertheless, it's incentive to see whether more can be squeezed out. It looks like we're using about 200B per Atom: julia> c
Chain A with 260 residues, 0 other molecules, 1949 atoms
julia> Base.summarysize(c; exclude=Model) / 1949
212.0831195484864 and that does seem like a fair bit. Switching |
Yes, one the things I've done in PDBTools.jl is to remove a Using InlineStrings for the string values was necessary not only to save space, but to drop allocations from the reading procedure. |
Switching to Float32 is probably a good idea, but would be a breaking change so would have to be worth it. I recall that parsing and storing strings was a significant amount of the parse time. At the time I thought I had reached the limit and stopped optimising, but looking back there is room for improvement there. |
I have a CIF file with 65 million atoms. It occupies 5.7GB in the hard disk. My computer has 32GB of memory, so I would expect that the structure could be read in this machine. Nevertheless, Julia gets killed because of memory overflow.
VMD, on the other side, reads it fine.
I might take a look into that to see if there's something I can contribute, but anyway I'm reporting for us to know of the existence of the problem.
The text was updated successfully, but these errors were encountered: