Skip to content

WIP: Add .bcif support #71

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 25 commits into
base: master
Choose a base branch
from
Draft

Conversation

BradyAJohnston
Copy link

@BradyAJohnston BradyAJohnston commented May 9, 2025

This is discussed somewhat in #45.

Still very much a WIP, but has initial support for parsing a .bcif file into a MolecularStructure. Based on the BCIF Spec and the biotite implementation.

This is my first sizeable Julia project, so I would like feedback on ways things could be better approached in a more "Julia-esque" way. Currently the .bcif file is unpacked using MsgPack and just left as a dictionary that is then processed and decoded at different steps. Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?

Read function is bare-bones but works. I am still getting some errors for the construction of the MolecularStructure for some files which I am not sure how we approach them:

TODOS:

  • Refactor multiple dispatch for decoding into single larger function
  • DSSP / Stride support
  • extract SS information when opening
  • use MsgPack.unpack(io, BCIFDict)
  • cleanup superfluous function definitions and dictionary lookups

Benchmarks

Some initial bencharks that I'll make some plots for & be more comprehensive when closer to being finished. Benchmarks run with julia --threads 8 as the .bcif columns are returned all together so can be decoded in parallel.

Unsurprisingly it is slightly slower on the small structures, with minimal difference in file sizes on disk. For the larger structures it ends up being about ~2x faster to return the MolecularStructure and with 1/2 the memory required for reading. Decoding the file into a dictionary for the largest structure tested (8J07) only takes ~400 ms with 1.5 GB memory (vs 6.5 s & 4 GB for returning full structure) with the remaining ~6 s being the construction of the MolecularStructure.

1AKE

CIF Size: 427 KB

julia> @benchmark read("1AKE.cif", MMCIFFormat)
BenchmarkTools.Trial: 483 samples with 1 evaluation per sample.
 Range (min  max):   9.590 ms   18.404 ms  ┊ GC (min  max): 0.00%  46.39%
 Time  (median):      9.969 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.352 ms ± 899.304 μs  ┊ GC (mean ± σ):  3.96% ±  6.87%

     ▃█▇▅▅▂▂▂                                                   
  ▃▄▇████████▄▄▃▄▃▂▂▂▁▁▃▁▁▁▁▁▁▁▁▁▃▃▂▃▃▃▄▃▄▄▃▄▄▂▄▃▃▄▃▃▂▃▁▃▂▁▁▂▃ ▃
  9.59 ms         Histogram: frequency by time         12.6 ms <

BCIF Size: 224 KB

julia> @benchmark read("1AKE.bcif", BCIFFormat)
BenchmarkTools.Trial: 214 samples with 1 evaluation per sample.
 Range (min  max):  22.036 ms  33.331 ms  ┊ GC (min  max): 0.00%  28.67%
 Time  (median):     22.674 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.419 ms ±  1.502 ms  ┊ GC (mean ± σ):  3.42% ±  5.25%

    ▂▂▁█▁                                                      
  ▄▆█████▆▄▃▃▂▂▃▃▃▂▃▁▁▁▃▁▁▃▃▃▆▄▄▄▃▅▃▃▁▁▃▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  22 ms           Histogram: frequency by time        28.3 ms <

 Memory estimate: 15.89 MiB, allocs estimate: 277943.

1CD3

CIF Size: 1 MB

julia> @benchmark read("1CD3.cif", MMCIFFormat)
BenchmarkTools.Trial: 194 samples with 1 evaluation per sample.
 Range (min  max):  23.517 ms  35.699 ms  ┊ GC (min  max): 0.00%  29.85%
 Time  (median):     24.963 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.872 ms ±  2.146 ms  ┊ GC (mean ± σ):  5.94% ±  7.00%

  ▁█▅▇▅   ▄▁                   ▁                               
  █████▆▇███▇▇▅▅▃▅▄▃▁▁▄▇▃▅▆▆▆▅▄█▄█▃▆▄▄▄▆▅▄▃▄▃▁▃▃▄▁▁▁▁▃▁▃▁▃▃▁▃ ▃
  23.5 ms         Histogram: frequency by time        31.6 ms <

 Memory estimate: 26.06 MiB, allocs estimate: 336460.

BCIF Size: 316 KB

julia> @benchmark read("1CD3.bcif", BCIFFormat)
BenchmarkTools.Trial: 187 samples with 1 evaluation per sample.
 Range (min  max):  22.042 ms  39.896 ms  ┊ GC (min  max): 0.00%  7.66%
 Time  (median):     26.303 ms              ┊ GC (median):    6.11%
 Time  (mean ± σ):   26.707 ms ±  2.701 ms  ┊ GC (mean ± σ):  5.17% ± 4.21%

             █▁▄▅ ▃▄ ▆▄▂  ▂▁                                   
  ▄▁▇▇▁▅▅▅▇▆▆████▆██▄███▆▅██▆▅▃▁▅▁▄▆▄▇▁▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▃
  22 ms           Histogram: frequency by time        36.5 ms <

 Memory estimate: 24.05 MiB, allocs estimate: 431307.

7CGO

CIF Size: 36.3 MB

julia> @benchmark read("7CGO.cif", MMCIFFormat)
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min  max):  1.067 s    1.254 s  ┊ GC (min  max):  8.48%  22.51%
 Time  (median):     1.159 s              ┊ GC (median):    12.82%
 Time  (mean ± σ):   1.172 s ± 78.064 ms  ┊ GC (mean ± σ):  15.69% ±  5.87%

  █                    █     █                         █  █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  1.07 s         Histogram: frequency by time        1.25 s <

 Memory estimate: 841.24 MiB, allocs estimate: 11335780.

BCIF Size: 4.2 MB

julia> @benchmark read("7CGO.bcif", BCIFFormat)
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min  max):  530.286 ms  604.980 ms  ┊ GC (min  max):  7.25%  16.83%
 Time  (median):     572.595 ms               ┊ GC (median):    12.51%
 Time  (mean ± σ):   573.714 ms ±  20.675 ms  ┊ GC (mean ± σ):  12.92% ±  2.96%

  █                          █  █  ██    █    █    █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁██▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  530 ms           Histogram: frequency by time          605 ms <

 Memory estimate: 455.95 MiB, allocs estimate: 8453142.

8J07

CIF Size: 353.3 MB

julia> @benchmark read("8J07.cif", MMCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 12.902 s (14.72% GC) to evaluate,
 with a memory estimate of 8.26 GiB, over 106536150 allocations.

BCIF Size 37.5 MB

julia> @benchmark read("8J07.bcif", BCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 6.537 s (8.73% GC) to evaluate,
 with a memory estimate of 3.94 GiB, over 74444193 allocations.

Just decoding to Dict()

julia> @benchmark datablock_to_dict(MsgPack.unpack(read("8J07.bcif"))["dataBlocks"][1])
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min  max):  366.401 ms  424.174 ms  ┊ GC (min  max): 23.95%  26.69%
 Time  (median):     395.279 ms               ┊ GC (median):    31.13%
 Time  (mean ± σ):   396.415 ms ±  22.352 ms  ┊ GC (mean ± σ):  30.22% ±  3.46%

  █      █            ▁   ▁     ▁  ▁                   ▁▁  ▁  █  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁█▁▁█ ▁
  366 ms           Histogram: frequency by time          424 ms <

 Memory estimate: 1.74 GiB, allocs estimate: 19754142.

Copy link

codecov bot commented May 9, 2025

Codecov Report

Attention: Patch coverage is 81.56425% with 33 lines in your changes missing coverage. Please review.

Project coverage is 94.24%. Comparing base (a507527) to head (080ad27).

Files with missing lines Patch % Lines
src/bcif.jl 80.47% 33 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
- Coverage   95.32%   94.24%   -1.09%     
==========================================
  Files          14       15       +1     
  Lines        2033     2205     +172     
==========================================
+ Hits         1938     2078     +140     
- Misses         95      127      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BradyAJohnston BradyAJohnston marked this pull request as draft May 9, 2025 06:09
Copy link
Member

@jgreener64 jgreener64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR, looks great. I have left some comments but overall it seems "Julia-esque". It needs tests, but I imagine you will get round to that.

Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?

Not unless it gives a benefit such as speed, which I doubt.

julia> read("1SSU.bcif", BCIFFormat)

This is a NMR structure with multiple models, the model number is in _atom_site.pdbx_PDB_model_num for mmCIF so hopefully is accessible from BCIF.

Sounds good about performance, it may even get faster and use less memory with some of Tim's planned changes (#70).

@@ -47,6 +48,7 @@ Graphs = "1"
LinearAlgebra = "1.9"
MMTF = "1"
MetaGraphs = "0.7, 0.8"
MsgPack = "1.2.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to know how much time this adds to using BioStructures. I expect not a lot, but it can be put in an extension if it is significant.

src/bcif.jl Outdated
dict::Dict{String,BCIFArrayTypes}
end

function BCIFDict(dict::Dict{String,BCIFArrayTypes})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor should be available by default.

src/download.jl Outdated
"https://models.rcsb.org/$pdbid.bcif",
pdbpath,
)
return pdbpath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better not to return from within the try block, but to use the return at the end of the function and add BCIFFormat checks as required.

@BradyAJohnston
Copy link
Author

Ah sorry for the auto-format spam. Will go through comments and now I know you're happy with overall I'll put together some tests / benchmarks as well.

Copy link
Collaborator

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite impressive for a first project!

I'm neither a MsgPack nor BinaryCIF guru, but rather than making all your types mutable and passing an "empty" object to encode(enc, data), consider making the types immutable and just construct them on demand. You may also find that writing a big

function decode(typecode::TypeCode, data)
    if typecode == INT8
        return ...
    elseif typecode == INT16
        return ...
    ...
end

is actually faster: the advantage is that typecode is a concrete type, and so everything except the return type is inferrable, whereas dispatching based on types that you can't predict in advance requires julia to take more laborious steps at runtime. You can use ProfileView (or its analog built-in to vscode) to look for type-instability in your current implementation; if there isn't any, then you can ignore this suggestion!

BradyAJohnston and others added 3 commits May 15, 2025 14:52
Incorporates most feedback from PR
Co-authored-by: Tim Holy <[email protected]>
This reverts commit b168125.
@BradyAJohnston
Copy link
Author

Thanks for the feedback @timholy and @jgreener64 . I've gone through with some refactoring and incorporated most of it and it's much cleaner. I've got very minor test currently, but over the coming days I'll continue to add more tests for BCIF-specific things. I've currently got the encode() functions commented out as I haven't worked on them at all since the initial write up.

@BradyAJohnston
Copy link
Author

@timholy are you able to maybe explain if I should still be trying to attempt this? There has been a bit of a restructure, but with the decoding process there can be a different encoding type, but same final type (mostly with integers), so specifying the type I don't think would work. I might be misunderstanding this though.

function decode(typecode::TypeCode, data)
    if typecode == INT8
        return ...
    elseif typecode == INT16
        return ...
    ...
end

@timholy
Copy link
Collaborator

timholy commented May 15, 2025

Looking good! What I meant was the following:

function reencode1!(out, types::Vector{DataType}, vals)
    for (T, v) in zip(types, vals)
        push!(out, T(v))
    end
    return out
end

function reencode2!(out, typecodes::Vector{Int}, vals)
    for (nb, v) in zip(typecodes, vals)
        if nb == 8
            push!(out, Int8(v))
        elseif nb == 16
            push!(out, Int16(v))
        elseif nb == 32
            push!(out, Int32(v))
        elseif nb == 64
            push!(out, Int64(v))
        end
    end
    return out
end

vals = 1:16
types = repeat([Int8, Int16, Int32, Int64], 4)
typecodes = repeat([8, 16, 32, 64], 4)
out = sizehint!(Integer[], length(vals))

using BenchmarkTools

@btime reencode1!($out, $types, $vals) setup=(empty!(out));
@btime reencode2!($out, $typecodes, $vals) setup=(empty!(out));

Result:

julia> include("bench.jl")
  1.300 μs (0 allocations: 0 bytes)
  74.794 ns (0 allocations: 0 bytes)

As you can see, the second mostly-type-stable version is much faster.

@BradyAJohnston
Copy link
Author

Okay thanks for the additional info on that. I'll have a play around with some benchmarking and see what I can come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants