WIP: Add `.bcif` support #71

BradyAJohnston · 2025-05-09T05:12:50Z

This is discussed somewhat in #45.

Still very much a WIP, but has initial support for parsing a .bcif file into a MolecularStructure. Based on the BCIF Spec and the biotite implementation.

This is my first sizeable Julia project, so I would like feedback on ways things could be better approached in a more "Julia-esque" way. Currently the .bcif file is unpacked using MsgPack and just left as a dictionary that is then processed and decoded at different steps. Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?

Read function is bare-bones but works. I am still getting some errors for the construction of the MolecularStructure for some files which I am not sure how we approach them:

TODOS:

Refactor multiple dispatch for decoding into single larger function
DSSP / Stride support
extract SS information when opening
use MsgPack.unpack(io, BCIFDict)
cleanup superfluous function definitions and dictionary lookups

Benchmarks

Some initial bencharks that I'll make some plots for & be more comprehensive when closer to being finished. Benchmarks run with julia --threads 8 as the .bcif columns are returned all together so can be decoded in parallel.

Unsurprisingly it is slightly slower on the small structures, with minimal difference in file sizes on disk. For the larger structures it ends up being about ~2x faster to return the MolecularStructure and with 1/2 the memory required for reading. Decoding the file into a dictionary for the largest structure tested (8J07) only takes ~400 ms with 1.5 GB memory (vs 6.5 s & 4 GB for returning full structure) with the remaining ~6 s being the construction of the MolecularStructure.

1AKE

CIF Size: 427 KB

julia> @benchmark read("1AKE.cif", MMCIFFormat)
BenchmarkTools.Trial: 483 samples with 1 evaluation per sample.
 Range (min … max):   9.590 ms …  18.404 ms  ┊ GC (min … max): 0.00% … 46.39%
 Time  (median):      9.969 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.352 ms ± 899.304 μs  ┊ GC (mean ± σ):  3.96% ±  6.87%

     ▃█▇▅▅▂▂▂                                                   
  ▃▄▇████████▄▄▃▄▃▂▂▂▁▁▃▁▁▁▁▁▁▁▁▁▃▃▂▃▃▃▄▃▄▄▃▄▄▂▄▃▃▄▃▃▂▃▁▃▂▁▁▂▃ ▃
  9.59 ms         Histogram: frequency by time         12.6 ms <

BCIF Size: 224 KB

julia> @benchmark read("1AKE.bcif", BCIFFormat)
BenchmarkTools.Trial: 214 samples with 1 evaluation per sample.
 Range (min … max):  22.036 ms … 33.331 ms  ┊ GC (min … max): 0.00% … 28.67%
 Time  (median):     22.674 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   23.419 ms ±  1.502 ms  ┊ GC (mean ± σ):  3.42% ±  5.25%

    ▂▂▁█▁                                                      
  ▄▆█████▆▄▃▃▂▂▃▃▃▂▃▁▁▁▃▁▁▃▃▃▆▄▄▄▃▅▃▃▁▁▃▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  22 ms           Histogram: frequency by time        28.3 ms <

 Memory estimate: 15.89 MiB, allocs estimate: 277943.

1CD3

CIF Size: 1 MB

julia> @benchmark read("1CD3.cif", MMCIFFormat)
BenchmarkTools.Trial: 194 samples with 1 evaluation per sample.
 Range (min … max):  23.517 ms … 35.699 ms  ┊ GC (min … max): 0.00% … 29.85%
 Time  (median):     24.963 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.872 ms ±  2.146 ms  ┊ GC (mean ± σ):  5.94% ±  7.00%

  ▁█▅▇▅   ▄▁                   ▁                               
  █████▆▇███▇▇▅▅▃▅▄▃▁▁▄▇▃▅▆▆▆▅▄█▄█▃▆▄▄▄▆▅▄▃▄▃▁▃▃▄▁▁▁▁▃▁▃▁▃▃▁▃ ▃
  23.5 ms         Histogram: frequency by time        31.6 ms <

 Memory estimate: 26.06 MiB, allocs estimate: 336460.

BCIF Size: 316 KB

julia> @benchmark read("1CD3.bcif", BCIFFormat)
BenchmarkTools.Trial: 187 samples with 1 evaluation per sample.
 Range (min … max):  22.042 ms … 39.896 ms  ┊ GC (min … max): 0.00% … 7.66%
 Time  (median):     26.303 ms              ┊ GC (median):    6.11%
 Time  (mean ± σ):   26.707 ms ±  2.701 ms  ┊ GC (mean ± σ):  5.17% ± 4.21%

             █▁▄▅ ▃▄ ▆▄▂  ▂▁                                   
  ▄▁▇▇▁▅▅▅▇▆▆████▆██▄███▆▅██▆▅▃▁▅▁▄▆▄▇▁▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃ ▃
  22 ms           Histogram: frequency by time        36.5 ms <

 Memory estimate: 24.05 MiB, allocs estimate: 431307.

7CGO

CIF Size: 36.3 MB

julia> @benchmark read("7CGO.cif", MMCIFFormat)
BenchmarkTools.Trial: 5 samples with 1 evaluation per sample.
 Range (min … max):  1.067 s …   1.254 s  ┊ GC (min … max):  8.48% … 22.51%
 Time  (median):     1.159 s              ┊ GC (median):    12.82%
 Time  (mean ± σ):   1.172 s ± 78.064 ms  ┊ GC (mean ± σ):  15.69% ±  5.87%

  █                    █     █                         █  █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█ ▁
  1.07 s         Histogram: frequency by time        1.25 s <

 Memory estimate: 841.24 MiB, allocs estimate: 11335780.

BCIF Size: 4.2 MB

julia> @benchmark read("7CGO.bcif", BCIFFormat)
BenchmarkTools.Trial: 9 samples with 1 evaluation per sample.
 Range (min … max):  530.286 ms … 604.980 ms  ┊ GC (min … max):  7.25% … 16.83%
 Time  (median):     572.595 ms               ┊ GC (median):    12.51%
 Time  (mean ± σ):   573.714 ms ±  20.675 ms  ┊ GC (mean ± σ):  12.92% ±  2.96%

  █                          █  █  ██    █    █    █          █  
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁██▁▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁█ ▁
  530 ms           Histogram: frequency by time          605 ms <

 Memory estimate: 455.95 MiB, allocs estimate: 8453142.

8J07

CIF Size: 353.3 MB

julia> @benchmark read("8J07.cif", MMCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 12.902 s (14.72% GC) to evaluate,
 with a memory estimate of 8.26 GiB, over 106536150 allocations.

BCIF Size 37.5 MB

julia> @benchmark read("8J07.bcif", BCIFFormat)
BenchmarkTools.Trial: 1 sample with 1 evaluation per sample.
 Single result which took 6.537 s (8.73% GC) to evaluate,
 with a memory estimate of 3.94 GiB, over 74444193 allocations.

Just decoding to `Dict()`

julia> @benchmark datablock_to_dict(MsgPack.unpack(read("8J07.bcif"))["dataBlocks"][1])
BenchmarkTools.Trial: 13 samples with 1 evaluation per sample.
 Range (min … max):  366.401 ms … 424.174 ms  ┊ GC (min … max): 23.95% … 26.69%
 Time  (median):     395.279 ms               ┊ GC (median):    31.13%
 Time  (mean ± σ):   396.415 ms ±  22.352 ms  ┊ GC (mean ± σ):  30.22% ±  3.46%

  █      █            ▁   ▁     ▁  ▁                   ▁▁  ▁  █  
  █▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁█▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁█▁▁█ ▁
  366 ms           Histogram: frequency by time          424 ms <

 Memory estimate: 1.74 GiB, allocs estimate: 19754142.

codecov · 2025-05-09T05:16:03Z

Codecov Report

Attention: Patch coverage is 81.56425% with 33 lines in your changes missing coverage. Please review.

Project coverage is 94.24%. Comparing base (a507527) to head (080ad27).

Files with missing lines	Patch %	Lines
src/bcif.jl	80.47%	33 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #71      +/-   ##
==========================================
- Coverage   95.32%   94.24%   -1.09%     
==========================================
  Files          14       15       +1     
  Lines        2033     2205     +172     
==========================================
+ Hits         1938     2078     +140     
- Misses         95      127      +32

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jgreener64

Thanks for making this PR, looks great. I have left some comments but overall it seems "Julia-esque". It needs tests, but I imagine you will get round to that.

Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?

Not unless it gives a benefit such as speed, which I doubt.

julia> read("1SSU.bcif", BCIFFormat)

This is a NMR structure with multiple models, the model number is in _atom_site.pdbx_PDB_model_num for mmCIF so hopefully is accessible from BCIF.

Sounds good about performance, it may even get faster and use less memory with some of Tim's planned changes (#70).

jgreener64 · 2025-05-12T22:06:29Z

Project.toml

@@ -47,6 +48,7 @@ Graphs = "1"
 LinearAlgebra = "1.9"
 MMTF = "1"
 MetaGraphs = "0.7, 0.8"
+MsgPack = "1.2.1"


Would be good to know how much time this adds to using BioStructures. I expect not a lot, but it can be put in an extension if it is significant.

src/bcif.jl

jgreener64 · 2025-05-12T22:10:53Z

src/bcif.jl

+    dict::Dict{String,BCIFArrayTypes}
+end
+
+function BCIFDict(dict::Dict{String,BCIFArrayTypes})


This constructor should be available by default.

src/model.jl

jgreener64 · 2025-05-12T22:23:03Z

src/download.jl

+                        "https://models.rcsb.org/$pdbid.bcif",
+                        pdbpath,
+                    )
+                    return pdbpath


Might be better not to return from within the try block, but to use the return at the end of the function and add BCIFFormat checks as required.

BradyAJohnston · 2025-05-13T08:38:12Z

Ah sorry for the auto-format spam. Will go through comments and now I know you're happy with overall I'll put together some tests / benchmarks as well.

This reverts commit 88d6940.

timholy

This is quite impressive for a first project!

I'm neither a MsgPack nor BinaryCIF guru, but rather than making all your types mutable and passing an "empty" object to encode(enc, data), consider making the types immutable and just construct them on demand. You may also find that writing a big

function decode(typecode::TypeCode, data)
    if typecode == INT8
        return ...
    elseif typecode == INT16
        return ...
    ...
end

is actually faster: the advantage is that typecode is a concrete type, and so everything except the return type is inferrable, whereas dispatching based on types that you can't predict in advance requires julia to take more laborious steps at runtime. You can use ProfileView (or its analog built-in to vscode) to look for type-instability in your current implementation; if there isn't any, then you can ignore this suggestion!

src/bcif.jl

Incorporates most feedback from PR

Co-authored-by: Tim Holy <[email protected]>

This reverts commit b168125.

…o bcif

BradyAJohnston · 2025-05-15T07:04:40Z

Thanks for the feedback @timholy and @jgreener64 . I've gone through with some refactoring and incorporated most of it and it's much cleaner. I've got very minor test currently, but over the coming days I'll continue to add more tests for BCIF-specific things. I've currently got the encode() functions commented out as I haven't worked on them at all since the initial write up.

BradyAJohnston · 2025-05-15T12:39:17Z

@timholy are you able to maybe explain if I should still be trying to attempt this? There has been a bit of a restructure, but with the decoding process there can be a different encoding type, but same final type (mostly with integers), so specifying the type I don't think would work. I might be misunderstanding this though.

function decode(typecode::TypeCode, data)
    if typecode == INT8
        return ...
    elseif typecode == INT16
        return ...
    ...
end

timholy · 2025-05-15T20:12:15Z

Looking good! What I meant was the following:

function reencode1!(out, types::Vector{DataType}, vals)
    for (T, v) in zip(types, vals)
        push!(out, T(v))
    end
    return out
end

function reencode2!(out, typecodes::Vector{Int}, vals)
    for (nb, v) in zip(typecodes, vals)
        if nb == 8
            push!(out, Int8(v))
        elseif nb == 16
            push!(out, Int16(v))
        elseif nb == 32
            push!(out, Int32(v))
        elseif nb == 64
            push!(out, Int64(v))
        end
    end
    return out
end

vals = 1:16
types = repeat([Int8, Int16, Int32, Int64], 4)
typecodes = repeat([8, 16, 32, 64], 4)
out = sizehint!(Integer[], length(vals))

using BenchmarkTools

@btime reencode1!($out, $types, $vals) setup=(empty!(out));
@btime reencode2!($out, $typecodes, $vals) setup=(empty!(out));

Result:

julia> include("bench.jl")
  1.300 μs (0 allocations: 0 bytes)
  74.794 ns (0 allocations: 0 bytes)

As you can see, the second mostly-type-stable version is much faster.

BradyAJohnston · 2025-05-16T00:23:58Z

Okay thanks for the additional info on that. I'll have a play around with some benchmarking and see what I can come up with.

BradyAJohnston added 11 commits May 7, 2025 17:52

initial working BCIF decoding

cf71a8f

turn into read() function

8bdb53f

cleanup

a7defb2

cleanup

7367a76

run JuliaFormatter

c0911d7

cleanup and multithread column decoding

0886b58

Return as MolecularStructure

c427d66

entire datablock to dict

070fd36

cleanup

c1ee984

cleanup

9fb8b7b

add download and cleanup

88d6940

BradyAJohnston marked this pull request as draft May 9, 2025 06:09

jgreener64 reviewed May 12, 2025

View reviewed changes

Merge branch 'master' into bcif

6ffcbfc

BradyAJohnston added 3 commits May 13, 2025 16:41

Revert "add download and cleanup"

4f191cc

This reverts commit 88d6940.

BCIF support for downloading

137186b

format

1bdf0fc

BradyAJohnston force-pushed the bcif branch from 605acd8 to 1bdf0fc Compare May 13, 2025 10:06

BradyAJohnston added 5 commits May 13, 2025 18:10

fix downloading

b168125

parsing working again

f91a94a

fix parsing of multi-model files

cd08209

initial test comparison

df0a03d

remove extra type lookups and conversions

5421031

timholy mentioned this pull request May 14, 2025

Improve read perf, store coords as SVector #70

Open

timholy reviewed May 14, 2025

View reviewed changes

BradyAJohnston and others added 3 commits May 15, 2025 14:52

incorporate most feedback

8270f08

Incorporates most feedback from PR

Update src/bcif.jl

244a6a0

Co-authored-by: Tim Holy <[email protected]>

remove 1BNA

622f4b7

This reverts commit b168125.

Merge branch 'bcif' of github.com:BradyAJohnston/BioStructures.jl int…

512bb8b

…o bcif

make structs immutable

080ad27

BradyAJohnston mentioned this pull request May 16, 2025

Deciding order of encodings molstar/BinaryCIF#12

Closed

WIP: Add .bcif support #71

Are you sure you want to change the base?

WIP: Add .bcif support #71

Uh oh!

Conversation

BradyAJohnston commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

1AKE

1CD3

7CGO

8J07

Just decoding to Dict()

Uh oh!

codecov bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jgreener64 left a comment

Choose a reason for hiding this comment

Uh oh!

jgreener64 May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jgreener64 May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jgreener64 May 12, 2025

Choose a reason for hiding this comment

Uh oh!

BradyAJohnston commented May 13, 2025

Uh oh!

timholy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BradyAJohnston commented May 15, 2025

Uh oh!

BradyAJohnston commented May 15, 2025

Uh oh!

timholy commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BradyAJohnston commented May 16, 2025

Uh oh!

Uh oh!

WIP: Add `.bcif` support #71

WIP: Add `.bcif` support #71

BradyAJohnston commented May 9, 2025 •

edited

Loading

Just decoding to `Dict()`

codecov bot commented May 9, 2025 •

edited

Loading

timholy left a comment •

edited

Loading

timholy commented May 15, 2025 •

edited

Loading