-
Notifications
You must be signed in to change notification settings - Fork 24
WIP: Add .bcif
support
#71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #71 +/- ##
==========================================
- Coverage 95.32% 94.24% -1.09%
==========================================
Files 14 15 +1
Lines 2033 2205 +172
==========================================
+ Hits 1938 2078 +140
- Misses 95 127 +32 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this PR, looks great. I have left some comments but overall it seems "Julia-esque". It needs tests, but I imagine you will get round to that.
Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?
Not unless it gives a benefit such as speed, which I doubt.
julia> read("1SSU.bcif", BCIFFormat)
This is a NMR structure with multiple models, the model number is in _atom_site.pdbx_PDB_model_num
for mmCIF so hopefully is accessible from BCIF.
Sounds good about performance, it may even get faster and use less memory with some of Tim's planned changes (#70).
@@ -47,6 +48,7 @@ Graphs = "1" | |||
LinearAlgebra = "1.9" | |||
MMTF = "1" | |||
MetaGraphs = "0.7, 0.8" | |||
MsgPack = "1.2.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to know how much time this adds to using BioStructures
. I expect not a lot, but it can be put in an extension if it is significant.
src/bcif.jl
Outdated
dict::Dict{String,BCIFArrayTypes} | ||
end | ||
|
||
function BCIFDict(dict::Dict{String,BCIFArrayTypes}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor should be available by default.
src/download.jl
Outdated
"https://models.rcsb.org/$pdbid.bcif", | ||
pdbpath, | ||
) | ||
return pdbpath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better not to return from within the try
block, but to use the return at the end of the function and add BCIFFormat
checks as required.
Ah sorry for the auto-format spam. Will go through comments and now I know you're happy with overall I'll put together some tests / benchmarks as well. |
This reverts commit 88d6940.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite impressive for a first project!
I'm neither a MsgPack nor BinaryCIF guru, but rather than making all your types mutable
and passing an "empty" object to encode(enc, data)
, consider making the types immutable and just construct them on demand. You may also find that writing a big
function decode(typecode::TypeCode, data)
if typecode == INT8
return ...
elseif typecode == INT16
return ...
...
end
is actually faster: the advantage is that typecode
is a concrete type, and so everything except the return type is inferrable, whereas dispatching based on types that you can't predict in advance requires julia to take more laborious steps at runtime. You can use ProfileView (or its analog built-in to vscode) to look for type-instability in your current implementation; if there isn't any, then you can ignore this suggestion!
Incorporates most feedback from PR
Co-authored-by: Tim Holy <[email protected]>
This reverts commit b168125.
Thanks for the feedback @timholy and @jgreener64 . I've gone through with some refactoring and incorporated most of it and it's much cleaner. I've got very minor test currently, but over the coming days I'll continue to add more tests for BCIF-specific things. I've currently got the |
@timholy are you able to maybe explain if I should still be trying to attempt this? There has been a bit of a restructure, but with the decoding process there can be a different encoding type, but same final type (mostly with integers), so specifying the
|
Looking good! What I meant was the following: function reencode1!(out, types::Vector{DataType}, vals)
for (T, v) in zip(types, vals)
push!(out, T(v))
end
return out
end
function reencode2!(out, typecodes::Vector{Int}, vals)
for (nb, v) in zip(typecodes, vals)
if nb == 8
push!(out, Int8(v))
elseif nb == 16
push!(out, Int16(v))
elseif nb == 32
push!(out, Int32(v))
elseif nb == 64
push!(out, Int64(v))
end
end
return out
end
vals = 1:16
types = repeat([Int8, Int16, Int32, Int64], 4)
typecodes = repeat([8, 16, 32, 64], 4)
out = sizehint!(Integer[], length(vals))
using BenchmarkTools
@btime reencode1!($out, $types, $vals) setup=(empty!(out));
@btime reencode2!($out, $typecodes, $vals) setup=(empty!(out)); Result: julia> include("bench.jl")
1.300 μs (0 allocations: 0 bytes)
74.794 ns (0 allocations: 0 bytes) As you can see, the second mostly-type-stable version is much faster. |
Okay thanks for the additional info on that. I'll have a play around with some benchmarking and see what I can come up with. |
This is discussed somewhat in #45.
Still very much a WIP, but has initial support for parsing a
.bcif
file into aMolecularStructure
. Based on the BCIF Spec and the biotite implementation.This is my first sizeable Julia project, so I would like feedback on ways things could be better approached in a more "Julia-esque" way. Currently the
.bcif
file is unpacked usingMsgPack
and just left as a dictionary that is then processed and decoded at different steps. Should the different sub-dictionaries instead be turned into custom structs for multiple dispatch?Read function is bare-bones but works. I am still getting some errors for the construction of the
MolecularStructure
for some files which I am not sure how we approach them:TODOS:
MsgPack.unpack(io, BCIFDict)
Benchmarks
Some initial bencharks that I'll make some plots for & be more comprehensive when closer to being finished. Benchmarks run with
julia --threads 8
as the.bcif
columns are returned all together so can be decoded in parallel.Unsurprisingly it is slightly slower on the small structures, with minimal difference in file sizes on disk. For the larger structures it ends up being about ~2x faster to return the
MolecularStructure
and with 1/2 the memory required for reading. Decoding the file into a dictionary for the largest structure tested (8J07
) only takes ~400 ms with 1.5 GB memory (vs 6.5 s & 4 GB for returning full structure) with the remaining ~6 s being the construction of theMolecularStructure
.1AKE
CIF Size:
427 KB
BCIF Size:
224 KB
1CD3
CIF Size:
1 MB
BCIF Size:
316 KB
7CGO
CIF Size:
36.3 MB
BCIF Size:
4.2 MB
8J07
CIF Size:
353.3 MB
BCIF Size
37.5 MB
Just decoding to
Dict()