-
Notifications
You must be signed in to change notification settings - Fork 240
Refactoring of Molecule module and change to how graphs are stored. #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There is enough functionality involving molecules to warrant a full subpackage. Within the subpackage, feel free to split the content into several smaller modules to enable faster partial rebuilds.
As you can see, we already have several modules devoted to working with molecules and molecular graphs. This will make them easier to find and generally clean up the source tree.
Most of the time I think you can just import things from the subpackage directly, e.g. :: from rmgpy.molecule import Molecule, Group, fromAdjacencyList rather than trying to remember which submolecule these are defined in. The only exception would be if you are cimporting from a Cython module; in this case you will need to refer to the full subpackage.
The idea is that the unit test modules should generally be organized parallel to the main rmgpy modules.
All of the unit tests in the unittest/molecule subfolder now run as they did at the start of this branch.
…ation. This functionality is complex enough to warrant its own module, especially when we get around to adding QMTP functionality to RMG-Py.
Again, the idea is to parallel the structure of the modules in the rmgpy.* source tree. Note that, as before, a few of the unit tests fail because they represent tricky edge cases that the current symmetry number algorithms do not yet capture.
The intent was to conserve memory in some situations by storing the number of hydrogen atoms adjacent to each heavy atom as an integer, instead of storing the hydrogen atoms explicitly as Atom objects. (The idea came from OpenBabel, which has this functionality.) However, this turned out to be far from nontrivial to implement when dealing with the many graph manipulation requirements of an RMG job, and I don't think the memory savings is worth the trouble any more. There was probably also a time savings in the graph isomorphism evaluation when comparing implicit to implicit, but I think we can get at least some of that savings back in other ways.
Since this functionality no longer exists, we need to remove these references for RMG-Py to run.
Previously we stored the edges in a dict of dicts on the Graph, and did not store the vertices on the edge at all. However, the convenience of having these attributes on the Vertex and Edge objects outweights the slight increase in memory use. This is a pretty significant change in approach, and many of the methods of the Graph class needed modification as a result.
The new implementation exploits the fact that we are now storing the edges on the Vertex objects to dramatically speed up the isomorphism check by significantly decreasing the number of calls into the slow Python/C API. In earlier tests I was seeing a ~8x speedup in isomorphism evaluation for Graph objects, and about ~2x for Molecule objects.
These functions needed quite a bit of work to adjust to storing the bonds on the atoms itself. In particular, we now need to make deep copies of the Molecule object in more places, since adding and removing bonds now modified the Atom objects. This may need more attention in the future, but seems okay for now.
…sts. This functionality is used by both Molecule and Group objects. Before it was stored in the rmgpy.molecule.group module, a bit of an unfortunate compromise. Now that rmgpy.molecule is a subpackage, we have space to place the adjacency list functionality in a module of its own, as done here.
As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects. Some of the corresponding unit tests also needed a bit of work.
As before, this is mostly updating to reflect that the edges are now stored on the GroupAtom objects instead of the Group objects. Some of the corresponding unit tests also needed a bit of work.
As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects.
Mostly this is creating Bond and GroupBond objects with the two atoms specified in the __init__() method. The findIsomorphism() and findSubgraphIsomorphisms() methods also now only return the list of mappings; this caused a few additional changes in the database code.
Small changes, but I think improvements. One reduces the number of if checks, the other reduces the number of loops.
I *think* these are more accurate...
The getSmallestSetOfSmallestRings() method generates a copy of the graph before applying the SSSR algorithm so as to not modify the original. Before, this was safe because we could make a shallow copy and reuse the same Vertex and Edge objects, since we didn't store any information about the graph connectivity on these objects. Now that we are, we must make a deep copy of the graph to use for the SSSR algorithm. As a result, we also need to map the vertices of the copy back to those of the original graph before returning. This was causing the molecule drawing to fail for cyclic species; the problem should now be fixed.
The new method of storing bonds on atoms requires this change. Closes #81. NB. Ring corrections are searched using a molecule fragment with only the ring and no adjoining atoms. If a ring correction definition depends on ligands, it will not be found. (This is not a change in this commit, just an observation)
We (currently) don't have any stereochemistry, it's just guessed by OpenBabel, and makes the SMILES strings longer and more ugly than they need to be. This removes the @@h signs, and also might make it faster as we make fewer calls through pybel.
I think I'm ready to merge it into master, simply because there are several branches that already incorporate it, resulting in a history that is increasingly difficult to follow (for me at least). We should write lots more unit tests in case we haven't quite found all of the bugs in the new graph storage method. (Really, we should write lots more unit tests anyway!) If there are no objections, I'll merge it tomorrow. |
Fair enough. I am not yet convinced that the branches behave the same, but we can always determine the differences afterwards, and I agree that continued divergence would be unfortunate. |
Two major things on this branch:
The second of these was complicated and inevitably introduced some bugs, several of which we have so far found, but an unknown number (perhaps zero!) have not been.
This pull request is to collect debugging efforts so we can merge ASAP, ideally once we have fixed the unit tests!