Refactoring of Molecule module and change to how graphs are stored. #85

rwest · 2012-06-20T22:53:28Z

Two major things on this branch:

a new "molecule" package containing lots of things related to molecules.
a change to how graphs are stored (Now storing dict of edges on Vertex and vertices on Edge.)

The second of these was complicated and inevitably introduced some bugs, several of which we have so far found, but an unknown number (perhaps zero!) have not been.
This pull request is to collect debugging efforts so we can merge ASAP, ideally once we have fixed the unit tests!

There is enough functionality involving molecules to warrant a full subpackage. Within the subpackage, feel free to split the content into several smaller modules to enable faster partial rebuilds.

As you can see, we already have several modules devoted to working with molecules and molecular graphs. This will make them easier to find and generally clean up the source tree.

Most of the time I think you can just import things from the subpackage directly, e.g. :: from rmgpy.molecule import Molecule, Group, fromAdjacencyList rather than trying to remember which submolecule these are defined in. The only exception would be if you are cimporting from a Cython module; in this case you will need to refer to the full subpackage.

The idea is that the unit test modules should generally be organized parallel to the main rmgpy modules.

All of the unit tests in the unittest/molecule subfolder now run as they did at the start of this branch.

…ation. This functionality is complex enough to warrant its own module, especially when we get around to adding QMTP functionality to RMG-Py.

Again, the idea is to parallel the structure of the modules in the rmgpy.* source tree. Note that, as before, a few of the unit tests fail because they represent tricky edge cases that the current symmetry number algorithms do not yet capture.

The intent was to conserve memory in some situations by storing the number of hydrogen atoms adjacent to each heavy atom as an integer, instead of storing the hydrogen atoms explicitly as Atom objects. (The idea came from OpenBabel, which has this functionality.) However, this turned out to be far from nontrivial to implement when dealing with the many graph manipulation requirements of an RMG job, and I don't think the memory savings is worth the trouble any more. There was probably also a time savings in the graph isomorphism evaluation when comparing implicit to implicit, but I think we can get at least some of that savings back in other ways.

Since this functionality no longer exists, we need to remove these references for RMG-Py to run.

Previously we stored the edges in a dict of dicts on the Graph, and did not store the vertices on the edge at all. However, the convenience of having these attributes on the Vertex and Edge objects outweights the slight increase in memory use. This is a pretty significant change in approach, and many of the methods of the Graph class needed modification as a result.

The new implementation exploits the fact that we are now storing the edges on the Vertex objects to dramatically speed up the isomorphism check by significantly decreasing the number of calls into the slow Python/C API. In earlier tests I was seeing a ~8x speedup in isomorphism evaluation for Graph objects, and about ~2x for Molecule objects.

These functions needed quite a bit of work to adjust to storing the bonds on the atoms itself. In particular, we now need to make deep copies of the Molecule object in more places, since adding and removing bonds now modified the Atom objects. This may need more attention in the future, but seems okay for now.

…sts. This functionality is used by both Molecule and Group objects. Before it was stored in the rmgpy.molecule.group module, a bit of an unfortunate compromise. Now that rmgpy.molecule is a subpackage, we have space to place the adjacency list functionality in a module of its own, as done here.

As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects. Some of the corresponding unit tests also needed a bit of work.

As before, this is mostly updating to reflect that the edges are now stored on the GroupAtom objects instead of the Group objects. Some of the corresponding unit tests also needed a bit of work.

As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects.

Mostly this is creating Bond and GroupBond objects with the two atoms specified in the __init__() method. The findIsomorphism() and findSubgraphIsomorphisms() methods also now only return the list of mappings; this caused a few additional changes in the database code.

Small changes, but I think improvements. One reduces the number of if checks, the other reduces the number of loops.

I *think* these are more accurate...

The getSmallestSetOfSmallestRings() method generates a copy of the graph before applying the SSSR algorithm so as to not modify the original. Before, this was safe because we could make a shallow copy and reuse the same Vertex and Edge objects, since we didn't store any information about the graph connectivity on these objects. Now that we are, we must make a deep copy of the graph to use for the SSSR algorithm. As a result, we also need to map the vertices of the copy back to those of the original graph before returning. This was causing the molecule drawing to fail for cyclic species; the problem should now be fixed.

The new method of storing bonds on atoms requires this change. Closes #81. NB. Ring corrections are searched using a molecule fragment with only the ring and no adjoining atoms. If a ring correction definition depends on ligands, it will not be found. (This is not a change in this commit, just an observation)

…ings Closes #82.

@h

We (currently) don't have any stereochemistry, it's just guessed by OpenBabel, and makes the SMILES strings longer and more ugly than they need to be. This removes the @@h signs, and also might make it faster as we make fewer calls through pybel.

jwallen · 2012-07-05T19:56:30Z

I think I'm ready to merge it into master, simply because there are several branches that already incorporate it, resulting in a history that is increasingly difficult to follow (for me at least). We should write lots more unit tests in case we haven't quite found all of the bugs in the new graph storage method. (Really, we should write lots more unit tests anyway!) If there are no objections, I'll merge it tomorrow.

rwest · 2012-07-06T02:26:42Z

Fair enough. I am not yet convinced that the branches behave the same, but we can always determine the differences afterwards, and I agree that continued divergence would be unfortunate.

jwallen and others added 23 commits June 5, 2012 09:33

Created rmgpy.molecule subpackage.

a31603f

There is enough functionality involving molecules to warrant a full subpackage. Within the subpackage, feel free to split the content into several smaller modules to enable faster partial rebuilds.

Moved several modules into the rmgpy.molecule subpackage.

4c8db82

As you can see, we already have several modules devoted to working with molecules and molecular graphs. This will make them easier to find and generally clean up the source tree.

Moved rmgpy.molecule unit test modules into molecule subfolder.

5021e15

The idea is that the unit test modules should generally be organized parallel to the main rmgpy modules.

Updated references to moved rmgpy.molecule submodules in unit tests.

3f8ecee

All of the unit tests in the unittest/molecule subfolder now run as they did at the start of this branch.

Created new rmgpy.molecule.symmetry module for symmetry number calcul…

7f8e1d9

…ation. This functionality is complex enough to warrant its own module, especially when we get around to adding QMTP functionality to RMG-Py.

Removed assorted references to implicit/explicit hydrogens.

6573ee9

Since this functionality no longer exists, we need to remove these references for RMG-Py to run.

Updated Atom, Bond, and Molecule to work with new Graph format.

d07a1df

As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects. Some of the corresponding unit tests also needed a bit of work.

Updated GroupAtom, GroupBond, and Group to work with new Graph format.

744b10c

As before, this is mostly updating to reflect that the edges are now stored on the GroupAtom objects instead of the Group objects. Some of the corresponding unit tests also needed a bit of work.

Updated molecule drawing functionality to work with new Graph format.

0581d15

As before, this is mostly updating to reflect that the edges are now stored on the Atom objects instead of the Molecule objects.

Reordered a couple of 'if' blocks to be more efficient.

6bbb693

Small changes, but I think improvements. One reduces the number of if checks, the other reduces the number of loops.

Small changes to docstring and comment.

82b6a2b

I *think* these are more accurate...

Fix issue #82 in rmgpy.molecule.graph.Graph.getSmallestSetOfSmallestR…

108886f

…ings Closes #82.

jwallen merged commit 0bd9b13 into master Jul 6, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactoring of Molecule module and change to how graphs are stored. #85

Refactoring of Molecule module and change to how graphs are stored. #85

Uh oh!

rwest commented Jun 20, 2012

Uh oh!

jwallen commented Jul 5, 2012

Uh oh!

rwest commented Jul 6, 2012

Uh oh!

Uh oh!

Refactoring of Molecule module and change to how graphs are stored. #85

Refactoring of Molecule module and change to how graphs are stored. #85

Uh oh!

Conversation

rwest commented Jun 20, 2012

Uh oh!

jwallen commented Jul 5, 2012

Uh oh!

rwest commented Jul 6, 2012

Uh oh!

Uh oh!