Skip to content

Refactoring of Molecule module and change to how graphs are stored. #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jul 6, 2012

Conversation

rwest
Copy link
Member

@rwest rwest commented Jun 20, 2012

Two major things on this branch:

  1. a new "molecule" package containing lots of things related to molecules.
  2. a change to how graphs are stored (Now storing dict of edges on Vertex and vertices on Edge.)

The second of these was complicated and inevitably introduced some bugs, several of which we have so far found, but an unknown number (perhaps zero!) have not been.
This pull request is to collect debugging efforts so we can merge ASAP, ideally once we have fixed the unit tests!

jwallen and others added 23 commits June 5, 2012 09:33
There is enough functionality involving molecules to warrant a full
subpackage. Within the subpackage, feel free to split the content into
several smaller modules to enable faster partial rebuilds.
As you can see, we already have several modules devoted to working with
molecules and molecular graphs. This will make them easier to find and
generally clean up the source tree.
Most of the time I think you can just import things from the subpackage
directly, e.g. ::

    from rmgpy.molecule import Molecule, Group, fromAdjacencyList

rather than trying to remember which submolecule these are defined in.
The only exception would be if you are cimporting from a Cython module;
in this case you will need to refer to the full subpackage.
The idea is that the unit test modules should generally be organized
parallel to the main rmgpy modules.
All of the unit tests in the unittest/molecule subfolder now run as they
did at the start of this branch.
…ation.

This functionality is complex enough to warrant its own module, especially
when we get around to adding QMTP functionality to RMG-Py.
Again, the idea is to parallel the structure of the modules in the
rmgpy.* source tree. Note that, as before, a few of the unit tests fail
because they represent tricky edge cases that the current symmetry number
algorithms do not yet capture.
The intent was to conserve memory in some situations by storing the
number of hydrogen atoms adjacent to each heavy atom as an integer,
instead of storing the hydrogen atoms explicitly as Atom objects. (The
idea came from OpenBabel, which has this functionality.) However, this
turned out to be far from nontrivial to implement when dealing with
the many graph manipulation requirements of an RMG job, and I don't think
the memory savings is worth the trouble any more.

There was probably also a time savings in the graph isomorphism
evaluation when comparing implicit to implicit, but I think we can get
at least some of that savings back in other ways.
Since this functionality no longer exists, we need to remove these
references for RMG-Py to run.
Previously we stored the edges in a dict of dicts on the Graph, and did
not store the vertices on the edge at all. However, the convenience of
having these attributes on the Vertex and Edge objects outweights the
slight increase in memory use.

This is a pretty significant change in approach, and many of the methods
of the Graph class needed modification as a result.
The new implementation exploits the fact that we are now storing the
edges on the Vertex objects to dramatically speed up the isomorphism
check by significantly decreasing the number of calls into the slow
Python/C API. In earlier tests I was seeing a ~8x speedup in isomorphism
evaluation for Graph objects, and about ~2x for Molecule objects.
These functions needed quite a bit of work to adjust to storing the bonds
on the atoms itself. In particular, we now need to make deep copies of
the Molecule object in more places, since adding and removing bonds now
modified the Atom objects. This may need more attention in the future,
but seems okay for now.
…sts.

This functionality is used by both Molecule and Group objects. Before it
was stored in the rmgpy.molecule.group module, a bit of an unfortunate
compromise. Now that rmgpy.molecule is a subpackage, we have space to
place the adjacency list functionality in a module of its own, as done
here.
As before, this is mostly updating to reflect that the edges are now
stored on the Atom objects instead of the Molecule objects. Some of the
corresponding unit tests also needed a bit of work.
As before, this is mostly updating to reflect that the edges are now
stored on the GroupAtom objects instead of the Group objects. Some of
the corresponding unit tests also needed a bit of work.
As before, this is mostly updating to reflect that the edges are now
stored on the Atom objects instead of the Molecule objects.
Mostly this is creating Bond and GroupBond objects with the two atoms
specified in the __init__() method. The findIsomorphism() and
findSubgraphIsomorphisms() methods also now only return the list of
mappings; this caused a few additional changes in the database code.
Small changes, but I think improvements.
One reduces the number of if checks, the other reduces the number of loops.
I *think* these are more accurate...
The getSmallestSetOfSmallestRings() method generates a copy of the graph
before applying the SSSR algorithm so as to not modify the original.
Before, this was safe because we could make a shallow copy and reuse the
same Vertex and Edge objects, since we didn't store any information about
the graph connectivity on these objects. Now that we are, we must make a
deep copy of the graph to use for the SSSR algorithm. As a result, we
also need to map the vertices of the copy back to those of the original
graph before returning.

This was causing the molecule drawing to fail for cyclic species; the
problem should now be fixed.
The new method of storing bonds on atoms requires this change.
Closes #81.

NB. Ring corrections are searched using a molecule fragment with
only the ring and no adjoining atoms. If a ring correction definition
depends on ligands, it will not be found. (This is not a change in 
this commit, just an observation)
We (currently) don't have any stereochemistry, it's just guessed
by OpenBabel, and makes the SMILES strings longer and more ugly
than they need to be. This removes the @@h signs, and also might
make it faster as we make fewer calls through pybel.
@jwallen
Copy link
Contributor

jwallen commented Jul 5, 2012

I think I'm ready to merge it into master, simply because there are several branches that already incorporate it, resulting in a history that is increasingly difficult to follow (for me at least). We should write lots more unit tests in case we haven't quite found all of the bugs in the new graph storage method. (Really, we should write lots more unit tests anyway!) If there are no objections, I'll merge it tomorrow.

@rwest
Copy link
Member Author

rwest commented Jul 6, 2012

Fair enough. I am not yet convinced that the branches behave the same, but we can always determine the differences afterwards, and I agree that continued divergence would be unfortunate.

@jwallen jwallen merged commit 0bd9b13 into master Jul 6, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants