Skip to content

Version 3 with cached cross chunk edges #454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 113 commits into
base: main
Choose a base branch
from
Open

Version 3 with cached cross chunk edges #454

wants to merge 113 commits into from

Conversation

akhileshh
Copy link
Contributor

@akhileshh akhileshh commented Aug 6, 2023

  • Adds a new column family for cached cross chunks edges.
  • Adds MaxAgeGCRule for previous column family with supervoxel cross chunk edges; only needed during ingest and they get deleted eventually.
  • Edits make use of cached cross chunk edges.

Summary of changes in pychunkedgraph.ingest:

  • Layer 2 creation is mostly unchanged; stores cross chunk edges with supervoxels
    • The column family used to store these edges now has a max age garbage collection rule
    • During ingest, these edges can be used to cache higher layer cross chunk edges; will be deleted eventually by BigTable's garbage collection routines.
  • When ingesting layer 3, cross edges for children (layer 2) get updated and "lifted" by using the previously mentioned supervoxel cross chunk edges, these have a different column family so they're retained forever.
    • At the same time, cross edges for parents at layer 3 will get created by merging cross edges of their children, these are intermediate and will be lifted when ingesting the next parent layer.
  • For each layer > 3 until root layer:
    • Update children cross chunk edges by "lifting" the edges created during the previous layer ingest.
    • Add parent cross chunk edges by merging children cross chunk edges; they will be updated when ingesting the next layer.

This assumes all chunks at lower layer have been created before creating the current layer so we can no longer queue parent chunk jobs automatically when its children chunks are complete.

We must now ingest/create one layer at a time.

Summary of changes in pychunkedgraph.graph.edits:

  • Edits are expected to be faster now; going to layer 2 to extract cross chunk edges is no longer necessary since they're cached at each layer.
  • During an edit, these cached cross chunk edges must be updated from both directions - to and from the newly created nodes and its existing neighbors.
    • Most changes in this module are to handle this step.
    • Caching these edges has also made the edits logic simpler and cleaner.
    • When updating new cross edges, we need to ensure descendants get replaced by the highest parent.
    • For splits, we need to filter out inactive cross edges after the local graph is read from bucket storage.

@akhileshh akhileshh requested a review from sdorkenw August 6, 2023 20:04
@akhileshh akhileshh changed the title WIP WIP V3 Aug 11, 2023
@akhileshh akhileshh marked this pull request as ready for review August 23, 2023 22:56
@akhileshh akhileshh changed the title WIP V3 Version 3 with cached cross chunk edges Aug 23, 2023
@akhileshh akhileshh requested a review from fcollman August 24, 2023 00:43

def parents_multiple(self, node_ids: np.ndarray, *, time_stamp: datetime = None):
node_ids = np.array(node_ids, dtype=NODE_ID)
Copy link
Contributor

@nkemnitz nkemnitz Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw this here (and some other places) - same as in #458: np.array will by default create a copy. np.asarray will avoid copies, if the requirements are already met.

Copy link
Contributor

@sdorkenw sdorkenw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good besides the one point - a tricky one though - that I marked

new_cx_edges_d[layer] = edges
assert np.all(edges[:, 0] == new_id)
cg.cache.cross_chunk_edges_cache[new_id] = new_cx_edges_d
entries = _update_neighbor_cross_edges(
Copy link
Contributor

@sdorkenw sdorkenw Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this here can introduce problems if a neighboring node is a neighbor to multiple new_l2_ids.

_update_neighbor_cross_edges looks right to me. It writes a complete new set of L2 edges for a node. But if the same node is updated multiple times, then only the last update is reflected. Maybe the logic here takes care of this somehow but then it still introduces multiple unnecessary writes.

So, if I am correct about this, the solution would be to consolidate this call across all new_l2_ids to only make one call per neighboring node id.

new_cx_edges_d[layer] = edges
assert np.all(edges[:, 0] == new_id)
cg.cache.cross_chunk_edges_cache[new_id] = new_cx_edges_d
entries = _update_neighbor_cross_edges(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue as above

@akhileshh akhileshh force-pushed the pcgv3 branch 3 times, most recently from 96e42c4 to 976574a Compare July 31, 2025 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants