Efficiently identifying nodes in a list of marginal trees (indexed from across the genome) #3232

moshejasper · 2025-06-24T10:31:31Z

moshejasper
Jun 24, 2025

I have a trees sequence and am interested in a particular subsection of trees (that have been predefined & index wrt some property).

I am wondering if there is a way of efficiently identifying which nodes are in this subset of trees?

(preferably more efficient than; for tree in trees: for node in tree.nodes: nodelist.add(node.index), which is extremely redundant, as most nodes are shared between trees).

EDIT: purpose was to identify nodes found in arbitrary trees sprinkled across the tree sequence according to some criterion that filters trees.

The broader context is my exploration of 'tree masking' - i.e. removing trees that fail some criteria (which may vary) when performing certain kind of tree calculations.

Answered by hyanwong

Jun 24, 2025

Re the edge_diffs approach, the general idea of using a node mask and adding / subtracting from that is probably right.

Have you tried using the fast Tree array accessors. That might also be a good option?

Simply:

used_nodes = np.zeros(ts.num_nodes)
for tree in ts.trees():
    if I_am_using_this_tree:
        used_nodes[tree.preorder()] = True

It could well be fast enough?

View full answer

hyanwong · 2025-06-24T10:44:01Z

hyanwong
Jun 24, 2025
Maintainer

keep_intervals and look for mention in the edges table?

# warning: untested
new_ts = ts.keep_intervals([[1e4, 2e4]], simplify=False)
nodes_in_interval=np.unique(np.concatenate((new_ts.edges_parent, new_ts.edges_child, new_ts.samples())))

7 replies

moshejasper Jun 24, 2025
Author

Hi! This potentially looks like a great base to work with, though there are a couple things I'll have to check. (I am particularly keen to avoid duplicated the sequence, as I am working in memory-tight environments on large trees, which is one of the reasons I'm trying to only access the bits I need for a particular operation)

I thought I was keen to preserve order, but I guess if I'm just checking membership of a category, a set would work also (or I could switch to array & sort as nodes are well-ordered.

I'll have a crack at both & get back here once I have a better sense of how they went.

Thanks for the help!

moshejasper Jun 24, 2025
Author

Thanks again for the above code. Actually, while that is helpful, it presumes that my filtered trees are consecutive (i.e. that I am doing a region filter, which is not necessarily true). They are generated programmatically (e.g. via a function that evaluates some statistic/attribute of trees) in advance, and very likely sprinkled throughout.

I've built off the edge_diffs idea and come up with the following solution (not yet tested):
(it basically uses either an index-mask or a test and manages a nodes_temp set which keeps track of activated & inactivated nodes along the way. Is this likely to do the job, or am I missing something?

def get_filtered_nodes(ts, treemask):
    """ `nodemask` is a logical list (indexing trees) describing whether to use them or not
    either [True, False, False] or [1, 0, 0], but could also be a test condition if we were iterating trees"""

    nodes_filtered = set()  # set of nodes to return
    nodes_temp = set()      # container to track nodes along the way
    consecutive = False     # tracks when we can get away with less work

    for ed, valid_tree in zip(ts.edge_diffs(), treemask):   # or zip tree in trees() then perform a validity test
        
        # manage temp space
        n_add = set()
        for e in ed.edges_in:
            n_add.add(e.parent)
            n_add.add(e.child)
        nodes_temp.extend(n_add)
        for o in ed.edges_out:
            nodes_temp.remove(o.parent)
            nodes_temp.remove(o.child)

        # if valid, update final nodelist
        if valid_tree:                                      # or is_valid(tree)
            if not consecutive: 
                nodes_filtered.extend(nodes_temp)
                consecutive = True
            else:
                nodes_filtered.extend(n_add)
        else:
            consecutive = False
    
    return nodes_filtered

hyanwong Jun 24, 2025
Maintainer

If you want to use the keep_intervals thing, you can pass a list of many intervals. I suspect that might be easier? But if there are loads of little intervals it might inflate the edge list too much

hyanwong Jun 24, 2025
Maintainer

Re the edge_diffs approach, the general idea of using a node mask and adding / subtracting from that is probably right.

Have you tried using the fast Tree array accessors. That might also be a good option?

Simply:

used_nodes = np.zeros(ts.num_nodes)
for tree in ts.trees():
    if I_am_using_this_tree:
        used_nodes[tree.preorder()] = True

It could well be fast enough?

Answer selected by moshejasper

moshejasper Jun 24, 2025
Author

If you want to use the keep_intervals thing, you can pass a list of many intervals. I suspect that might be easier?

I imagine so (e.g. I could use trees to define intervals, etc.).

However, my concern with this method is that it fundamentally changes the structure of the surrounding tree sequence (truncating edges, removing sites, etc.). I am working with a program that has already preprocessed a tree-sequence and is now running node-based statistics. Having an index that maps both to a tree and a node-based calculation based on that tree has its own benefits (recalculate summary stat for only trees that pass a filter without rerunning everything, etc.)

moshejasper Jun 24, 2025
Author

Have you tried using the fast Tree array accessors. That might also be a good option?

Simply:
used_nodes = np.zeros(ts.num_nodes)
for tree in ts.trees():
    if I_am_using_this_tree:
        used_nodes[tree.preorder()] = True

That looks quite interesting, though I am a little less familiar:
Is the idea that tree has a property preorder that tells it the nodes it is adding (or something) and we are just setting those to true as we go along?

EDIT:
So, it returns np array of all nodes in tree, which we then set to true in our np object. Looks good. I'll check speed, but I think this is what I'm after!

Efficiently identifying nodes in a list of marginal trees (indexed from across the genome) #3232

Uh oh!

Uh oh!

moshejasper Jun 24, 2025

Replies: 1 comment · 7 replies

Uh oh!

hyanwong Jun 24, 2025 Maintainer

Uh oh!

moshejasper Jun 24, 2025 Author

Uh oh!

moshejasper Jun 24, 2025 Author

Uh oh!

Uh oh!

hyanwong Jun 24, 2025 Maintainer

Uh oh!

Uh oh!

hyanwong Jun 24, 2025 Maintainer

Uh oh!

moshejasper Jun 24, 2025 Author

Uh oh!

Uh oh!

moshejasper Jun 24, 2025 Author

moshejasper
Jun 24, 2025

Replies: 1 comment 7 replies

hyanwong
Jun 24, 2025
Maintainer

moshejasper Jun 24, 2025
Author

moshejasper Jun 24, 2025
Author

hyanwong Jun 24, 2025
Maintainer

hyanwong Jun 24, 2025
Maintainer

moshejasper Jun 24, 2025
Author

moshejasper Jun 24, 2025
Author