BHC speedup #5

martinalex000 · 2025-08-11T13:43:23Z

The original BHC code did not scale well to larger datasets with more than a few hundred points. With some minor modifications (caching the active clusters and some batching for the original tmp_merge), it now scales to a few thousand points.

I did not perform any optimization on the rose-tree variant of BHC, but the caching of active clusters should also speed up that code quite a bit and the batched computation should carry over to the __init_pairs() function.

caponetto · 2025-08-15T19:00:21Z

Hi @martinalex000, thanks for this valuable contribution!
I've fixed the outdated GitHub Workflows. Could you please rebase your branch?

martinalex000 · 2025-08-18T09:02:23Z

I did the rebase, could you have a look?

caponetto · 2025-08-30T12:30:45Z

@martinalex000 there are some formatting issues raised by the GitHub workflow. Could you please address them?

martinalex000 · 2025-09-01T09:10:52Z

I fixed the formatting issues (even though the default settings of the black formatter conflict with the flake8 settings). There was a minor problem with the flake8 config that "ignore" resets everything that is currently ignored and "extend-ignore" does what is probably expected.
This way now W503 is ignored by default, in accordance with current pep8 rules.

I can't do much about the filesize though.

Copilot

Pull Request Overview

Performance optimization of the Bayesian Hierarchical Clustering (BHC) algorithm to improve scalability from hundreds to thousands of data points through strategic caching and batching modifications.

Introduced caching of active cluster data to avoid repeated recomputation
Added batched computation method for initial pairwise log-likelihood calculations
Optimized tmp_merge data structure management with pre-allocation and filtering

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
setup.cfg	Updated flake8 configuration from ignore to extend-ignore
bhc/core/prior.py	Added batched log-likelihood computation method and improved code formatting
bhc/core/bhc.py	Implemented caching system and optimized merge operations for performance
bhc/api.py	Improved code formatting and consistency

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-01T11:11:53Z

bhc/core/prior.py

+        # ------------------------------------------------------------------
+        rp = self.r + 2.0  # each cluster has two points
+        vp = self.v + 2.0
+        sign, logdet = np.linalg.slogdet(s_mat_p)  # (N-i-1,)


The variable sign is computed but never used. If the determinant can be negative or zero, this could lead to incorrect log-likelihood calculations since the subsequent computation assumes positive determinants.

Suggested change

sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,)

sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,)

if not np.all(sign > 0):

raise ValueError("Posterior scale matrix is not positive-definite for some pairs (determinant <= 0).")

Copilot · 2025-09-01T11:11:54Z

bhc/core/bhc.py

+            data_per_cluster[i] = None
+            data_per_cluster[j] = None


Setting merged cluster data to None creates potential for accessing None values later. Consider using a dictionary or set to track active clusters instead of relying on None checks.

martinalex000 added 2 commits August 18, 2025 10:59

black formatting

34f688a

BHC speedup (memory optimization and some batching)

b169084

martinalex000 force-pushed the main branch from 8da6906 to b169084 Compare August 18, 2025 09:01

martinalex000 added 3 commits September 1, 2025 10:25

formatting issues

9e6c6c8

more formatting, now ignoring flake8 W503

94c00f4

flake8 config

8fc0bf5

caponetto requested a review from Copilot September 1, 2025 11:09

Copilot AI reviewed Sep 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BHC speedup #5

BHC speedup #5

Uh oh!

martinalex000 commented Aug 11, 2025

Uh oh!

caponetto commented Aug 15, 2025

Uh oh!

martinalex000 commented Aug 18, 2025

Uh oh!

caponetto commented Aug 30, 2025

Uh oh!

martinalex000 commented Sep 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BHC speedup #5

Are you sure you want to change the base?

BHC speedup #5

Uh oh!

Conversation

martinalex000 commented Aug 11, 2025

Uh oh!

caponetto commented Aug 15, 2025

Uh oh!

martinalex000 commented Aug 18, 2025

Uh oh!

caponetto commented Aug 30, 2025

Uh oh!

martinalex000 commented Sep 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants