Skip to content

Conversation

martinalex000
Copy link

The original BHC code did not scale well to larger datasets with more than a few hundred points. With some minor modifications (caching the active clusters and some batching for the original tmp_merge), it now scales to a few thousand points.

I did not perform any optimization on the rose-tree variant of BHC, but the caching of active clusters should also speed up that code quite a bit and the batched computation should carry over to the __init_pairs() function.

@caponetto
Copy link
Owner

Hi @martinalex000, thanks for this valuable contribution!
I've fixed the outdated GitHub Workflows. Could you please rebase your branch?

@martinalex000
Copy link
Author

I did the rebase, could you have a look?

@caponetto
Copy link
Owner

@martinalex000 there are some formatting issues raised by the GitHub workflow. Could you please address them?

@martinalex000
Copy link
Author

I fixed the formatting issues (even though the default settings of the black formatter conflict with the flake8 settings). There was a minor problem with the flake8 config that "ignore" resets everything that is currently ignored and "extend-ignore" does what is probably expected.
This way now W503 is ignored by default, in accordance with current pep8 rules.

I can't do much about the filesize though.

@caponetto caponetto requested a review from Copilot September 1, 2025 11:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Performance optimization of the Bayesian Hierarchical Clustering (BHC) algorithm to improve scalability from hundreds to thousands of data points through strategic caching and batching modifications.

  • Introduced caching of active cluster data to avoid repeated recomputation
  • Added batched computation method for initial pairwise log-likelihood calculations
  • Optimized tmp_merge data structure management with pre-allocation and filtering

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
setup.cfg Updated flake8 configuration from ignore to extend-ignore
bhc/core/prior.py Added batched log-likelihood computation method and improved code formatting
bhc/core/bhc.py Implemented caching system and optimized merge operations for performance
bhc/api.py Improved code formatting and consistency

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# ------------------------------------------------------------------
rp = self.r + 2.0 # each cluster has two points
vp = self.v + 2.0
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,)
Copy link

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable sign is computed but never used. If the determinant can be negative or zero, this could lead to incorrect log-likelihood calculations since the subsequent computation assumes positive determinants.

Suggested change
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,)
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,)
if not np.all(sign > 0):
raise ValueError("Posterior scale matrix is not positive-definite for some pairs (determinant <= 0).")

Copilot uses AI. Check for mistakes.

Comment on lines +112 to +113
data_per_cluster[i] = None
data_per_cluster[j] = None
Copy link

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting merged cluster data to None creates potential for accessing None values later. Consider using a dictionary or set to track active clusters instead of relying on None checks.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants