-
Notifications
You must be signed in to change notification settings - Fork 3
BHC speedup #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BHC speedup #5
Conversation
Hi @martinalex000, thanks for this valuable contribution! |
I did the rebase, could you have a look? |
@martinalex000 there are some formatting issues raised by the GitHub workflow. Could you please address them? |
I fixed the formatting issues (even though the default settings of the black formatter conflict with the flake8 settings). There was a minor problem with the flake8 config that "ignore" resets everything that is currently ignored and "extend-ignore" does what is probably expected. I can't do much about the filesize though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Performance optimization of the Bayesian Hierarchical Clustering (BHC) algorithm to improve scalability from hundreds to thousands of data points through strategic caching and batching modifications.
- Introduced caching of active cluster data to avoid repeated recomputation
- Added batched computation method for initial pairwise log-likelihood calculations
- Optimized tmp_merge data structure management with pre-allocation and filtering
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
setup.cfg | Updated flake8 configuration from ignore to extend-ignore |
bhc/core/prior.py | Added batched log-likelihood computation method and improved code formatting |
bhc/core/bhc.py | Implemented caching system and optimized merge operations for performance |
bhc/api.py | Improved code formatting and consistency |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
# ------------------------------------------------------------------ | ||
rp = self.r + 2.0 # each cluster has two points | ||
vp = self.v + 2.0 | ||
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,) |
Copilot
AI
Sep 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable sign
is computed but never used. If the determinant can be negative or zero, this could lead to incorrect log-likelihood calculations since the subsequent computation assumes positive determinants.
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,) | |
sign, logdet = np.linalg.slogdet(s_mat_p) # (N-i-1,) | |
if not np.all(sign > 0): | |
raise ValueError("Posterior scale matrix is not positive-definite for some pairs (determinant <= 0).") |
Copilot uses AI. Check for mistakes.
data_per_cluster[i] = None | ||
data_per_cluster[j] = None |
Copilot
AI
Sep 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting merged cluster data to None creates potential for accessing None values later. Consider using a dictionary or set to track active clusters instead of relying on None checks.
Copilot uses AI. Check for mistakes.
The original BHC code did not scale well to larger datasets with more than a few hundred points. With some minor modifications (caching the active clusters and some batching for the original tmp_merge), it now scales to a few thousand points.
I did not perform any optimization on the rose-tree variant of BHC, but the caching of active clusters should also speed up that code quite a bit and the batched computation should carry over to the __init_pairs() function.