-
Couldn't load subscription status.
- Fork 64
improvement: speed up kmeans on large dataset #1219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Installation incomplete: to start using Gemini Code Assist, please ask the organization owner(s) to visit the Gemini Code Assist Admin Console and sign the Terms of Services. |
Reviewer's GuideThis PR optimizes object pooling by introducing thread-local caches to minimize global locking and enhances HGraph index setup in KMeans by parameterizing the degree and marking the index as immutable. Class diagram for updated ResourceObjectPool with thread-local poolclassDiagram
class ResourceObjectPool {
+TakeOne() std::shared_ptr<T>
+ReturnOne(std::shared_ptr<T>&)
+GetSize() uint64_t
-resize(uint64_t)
-get_local_pool() std::deque<std::shared_ptr<T>>&
-pool_ std::unique_ptr<Deque<std::shared_ptr<T>>>
-pool_size_ std::atomic<uint64_t>
-owned_allocator_ std::shared_ptr<Allocator>
-kLocalPoolCapacity static const uint64_t
}
class Deque
class Allocator
ResourceObjectPool --> Deque : uses
ResourceObjectPool --> Allocator : owns
Class diagram for updated KMeansCluster HGraph index setupclassDiagram
class KMeansCluster {
+find_nearest_one_with_hgraph(const float* query, ...)
-dim_ int
-allocator_ Allocator
-thread_pool_ ThreadPool
}
class InnerIndexInterface {
+FastCreateIndex(string, param)
}
class HGraphIndex {
+Build(base)
+SetImmutable()
}
class Dataset {
+Make()
}
KMeansCluster --> InnerIndexInterface : creates index
KMeansCluster --> HGraphIndex : builds and sets immutable
KMeansCluster --> Dataset : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/utils/resource_object_pool.h:71-75` </location>
<code_context>
- std::unique_lock<std::mutex> lock(mutex_);
- if (pool_->empty()) {
+ auto& local_pool = this->get_local_pool();
+ if (not local_pool.empty()) {
+ std::shared_ptr<T> obj = local_pool.front();
+ local_pool.pop_front();
+ return obj;
+ }
</code_context>
<issue_to_address>
**suggestion:** Consider resetting objects from the local pool before returning.
Objects from the local pool are returned without Reset(), unlike those from the shared pool. This may cause inconsistent state if Reset() is required before reuse. Please review whether Reset() should be called for local pool objects as well.
```suggestion
if (not local_pool.empty()) {
std::shared_ptr<T> obj = local_pool.front();
local_pool.pop_front();
if (obj) {
obj->Reset();
}
return obj;
}
```
</issue_to_address>
### Comment 2
<location> `src/impl/cluster/kmeans_cluster.cpp:255` </location>
<code_context>
param.allocator_ = std::make_shared<SafeAllocator>(this->allocator_);
param.thread_pool_ = this->thread_pool_;
param.metric_ = MetricType::METRIC_TYPE_L2SQR;
+ auto max_degree = std::max(32, dim_ / 8);
- auto hgraph = InnerIndexInterface::FastCreateIndex("hgraph|32|fp32", param);
</code_context>
<issue_to_address>
**question:** Check for potential integer division issues with dim_ / 8.
If dim_ is less than 8, max_degree will default to 32. Please verify this is correct for small values of dim_.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| param.allocator_ = std::make_shared<SafeAllocator>(this->allocator_); | ||
| param.thread_pool_ = this->thread_pool_; | ||
| param.metric_ = MetricType::METRIC_TYPE_L2SQR; | ||
| auto max_degree = std::max(32, dim_ / 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Check for potential integer division issues with dim_ / 8.
If dim_ is less than 8, max_degree will default to 32. Please verify this is correct for small values of dim_.
src/utils/resource_object_pool.h
Outdated
|
|
||
| private: | ||
| std::shared_ptr<Allocator> owned_allocator_{nullptr}; | ||
| static const uint64_t kLocalPoolCapacity = 4; // Tunable parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: rename kLocalPoolCapacity to local_pool_capacity_ or LOCAL_POOL_CAPACITY
src/utils/resource_object_pool.h
Outdated
| private: | ||
| inline void | ||
| resize(uint64_t size) { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unrelated change
src/utils/resource_object_pool.h
Outdated
| return; | ||
| } | ||
| { | ||
| std::lock_guard<std::mutex> lock(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change std::lock_guard to std::scoped_lock
428773e to
e3992fc
Compare
Codecov Report❌ Patch coverage is @@ Coverage Diff @@
## main #1219 +/- ##
==========================================
+ Coverage 91.42% 91.52% +0.10%
==========================================
Files 318 318
Lines 17622 17621 -1
==========================================
+ Hits 16111 16128 +17
+ Misses 1511 1493 -18
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- vt pool waste too much time on fast search in HGraph Index Signed-off-by: LHT129 <[email protected]>
Summary by Sourcery
Optimize k-means clustering performance by reducing contention in the resource pool and tuning the HGraph index parameters dynamically.
Enhancements: