-
Notifications
You must be signed in to change notification settings - Fork 221
Use parallel reduce threading primitive in covariance algorithm #3126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
{ | ||
return; | ||
} | ||
tls_data_t<algorithmFPType, cpu> result = tbb::parallel_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to make a NUMA-aware version of this function that would first reduce within nodes and then globally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@david-cortes-intel Thank you for looking into this.
I would prefer to go step-by-step with NUMA.
If a non-NUMA version is enough to improve performance, I would be more likely to add it first.
If not, than I'll probably add NUMA awareness into the primitives.
I also have to say that this version of the code is just an initial quick and dirty commit. The code will change a lot further. At least it should pass the testing and all the tbb::
stuff should be moved into threading layer.
…:tls_data_t struct
Pull changes from main branch
Description
Ads new
daal::Reducer
interface class that defines the API that have to be implemented in the algorithms to allow the use of reduction primitives based ontbb::parallel_reduce
.Two new threading primitives were added:
threader_reduce
implements parallel reduction using dynamic work balancing,static_threader_reduce
implements parallel reduction using static work balancing.Dense covariance algorithm in oneDAL was modified to use new
static_threader_reduce
primitive instead ofstatic_threader_for
+ single thread reduction as it was done previously.tls_data_t
structure previously used as a thread local storage for partial results in Covariance algorithm was replaced withCovarianceReduser
class which implements the interface of newdaal::Reducer
to perform parallel reduction.This PR depends on #3159 because
CovarianceReducer
usesTArrayScalableCalloc
to store the partial results.And the performance of
TArrayScalableCalloc
is not optimal due to unaligned memory stores (takes up to 45% of the total Covariance compute time). PR #3159 fixes the issue.PR completeness and readability
Testing
Performance