Use parallel reduce threading primitive in covariance algorithm #3126

Vika-F · 2025-03-17T16:18:05Z

Description

Ads new daal::Reducer interface class that defines the API that have to be implemented in the algorithms to allow the use of reduction primitives based on tbb::parallel_reduce.

Two new threading primitives were added:

threader_reduce implements parallel reduction using dynamic work balancing,
static_threader_reduce implements parallel reduction using static work balancing.

Dense covariance algorithm in oneDAL was modified to use new static_threader_reduce primitive instead of static_threader_for + single thread reduction as it was done previously.

tls_data_t structure previously used as a thread local storage for partial results in Covariance algorithm was replaced with CovarianceReduser class which implements the interface of new daal::Reducer to perform parallel reduction.

This PR depends on #3159 because CovarianceReducer uses TArrayScalableCalloc to store the partial results.
And the performance of TArrayScalableCalloc is not optimal due to unaligned memory stores (takes up to 45% of the total Covariance compute time). PR #3159 fixes the issue.

PR completeness and readability

I have reviewed my changes thoroughly before submitting this pull request.
I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with update and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have added a respective label(s) to PR if I have a permission for that.
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.

Performance

I have measured performance for affected algorithms using scikit-learn_bench and provided at least summary table with measured data, if performance change is expected.
I have provided justification why performance has changed or why changes are not expected.
I have provided justification why quality metrics have changed or why changes are not expected.
I have extended benchmarking suite and provided corresponding scikit-learn_bench PR if new measurable functionality was introduced in this PR.

david-cortes-intel · 2025-03-18T08:50:05Z

cpp/daal/src/algorithms/covariance/covariance_impl.i

-            {
-                return;
-            }
+        tls_data_t<algorithmFPType, cpu> result = tbb::parallel_reduce(


Wouldn't it be better to make a NUMA-aware version of this function that would first reduce within nodes and then globally?

@david-cortes-intel Thank you for looking into this.

I would prefer to go step-by-step with NUMA.
If a non-NUMA version is enough to improve performance, I would be more likely to add it first.
If not, than I'll probably add NUMA awareness into the primitives.

I also have to say that this version of the code is just an initial quick and dirty commit. The code will change a lot further. At least it should pass the testing and all the tbb:: stuff should be moved into threading layer.

…:tls_data_t struct

…in DinamicArray

Pull changes from main branch

Vika-F added 2 commits March 17, 2025 09:16

Initial commit

4a46802

Minor fix

f0e3045

david-cortes-intel reviewed Mar 18, 2025

View reviewed changes

Vika-F added the enhancement label Mar 20, 2025

Vika-F added 4 commits March 21, 2025 07:08

Fix copy constructor and assignment operator in covariance::internal:…

8ebbd4a

…:tls_data_t struct

Implement rule-of-zero in covariance::internal::tls_data_t

cf9112b

Replace unaligned memory copy with aligned memory copy when possible …

c10597b

…in DinamicArray

Use 'Body' form for tbb::parallel_reduce

d2cbbc3

Vika-F mentioned this pull request Mar 31, 2025

Enhancement: NUMA-aware threading on CPU #3053

Closed

13 tasks

Vika-F added 13 commits March 31, 2025 05:12

Move TBB-specific code into threading.cpp

1a712d2

Fixes

8013194

Fix a typo

4664857

calng-format; std::uint64_t -> DAAL_UNIT64 in DynamicArray

a239d73

std::uint32_t -> unsigned int

9853f53

ABI and win build fixes

4e5660e

Refactoring

d773279

Merge pull request #41 from uxlfoundation/main

b148447

Pull changes from main branch

Add missing files

ab9ee4d

clang-format

688ce5d

Remove unnecessary changes

4adb408

Minor fix

1fe4813

Remove unnecessary changes

bc3ea97

Vika-F added the perf Performance optimization label Apr 2, 2025

Vika-F added 4 commits April 3, 2025 07:25

Add static parallel reduce primitive

e4d7062

Fix a typo

c41ca9e

Changes in memory managemant were moved to a separate PR

071a0b1

clang-format

5d8cb7d

Vika-F mentioned this pull request Apr 11, 2025

Use aligned loads and stores where possible in DAAL memory management #3159

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use parallel reduce threading primitive in covariance algorithm #3126

Use parallel reduce threading primitive in covariance algorithm #3126

Vika-F commented Mar 17, 2025 •

edited

Loading

david-cortes-intel Mar 18, 2025

Vika-F Mar 20, 2025

Use parallel reduce threading primitive in covariance algorithm #3126

Are you sure you want to change the base?

Use parallel reduce threading primitive in covariance algorithm #3126

Conversation

Vika-F commented Mar 17, 2025 • edited Loading

Description

david-cortes-intel Mar 18, 2025

Choose a reason for hiding this comment

Vika-F Mar 20, 2025

Choose a reason for hiding this comment

Vika-F commented Mar 17, 2025 •

edited

Loading