Implementation of matmul for complex datatypes. #1992

PawelSwider2000 · 2025-08-29T11:02:07Z

Implementation of kernels for complex datatype support for 4 ops: mm, bmm,addmm, baddbmm using OneMKL.

Current implementation of this ops for XPU is in pytorch/aten/src/ATen/native/mkldnn/xpu/Blas.cpp. Since OneMKL is a torch-xpu-ops dependency and is available only with USE_ONEMKL_XPU=ON (which is a default value). Implementation needs to be in torch-ops-xpu and kernels and TORCH_LIBRARY_IMPL are in ifdef macro to avoid complication error when OneMKL is not supported. Newly declared op will be called from existing torch implementation using c10::Dispatcher.

This is part of: #1853

Copilot

Pull Request Overview

This PR implements OneMKL-based kernels for complex datatype support in matrix multiplication operations (mm, bmm, addmm, baddbmm) on XPU devices. The implementation provides optimized BLAS operations for complex numbers using OneMKL library integration.

Key changes include:

Addition of OneMKL-based complex matrix multiplication kernels
Implementation of four core matrix operations with complex number support
Conditional compilation support for OneMKL availability

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/ATen/native/xpu/Blas.cpp

PawelSwider2000 · 2025-08-29T11:04:23Z

@CuiYifeng @kbinias Please review

PawelSwider2000 · 2025-08-29T11:17:41Z

Follow up change with tests: #1993

CuiYifeng · 2025-08-29T13:38:32Z

src/ATen/native/xpu/Blas.cpp

+  oneapi::mkl::blas::row_major::gemm(
+      c10::xpu::getCurrentXPUStream().queue(),
+      oneapi::mkl::transpose::nontrans,
+      oneapi::mkl::transpose::nontrans,


Please note that the storage of input tensors and output tensor is not always row-major.

Yes, we could decide which algorithm (row-major, col-major) to use base on input/output tensors. The implementation however will be much more complicated since we need to make a decision based on strides and shapes. Example we could have one tensor as row-major and second as col-major then deciding what to use could be more complicated.

Using row-major along with transposition to this format leads to worse performance when comparing to row-major, however for contiguous, second_col_major, first_col_major they are still comparable. Only both_col_major is visibly worse.

I would suggest to make this performance improvements in subsequent PRs, as it is much better to have complex support at all. Also mm is the simplest out of these ops and for baddbmm algorithm for efficient selection between row-major and col-major could be more complicated.

Made more detailed comparison between existing reference on IPEX GPU found following performance issues:

Perf degradation when one/both inputs are not row-major

Perf degradation when one/both inputs are conjugated

Worse performance for smaller sizes

Observed differences in perf are big, for some cases current implementation is a few times slower than reference.

@CuiYifeng do you know if there is a performance difference between row-major and column-major implementations?

The other issue that I notice is that for addmm, baddbmm some tests are failing like: TestCommonXPU::test_noncontiguous_samples_addmm_xpu_complex64 which was passing with implementation proposed in this PR

To be more precise about perf for 4096x4096 tensors perf for contiguous non conjugate tensors implementation speedup is: 1.028 and it is similar for both conjugate inputs.

whereas the same for 256x256 tensors is around 0.565

For both columns major tensors and 4096x4096 tensors we have 0.255.

when only one tensor is column major, then differences are smaller but still large.

Different memory layouts may lead to performance differences. In complex MatMul kernel, data reorder for different layouts may also introduce differences.

CuiYifeng · 2025-08-29T13:41:50Z

src/ATen/native/xpu/Blas.cpp

+  oneapi::mkl::blas::row_major::gemm_batch(
+      c10::xpu::getCurrentXPUStream().queue(),
+      oneapi::mkl::transpose::nontrans,
+      oneapi::mkl::transpose::nontrans,


CuiYifeng · 2025-08-29T13:42:05Z

src/ATen/native/xpu/Blas.cpp

+  oneapi::mkl::blas::row_major::gemm(
+      c10::xpu::getCurrentXPUStream().queue(),
+      oneapi::mkl::transpose::nontrans,
+      oneapi::mkl::transpose::nontrans,


CuiYifeng · 2025-08-29T13:42:28Z

src/ATen/native/xpu/Blas.cpp

+  oneapi::mkl::blas::row_major::gemm_batch(
+      c10::xpu::getCurrentXPUStream().queue(),
+      oneapi::mkl::transpose::nontrans,
+      oneapi::mkl::transpose::nontrans,


src/ATen/native/xpu/Blas.cpp

PawelSwider2000 added 5 commits August 8, 2025 12:25

Matmul complex POC

05096e7

Merge remote-tracking branch 'origin/main' into pswider/complex-matmul

f70f4fc

MM kernels improvements

e865b3f

Switch to TORCH_LIBRARY makro

55dc07e

Refactor

963531c

Copilot AI review requested due to automatic review settings August 29, 2025 11:02

Merge branch 'main' into pswider/complex-matmul

ac98994

Copilot AI reviewed Aug 29, 2025

View reviewed changes

src/ATen/native/xpu/Blas.cpp Show resolved Hide resolved

src/ATen/native/xpu/Blas.cpp Show resolved Hide resolved

src/ATen/native/xpu/Blas.cpp Show resolved Hide resolved

PawelSwider2000 mentioned this pull request Aug 29, 2025

Tests for matmul for complex datatypes. #1993

Open

PawelSwider2000 mentioned this pull request Aug 29, 2025

[Intel GPU] Xpu matmul implementation for complex dtype pytorch/pytorch#160867

Draft

CuiYifeng requested changes Aug 29, 2025

View reviewed changes

CuiYifeng requested review from guangyey, EikanWang and ZhiweiYan-96 August 29, 2025 13:45

guangyey reviewed Aug 30, 2025

View reviewed changes

src/ATen/native/xpu/Blas.cpp Show resolved Hide resolved

PawelSwider2000 and others added 3 commits September 1, 2025 13:02

Add device guard

312d8ed

Merge branch 'main' into pswider/complex-matmul

0e86518

Merge branch 'main' into pswider/complex-matmul

4b905ad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementation of matmul for complex datatypes. #1992

Implementation of matmul for complex datatypes. #1992

Uh oh!

PawelSwider2000 commented Aug 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PawelSwider2000 commented Aug 29, 2025

Uh oh!

PawelSwider2000 commented Aug 29, 2025

Uh oh!

CuiYifeng Aug 29, 2025

Uh oh!

PawelSwider2000 Sep 1, 2025 •

edited

Loading

Uh oh!

PawelSwider2000 Sep 2, 2025 •

edited

Loading

Uh oh!

PawelSwider2000 Sep 2, 2025 •

edited

Loading

Uh oh!

CuiYifeng Sep 3, 2025 •

edited

Loading

Uh oh!

CuiYifeng Aug 29, 2025

Uh oh!

CuiYifeng Aug 29, 2025

Uh oh!

CuiYifeng Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

Implementation of matmul for complex datatypes. #1992

Are you sure you want to change the base?

Implementation of matmul for complex datatypes. #1992

Uh oh!

Conversation

PawelSwider2000 commented Aug 29, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PawelSwider2000 commented Aug 29, 2025

Uh oh!

PawelSwider2000 commented Aug 29, 2025

Uh oh!

CuiYifeng Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

PawelSwider2000 Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PawelSwider2000 Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PawelSwider2000 Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

CuiYifeng Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PawelSwider2000 Sep 1, 2025 •

edited

Loading

PawelSwider2000 Sep 2, 2025 •

edited

Loading

PawelSwider2000 Sep 2, 2025 •

edited

Loading

CuiYifeng Sep 3, 2025 •

edited

Loading