From 9628bd01b8ff255b8e3adef9e1f2d242dcef4fdb Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 9 Jan 2023 13:43:06 -0800
Subject: [PATCH 01/51] - Remove the general query from TODO list as an example
 is added to the llvm-test-suite - Add get coord API and remove it from TODO
 list - Remove the local memory future API looking as it is no more relevant

---
 .../sycl_ext_intel_matrix.asciidoc            | 49 +++++++++++++++++++
 .../sycl_ext_oneapi_matrix.asciidoc           | 37 --------------
 2 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 883c73c655217..a63a61a9b8d4f 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -151,5 +151,54 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 |  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
 |======================
 
+
+## WI data to joint matrix mapping coordinates information for piece-wise operations
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the joint matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
+The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 8, layout::row_major> tA;`. In this example, we assume each WI owns 1 columns of `tA` and the sub-group size is 8. `data.length` returns 8 elements per WI.
+
+[frame="none",options="header"]
+|======================
+| a00 | a01 | a02 |a03 | a04 | a05|a06|a07|a08|a09|a010|a011|a012|a013|a014|a015
+| a10 | a11 | a12 |a13 | a14 | a15|a16|a17|a18|a19|a110|a111|a112|a113|a114|a115
+| a20 | a21 | a22 |a23 | a24 | a25|a26|a27|a28|a29|a210|a211|a212|a213|a214|a215
+| a30 | a31 | a32 |a33 | a34 | a35|a36|a37|a38|a39|a310|a311|a312|a313|a314|a315
+| a40 | a41 | a42 |a43 | a44 | a45|a46|a47|a48|a49|a410|a411|a412|a413|a414|a415
+| a50 | a51 | a52 |a53 | a54 | a55|a56|a57|a58|a59|a510|a511|a512|a513|a514|a515
+| a60 | a61 | a62 |a63 | a64 | a65|a66|a67|a68|a69|a610|a611|a612|a613|a614|a615
+| a70 | a71 | a72 |a73 | a74 | a75|a76|a77|a78|a79|a710|a711|a712|a713|a714|a715
+|======================
+
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int row = 0; row < 8; row++) { 
+  for (int i = 0; i < data.length()/8; ++i) {//WI owns 1 element per row
+    sum_of_local_rows[row] += data[i+row];
+  }
+}  
+```
+The code above assumes knowledge of the distribution of the joint matrix across the different work-items. This is different when a different distribution happens. In order to be agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
+
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+ std::tuple<uint32_t, uint32_t> get_coord();
+}
+```
+
+`get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
+
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int i = 0; i < data.length(); ++i) {
+  auto [row, col] = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}  
+```
+
+
 ## Open Questions
 - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 4c7214ab56e7a..290e702cf5479 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -546,43 +546,6 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 //Remainder handling
 ```
 
-## Future-looking API
-
-### Memory scope
-The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
-
-
-```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
-```
-We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
-
-### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
-
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length(); ++i) {
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
-
-So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length; ++i) {
-  auto row, col = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-## TODO List
-- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add a more realistic and complete example that shows the value of the general query. 
-
 
 ## Revision History
 

From 39875df06db3419bcfb8dc75efd749b23b0d922f Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 9 Jan 2023 14:13:26 -0800
Subject: [PATCH 02/51] add an other distribution example

---
 .../sycl_ext_intel_matrix.asciidoc            | 43 +++++++++++++++----
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index a63a61a9b8d4f..c2adf24266f9e 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -1,4 +1,4 @@
-# Additional Intel-only specifics about matrix extension for DPC++
+# Intel-specific matrix features
 
 :source-highlighter: coderay
 :coderay-linenums-mode: table
@@ -153,11 +153,11 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 
 
 ## WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the joint matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
-The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 8, layout::row_major> tA;`. In this example, we assume each WI owns 1 columns of `tA` and the sub-group size is 8. `data.length` returns 8 elements per WI.
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the joint matrix because the coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+The joint matrix extension aims to enable writing, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact column of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
+The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 16, layout::row_major> tA;`. In this example, we assume each WI owns 2 successive columns of `tA` and the sub-group size is 8. `data.length` returns 16 elements per WI.
 
-[frame="none",options="header"]
+[frame="none"]
 |======================
 | a00 | a01 | a02 |a03 | a04 | a05|a06|a07|a08|a09|a010|a011|a012|a013|a014|a015
 | a10 | a11 | a12 |a13 | a14 | a15|a16|a17|a18|a19|a110|a111|a112|a113|a114|a115
@@ -170,16 +170,43 @@ The following code locally calculates sum of rows of matrix `joint_matrix<sub_gr
 |======================
 
 
+
+WI0 elements returned in `data`  are the first 2 columns as follows:
+
+[frame="none"]
+|======================
+| a00 | a01 | a10 |a11 | a20 | a21|a30|a31|a40|a41|a50|a51|a60|a61|a70|a71
+
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int row = 0; row < 8; row++) { 
+  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
+    sum_of_local_rows[row] += data[i+row*2];
+  }
+}  
+```
+The code above assumes knowledge of the distribution of the joint matrix across the different work-items. This is different when a different distribution happens.
+
+For instance, if we assume a round-robin distribution of the joint matrix elements among the work-items, WI0 elements returned in `data`  are the first and 8th columns as follows:
+
+[frame="none"]
+|======================
+| a00 | a10 | a20 |a30 | a40 | a50|a60|a70|a08|a18|a28|a38|a48|a58|a68|a78
+
+The code becomes:
 ```c++
 auto data = get_wi_data(sg, tA);
 // each WI calculates local sum of rows
 for (int row = 0; row < 8; row++) { 
-  for (int i = 0; i < data.length()/8; ++i) {//WI owns 1 element per row
-    sum_of_local_rows[row] += data[i+row];
+  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
+    sum_of_local_rows[row] += data[i*8+row];
   }
 }  
 ```
-The code above assumes knowledge of the distribution of the joint matrix across the different work-items. This is different when a different distribution happens. In order to be agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
+
+In order to be agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {

From e42ef4a4da04da6292e293ddcc634a51bc846692 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 9 Jan 2023 18:21:03 -0800
Subject: [PATCH 03/51] add revision history

---
 .../sycl_ext_intel_matrix.asciidoc                 | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index c2adf24266f9e..e9aee358cf91b 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -176,6 +176,7 @@ WI0 elements returned in `data`  are the first 2 columns as follows:
 [frame="none"]
 |======================
 | a00 | a01 | a10 |a11 | a20 | a21|a30|a31|a40|a41|a50|a51|a60|a61|a70|a71
+|======================
 
 
 ```c++
@@ -187,6 +188,7 @@ for (int row = 0; row < 8; row++) {
   }
 }  
 ```
+
 The code above assumes knowledge of the distribution of the joint matrix across the different work-items. This is different when a different distribution happens.
 
 For instance, if we assume a round-robin distribution of the joint matrix elements among the work-items, WI0 elements returned in `data`  are the first and 8th columns as follows:
@@ -194,8 +196,11 @@ For instance, if we assume a round-robin distribution of the joint matrix elemen
 [frame="none"]
 |======================
 | a00 | a10 | a20 |a30 | a40 | a50|a60|a70|a08|a18|a28|a38|a48|a58|a68|a78
+|======================
+
 
 The code becomes:
+
 ```c++
 auto data = get_wi_data(sg, tA);
 // each WI calculates local sum of rows
@@ -229,3 +234,12 @@ for (int i = 0; i < data.length(); ++i) {
 
 ## Open Questions
 - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
+
+## Revision History
+
+[frame="none",options="header"]
+|======================
+|Rev |Date       |Author     |Changes
+|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and layout information.
+|2   |2023-01-09 |Dounia Khaldi |Add `get_coord` API.
+|======================

From 8bb98c1d961e89dc59dfa2a1267cab7644f1bff4 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 9 Jan 2023 18:35:28 -0800
Subject: [PATCH 04/51] Bader comments

---
 .../sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc | 8 +-------
 .../sycl_ext_oneapi_matrix.asciidoc                       | 4 ++--
 2 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index e9aee358cf91b..513426fc5e275 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -169,8 +169,6 @@ The following code locally calculates sum of rows of matrix `joint_matrix<sub_gr
 | a70 | a71 | a72 |a73 | a74 | a75|a76|a77|a78|a79|a710|a711|a712|a713|a714|a715
 |======================
 
-
-
 WI0 elements returned in `data`  are the first 2 columns as follows:
 
 [frame="none"]
@@ -178,7 +176,6 @@ WI0 elements returned in `data`  are the first 2 columns as follows:
 | a00 | a01 | a10 |a11 | a20 | a21|a30|a31|a40|a41|a50|a51|a60|a61|a70|a71
 |======================
 
-
 ```c++
 auto data = get_wi_data(sg, tA);
 // each WI calculates local sum of rows
@@ -198,7 +195,6 @@ For instance, if we assume a round-robin distribution of the joint matrix elemen
 | a00 | a10 | a20 |a30 | a40 | a50|a60|a70|a08|a18|a28|a38|a48|a58|a68|a78
 |======================
 
-
 The code becomes:
 
 ```c++
@@ -211,7 +207,7 @@ for (int row = 0; row < 8; row++) {
 }  
 ```
 
-In order to be agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
+In order to make element-wise operations on joint matrices agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
@@ -221,7 +217,6 @@ namespace sycl::ext::intel::experimental::matrix {
 
 `get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
 
-
 ```c++
 auto data = get_wi_data(sg, tA);
 // each WI calculates local sum of rows
@@ -231,7 +226,6 @@ for (int i = 0; i < data.length(); ++i) {
 }  
 ```
 
-
 ## Open Questions
 - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
 
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 290e702cf5479..25b27152bff0e 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -112,7 +112,7 @@ enum class layout {
 
 
 #### Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter.
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported
 
@@ -188,7 +188,7 @@ The matrix multiply and add function performs the multiply operation on the matr
 
 
 #### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+Unlike `joint_matrix_load` that assumes that all the matrices are directly loaded from memory, `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {

From 48386d648df579c4782a63cd08fb0a4671d344f0 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 10 Jan 2023 08:31:40 -0800
Subject: [PATCH 05/51] better wording

---
 .../sycl_ext_intel_matrix.asciidoc            | 33 ++++++++++---------
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 513426fc5e275..7d2eaeb725a3e 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -152,10 +152,22 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 |======================
 
 
-## WI data to joint matrix mapping coordinates information for piece-wise operations
+## WI data to joint matrix mapping coordinates
 The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the joint matrix because the coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-The joint matrix extension aims to enable writing, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact column of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
-The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 16, layout::row_major> tA;`. In this example, we assume each WI owns 2 successive columns of `tA` and the sub-group size is 8. `data.length` returns 16 elements per WI.
+The joint matrix extension aims to enable writing, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact column of the matrix for instance, writing code like sum of rows or sum of columns will only work on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
+
+The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 16, layout::row_major> tA;`. In this example, we assume that each WI owns 2 successive columns of `tA` and the sub-group size is 8. `data.length` returns 16 elements per WI.
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int row = 0; row < 8; row++) { 
+  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
+    sum_of_local_rows[row] += data[i+row*2];
+  }
+}  
+```
+`tA` matrix can be visualized as follows:
 
 [frame="none"]
 |======================
@@ -169,26 +181,17 @@ The following code locally calculates sum of rows of matrix `joint_matrix<sub_gr
 | a70 | a71 | a72 |a73 | a74 | a75|a76|a77|a78|a79|a710|a711|a712|a713|a714|a715
 |======================
 
-WI0 elements returned in `data`  are the first 2 columns as follows:
+WI0 elements returned in `data`  are the first 2 columns of `tA` and can be visualized as follows:
 
 [frame="none"]
 |======================
 | a00 | a01 | a10 |a11 | a20 | a21|a30|a31|a40|a41|a50|a51|a60|a61|a70|a71
 |======================
 
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int row = 0; row < 8; row++) { 
-  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
-    sum_of_local_rows[row] += data[i+row*2];
-  }
-}  
-```
 
-The code above assumes knowledge of the distribution of the joint matrix across the different work-items. This is different when a different distribution happens.
+The code above assumes knowledge of the distribution of the joint matrix across the different work-items. The same code will be different when a different distribution happens.
 
-For instance, if we assume a round-robin distribution of the joint matrix elements among the work-items, WI0 elements returned in `data`  are the first and 8th columns as follows:
+For instance, if we assume a round-robin distribution of the joint matrix elements among the work-items. In this case, WI0 elements returned in `data`  are the first and the 8th columns as follows:
 
 [frame="none"]
 |======================

From 1e85155d749b797213b2f2c72190d8ffc1076a64 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 12 Jan 2023 09:39:49 -0800
Subject: [PATCH 06/51] Incorporate Greg comments and other improvements,
 specifically: - Put all combinations in appendix - move get_coord to the main
 document - Correct the example by converting USM pointers to multi_ptr

---
 .../sycl_ext_intel_matrix.asciidoc            | 102 ------------------
 .../sycl_ext_oneapi_matrix.asciidoc           |  75 +++++++++----
 2 files changed, 56 insertions(+), 121 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 7d2eaeb725a3e..62306b0840789 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -128,107 +128,6 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // ---------------------------------
       // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
 
-## Supported Combinations Per Hardware
-
-The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
-
-### Intel AMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
-|======================
-
-### Intel XMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
-|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
-|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
-|======================
-
-
-## WI data to joint matrix mapping coordinates
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the joint matrix because the coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-The joint matrix extension aims to enable writing, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact column of the matrix for instance, writing code like sum of rows or sum of columns will only work on that backend for that version of hardware. If a different implementation is used, the same WI may own only half of the row if, for example, the SG size increased.
-
-The following code locally calculates sum of rows of matrix `joint_matrix<sub_group, int8_t, use::a, 8, 16, layout::row_major> tA;`. In this example, we assume that each WI owns 2 successive columns of `tA` and the sub-group size is 8. `data.length` returns 16 elements per WI.
-
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int row = 0; row < 8; row++) { 
-  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
-    sum_of_local_rows[row] += data[i+row*2];
-  }
-}  
-```
-`tA` matrix can be visualized as follows:
-
-[frame="none"]
-|======================
-| a00 | a01 | a02 |a03 | a04 | a05|a06|a07|a08|a09|a010|a011|a012|a013|a014|a015
-| a10 | a11 | a12 |a13 | a14 | a15|a16|a17|a18|a19|a110|a111|a112|a113|a114|a115
-| a20 | a21 | a22 |a23 | a24 | a25|a26|a27|a28|a29|a210|a211|a212|a213|a214|a215
-| a30 | a31 | a32 |a33 | a34 | a35|a36|a37|a38|a39|a310|a311|a312|a313|a314|a315
-| a40 | a41 | a42 |a43 | a44 | a45|a46|a47|a48|a49|a410|a411|a412|a413|a414|a415
-| a50 | a51 | a52 |a53 | a54 | a55|a56|a57|a58|a59|a510|a511|a512|a513|a514|a515
-| a60 | a61 | a62 |a63 | a64 | a65|a66|a67|a68|a69|a610|a611|a612|a613|a614|a615
-| a70 | a71 | a72 |a73 | a74 | a75|a76|a77|a78|a79|a710|a711|a712|a713|a714|a715
-|======================
-
-WI0 elements returned in `data`  are the first 2 columns of `tA` and can be visualized as follows:
-
-[frame="none"]
-|======================
-| a00 | a01 | a10 |a11 | a20 | a21|a30|a31|a40|a41|a50|a51|a60|a61|a70|a71
-|======================
-
-
-The code above assumes knowledge of the distribution of the joint matrix across the different work-items. The same code will be different when a different distribution happens.
-
-For instance, if we assume a round-robin distribution of the joint matrix elements among the work-items. In this case, WI0 elements returned in `data`  are the first and the 8th columns as follows:
-
-[frame="none"]
-|======================
-| a00 | a10 | a20 |a30 | a40 | a50|a60|a70|a08|a18|a28|a38|a48|a58|a68|a78
-|======================
-
-The code becomes:
-
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int row = 0; row < 8; row++) { 
-  for (int i = 0; i < data.length()/8; ++i) {//WI owns 2 element per row
-    sum_of_local_rows[row] += data[i*8+row];
-  }
-}  
-```
-
-In order to make element-wise operations on joint matrices agnostic to this distribution, instead of hard-coding this mapping, we use general backend and target-agnostic functionality. To this end, a new method is added to 'wi_element' to query this mapping.
-
-```c++
-namespace sycl::ext::intel::experimental::matrix {
- std::tuple<uint32_t, uint32_t> get_coord();
-}
-```
-
-`get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
-
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int i = 0; i < data.length(); ++i) {
-  auto [row, col] = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}  
-```
-
 ## Open Questions
 - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
 
@@ -238,5 +137,4 @@ for (int i = 0; i < data.length(); ++i) {
 |======================
 |Rev |Date       |Author     |Changes
 |1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and layout information.
-|2   |2023-01-09 |Dounia Khaldi |Add `get_coord` API.
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 25b27152bff0e..6fe0a704d1eba 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -216,9 +216,7 @@ We introduce a new function `get_wi_data` that provides a view of the portion of
 
 Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
 
-Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
-
-Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+Note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
 
 1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
 
@@ -241,7 +239,8 @@ template <typename T, size_t Rows, size_t Cols,
 class wi_element {
   operator T();
   wi_element &operator=(const T &rhs);
-…
+  // other operators overloading (+, -, etc)
+  std::tuple<size_t, size_t> get_coord();	
 };
 }
 ```
@@ -258,7 +257,21 @@ for (int i = 0; i < wi_data_c.length(); i++)
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
 
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+##### Work-item data to joint matrix mapping coordinates
+The `wi_data` and `wi_element` classes provide access to the matrix elements that are local to the calling work-item. However, the distribution of matrix elements to each work-item is implementation-defined, so application code cannot assume any fixed distribution. Instead, application code can use the `get_coord` method to query the matrix coordinates of an individual `wi_element`.
+
+`get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int i = 0; i < data.length(); ++i) {
+  auto [row, col] = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}  
+```
+
+IMPORTANT: `get_coord` is not implemented yet.
 
 ## Example using int8_t type
 ```c++
@@ -282,22 +295,27 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
    joint_matrix_fill(sg, tC, 0);
    for (int k = 0; k < K; k += tK) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
+     joint_matrix_load(sg, tA,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memA) +
+	  sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
+	  k * N + sg_starty/SG_SIZE*tN, N); 
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = get_wi_data(sg, tC);
    for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+     wi_data_c[i] *= alpha; 
+   joint_matrix_store(sg, tC,
+        multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
+	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
 }).wait();
 ```
 
 == Query Interface
-Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
-The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.
+Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types (see Appendix: Supported Combinations Per Hardware). The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors.
+
 The query interface proposed here consists of three functionalities:
 
 - Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
@@ -330,8 +348,6 @@ The table below provides a description for each of the member variables and type
 |`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
 |======================
 
-
-
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
@@ -472,7 +488,6 @@ struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
       sizeof(combinations) / sizeof(combination);
 };
 
-
 enum class tpu {
   xmx8,
   xmx16,
@@ -505,8 +520,6 @@ enum class scope_t {
 };
 }
 ```
-
-
 === Validation Example:
 ```c++
 // User can provide sizes besides the types and tpu_params can assert if they are supported or not
@@ -546,6 +559,29 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 //Remainder handling
 ```
 
+## Appendix: Supported Combinations Per Hardware
+
+The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
+
+### Intel AMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+|======================
+
+### Intel XMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
+|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
+|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+|======================
+
 
 ## Revision History
 
@@ -556,5 +592,6 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 |2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
 |3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
 |4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
-|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.
+|6   |2023-01-09 |Dounia Khaldi |Add `get_coord` API and supported combinations appendix.
 |======================

From 6f915255efa41f5331d9120a644a5140dbfe1c07 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 30 Jan 2023 10:38:25 -0800
Subject: [PATCH 07/51] Update the specification document to follow the formal
 template

---
 .../sycl_ext_intel_matrix.asciidoc            |  79 +++++++-----
 .../sycl_ext_oneapi_matrix.asciidoc           | 120 ++++++++++--------
 2 files changed, 120 insertions(+), 79 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 62306b0840789..c64e153531c1f 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -1,8 +1,7 @@
-# Intel-specific matrix features
+= sycl_ext_oneapi_matrix
 
 :source-highlighter: coderay
 :coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
 
 // This section needs to be after the document title.
 :doctype: book
@@ -10,8 +9,7 @@
 :toc: left
 :encoding: utf-8
 :lang: en
-
-:blank: pass:[ +]
+:dpcpp: pass:[DPC++]
 
 // Set the default source code type in this document to C++,
 // for syntax highlighting purposes.  This is needed because
@@ -21,44 +19,68 @@
 
 == Notice
 
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+Copyright (c) 2022-2022 Intel Corporation.  All rights reserved.
 
 NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
 trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
 used by permission by Khronos.
 
-This extension is written against the SYCL 2020 revision 5 specification.  All
+== Contact
+
+To report problems with this extension, please open a new issue at:
+
+https://github.com/intel/llvm/issues
+
+== Dependencies
+
+This extension is written against the SYCL 2020 revision 6 specification.  All
 references below to the "core SYCL specification" or to section numbers in the
 SYCL specification refer to that revision.
 
-**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
- This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.
+This extension also depends on the following other SYCL extensions:
+
+* link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[
+  sycl_ext_oneapi_matrix]
 
-## Introduction
+== Status
+This is an experimental extension specification, intended to provide early
+access to features and gather community feedback.  Interfaces defined in this
+specification are implemented in {dpcpp}, but they are not finalized and may
+change incompatibly in future versions of {dpcpp} without prior notice.
+*Shipping software products should not rely on APIs defined in this
+specification.*
+
+== Backend support status
+This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
+
+== Overview
 The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
 This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
 
-## Feature test macro
+== Specification
+
+=== Feature test macro
 
 This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
+specification. An implementation supporting this extension must predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
 Applications can test for the existence of this macro to determine if the
 implementation supports this feature, or applications can test the macro's
 value to determine which of the extension's APIs the implementation supports.
 
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B.
-|======================
+[%header,cols="1,5"]
+|===
+|Value
+|Description
+
+|1
+|The APIs of this experimental extension are not versioned, so the
+ feature-test macro always has this value.
+|===
 
 
-## Extra Functionality
+=== Joint Matrix Intel-Specific Matrix Features
 
-### Layout
+==== Layout
 Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
 
 ```c++
@@ -70,17 +92,16 @@ enum class layout {
 ```
 
 
-### Layout argument in `joint_matrix_load`
+==== Layout argument in `joint_matrix_load`
 `layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
 
 In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
 
 IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
 
-### Store Operation
+==== Store Operation
 Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
 
-#### Store
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
@@ -92,11 +113,11 @@ namespace sycl::ext::intel::experimental::matrix {
 ```
 
 
-## VNNI/Packed Layout
+==== VNNI/Packed Layout
 Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
 The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
 
-#### Example 1: 16-bit elements
+===== Example 1: 16-bit elements
       // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
@@ -112,7 +133,7 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // a1, a2, b1, b2, c1, c2, d1, d2
       // a3, a4, b3, b4, c3, c4, d3, d4
 
-#### Example 2: 8-bit elements
+===== Example 2: 8-bit elements
 
       // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
@@ -128,10 +149,10 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // ---------------------------------
       // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
 
-## Open Questions
+== Issues
 - Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
 
-## Revision History
+== Revision History
 
 [frame="none",options="header"]
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 6fe0a704d1eba..f8f071b345f8c 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -1,7 +1,7 @@
-# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
+sycl_ext_oneapi_matrix
+
 :source-highlighter: coderay
 :coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
 
 // This section needs to be after the document title.
 :doctype: book
@@ -9,8 +9,7 @@
 :toc: left
 :encoding: utf-8
 :lang: en
-
-:blank: pass:[ +]
+:dpcpp: pass:[DPC++]
 
 // Set the default source code type in this document to C++,
 // for syntax highlighting purposes.  This is needed because
@@ -20,41 +19,62 @@
 
 == Notice
 
+[%hardbreaks]
 Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
 
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
+Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
+of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by
+permission by Khronos.
+
+== Contact
+
+To report problems with this extension, please open a new issue at:
+
+https://github.com/intel/llvm/issues
+
+== Dependencies
 
-This extension is written against the SYCL 2020 revision 5 specification.  All
+This extension is written against the SYCL 2020 revision 6 specification.  All
 references below to the "core SYCL specification" or to section numbers in the
 SYCL specification refer to that revision.
 
+== Status
 
-**_NOTE:_** _This document describes the current design and API for the matrix
-extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
+This is an experimental extension specification, intended to provide early
+access to features and gather community feedback.  Interfaces defined in this
+specification are implemented in {dpcpp}, but they are not finalized and may
+change incompatibly in future versions of {dpcpp} without prior notice.
+*Shipping software products should not rely on APIs defined in this
+specification.*
 
-## Introduction
-This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+== Backend support status
+This extension is currently implemented in {dpcpp} only for devices that contain a matrix hardware, specifically Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores.
 
-## Feature test macro
+== Overview
+Joint matrix is a SYCL extension for matrix hardware programming. It
+unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and Nvidia Tensor Cores. This provides portable but performant API for users who want to build their own neural networks applications, perform custom optimzations, or experiment with new operations in a timely and performing manner.
+
+== Specification
+
+=== Feature test macro
 
 This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
+specification. An implementation supporting this extension must predefine
+the macro `SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the
 implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
+value to determine which of the extension's features the implementation supports.
 
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
-|======================
+[%header,cols="1,5"]
+|===
+|Value
+|Description
+
+|1
+|The APIs of this experimental extension are not versioned, so the
+ feature-test macro always has this value.
+|===
 
-## Matrix API Versions
+=== Matrix API Versions
 
 While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
 
@@ -62,11 +82,11 @@ While this document presents the core API that unifies Intel AMX, Intel XMX, and
 |======================
 |Value |Description
 |1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
-|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
-|3     |Implementation on Nvidia Tensor Cores
+|3     |Initial implementation on Nvidia Tensor Cores
+|4     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. Plus, AOT implementation on Nvidia tensor Cores 
 |======================
 
-## New `joint_matrix` class
+=== New `joint_matrix` class
 We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
 
 ```c++
@@ -81,7 +101,7 @@ struct joint_matrix {
 
 IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
 
-#### Use
+==== Use
 Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
 
 ```c++
@@ -94,10 +114,10 @@ enum class use {
 }
 ```
 
-#### Shape
+==== Shape
 The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
 
-#### Layout
+==== Layout
 This specifies the memory layout and it can be row major or column major.
 
 ```c++
@@ -111,7 +131,7 @@ enum class layout {
 ```
 
 
-#### Group Memory Scope
+==== Group Memory Scope
 In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter.
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported
@@ -123,7 +143,7 @@ joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
 ```
 
 
-## Matrix Operations and their Execution Scope
+=== Matrix Operations and their Execution Scope
 We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
 
 Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
@@ -132,7 +152,7 @@ To be aligned with the SYCL 2020 group algorithms, an additional group argument
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
 
-#### Load
+==== Load
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
@@ -157,7 +177,7 @@ The second overload without a memory layout must not be used with a `joint_matri
 The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
 
 
-#### Store
+==== Store
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
@@ -172,7 +192,7 @@ This function stores the data in the accumulator matrix from the 2d tiles back t
 The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
 
 
-#### Multiply and Add
+==== Multiply and Add
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -187,7 +207,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
 
 
-#### Matrix Initialization: `joint_matrix_fill`
+==== Matrix Initialization: `joint_matrix_fill`
 Unlike `joint_matrix_load` that assumes that all the matrices are directly loaded from memory, `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
 
 ```c++
@@ -199,8 +219,8 @@ namespace sycl::ext::oneapi::experimental::matrix {
 ```
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
 
-#### Element Indexing and Piece-Wise Operations
-##### Background
+==== Element Indexing and Piece-Wise Operations
+===== Background
 Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
 
 Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
@@ -209,7 +229,7 @@ Class 2- Piece-wise operations where the operation depends on the element index
 
 // We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
 
-##### Explicit conversion with mapping from SIMD to SPMD
+===== Explicit conversion with mapping from SIMD to SPMD
 The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
 
 We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
@@ -257,7 +277,7 @@ for (int i = 0; i < wi_data_c.length(); i++)
 
 IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
 
-##### Work-item data to joint matrix mapping coordinates
+===== Work-item data to joint matrix mapping coordinates
 The `wi_data` and `wi_element` classes provide access to the matrix elements that are local to the calling work-item. However, the distribution of matrix elements to each work-item is implementation-defined, so application code cannot assume any fixed distribution. Instead, application code can use the `get_coord` method to query the matrix coordinates of an individual `wi_element`.
 
 `get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
@@ -273,7 +293,7 @@ for (int i = 0; i < data.length(); ++i) {
 
 IMPORTANT: `get_coord` is not implemented yet.
 
-## Example using int8_t type
+=== Example using int8_t type
 ```c++
 using namespace sycl::ext::oneapi::experimental::matrix;
 
@@ -312,7 +332,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 }).wait();
 ```
 
-== Query Interface
+=== Query Interface
 Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types (see Appendix: Supported Combinations Per Hardware). The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
 This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors.
 
@@ -520,7 +540,7 @@ enum class scope_t {
 };
 }
 ```
-=== Validation Example:
+==== Validation Example:
 ```c++
 // User can provide sizes besides the types and tpu_params can assert if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
@@ -529,7 +549,7 @@ size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
 size_t NDRangeN = N / myparams::N;
 ```
 
-=== Default Values Example:
+==== Default Values Example:
 ```c++
 using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
@@ -543,7 +563,7 @@ myparams::joint_matrix_accumulator<sub_group> sub_c;
 
 ```
 
-=== General Query Example:
+==== General Query Example:
 ```c++
 constexpr int M = 1500; // with msize = 8 and msize = 4,
           // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
@@ -559,11 +579,11 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 //Remainder handling
 ```
 
-## Appendix: Supported Combinations Per Hardware
+=== Appendix: Supported Combinations Per Hardware
 
 The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
 
-### Intel AMX Supported Combinations
+==== Intel AMX Supported Combinations
 
 [frame="none",options="header"]
 |======================
@@ -572,7 +592,7 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 |  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
 
-### Intel XMX Supported Combinations
+==== Intel XMX Supported Combinations
 
 [frame="none",options="header"]
 |======================
@@ -583,7 +603,7 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 |======================
 
 
-## Revision History
+=== Revision History
 
 [frame="none",options="header"]
 |======================

From cdcab5a83a0a920aae662c2bea09249efa851ded Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 30 Jan 2023 12:03:43 -0800
Subject: [PATCH 08/51] add tf32 type and conversion function

---
 .../sycl_ext_intel_matrix.asciidoc            |  91 +++-
 .../sycl_ext_oneapi_matrix.asciidoc           | 503 ++++++++++++++----
 2 files changed, 460 insertions(+), 134 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index c64e153531c1f..5484303530cf8 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -51,21 +51,35 @@ change incompatibly in future versions of {dpcpp} without prior notice.
 specification.*
 
 == Backend support status
-This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
+This document describes the extra features and details for the
+implementation of `joint_matrix` extension on Intel AMX and Intel
+XMX.
 
 == Overview
-The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
-This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
+The Intel backend implementations on both Intel AMX and Intel XMX
+support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`,
+`joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query
+interface, as they are defined in the sycl_ext_oneapi_matrix
+extension. There are additional specifics about the supported layouts
+that enable extra performance and functionality listed in this
+document.
+This extension presents some supplementary Intel AMX and Intel XMX
+features not contained within the sycl_ext_oneapi_matrix
+extension. The additional features are built on top of the
+sycl_ext_oneapi_matrix extension but are only supported by the Intel
+AMX and Intel XMX backends.
 
 == Specification
 
 === Feature test macro
 
 This extension provides a feature-test macro as described in the core SYCL
-specification. An implementation supporting this extension must predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
+specification. An implementation supporting this extension must
+predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
 Applications can test for the existence of this macro to determine if the
 implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
+value to determine which of the extension's APIs the implementation
+supports.
 
 [%header,cols="1,5"]
 |===
@@ -81,7 +95,9 @@ value to determine which of the extension's APIs the implementation supports.
 === Joint Matrix Intel-Specific Matrix Features
 
 ==== Layout
-Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
+Besides row major and column major layouts, `layout` introduces the
+custom layout packed layout that refers to the VNNI format descibed in
+the following section.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
@@ -93,14 +109,31 @@ enum class layout {
 
 
 ==== Layout argument in `joint_matrix_load`
-`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
-
-In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
-
-IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
+`layout` in `joint_matrix_load` can take `packed` as argument to
+specify that the data has already been transformed into VNNI format
+(`packed`). in this case, `stride` argument of `joint_matrix_load`
+describes the number of elements between consecutive rows for packed
+layouts.
+
+In order to get maximum performance on Intel AMX and Intel XMX,
+prepacking data in the memory is necessary. If users did not specify
+the packed layouts, transforms done by the implementation will be slow
+due to extra scatter/gather operations. Hence, we expose the `packed`
+layout to the user to specify that A or B have already been
+VNNIed. The packed or VNNI layout is introduced in the `VNNI layout`
+section below.
+
+IMPORTANT: In the current Intel AMX and Intel XMX implementations, the
+layout in the load of matrix B (provided by the `layout memL`
+parameter below) must be `packed` or `row_major`. Automatic VNNI
+transform is supported on AMX. The layout in the load of matrices A
+and C must be `row_major`, and the layout in the store of matrix C
+(provided by the `layout memL` parameter below) must also be
+`row_major`.
 
 ==== Store Operation
-Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
+Besides store of matrix `accumulator`, the Intel implementation allows
+store on matrix `a` and `b` as well.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
@@ -114,11 +147,20 @@ namespace sycl::ext::intel::experimental::matrix {
 
 
 ==== VNNI/Packed Layout
-Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
+Intel AMX and Intel XMX compute assumes that the B tile register
+(src1) is in the VNNI format as they need 32bit of K-data in A and B
+to be contiguous in memory.
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4
+in the case of 8-bit types. While the current implementation assumes
+that the matrix has been already packed by the user for performance
+reasons, the layout information is needed to inform the implementation
+about this transformation.  The following example illustrates how a
+matrix in `row_major` layout is transformed into the `packed` layout
+for a 16-bit type.
 
 ===== Example 1: 16-bit elements
-      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
+      // Example of a 4 row x 4 column matrix using a 16-bit data
+      element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
       // a1, b1, c1, d1
@@ -135,7 +177,8 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
 
 ===== Example 2: 8-bit elements
 
-      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
+      // Example of a 4 row x 4 column matrix using a 8-bit data
+      element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
       // a1, b1, c1, d1
@@ -150,12 +193,24 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca
       // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
 
 == Issues
-- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
+- Should the same class, `joint_matrix`, handle both cases where sizes
+are constant (GPU case) and when sizes are variable (CPU case)? Note
+that a Intel AMX 2d tile register permits sizes up to 1024
+(16rowsx64cols) bytes that can be variable. The ability to define only
+one interface for both would make it possible to give the user a way
+to make use of the flexibility introduced by the CPU but at the same
+time save resources on the GPU. In a previous version of the design,
+we used `sycl::dynamic_extent`  to differentiate between static and
+dynamic sizes. But since this was not implemented at all, we decided
+to remove it. We can revisit this design choice if this comes up as
+part of a customer request or if SPIRV matrix extension extends its
+support to dynamic sizes.
 
 == Revision History
 
 [frame="none",options="header"]
 |======================
 |Rev |Date       |Author     |Changes
-|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and layout information.
+|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and
+layout information.
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index f8f071b345f8c..bdd2db5d83896 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -48,11 +48,18 @@ change incompatibly in future versions of {dpcpp} without prior notice.
 specification.*
 
 == Backend support status
-This extension is currently implemented in {dpcpp} only for devices that contain a matrix hardware, specifically Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores.
+This extension is currently implemented in {dpcpp} only for devices
+that contain a matrix hardware, specifically Intel(R) Advanced Matrix
+Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
+XMX) and Nvidia(R) Tensor Cores.
 
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
-unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and Nvidia Tensor Cores. This provides portable but performant API for users who want to build their own neural networks applications, perform custom optimzations, or experiment with new operations in a timely and performing manner.
+unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and
+Nvidia Tensor Cores. This provides portable but performant API for
+users who want to build their own neural networks applications,
+perform custom optimzations, or experiment with new operations in a
+timely and performing manner.
 
 == Specification
 
@@ -60,9 +67,11 @@ unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and Nvidia Tenso
 
 This extension provides a feature-test macro as described in the core SYCL
 specification. An implementation supporting this extension must predefine
-the macro `SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's features the implementation supports.
+the macro `SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the
+table below. Applications can test for the existence of this macro to
+determine if the implementation supports this feature, or applications
+can test the macro's value to determine which of the extension's
+features the implementation supports.
 
 [%header,cols="1,5"]
 |===
@@ -76,18 +85,33 @@ value to determine which of the extension's features the implementation supports
 
 === Matrix API Versions
 
-While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
+While this document presents the core API that unifies Intel AMX,
+Intel XMX, and Nvidia Tensor Cores, the implementations support
+slightly different versions of the API. For this reason, we introduce
+a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION` to distinguish
+between these different implementations. The goal in the next few
+months is to get rid of this implementation versioning macro. These
+are the current values for this macro.
 
 [frame="none",options="header"]
 |======================
 |Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
+|1     |Initial extension JIT implementation on Intel AMX and Intel
+XMX. load, store, mad, fill, piece-wise operations, and the query
+interface are supported. The old API used for this implementation is
+detailed in
+link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
 |3     |Initial implementation on Nvidia Tensor Cores
-|4     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. Plus, AOT implementation on Nvidia tensor Cores 
+|4     |JIT implementation on Intel AMX and Intel XMX. load, store,
+mad, fill, piece-wise operations, and the query interface are
+supported. Plus, AOT implementation on Nvidia Tensor Cores 
 |======================
 
 === New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
+We introduce a new class called `joint_matrix`. The user needs to
+specify the group memory scope, the type of the elements, the shape,
+the matrix use, and the memory layout of the matrix. This results in
+the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -99,10 +123,13 @@ struct joint_matrix {
 }
 ```
 
-IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only
+to matrix with `use::accumulator`
 
 ==== Use
-Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
+Specifying the usage of the matrix: matrix left (A), matrix right (B)
+or accumulator +(C)+ is required by backend implementations to reason
+about the layout of the matrix in registers.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -115,7 +142,8 @@ enum class use {
 ```
 
 ==== Shape
-The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
+The shape of a `joint_matrix` refers to its number of rows `Rows` and
+number of columns `Cols`.
 
 ==== Layout
 This specifies the memory layout and it can be row major or column major.
@@ -130,11 +158,14 @@ enum class layout {
 }
 ```
 
-
 ==== Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter.
+In this API, we use the terminology of `joint_matrix` instead of plain
+`matrix` to emphasize that the matrix is shared among a group of work
+items and is not private to each work item. The group scope is added
+as an additional template parameter.
 
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported
+IMPORTANT: In the current implementation, only the `sub_group` scope
+is supported
 
 When the group is a `sycl::sub_group`, a matrix is declared as follows:
 
@@ -142,39 +173,61 @@ When the group is a `sycl::sub_group`, a matrix is declared as follows:
 joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
 ```
 
-
 === Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
+We define three new functions needed to perform the main and common
+operations on matrices, namely load, store, and the actual multiply
+and add operation. This set of functions can be easily extended if the
+matrix hardware implements new features.
 
-Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
+Since the matrix functions are group operations (as defined in Section
+4.17.3 of the SYCL specification), the matrix API has to be accessed
+by all the work-items in the group in a convergent control flow. The
+`Group` template argument can be a work-group or a sub-group. These
+functions will be called once by each work item in the group.
 
-To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
+To be aligned with the SYCL 2020 group algorithms, an additional group
+argument is added to the matrix operations to designate that these
+functions are collective operations. The {dpcpp} syntax is the following: 
 
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+IMPORTANT: In the current implementation, only the `sub_group` scope
+is supported.  
 
 ==== Load
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+  template <typename Group, typename T, typename S,
+            size_t NumRows, size_t NumCols,
             access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols,
+    layout::dynamic> &res,
+    multi_ptr<S, Space, IsDecorated> src, size_t stride, layout Layout);
     
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout, access::address_space Space>
+  template <typename Group, typename T, typename S,
+            size_t NumRows, size_t NumCols,
+            use Use, layout Layout, access::address_space Space>
   void joint_matrix_load(Group sg,
     joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+    multi_ptr<S, Space, IsDecorated> src, size_t stride);
 }
 ```
 
-`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
-We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
-The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
-The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
+`joint_matrix_load` loads data from memory to the 2d tiles/registers
+of the matrix hardware.
+We define two overloads of the load function depending on whether the
+memory layout was declared as part of the `joint_matrix` type or not. 
+The first overload that takes memory layout as an argument is only
+available for a `joint_matrix` type that used the default value
+`layout::dynamic`.
+The second overload without a memory layout must not be used with a
+`joint_matrix` type that used the default value `layout::dynamic`.
 
-The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+The base pointer `src` here determines the starting address of the
+matrix to be loaded from. `Layout` determines whether the data is
+being read in a row (`row_major`), column major (`column_major`)
+fashion. `stride` describes the number of elements between consecutive
+rows for the row major layout, or between columns for the column major
+layout. 
 
 
 ==== Store
@@ -183,62 +236,132 @@ namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
             access::address_space Space>
   void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols,
+    layout::dynamic> &res,
     multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
 }
 ```
-This function stores the data in the accumulator matrix from the 2d tiles back to memory.
+This function stores the data in the accumulator matrix from the 2d
+tiles back to memory.
 
-The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+The base pointer `dest` here determines the starting address of the
+matrix to be stored. `Layout` determines whether the data is being
+written in a row (`row_major`), column major (`column_major`)
+fashion. `stride` describes the number of elements between consecutive
+rows for the row major layout, or between columns for the column major layout. 
 
 
 ==== Multiply and Add
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+  template <typename Group, typename Ta, typename Tb, typename Tc,
+  std::size_t M, std::size_t K, std::size_t N, 
             layout LayoutA, layout LayoutB>
-  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
+  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
+  joint_matrix_mad(Group sg,
     joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
     joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
     joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
 }
 ```
-The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
+The matrix multiply and add function performs the multiply operation
+on the matrices `A` and `B`, accumulate the result with `C` and return
+the result.
 
 
 ==== Matrix Initialization: `joint_matrix_fill`
-Unlike `joint_matrix_load` that assumes that all the matrices are directly loaded from memory, `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+Unlike `joint_matrix_load` that assumes that all the matrices are
+directly loaded from memory, `joint_matrix_fill`  makes it possible to
+multiply a matrix which is not directly loaded from memory but rather
+initialized directly in the register. On Intel AMX, if the
+initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, size_t NumRows, size_t NumCols,
            use Use, layout Layout, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
+  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use,
+  NumRows, NumCols, Layout> &m, Tv v);
 }
 ```
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+IMPORTANT: In the current implementation, only the `sub_group` scope
+is supported.  
 
 ==== Element Indexing and Piece-Wise Operations
 ===== Background
-Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
-
-Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
-
-Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
-
-// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
+Besides matrix multiply and add, this extension aims to make it
+possible to perform piece-wise operations on matrices in a SPMD
+manner. The mechanisms that are recommended to perform such piece-wise
+operations depend upon which of the following classes the operation
+falls into:
+
+Class 1- Element-wise operations where the same operation is performed
+on every element of the matrix, such that the operation can be
+performed without knowledge of the position of the element within the
+matrix. Activation functions or adding a constant value to every
+element of the matrix are two examples.
+
+Class 2- Piece-wise operations where the operation depends on the
+element index of the matrix or the operation takes multiple elements
+as operands (such as a sum of all elements in a row for
+example). Quantization that is needed for conversion between low
+precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix
+interface: 1) Allowing non-restrictive element indexing on the matrix
+elements would result into slow indexing on the GPU, 2) Operator
+overloading can represent only element-wise operations and not the
+operations on pieces (row, column, diagonal, etc) of the matrix. 3)
+Providing specific functions for these piece-wise operations can
+resolve some of the functions we know of today like the ones involved
+in quantization but it is not general to any problem that may occur in
+the future. 
 
 ===== Explicit conversion with mapping from SIMD to SPMD
-The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
-
-We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
-
-Note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
+The data elements in a `joint_matrix` are distributed or shared across
+the work-items in the Group in an implementation-defined way. There is
+no fixed allocation of matrix elements owned by a `joint_matrix`
+instance to the WIs comprising the group used to instantiate it. For
+instance, the matrix is a shared entity among the work items in the
+case of the AMX backend because the AMX tile that holds the matrix
+data is a 2d register that is shared among the work items. Therefore
+the partitioning among the WIs is implementation defined. However, it
+is necessary to allocate WIs to specific elements of the matrix in
+order to perform element-wise operations. In order to be able to
+perform element-wise operations in a general and efficient way, we
+provide a conversion function from the `joint_matrix` domain that is
+owned by a group of work items to the portion that is owned by each
+work item. This enables the WI to perform piece-wise operations on the
+matrix within the SYCL SPMD programming model.
+
+We introduce a new function `get_wi_data` that provides a view of the
+portion of the matrix that is owned by the current WI. The indexing
+provided inside the `wi_data` class accesses only the portion of the
+current WI and returns  `wi_element`. This latter holds a reference to
+the original joint_matrix that `wi_data` was constructed from. This
+means that modifying `wi_data` also modifies the corresponding joint
+matrix elements. Users can use the `=` operator to update the element
+of the `joint_matrix` represented by the `wi_element` after the
+element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data
+are owned by each thread in the group as this is implementation
+defined and changes from one backend to the other. For general
+piece-wise operations such as summing the rows of a matrix, the WI
+data to joint matrix mapping coordinates information must be known in
+order to reason about the matrix view and extract the relevant
+piece. However, for element-wise operations where the same operation
+is performed on all the elements of the matrix, having all the WIs in
+the group apply the operation inside a loop iterating over the
+`length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Note that `get_wi_data` cannot return a fixed size array length
+because the length of the WI portion is a runtime variable for the
+following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and
+partitioning among WIs is implementation defined.
 
 2- Sub group size is not generally fixed.
 
@@ -246,9 +369,11 @@ The code listing below shows a synopsis of these new APIs.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg,
+   joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
 
-template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
+template <typename T, size_t Rows, size_t Cols, use Use, layout
+Layout, typename Group>
 class wi_data {
   size_t length();
   wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
@@ -259,28 +384,43 @@ template <typename T, size_t Rows, size_t Cols,
 class wi_element {
   operator T();
   wi_element &operator=(const T &rhs);
-  // other operators overloading (+, -, etc)
+  wi_element &operator+=(const T &rhs);
+  wi_element &operator-=(const T &rhs);
+  wi_element &operator*=(const T &rhs);
+  wi_element &operator/=(const T &rhs);
+
   std::tuple<size_t, size_t> get_coord();	
 };
 }
 ```
 
-In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
+In the following example `wi_data_c` is a reference to the WI owned
+portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
+updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the sub group dimension will get enabled
+automatically to vectorize the contiguous portion of the matrix. 
 
 
 ```c++
 auto wi_data_c = get_wi_data(sg, matC);
 for (int i = 0; i < wi_data_c.length(); i++)
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i"
+	is in the vector owned by a WI, not in the matrix C        
 ```
 
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
+IMPORTANT: In the current implementation, only the `sub_group` scope
+is supported.
 
 ===== Work-item data to joint matrix mapping coordinates
-The `wi_data` and `wi_element` classes provide access to the matrix elements that are local to the calling work-item. However, the distribution of matrix elements to each work-item is implementation-defined, so application code cannot assume any fixed distribution. Instead, application code can use the `get_coord` method to query the matrix coordinates of an individual `wi_element`.
+The `wi_data` and `wi_element` classes provide access to the matrix
+elements that are local to the calling work-item. However, the
+distribution of matrix elements to each work-item is
+implementation-defined, so application code cannot assume any fixed
+distribution. Instead, application code can use the `get_coord` method
+to query the matrix coordinates of an individual `wi_element`.
 
-`get_coord` returns [row,col] coordinates of the current object `wi_element` of the joint matrix.  The code above results into the following:
+`get_coord` returns [row,col] coordinates of the current object
+`wi_element` of the joint matrix.  The code above results into the following:
 
 ```c++
 auto data = get_wi_data(sg, tA);
@@ -293,6 +433,35 @@ for (int i = 0; i < data.length(); ++i) {
 
 IMPORTANT: `get_coord` is not implemented yet.
 
+=== Joint Matrix Additional Types
+Besides C++ half, float, double types, and sycl::bfloat16 types, joint
+matrix implementations may support other low-precision floating-point types
+such as tf32. tf32 type has a 19 bit format with one sign bit, 8
+exponent bits offering the same range as fp32,  and 10 mantissa bits
+offering same precision as  half type. The usage of tf32 type is
+restricted to `joint_matrix` using:
+sycl::ext::oneapi::experimental::matrix::precision::tf32. 
+
+Joint matrix type tf32 is defined as an empty class with no member functions. 
+```c++
+namespace precision {
+class tf32;
+} 
+```
+Besides the type, one conversion function is added:
+`round_to_tf32` that  performs the rounding to tf32.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  float round_to_tf32(float &elem);
+}
+```
+Joint matrix load/store/fill  perform float type memory access to/from
+tf32 joint matrix. Also, the return type of element-wise accesses of a
+tf32 `joint_matrix` returns float. In this case, general arithmetic is
+done on fp32 data.
+
+
 === Example using int8_t type
 ```c++
 using namespace sycl::ext::oneapi::experimental::matrix;
@@ -333,44 +502,116 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 ```
 
 === Query Interface
-Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types (see Appendix: Supported Combinations Per Hardware). The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors.
+Intel AMX, Intel XMX and Nvidia matrix hardware support different
+sizes and types (see Appendix: Supported Combinations Per
+Hardware). The query interface is used to validate user code and
+inform them about supported types, sizes, scope, and layouts by the
+implementation. This also offers development and tuning productivity by both
+scientists and library developers. The query interface we are
+proposing here is a compile-time query, so there will be no runtime
+errors.
 
 The query interface proposed here consists of three functionalities:
 
-- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
-
-- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
-
-- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
-
-The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
+- Validation: at compile time, the validation functionality informs
+  the user whether a specific combination is valid or not. This takes
+  place when the user specifies all template parameters.
+
+- Default values: this provides a default shape if the user does not
+  provide a specific combination. In this case, aliases to the
+  `joint_matrix` type can be used, namely
+  `joint_matrix_a/b/accumulator` where no additional argument is
+  needed. This form happens when the user specifies all template
+  parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information
+  about sizes, types,  and scopes that are supported by a specific TPU
+  implementation. This is needed to avoid padding by the user, for
+  tuning, and efficient code generation if used by a library. The
+  general query returns an array of `combinations` of `combination`
+  type. Each combination includes the sizes and the types for the
+  matrices A, B, and accumulator. Note that for each TPU, the query
+  returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize`
+  exclusively, depending on whether the implementation supports a
+  continuous or discrete number of sizes. For example, the Intel AMX
+  implementation supports a continuous number of sizes, so the `max_*`
+  variant is applied and only the maximum number is returned. The
+  Intel XMX implementation, on the other hand, supports a discrete
+  list of numbers so the  `msize, nsize, ksize` variant is applied.
+  This form takes place when users only specify the TPU they are
+  interested in using.
+
+The table below provides a description for each of the member
+variables and type aliases in `tpu_params` class and the forms in
+which  they are defined.
 
 [frame="none",options="header"]
 |======================
 | Member/type alias in `tpu_params` | Forms they are defined in |Description
 |`type_a`| validation, default values|type alias for the type of matrix A
 |`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
-|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
-|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
-|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
-|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
-|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
-|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
-|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
-|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
-|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
-|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
-|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
+|`type_accumulator`|  validation, default values|type alias for the
+type of matrix accumulator
+|`M`|  validation, default values|when no sizes are provided by the
+user, indicates the suggested default size for M; usually this
+corresponds to the maximum size the implementation supports. In
+validation mode, where the user does provide sizes, this is the same
+value M that the user provides if M is supported by the implementation
+|`N`|  validation, default values|when no sizes are provided by the
+user, indicates the suggested default size for N; usually this
+corresponds to the maximum size the implementation supports. In
+validation mode, where the user does provide sizes, this is the same
+value N that the user provides if N is supported by the implementation
+|`K`|  validation, default values|when no sizes are provided by the
+user, indicates the suggested default size for K; usually this
+corresponds to the maximum size the implementation supports. In
+validation mode, where the user does provide sizes, this is the same
+value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for
+`joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for
+`joint_matrix` for matrix B
+|`joint_matrix_accumulator`|  validation, default values| type alias
+for `joint_matrix` for matrix accumulator
+|numtiles|  validation, default values, general query|indicates number
+of tiles in Intel AMX (does not apply to Intel XMX)
+|scopes| validation, default values, general query| indicates the
+memory and execution scopes supported by the TPU implementation
+|`combination` |  validation, default values, general query|composes
+the types and sizes of A, B, accumulator matrices allowed in one combination
+|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values,
+general query| if the TPU implementation supports a continuous number
+of element sizes, each of these members is non-zero, and the TPU
+implementation supports all element sizes from 1 up to (and including)
+that number. By contrast, if the TPU implementation supports a
+discrete number of element sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`|  validation, default values, general
+query| if the TPU implementation supports a discrete number of element
+sizes, each of these members is non-zero, and the value tells one of
+the supported element sizes. By contrast, if the TPU supports a
+continuous number of element sizes, each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| validation, default values,
+general query| indicates the types supported in the combination
+|`combinations`    | validation, default values, general query| tells
+the set of supported matrix sizes and types according to the template
+parameters that are provided. In the "general query" form, the user
+provides only the TPU type, so the combinations array contains all
+supported tile sizes and element types for that TPU. In the "default
+values" form, the user provides the TPU type and element types, so the
+combinations array contains only those supported matrix sizes and
+element types that match those element types on that TPU. In the
+"validation" form, the user provides the TPU type, element types, and
+element sizes so only this specific combination is returned in the
+combinations array. 
+|`num_combinations`|  validation, default values, general
+query|indicates number of combinations supported by the TPU
+implementation which corresponds to the size of the `combinations` array
 |======================
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void,
+int sM=0, int sN=0, int sK=0>
 struct tpu_params;
 
 // Validation form: Valid or not
@@ -387,12 +628,16 @@ struct tpu_params<
           (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
       "Invalid parameters for Intel AMX, query valid types and maximum sizes "
       "using: "
-      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
+      "tpu_params<tpu::amx> myparams; and then check out
+      myparams.combinations array");
 
 
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
+  using type_a = Ta; // this type alias is not available in the
+  current implementation 
+  using type_b = Tb; // this type alias is not available in the
+  current implementation
+  using type_accumulator = Tc; // this type alias is not available in
+  the current implementation
 
   // if combination is valid, construct the matrices
 
@@ -402,11 +647,14 @@ struct tpu_params<
       (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
 
   template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM,
+  defaultK, LayoutA>;
   template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK,
+  defaultN, LayoutB>;
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
+  using joint_matrix_accumulator = joint_matrix<Group, Tc,
+  use::accumulator, defaultM, defaultN>;
 
   static constexpr uint32_t numtiles = 8;
   static constexpr scope_t scopes[] = {scope_t::sub_group};
@@ -422,7 +670,8 @@ struct tpu_params<
     matrix_type btype;
     matrix_type accumulatortype;
   };
-  // In this case, the combinations array contains only the combination that the user provided
+  // In this case, the combinations array contains only the
+  combination that the user provided
   static constexpr combination combinations[] = {
       {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
   static constexpr int num_combinations =
@@ -437,13 +686,17 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
                                            !std::is_same_v<Tb, void> &&
                                            !std::is_same_v<Tc, void>)>::type> {
   static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
+                "Invalid types for Intel AMX, supported types are
+		int8_t, uint8_t, "
                 "and bf16 (Note that unsigned short should be used in the"
                 "DPC++ code to implement bf16) ");
 
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
+  using type_a = Ta; // this type alias is not available in the
+  current implementation 
+  using type_b = Tb; // this type alias is not available in the
+  current implementation
+  using type_accumulator = Tc; // this type alias is not available in
+  the current implementation
 
   // construct the matrices using the default sizes
   static constexpr std::size_t M = 16;
@@ -455,7 +708,8 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   template <typename Group, layout LayoutB>
   using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
+  using joint_matrix_accumulator = joint_matrix<Group, Tc,
+  use::accumulator, M, N>;
 
   static constexpr uint32_t numtiles = 8;
   static constexpr scope_t scopes[] = {scope_t::sub_group};
@@ -471,7 +725,8 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
     matrix_type btype;
     matrix_type accumulatortype;
   };
-  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
+  // In this case, the combinations array contain only the
+  combinations that correspond to the Ta, Tb, and Tc 
   // types that the user provided
   static constexpr combination combinations[] = {
       {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
@@ -499,11 +754,16 @@ struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
   };
   
   static constexpr combination combinations[] = {
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8,
+      matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8,
+      matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8,
+      matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8,
+      matrix_type::sint32},
+      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16,
+      matrix_type::fp32}};
   static constexpr int num_combinations =
       sizeof(combinations) / sizeof(combination);
 };
@@ -542,7 +802,8 @@ enum class scope_t {
 ```
 ==== Validation Example:
 ```c++
-// User can provide sizes besides the types and tpu_params can assert if they are supported or not
+// User can provide sizes besides the types and tpu_params can assert
+  if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
 using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
 size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
@@ -566,7 +827,8 @@ myparams::joint_matrix_accumulator<sub_group> sub_c;
 ==== General Query Example:
 ```c++
 constexpr int M = 1500; // with msize = 8 and msize = 4,
-          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
+          // M can be broken up to 125 sequence of 8-sized ops and
+	  remaining 500 using 125 sequence of 4-sized ops
 tpu_params<tpu::xmx16> params;
 constexpr int msize = break_dimension(params, M);
 constexpr int msize_remainder = break_dimension_remainder(params, M);
@@ -581,7 +843,10 @@ joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 
 === Appendix: Supported Combinations Per Hardware
 
-The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
+The table below provides a list of the combinations that
+`joint_matrix` implementations support on each of Intel AMX and Intel
+XMX hardware. Note that these can be returned in a parametrized way
+using the `tpu_params` query class.
 
 ==== Intel AMX Supported Combinations
 
@@ -610,8 +875,14 @@ The table below provides a list of the combinations that `joint_matrix` implemen
 |Rev |Date       |Author     |Changes
 |1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
 |2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
-|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
-|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
-|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.
-|6   |2023-01-09 |Dounia Khaldi |Add `get_coord` API and supported combinations appendix.
+|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise
+operations support
+|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the
+new matrix use parameter and remove reference to the AOT AMX initial
+implementation 
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it
+portable across Intel AMX, Intel XMX and Nvidia Tensor Cores, and move
+the Intel-specifics to a separate extension document.
+|6   |2023-01-09 |Dounia Khaldi |Add `get_coord` API, tf32 type, and supported
+combinations appendix.
 |======================

From 04e18fe673516225e150880bfafdb43cb6ab2e31 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 30 Jan 2023 12:20:36 -0800
Subject: [PATCH 09/51] correct the matrix types in the appendix

---
 .../sycl_ext_intel_matrix.asciidoc            |  5 ++-
 .../sycl_ext_oneapi_matrix.asciidoc           | 45 ++++++++++---------
 2 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 5484303530cf8..d4fa2d291755d 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -137,11 +137,12 @@ store on matrix `a` and `b` as well.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+  template <typename Group, typename T, typename S,
+            size_t NumRows, size_t NumCols,
             use Use, layout Layout, access::address_space Space>
   void joint_matrix_store(Group sg,
     joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+    multi_ptr<S, Space, IsDecorated> src, size_t stride);
 }
 ```
 
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index bdd2db5d83896..e97ee451ffe9f 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -434,18 +434,18 @@ for (int i = 0; i < data.length(); ++i) {
 IMPORTANT: `get_coord` is not implemented yet.
 
 === Joint Matrix Additional Types
-Besides C++ half, float, double types, and sycl::bfloat16 types, joint
+Besides C++ `half`, `float`, `double` types, and `sycl::bfloat16` types, joint
 matrix implementations may support other low-precision floating-point types
 such as tf32. tf32 type has a 19 bit format with one sign bit, 8
 exponent bits offering the same range as fp32,  and 10 mantissa bits
 offering same precision as  half type. The usage of tf32 type is
 restricted to `joint_matrix` using:
-sycl::ext::oneapi::experimental::matrix::precision::tf32. 
+`sycl::ext::oneapi::experimental::matrix::precision::tf32`. 
 
 Joint matrix type tf32 is defined as an empty class with no member functions. 
 ```c++
 namespace precision {
-class tf32;
+  class tf32;
 } 
 ```
 Besides the type, one conversion function is added:
@@ -780,18 +780,18 @@ enum class matrix_type {
   tf32,
   fp32,
   fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32, 
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
+  sint2_t,
+  sint4_t,
+  sint8_t,
+  sint16_t,
+  sint32_t, 
+  sint64_t,
+  uint2_t,
+  uint4_t,
+  uint8_t,
+  uint16_t,
+  uint32_t,
+  uint64_t
 };
 
 enum class scope_t {
@@ -853,8 +853,10 @@ using the `tpu_params` query class.
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+| matrix_type::(u)int8_t  | matrix_type::(u)int8_t |
+matrix_type::sint32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  matrix_type::bf16       |  matrix_type::bf16   |
+matrix_type::fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
 
 ==== Intel XMX Supported Combinations
@@ -862,9 +864,12 @@ using the `tpu_params` query class.
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
-|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
-|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+| matrix_type::(u)int8_t  | matrix_type::(u)int8_t |
+matrix_type::int32_t  |  +<=+ 8 |  16 |  32
+|  matrix_type::fp16       |  matrix_type::fp16   |
+matrix_type::fp32   |  +<=+ 8 |  16   |  16
+|  matrix_type::bf16       |  matrix_type::bf16   |
+matrix_type::fp32   |  +<=+ 8 |  16   |  16
 |======================
 
 

From 9403a38ca19f8c01670593010657ae03f83c2d6a Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 30 Jan 2023 12:23:36 -0800
Subject: [PATCH 10/51] correct the matrix types in the appendix

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 20 +++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index e97ee451ffe9f..85e6f4cf5a778 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -853,10 +853,10 @@ using the `tpu_params` query class.
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| matrix_type::(u)int8_t  | matrix_type::(u)int8_t |
-matrix_type::sint32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  matrix_type::bf16       |  matrix_type::bf16   |
-matrix_type::fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+| `matrix_type::(u)int8_t`  | `matrix_type::(u)int8_t` |
+`matrix_type::sint32_t`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  `matrix_type::bf16`       |  `matrix_type::bf16`   |
+`matrix_type::fp32`   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
 
 ==== Intel XMX Supported Combinations
@@ -864,12 +864,12 @@ matrix_type::fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| matrix_type::(u)int8_t  | matrix_type::(u)int8_t |
-matrix_type::int32_t  |  +<=+ 8 |  16 |  32
-|  matrix_type::fp16       |  matrix_type::fp16   |
-matrix_type::fp32   |  +<=+ 8 |  16   |  16
-|  matrix_type::bf16       |  matrix_type::bf16   |
-matrix_type::fp32   |  +<=+ 8 |  16   |  16
+| `matrix_type::(u)int8_t`  | `matrix_type::(u)int8_t` |
+`matrix_type::int32_t`  |  +<=+ 8 |  16 |  32
+|  `matrix_type::fp16`       |  `matrix_type::fp16`   |
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16
+|  `matrix_type::bf16`       |  `matrix_type::bf16`   |
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16
 |======================
 
 

From ddb87f1b2156bb07c97b758a2801a9d446ed6bf0 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 30 Jan 2023 12:48:35 -0800
Subject: [PATCH 11/51] remove _t from the types

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 32 +++++++++----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 85e6f4cf5a778..505d2f0d7fd2b 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -780,18 +780,18 @@ enum class matrix_type {
   tf32,
   fp32,
   fp64,
-  sint2_t,
-  sint4_t,
-  sint8_t,
-  sint16_t,
-  sint32_t, 
-  sint64_t,
-  uint2_t,
-  uint4_t,
-  uint8_t,
-  uint16_t,
-  uint32_t,
-  uint64_t
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32, 
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
 };
 
 enum class scope_t {
@@ -853,8 +853,8 @@ using the `tpu_params` query class.
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| `matrix_type::(u)int8_t`  | `matrix_type::(u)int8_t` |
-`matrix_type::sint32_t`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+| `matrix_type::(u)int8`  | `matrix_type::(u)int8` |
+`matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
 |  `matrix_type::bf16`       |  `matrix_type::bf16`   |
 `matrix_type::fp32`   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
@@ -864,8 +864,8 @@ using the `tpu_params` query class.
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| `matrix_type::(u)int8_t`  | `matrix_type::(u)int8_t` |
-`matrix_type::int32_t`  |  +<=+ 8 |  16 |  32
+| `matrix_type::(u)int8`  | `matrix_type::(u)int8` |
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32
 |  `matrix_type::fp16`       |  `matrix_type::fp16`   |
 `matrix_type::fp32`   |  +<=+ 8 |  16   |  16
 |  `matrix_type::bf16`       |  `matrix_type::bf16`   |

From 8a8e0a9bcea960bc6ca73bb6c624f0decc860af9 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Sat, 4 Feb 2023 12:34:58 -0800
Subject: [PATCH 12/51] Specify in Status that joint matrix is an optional
 kernel feature

---
 .../sycl_ext_oneapi_matrix.asciidoc                   | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 505d2f0d7fd2b..ebfe3716c7d10 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -39,7 +39,6 @@ references below to the "core SYCL specification" or to section numbers in the
 SYCL specification refer to that revision.
 
 == Status
-
 This is an experimental extension specification, intended to provide early
 access to features and gather community feedback.  Interfaces defined in this
 specification are implemented in {dpcpp}, but they are not finalized and may
@@ -47,6 +46,16 @@ change incompatibly in future versions of {dpcpp} without prior notice.
 *Shipping software products should not rely on APIs defined in this
 specification.*
 
+_Note The joint_matrix type is an optional kernel feature as defined
+in section 5.7 of the core SYCL specification.  Each device supports
+only certain values for the `Rows` and `Cols` template parameters and
+only certain types for the `T` template parameter.  Applications can
+use the query API `tpu_params` to determine the set of legal
+parameters for each device.  If the application submits a kernel using
+an unsupported `joint_matrix` parameter, the implementation throws a
+synchronous exception with the `errc::kernel_not_supported` error code
+as described in section 5.7. 
+
 == Backend support status
 This extension is currently implemented in {dpcpp} only for devices
 that contain a matrix hardware, specifically Intel(R) Advanced Matrix

From 7e610aa323a641e5568d787c4f87a86227cf9354 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 8 Feb 2023 18:39:16 -0800
Subject: [PATCH 13/51] Move the iteration-style EWOps to the Intel extension
 and introduce joint_matrix_apply map function

---
 .../sycl_ext_intel_matrix.asciidoc            | 204 ++++++++++++++++--
 .../sycl_ext_oneapi_matrix.asciidoc           | 183 ++++------------
 2 files changed, 226 insertions(+), 161 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index d4fa2d291755d..bb16ee698e646 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -1,4 +1,4 @@
-= sycl_ext_oneapi_matrix
+= sycl_ext_intel_matrix
 
 :source-highlighter: coderay
 :coderay-linenums-mode: table
@@ -146,6 +146,151 @@ namespace sycl::ext::intel::experimental::matrix {
 }
 ```
 
+==== Element Indexing and Piece-Wise Operations
+===== Background
+Besides matrix multiply and add, this extension aims to make it
+possible to perform piece-wise operations on matrices in a SPMD
+manner. The mechanisms that are recommended to perform such piece-wise
+operations depend upon which of the following classes the operation
+falls into:
+
+Class 1- Element-wise operations where the same operation is performed
+on every element of the matrix, such that the operation can be
+performed without knowledge of the position of the element within the
+matrix. Activation functions or adding a constant value to every
+element of the matrix are two examples. In this case
+`joint_matrix_apply` should be used. 
+
+Class 2- Piece-wise operations where the operation depends on the
+element index of the matrix or the operation takes multiple elements
+as operands (such as a sum of all elements in a row for
+example). Quantization that is needed for conversion between low
+precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix
+interface: 1) Allowing non-restrictive element indexing on the matrix
+elements would result into slow indexing on the GPU, 2) Operator
+overloading can represent only element-wise operations and not the
+operations on pieces (row, column, diagonal, etc) of the matrix. 3)
+Providing specific functions for these piece-wise operations can
+resolve some of the functions we know of today like the ones involved
+in quantization but it is not general to any problem that may occur in
+the future. 
+
+===== Explicit conversion with mapping from SIMD to SPMD
+The data elements in a `joint_matrix` are distributed or shared across
+the work-items in the Group in an implementation-defined way. There is
+no fixed allocation of matrix elements owned by a `joint_matrix`
+instance to the WIs comprising the group used to instantiate it. For
+instance, the matrix is a shared entity among the work items in the
+case of the AMX backend because the AMX tile that holds the matrix
+data is a 2d register that is shared among the work items. Therefore
+the partitioning among the WIs is implementation defined. However, it
+is necessary to allocate WIs to specific elements of the matrix in
+order to perform element-wise operations. In order to be able to
+perform element-wise operations in a general and efficient way, we
+provide a conversion function from the `joint_matrix` domain that is
+owned by a group of work items to the portion that is owned by each
+work item. This enables the WI to perform piece-wise operations on the
+matrix within the SYCL SPMD programming model.
+
+We introduce a new function `get_wi_data` that provides a view of the
+portion of the matrix that is owned by the current WI. The indexing
+provided inside the `wi_data` class accesses only the portion of the
+current WI and returns  `wi_element`. This latter holds a reference to
+the original joint_matrix that `wi_data` was constructed from. This
+means that modifying `wi_data` also modifies the corresponding joint
+matrix elements. Users can use the `=` operator to update the element
+of the `joint_matrix` represented by the `wi_element` after the
+element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data
+are owned by each thread in the group as this is implementation
+defined and changes from one backend to the other. For general
+piece-wise operations such as summing the rows of a matrix, the WI
+data to joint matrix mapping coordinates information must be known in
+order to reason about the matrix view and extract the relevant
+piece. However, for element-wise operations where the same operation
+is performed on all the elements of the matrix, having all the WIs in
+the group apply the operation inside a loop iterating over the
+`length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Note that `get_wi_data` cannot return a fixed size array length
+because the length of the WI portion is a runtime variable for the
+following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and
+partitioning among WIs is implementation defined.
+
+2- Sub group size is not generally fixed.
+
+The code listing below shows a synopsis of these new APIs.
+
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg,
+   joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+
+template <typename T, size_t Rows, size_t Cols, use Use, layout
+Layout, typename Group>
+class wi_data {
+  size_t length();
+  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
+};
+template <typename T, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          typename Group = sycl::sub_group>
+class wi_element {
+  operator T();
+  wi_element &operator=(const T &rhs);
+  wi_element &operator+=(const T &rhs);
+  wi_element &operator-=(const T &rhs);
+  wi_element &operator*=(const T &rhs);
+  wi_element &operator/=(const T &rhs);
+
+  std::tuple<size_t, size_t> get_coord();	
+};
+}
+```
+
+In the following example `wi_data_c` is a reference to the WI owned
+portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
+updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the sub group dimension will get enabled
+automatically to vectorize the contiguous portion of the matrix. 
+
+
+```c++
+auto wi_data_c = get_wi_data(sg, matC);
+for (int i = 0; i < wi_data_c.length(); i++)
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i"
+	is in the vector owned by a WI, not in the matrix C        
+```
+
+IMPORTANT: In the current implementation, only the `sub_group` scope
+is supported.
+
+===== Work-item data to joint matrix mapping coordinates
+The `wi_data` and `wi_element` classes provide access to the matrix
+elements that are local to the calling work-item. However, the
+distribution of matrix elements to each work-item is
+implementation-defined, so application code cannot assume any fixed
+distribution. Instead, application code can use the `get_coord` method
+to query the matrix coordinates of an individual `wi_element`.
+
+`get_coord` returns [row,col] coordinates of the current object
+`wi_element` of the joint matrix.  The code above results into the following:
+
+```c++
+auto data = get_wi_data(sg, tA);
+// each WI calculates local sum of rows
+for (int i = 0; i < data.length(); ++i) {
+  auto [row, col] = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}  
+```
+
+IMPORTANT: `get_coord` is not implemented yet.
 
 ==== VNNI/Packed Layout
 Intel AMX and Intel XMX compute assumes that the B tile register
@@ -193,25 +338,52 @@ for a 16-bit type.
       // ---------------------------------
       // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
 
-== Issues
-- Should the same class, `joint_matrix`, handle both cases where sizes
-are constant (GPU case) and when sizes are variable (CPU case)? Note
-that a Intel AMX 2d tile register permits sizes up to 1024
-(16rowsx64cols) bytes that can be variable. The ability to define only
-one interface for both would make it possible to give the user a way
-to make use of the flexibility introduced by the CPU but at the same
-time save resources on the GPU. In a previous version of the design,
-we used `sycl::dynamic_extent`  to differentiate between static and
-dynamic sizes. But since this was not implemented at all, we decided
-to remove it. We can revisit this design choice if this comes up as
-part of a customer request or if SPIRV matrix extension extends its
-support to dynamic sizes.
+=== Example using int8_t type
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+range<2> G = {M/tM, N};
+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+int32_t *memC = malloc_shared<int32_t>(M*N, q);
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
+   const auto global_idx = item.get_global_id(0);
+   const auto global_idy = item.get_global_id(1);
+   const auto sg_startx = global_idx - item.get_local_id(0);
+   const auto sg_starty = global_idy - item.get_local_id(1);
+   sub_group sg = item.get_sub_group();
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
+   joint_matrix<sub_group, int8_t, use::b, tK, tN,
+                ext::intel::experimental::matrix::layout::packed> tB;
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
+   joint_matrix_fill(sg, tC, 0);
+   for (int k = 0; k < K; k += tK) {
+     joint_matrix_load(sg, tA,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memA) +
+	  sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB,
+          multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
+	  k * N*4 + sg_starty/SG_SIZE*tN*4, N*4); 
+     tC = joint_matrix_mad(sg, tA, tB, tC);
+   }
+   auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
+   for (int i = 0; i < wi_data_c.length(); i++)
+     wi_data_c[i] *= alpha; 
+   joint_matrix_store(sg, tC,
+        multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
+	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+}).wait();
+```
 
 == Revision History
 
 [frame="none",options="header"]
 |======================
 |Rev |Date       |Author     |Changes
-|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API and
-layout information.
+|1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
+layout information, iterative-based element-wise operations, and
+mapping 
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index ebfe3716c7d10..0e459b195e0e7 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -46,7 +46,13 @@ change incompatibly in future versions of {dpcpp} without prior notice.
 *Shipping software products should not rely on APIs defined in this
 specification.*
 
-_Note The joint_matrix type is an optional kernel feature as defined
+== Backend support status
+This extension is currently implemented in {dpcpp} only for devices
+that contain a matrix hardware, specifically Intel(R) Advanced Matrix
+Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
+XMX) and Nvidia(R) Tensor Cores.
+
+The joint_matrix type is an optional kernel feature as defined
 in section 5.7 of the core SYCL specification.  Each device supports
 only certain values for the `Rows` and `Cols` template parameters and
 only certain types for the `T` template parameter.  Applications can
@@ -56,12 +62,6 @@ an unsupported `joint_matrix` parameter, the implementation throws a
 synchronous exception with the `errc::kernel_not_supported` error code
 as described in section 5.7. 
 
-== Backend support status
-This extension is currently implemented in {dpcpp} only for devices
-that contain a matrix hardware, specifically Intel(R) Advanced Matrix
-Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
-XMX) and Nvidia(R) Tensor Cores.
-
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
 unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and
@@ -297,150 +297,43 @@ namespace sycl::ext::oneapi::experimental::matrix {
 IMPORTANT: In the current implementation, only the `sub_group` scope
 is supported.  
 
-==== Element Indexing and Piece-Wise Operations
-===== Background
+==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
 possible to perform piece-wise operations on matrices in a SPMD
-manner. The mechanisms that are recommended to perform such piece-wise
-operations depend upon which of the following classes the operation
-falls into:
-
-Class 1- Element-wise operations where the same operation is performed
-on every element of the matrix, such that the operation can be
-performed without knowledge of the position of the element within the
-matrix. Activation functions or adding a constant value to every
-element of the matrix are two examples.
-
-Class 2- Piece-wise operations where the operation depends on the
-element index of the matrix or the operation takes multiple elements
-as operands (such as a sum of all elements in a row for
-example). Quantization that is needed for conversion between low
-precision types like `int8_t` and `fp32` uses piece-wise operations.
-
-// We explored multiple options to enable this feature in the matrix
-interface: 1) Allowing non-restrictive element indexing on the matrix
-elements would result into slow indexing on the GPU, 2) Operator
-overloading can represent only element-wise operations and not the
-operations on pieces (row, column, diagonal, etc) of the matrix. 3)
-Providing specific functions for these piece-wise operations can
-resolve some of the functions we know of today like the ones involved
-in quantization but it is not general to any problem that may occur in
-the future. 
-
-===== Explicit conversion with mapping from SIMD to SPMD
-The data elements in a `joint_matrix` are distributed or shared across
-the work-items in the Group in an implementation-defined way. There is
-no fixed allocation of matrix elements owned by a `joint_matrix`
-instance to the WIs comprising the group used to instantiate it. For
-instance, the matrix is a shared entity among the work items in the
-case of the AMX backend because the AMX tile that holds the matrix
-data is a 2d register that is shared among the work items. Therefore
-the partitioning among the WIs is implementation defined. However, it
-is necessary to allocate WIs to specific elements of the matrix in
-order to perform element-wise operations. In order to be able to
-perform element-wise operations in a general and efficient way, we
-provide a conversion function from the `joint_matrix` domain that is
-owned by a group of work items to the portion that is owned by each
-work item. This enables the WI to perform piece-wise operations on the
-matrix within the SYCL SPMD programming model.
-
-We introduce a new function `get_wi_data` that provides a view of the
-portion of the matrix that is owned by the current WI. The indexing
-provided inside the `wi_data` class accesses only the portion of the
-current WI and returns  `wi_element`. This latter holds a reference to
-the original joint_matrix that `wi_data` was constructed from. This
-means that modifying `wi_data` also modifies the corresponding joint
-matrix elements. Users can use the `=` operator to update the element
-of the `joint_matrix` represented by the `wi_element` after the
-element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data
-are owned by each thread in the group as this is implementation
-defined and changes from one backend to the other. For general
-piece-wise operations such as summing the rows of a matrix, the WI
-data to joint matrix mapping coordinates information must be known in
-order to reason about the matrix view and extract the relevant
-piece. However, for element-wise operations where the same operation
-is performed on all the elements of the matrix, having all the WIs in
-the group apply the operation inside a loop iterating over the
-`length` of `wi_data` guarantees the whole matrix element-wise operation.   
-
-Note that `get_wi_data` cannot return a fixed size array length
-because the length of the WI portion is a runtime variable for the
-following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and
-partitioning among WIs is implementation defined.
-
-2- Sub group size is not generally fixed.
-
-The code listing below shows a synopsis of these new APIs.
+manner. `joint_matrix_apply` function performs an element-wise
+operation where the same operation is performed on every element of
+the joint matrix, such that the operation can be performed without knowledge
+of the position of the element within the matrix. Activation functions
+or adding a constant value to every element of the matrix are two
+examples of this usage. When the operation depends on the element
+index of the matrix, an Intel-specific extension is available as part
+of the * link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix]
+
+Besides the `Group` and the `joint_matrix` argument,
+`joint_matrix_apply` takes a lambda expression as an argument that
+specifies the specific operation on each of the elements of the input
+matrix. 
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg,
-   joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
-
-template <typename T, size_t Rows, size_t Cols, use Use, layout
-Layout, typename Group>
-class wi_data {
-  size_t length();
-  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
-};
-template <typename T, size_t Rows, size_t Cols,
-          use Use, layout Layout,
-          typename Group = sycl::sub_group>
-class wi_element {
-  operator T();
-  wi_element &operator=(const T &rhs);
-  wi_element &operator+=(const T &rhs);
-  wi_element &operator-=(const T &rhs);
-  wi_element &operator*=(const T &rhs);
-  wi_element &operator/=(const T &rhs);
-
-  std::tuple<size_t, size_t> get_coord();	
-};
+  template<typename Group, typename T, use Use, size_t M, size_t N,
+  layout Layout, typename F>
+  void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, M, N,
+  Layout>C, F&& lambda);
 }
 ```
 
-In the following example `wi_data_c` is a reference to the WI owned
-portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
-updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the sub group dimension will get enabled
-automatically to vectorize the contiguous portion of the matrix. 
-
+In the following example, every element of the matrix `C` is
+multiplied by `alpha`. Then, an activation function, `relu` in this
+example, is applied on each of the elements of `C`. 
 
 ```c++
-auto wi_data_c = get_wi_data(sg, matC);
-for (int i = 0; i < wi_data_c.length(); i++)
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i"
-	is in the vector owned by a WI, not in the matrix C        
-```
-
-IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.
-
-===== Work-item data to joint matrix mapping coordinates
-The `wi_data` and `wi_element` classes provide access to the matrix
-elements that are local to the calling work-item. However, the
-distribution of matrix elements to each work-item is
-implementation-defined, so application code cannot assume any fixed
-distribution. Instead, application code can use the `get_coord` method
-to query the matrix coordinates of an individual `wi_element`.
-
-`get_coord` returns [row,col] coordinates of the current object
-`wi_element` of the joint matrix.  The code above results into the following:
-
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int i = 0; i < data.length(); ++i) {
-  auto [row, col] = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}  
-```
+joint_matrix_apply(sg, C, [=](T x) {  
+    x *= alpha; 
+    relu(x); 
+});
 
-IMPORTANT: `get_coord` is not implemented yet.
+IMPORTANT: `joint_matrix_apply` is not implemented yet.
 
 === Joint Matrix Additional Types
 Besides C++ `half`, `float`, `double` types, and `sycl::bfloat16` types, joint
@@ -501,9 +394,9 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 	  k * N + sg_starty/SG_SIZE*tN, N); 
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
-   auto wi_data_c = get_wi_data(sg, tC);
-   for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; 
+   joint_matrix_apply(sg, tC, [=](int8_t x) {  
+    x *= alpha; 
+   });
    joint_matrix_store(sg, tC,
         multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
 	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
@@ -897,6 +790,6 @@ implementation
 |5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it
 portable across Intel AMX, Intel XMX and Nvidia Tensor Cores, and move
 the Intel-specifics to a separate extension document.
-|6   |2023-01-09 |Dounia Khaldi |Add `get_coord` API, tf32 type, and supported
-combinations appendix.
+|6   |2023-01-09 |Dounia Khaldi |Add `joint_matrix_apply` API, tf32
+type, and supported combinations appendix.
 |======================

From 509056ca112f0dfe2d0142633872b270a55ba090 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 10 Feb 2023 09:52:50 -0800
Subject: [PATCH 14/51] Address Jack's comments

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 35 ++++---------------
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 0e459b195e0e7..249ed2e602b05 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -92,30 +92,6 @@ features the implementation supports.
  feature-test macro always has this value.
 |===
 
-=== Matrix API Versions
-
-While this document presents the core API that unifies Intel AMX,
-Intel XMX, and Nvidia Tensor Cores, the implementations support
-slightly different versions of the API. For this reason, we introduce
-a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION` to distinguish
-between these different implementations. The goal in the next few
-months is to get rid of this implementation versioning macro. These
-are the current values for this macro.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and Intel
-XMX. load, store, mad, fill, piece-wise operations, and the query
-interface are supported. The old API used for this implementation is
-detailed in
-link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
-|3     |Initial implementation on Nvidia Tensor Cores
-|4     |JIT implementation on Intel AMX and Intel XMX. load, store,
-mad, fill, piece-wise operations, and the query interface are
-supported. Plus, AOT implementation on Nvidia Tensor Cores 
-|======================
-
 === New `joint_matrix` class
 We introduce a new class called `joint_matrix`. The user needs to
 specify the group memory scope, the type of the elements, the shape,
@@ -133,12 +109,13 @@ struct joint_matrix {
 ```
 
 IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only
-to matrix with `use::accumulator`
+to `joint_matrix` with `use::accumulator`
 
 ==== Use
-Specifying the usage of the matrix: matrix left (A), matrix right (B)
-or accumulator +(C)+ is required by backend implementations to reason
-about the layout of the matrix in registers.
+The main operation performed by the matrix hardware is `D=C+A*B`. `Use`
+argument specifies the usage of the matrix: matrix left (`A`), matrix
+right (`B`) or accumulator +(`C`)+ and `D`. This is required by backend
+implementations to reason about the layout of the matrix in registers.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -275,7 +252,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 }
 ```
 The matrix multiply and add function performs the multiply operation
-on the matrices `A` and `B`, accumulate the result with `C` and return
+on the matrices `A` and `B`, accumulates the result with `C` and returns
 the result.
 
 

From 805630cdbb965baf6c6fa3157fda915f8e539139 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 13 Feb 2023 11:49:56 -0800
Subject: [PATCH 15/51] Add get_info runtime query

---
 .../sycl_ext_intel_matrix.asciidoc            |  40 +-
 .../sycl_ext_oneapi_matrix.asciidoc           | 395 ++++++++----------
 2 files changed, 184 insertions(+), 251 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index bb16ee698e646..fef685415ec25 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -58,11 +58,11 @@ XMX.
 == Overview
 The Intel backend implementations on both Intel AMX and Intel XMX
 support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`,
-`joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query
-interface, as they are defined in the sycl_ext_oneapi_matrix
-extension. There are additional specifics about the supported layouts
-that enable extra performance and functionality listed in this
-document.
+`joint_matrix_mad`, `joint_matrix_fill`, `joint_matrix_apply`, and the
+query interface, as they are defined in the sycl_ext_oneapi_matrix
+extension. Besides element-wise operations with mapping information,
+there are additional specifics about the supported layouts that enable
+extra performance and functionality listed in this document.
 This extension presents some supplementary Intel AMX and Intel XMX
 features not contained within the sycl_ext_oneapi_matrix
 extension. The additional features are built on top of the
@@ -75,11 +75,11 @@ AMX and Intel XMX backends.
 
 This extension provides a feature-test macro as described in the core SYCL
 specification. An implementation supporting this extension must
-predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation
-supports.
+predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values
+defined in the table below.Applications can test for the existence of
+this macro to determine if the implementation supports this feature,
+or applications can test the macro's value to determine which of the
+extension's APIs the implementation supports.
 
 [%header,cols="1,5"]
 |===
@@ -213,7 +213,7 @@ order to reason about the matrix view and extract the relevant
 piece. However, for element-wise operations where the same operation
 is performed on all the elements of the matrix, having all the WIs in
 the group apply the operation inside a loop iterating over the
-`length` of `wi_data` guarantees the whole matrix element-wise operation.   
+`length` of `wi_data` guarantees the whole matrix element-wise operation.
 
 Note that `get_wi_data` cannot return a fixed size array length
 because the length of the WI portion is a runtime variable for the
@@ -248,7 +248,7 @@ class wi_element {
   wi_element &operator*=(const T &rhs);
   wi_element &operator/=(const T &rhs);
 
-  std::tuple<size_t, size_t> get_coord();	
+  std::tuple<size_t, size_t> get_coord();
 };
 }
 ```
@@ -257,14 +257,14 @@ In the following example `wi_data_c` is a reference to the WI owned
 portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
 updates the corresponding matrix element in the joint_matrix `matC`.
 Vectorization along the sub group dimension will get enabled
-automatically to vectorize the contiguous portion of the matrix. 
+automatically to vectorize the contiguous portion of the matrix.
 
 
 ```c++
 auto wi_data_c = get_wi_data(sg, matC);
 for (int i = 0; i < wi_data_c.length(); i++)
         wi_data_c[i] *= alpha;    // Note that the indexing here "i"
-	is in the vector owned by a WI, not in the matrix C        
+	is in the vector owned by a WI, not in the matrix C
 ```
 
 IMPORTANT: In the current implementation, only the `sub_group` scope
@@ -287,7 +287,7 @@ auto data = get_wi_data(sg, tA);
 for (int i = 0; i < data.length(); ++i) {
   auto [row, col] = data[i].get_coord();
   sum_of_local_rows[row] += data[i];
-}  
+}
 ```
 
 IMPORTANT: `get_coord` is not implemented yet.
@@ -314,7 +314,7 @@ for a 16-bit type.
       // a3, b3, c3, d3
       // a4, b4, c4, d4
       // ---------------------------------
-      // The same matrix reformatted in packed layout. 
+      // The same matrix reformatted in packed layout.
       // Here, packing of 2 elements is needed to form 32 bits.
       // Element a1 is contiguous in memory with element a2, etc.
       // ---------------------------------
@@ -332,7 +332,7 @@ for a 16-bit type.
       // a3, b3, c3, d3
       // a4, b4, c4, d4
       // ---------------------------------
-      // The same matrix reformatted in packed layout.  
+      // The same matrix reformatted in packed layout.
       // Here, packing of 4 elements is needed to form 32 bits.
       // Elements a1, a2, a3, a4 are contiguous in memory, etc.
       // ---------------------------------
@@ -348,7 +348,7 @@ range<2> L = {1, SG_SIZE};
 int8_t *memA = malloc_shared<int8_t>(M*K, q);
 int8_t *memB = malloc_shared<int8_t>(K*N, q);
 int32_t *memC = malloc_shared<int32_t>(M*N, q);
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
   [[sycl::reqd_sub_group_size(SG_SIZE)]] {
    const auto global_idx = item.get_global_id(0);
    const auto global_idy = item.get_global_id(1);
@@ -366,12 +366,12 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 	  sg_startx * tM * K + k, K);
      joint_matrix_load(sg, tB,
           multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
-	  k * N*4 + sg_starty/SG_SIZE*tN*4, N*4); 
+	  k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
    auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
    for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; 
+     wi_data_c[i] *= alpha;
    joint_matrix_store(sg, tC,
         multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
 	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 249ed2e602b05..072e752cb0325 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -56,11 +56,12 @@ The joint_matrix type is an optional kernel feature as defined
 in section 5.7 of the core SYCL specification.  Each device supports
 only certain values for the `Rows` and `Cols` template parameters and
 only certain types for the `T` template parameter.  Applications can
-use the query API `tpu_params` to determine the set of legal
-parameters for each device.  If the application submits a kernel using
+use the query API in `tpu_params` or
+`get_info<experimental::info::device::matrix>` to determine the set of
+legal parameters for each device.  If the application submits a kernel using
 an unsupported `joint_matrix` parameter, the implementation throws a
 synchronous exception with the `errc::kernel_not_supported` error code
-as described in section 5.7. 
+as described in section 5.7.
 
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
@@ -114,7 +115,7 @@ to `joint_matrix` with `use::accumulator`
 ==== Use
 The main operation performed by the matrix hardware is `D=C+A*B`. `Use`
 argument specifies the usage of the matrix: matrix left (`A`), matrix
-right (`B`) or accumulator +(`C`)+ and `D`. This is required by backend
+right (`B`) or accumulator (`C`) and `D`. This is required by backend
 implementations to reason about the layout of the matrix in registers.
 
 ```c++
@@ -173,10 +174,10 @@ functions will be called once by each work item in the group.
 
 To be aligned with the SYCL 2020 group algorithms, an additional group
 argument is added to the matrix operations to designate that these
-functions are collective operations. The {dpcpp} syntax is the following: 
+functions are collective operations. The {dpcpp} syntax is the following:
 
 IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.  
+is supported.
 
 ==== Load
 ```c++
@@ -213,7 +214,7 @@ matrix to be loaded from. `Layout` determines whether the data is
 being read in a row (`row_major`), column major (`column_major`)
 fashion. `stride` describes the number of elements between consecutive
 rows for the row major layout, or between columns for the column major
-layout. 
+layout.
 
 
 ==== Store
@@ -261,7 +262,7 @@ Unlike `joint_matrix_load` that assumes that all the matrices are
 directly loaded from memory, `joint_matrix_fill`  makes it possible to
 multiply a matrix which is not directly loaded from memory but rather
 initialized directly in the register. On Intel AMX, if the
-initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+initialization constant is zero, this would map to the `_tile_zero` intrinsic:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -272,7 +273,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 }
 ```
 IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.  
+is supported.
 
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
@@ -289,7 +290,7 @@ of the * link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix]
 Besides the `Group` and the `joint_matrix` argument,
 `joint_matrix_apply` takes a lambda expression as an argument that
 specifies the specific operation on each of the elements of the input
-matrix. 
+matrix.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -302,14 +303,14 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 In the following example, every element of the matrix `C` is
 multiplied by `alpha`. Then, an activation function, `relu` in this
-example, is applied on each of the elements of `C`. 
+example, is applied on each of the elements of `C`.
 
 ```c++
-joint_matrix_apply(sg, C, [=](T x) {  
-    x *= alpha; 
-    relu(x); 
+joint_matrix_apply(sg, C, [=](T x) {
+    x *= alpha;
+    relu(x);
 });
-
+```
 IMPORTANT: `joint_matrix_apply` is not implemented yet.
 
 === Joint Matrix Additional Types
@@ -319,13 +320,13 @@ such as tf32. tf32 type has a 19 bit format with one sign bit, 8
 exponent bits offering the same range as fp32,  and 10 mantissa bits
 offering same precision as  half type. The usage of tf32 type is
 restricted to `joint_matrix` using:
-`sycl::ext::oneapi::experimental::matrix::precision::tf32`. 
+`sycl::ext::oneapi::experimental::matrix::precision::tf32`.
 
-Joint matrix type tf32 is defined as an empty class with no member functions. 
+Joint matrix type tf32 is defined as an empty class with no member functions.
 ```c++
 namespace precision {
   class tf32;
-} 
+}
 ```
 Besides the type, one conversion function is added:
 `round_to_tf32` that  performs the rounding to tf32.
@@ -351,7 +352,7 @@ range<2> L = {1, SG_SIZE};
 int8_t *memA = malloc_shared<int8_t>(M*K, q);
 int8_t *memB = malloc_shared<int8_t>(K*N, q);
 int32_t *memC = malloc_shared<int32_t>(M*N, q);
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
   [[sycl::reqd_sub_group_size(SG_SIZE)]] {
    const auto global_idx = item.get_global_id(0);
    const auto global_idy = item.get_global_id(1);
@@ -365,14 +366,14 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    for (int k = 0; k < K; k += tK) {
      joint_matrix_load(sg, tA,
           multi_ptr<int8_t, sycl::access::address_space::global_space>(memA) +
-	  sg_startx * tM * K + k, K);
+          sg_startx * tM * K + k, K);
      joint_matrix_load(sg, tB,
           multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
-	  k * N + sg_starty/SG_SIZE*tN, N); 
+          k * N + sg_starty/SG_SIZE*tN, N);
      tC = joint_matrix_mad(sg, tA, tB, tC);
    }
-   joint_matrix_apply(sg, tC, [=](int8_t x) {  
-    x *= alpha; 
+   joint_matrix_apply(sg, tC, [=](int8_t x) {
+    x *= alpha;
    });
    joint_matrix_store(sg, tC,
         multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
@@ -381,16 +382,19 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 ```
 
 === Query Interface
-Intel AMX, Intel XMX and Nvidia matrix hardware support different
+Intel AMX, Intel XMX and Nvidia Tensor Cores matrix hardware support different
 sizes and types (see Appendix: Supported Combinations Per
 Hardware). The query interface is used to validate user code and
 inform them about supported types, sizes, scope, and layouts by the
-implementation. This also offers development and tuning productivity by both
-scientists and library developers. The query interface we are
-proposing here is a compile-time query, so there will be no runtime
-errors.
+implementation. This also offers development and tuning productivity
+by both scientists and library developers. We provide two types of the
+query interface: compile-time query and runtime query.
 
-The query interface proposed here consists of three functionalities:
+==== Compile-Time Query
+This returns `constexpr` values to use in `joint_matrix` template
+arguments but depends on an enumeration of the matrix hardware that
+can be tested.  The compile-time query interface proposed here
+consists of two functionalities:
 
 - Validation: at compile time, the validation functionality informs
   the user whether a specific combination is valid or not. This takes
@@ -403,88 +407,35 @@ The query interface proposed here consists of three functionalities:
   needed. This form happens when the user specifies all template
   parameters except the sizes of the matrices (`tiles`) M, N, and K.
 
-- General query: the general query interface provides information
-  about sizes, types,  and scopes that are supported by a specific TPU
-  implementation. This is needed to avoid padding by the user, for
-  tuning, and efficient code generation if used by a library. The
-  general query returns an array of `combinations` of `combination`
-  type. Each combination includes the sizes and the types for the
-  matrices A, B, and accumulator. Note that for each TPU, the query
-  returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize`
-  exclusively, depending on whether the implementation supports a
-  continuous or discrete number of sizes. For example, the Intel AMX
-  implementation supports a continuous number of sizes, so the `max_*`
-  variant is applied and only the maximum number is returned. The
-  Intel XMX implementation, on the other hand, supports a discrete
-  list of numbers so the  `msize, nsize, ksize` variant is applied.
-  This form takes place when users only specify the TPU they are
-  interested in using.
-
 The table below provides a description for each of the member
-variables and type aliases in `tpu_params` class and the forms in
-which  they are defined.
+variables in `matrix_params` class and the forms in which  they are
+defined.
 
 [frame="none",options="header"]
 |======================
-| Member/type alias in `tpu_params` | Forms they are defined in |Description
-|`type_a`| validation, default values|type alias for the type of matrix A
-|`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_accumulator`|  validation, default values|type alias for the
-type of matrix accumulator
-|`M`|  validation, default values|when no sizes are provided by the
-user, indicates the suggested default size for M; usually this
-corresponds to the maximum size the implementation supports. In
-validation mode, where the user does provide sizes, this is the same
-value M that the user provides if M is supported by the implementation
-|`N`|  validation, default values|when no sizes are provided by the
-user, indicates the suggested default size for N; usually this
-corresponds to the maximum size the implementation supports. In
-validation mode, where the user does provide sizes, this is the same
-value N that the user provides if N is supported by the implementation
-|`K`|  validation, default values|when no sizes are provided by the
-user, indicates the suggested default size for K; usually this
-corresponds to the maximum size the implementation supports. In
-validation mode, where the user does provide sizes, this is the same
-value K that the user provides if K is supported by the implementation
-|`joint_matrix_a`|  validation, default values|type alias for
-`joint_matrix` for matrix A
-|`joint_matrix_b`| validation, default values| type alias for
-`joint_matrix` for matrix B
-|`joint_matrix_accumulator`|  validation, default values| type alias
-for `joint_matrix` for matrix accumulator
-|numtiles|  validation, default values, general query|indicates number
-of tiles in Intel AMX (does not apply to Intel XMX)
-|scopes| validation, default values, general query| indicates the
-memory and execution scopes supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes
-the types and sizes of A, B, accumulator matrices allowed in one combination
-|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values,
-general query| if the TPU implementation supports a continuous number
-of element sizes, each of these members is non-zero, and the TPU
-implementation supports all element sizes from 1 up to (and including)
-that number. By contrast, if the TPU implementation supports a
-discrete number of element sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`|  validation, default values, general
-query| if the TPU implementation supports a discrete number of element
-sizes, each of these members is non-zero, and the value tells one of
-the supported element sizes. By contrast, if the TPU supports a
-continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| validation, default values,
-general query| indicates the types supported in the combination
-|`combinations`    | validation, default values, general query| tells
-the set of supported matrix sizes and types according to the template
-parameters that are provided. In the "general query" form, the user
-provides only the TPU type, so the combinations array contains all
-supported tile sizes and element types for that TPU. In the "default
-values" form, the user provides the TPU type and element types, so the
-combinations array contains only those supported matrix sizes and
-element types that match those element types on that TPU. In the
-"validation" form, the user provides the TPU type, element types, and
-element sizes so only this specific combination is returned in the
-combinations array. 
-|`num_combinations`|  validation, default values, general
-query|indicates number of combinations supported by the TPU
-implementation which corresponds to the size of the `combinations` array
+| Member/type alias in `tpu_params` | Description
+|`type_a`| type alias for the type of matrix A
+|`type_b`| type alias for the type of matrix B
+|`type_accumulator`| type alias for the type of matrix accumulator
+|`M`|when no sizes are provided by the user, indicates the suggested
+default size for M; usually this corresponds to the maximum size the
+implementation supports. In validation mode, where the user does
+provide sizes, this is the same value M that the user provides if M is
+supported by the implementation
+|`N`|when no sizes are provided by the user, indicates the suggested
+default size for N; usually this corresponds to the maximum size the
+implementation supports. In validation mode, where the user does
+provide sizes, this is the same value N that the user provides if N is
+supported by the implementation
+|`K`| when no sizes are provided by the user, indicates the suggested
+default size for K; usually this corresponds to the maximum size the
+implementation supports. In validation mode, where the user does
+provide sizes, this is the same value K that the user provides if K is
+supported by the implementation
+|`joint_matrix_a`| type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| type alias for `joint_matrix` for matrix B |
+`joint_matrix_accumulator`| type alias for `joint_matrix` for matrix
+accumulator
 |======================
 
 ```c++
@@ -495,7 +446,8 @@ struct tpu_params;
 
 // Validation form: Valid or not
 // Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
+template <typename Ta, typename Tb, typename Tc, int sM, int sN, int
+sK, layout>
 struct tpu_params<
     tpu::amx, Ta, Tb, Tc, sM, sN, sK,
     typename std::enable_if<(
@@ -506,11 +458,8 @@ struct tpu_params<
       (sM == 0 && sN == 0 && sK == 0) ||
           (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
       "Invalid parameters for Intel AMX, query valid types and maximum sizes "
-      "using: "
-      "tpu_params<tpu::amx> myparams; and then check out
-      myparams.combinations array");
-
-
+      "using: dev.get_info<experimental::info::device::matrix>(); and
+      then check out matrix::combinations array");
   using type_a = Ta; // this type alias is not available in the
   current implementation 
   using type_b = Tb; // this type alias is not available in the
@@ -534,27 +483,6 @@ struct tpu_params<
   template <typename Group>
   using joint_matrix_accumulator = joint_matrix<Group, Tc,
   use::accumulator, defaultM, defaultN>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contains only the
-  combination that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
 };
 
 // Default values form: Sizes-only query
@@ -566,9 +494,7 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
                                            !std::is_same_v<Tc, void>)>::type> {
   static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
                 "Invalid types for Intel AMX, supported types are
-		int8_t, uint8_t, "
-                "and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16) ");
+                 int8_t, uint8_t, and bfloat16) ");
 
   using type_a = Ta; // this type alias is not available in the
   current implementation 
@@ -589,107 +515,24 @@ struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
   template <typename Group>
   using joint_matrix_accumulator = joint_matrix<Group, Tc,
   use::accumulator, M, N>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contain only the
-  combinations that correspond to the Ta, Tb, and Tc 
-  // types that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
 };
-
-// General query form:
-// types are not given, no default sizes and no implicit matrix construction
-template <int sM, int sN, int sK>
-struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  
-  static constexpr combination combinations[] = {
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8,
-      matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8,
-      matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8,
-      matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8,
-      matrix_type::sint32},
-      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16,
-      matrix_type::fp32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
 enum class tpu {
   xmx8,
   xmx16,
   amx
 };
-
-enum class matrix_type {
-  bf16,
-  fp16,
-  tf32,
-  fp32,
-  fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32, 
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
-};
-
-enum class scope_t {
-  sub_group,
-  work_group
-};
-}
 ```
-==== Validation Example:
+===== Validation Example:
 ```c++
 // User can provide sizes besides the types and tpu_params can assert
   if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
+using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;
 size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
 size_t NDRangeN = N / myparams::N;
 ```
 
-==== Default Values Example:
+===== Default Values Example:
 ```c++
 using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
@@ -701,18 +544,108 @@ myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
 myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
 myparams::joint_matrix_accumulator<sub_group> sub_c;
 
+```
+==== Runtime Query
+This provides a more general query interface with information about
+sizes, types,  and scopes that are supported by a specific matrix
+implementation. This is needed to avoid padding by the user, for
+tuning, and efficient code generation if used by a library. The
+general query returns an array of `combinations` of `combination`
+type. Each combination includes the sizes and the types for the
+matrices A, B, and accumulator. Note that for each matrix hardware,
+the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize,
+ksize` exclusively, depending on whether the implementation supports a
+continuous or discrete number of sizes. For example, the Intel AMX
+implementation supports a continuous number of sizes, so the `max_*`
+variant is applied and only the maximum number is returned. The Intel
+XMX implementation, on the other hand, supports a discrete list of
+numbers so the `msize, nsize, ksize` variant is applied.
+
+The table below provides a description for each of the device matrix
+desciptors that can be queried using `get_info` API.
+
+[frame="none",options="header"]
+|======================
+| Device descriptors | Return type| Description
+|`ext::oneapi::experimental::info::device::matrix::numtiles`| `uint32_t`
+|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|`ext::oneapi::experimental::info::device::matrix::scopes`
+|`std::vector<scope_t>`
+|indicates the memory and execution scopes supported by the matrix
+implementation
+|`ext::oneapi::experimental::info::device::matrix::num_scopes`|`uint32_t`
+|indicates number of scopes supported by the matrix implementation
+|`ext::oneapi::experimental::info::device::matrix::combinations` |
+`std::vector<combination>`| tells the set of supported matrix sizes
+and types
+|`ext::oneapi::experimental::info::device::matrix::num_combinations`|`uint32_t`
+|indicates number of combinations supported by the matrix
+implementation which corresponds to the size of the `combinations` array
+|`combination` 
+|`combination`
+|composes the types and sizes of A, B, accumulator matrices allowed in
+one combination
+|`max_msize`, `max_nsize`, `max_ksize`| `uint32_t`| if the matrix
+implementation supports a continuous number of element sizes, each of
+these members is non-zero, and the matrix implementation supports all
+element sizes from 1 up to (and including) that number. By contrast,
+if the TPU implementation supports a discrete number of element
+sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`| `uint32_t`| if the matrix implementation
+supports a discrete number of element sizes, each of these members is
+non-zero, and the value tells one of the supported element sizes. By
+contrast, if the matrix hardware supports a continuous number of
+element sizes, each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| `matrix_type` | indicates the
+types supported in the combination
+|======================
+
+```c++
+enum class scope_t {
+  sub_group,
+  work_group
+};
+enum class matrix_type {
+  bf16,
+  fp16,
+  tf32,
+  fp32,
+  fp64,
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32,
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
+};
+struct combination {
+  uint32_t max_msize;
+  uint32_t max_nsize;
+  uint32_t max_ksize;
+  uint32_t msize;
+  uint32_t nsize;
+  uint32_t ksize;
+  matrix_type atype;
+  matrix_type btype;
+  matrix_type accumulatortype;
+};
 ```
 
-==== General Query Example:
+===== General Query Example:
 ```c++
 constexpr int M = 1500; // with msize = 8 and msize = 4,
           // M can be broken up to 125 sequence of 8-sized ops and
-	  remaining 500 using 125 sequence of 4-sized ops
-tpu_params<tpu::xmx16> params;
-constexpr int msize = break_dimension(params, M);
-constexpr int msize_remainder = break_dimension_remainder(params, M);
-constexpr int nsize = params.combinations[0].nsize;
-constexpr int ksize = params.combinations[0].ksize;
+          // remaining 500 using 125 sequence of 4-sized ops
+auto combinations = device.get_info<info::device::matrix::combinations>();
+
+constexpr int {msize, nsize, ksize} = break_dimension(combinations, M);
+constexpr int msize_remainder = break_dimension_remainder(combinations, M);
 // device code:
 joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
 joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
@@ -768,5 +701,5 @@ implementation
 portable across Intel AMX, Intel XMX and Nvidia Tensor Cores, and move
 the Intel-specifics to a separate extension document.
 |6   |2023-01-09 |Dounia Khaldi |Add `joint_matrix_apply` API, tf32
-type, and supported combinations appendix.
+type, runtime query, and supported combinations appendix.
 |======================

From 20c09c9e6a4c6db2d0ca5d95f76a7ac3b1dc1c10 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 14 Feb 2023 10:42:09 -0800
Subject: [PATCH 16/51] reword the optional device feature checking

---
 .../sycl_ext_oneapi_matrix.asciidoc                    | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 072e752cb0325..03a36e092ea63 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -52,14 +52,14 @@ that contain a matrix hardware, specifically Intel(R) Advanced Matrix
 Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
 XMX) and Nvidia(R) Tensor Cores.
 
-The joint_matrix type is an optional kernel feature as defined
+The `joint_matrix_mad` function is an optional kernel feature as defined
 in section 5.7 of the core SYCL specification.  Each device supports
-only certain values for the `Rows` and `Cols` template parameters and
-only certain types for the `T` template parameter.  Applications can
-use the query API in `tpu_params` or
+only certain values for the `M`, `N`, and `K` template parameters and
+only certain types for the `Ta`, `Tb`, and `Tc` template parameters.
+Applications can use the query API in `tpu_params` or
 `get_info<experimental::info::device::matrix>` to determine the set of
 legal parameters for each device.  If the application submits a kernel using
-an unsupported `joint_matrix` parameter, the implementation throws a
+an unsupported `joint_matrix_mad` combination, the implementation throws a
 synchronous exception with the `errc::kernel_not_supported` error code
 as described in section 5.7.
 

From a7494c822ca0087ae99639d211742a8c093953ae Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 28 Feb 2023 12:55:28 -0800
Subject: [PATCH 17/51] Address Greg's comments

---
 .../sycl_ext_intel_matrix.asciidoc            |  10 +
 .../sycl_ext_oneapi_matrix.asciidoc           | 216 ++++++++++--------
 2 files changed, 133 insertions(+), 93 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index fef685415ec25..6e0e3d76b7f51 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -378,6 +378,16 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 }).wait();
 ```
 
+=== Intel-Specific Runtime Query
+Besides the query we provide in ../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix], some device descriptors are Intel hardware specific. These are provided as part of `ext::intel::experimental::info::device::matrix` namespace:
+
+[frame="none",options="header"]
+|======================
+| Device descriptors | Return type| Description
+|`ext::oneapi::experimental::info::device::matrix::numtiles`| `uint32_t`
+|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|======================
+
 == Revision History
 
 [frame="none",options="header"]
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 03a36e092ea63..b72da3317ace9 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -1,4 +1,4 @@
-sycl_ext_oneapi_matrix
+= sycl_ext_oneapi_matrix
 
 :source-highlighter: coderay
 :coderay-linenums-mode: table
@@ -20,7 +20,7 @@ sycl_ext_oneapi_matrix
 == Notice
 
 [%hardbreaks]
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+Copyright (c) 2021-2023 Intel Corporation.  All rights reserved.
 
 Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
 of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc. used by
@@ -66,7 +66,7 @@ as described in section 5.7.
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
 unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and
-Nvidia Tensor Cores. This provides portable but performant API for
+Nvidia Tensor Cores. This provides a portable and performant API for
 users who want to build their own neural networks applications,
 perform custom optimzations, or experiment with new operations in a
 timely and performing manner.
@@ -102,15 +102,15 @@ the following description:
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
-          layout Layout = layout::dynamic>
-struct joint_matrix {
-    joint_matrix() {}
-};
+          layout Layout = (Use == use::accumulator) ?
+          layout::dynamic : /*unspecified*/ >
+struct joint_matrix;
 }
 ```
-
-IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only
-to `joint_matrix` with `use::accumulator`
+When the `Use` parameter is `use::accumulator`, the `Layout` parameter
+defaults to `layout::dynamic`, and it is invalid to specify any other
+value for `Layout`. When `Use` has any other value, there is no default
+for `Layout`, and the application must specify one explicitly.
 
 ==== Use
 The main operation performed by the matrix hardware is `D=C+A*B`. `Use`
@@ -183,18 +183,18 @@ is supported.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename T, typename S,
-            size_t NumRows, size_t NumCols,
+            size_t Rows, size_t Cols,
             access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols,
+    joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<S, Space, IsDecorated> src, size_t stride, layout Layout);
-    
+
   template <typename Group, typename T, typename S,
-            size_t NumRows, size_t NumCols,
+            size_t Rows, size_t Cols,
             use Use, layout Layout, access::address_space Space>
   void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
     multi_ptr<S, Space, IsDecorated> src, size_t stride);
 }
 ```
@@ -202,7 +202,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 `joint_matrix_load` loads data from memory to the 2d tiles/registers
 of the matrix hardware.
 We define two overloads of the load function depending on whether the
-memory layout was declared as part of the `joint_matrix` type or not. 
+memory layout was declared as part of the `joint_matrix` type or not.
 The first overload that takes memory layout as an argument is only
 available for a `joint_matrix` type that used the default value
 `layout::dynamic`.
@@ -220,10 +220,10 @@ layout.
 ==== Store
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+  template <typename Group, typename T, size_t Rows, size_t Cols,
             access::address_space Space>
   void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols,
+    joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
 }
@@ -235,7 +235,7 @@ The base pointer `dest` here determines the starting address of the
 matrix to be stored. `Layout` determines whether the data is being
 written in a row (`row_major`), column major (`column_major`)
 fashion. `stride` describes the number of elements between consecutive
-rows for the row major layout, or between columns for the column major layout. 
+rows for the row major layout, or between columns for the column major layout.
 
 
 ==== Multiply and Add
@@ -243,7 +243,7 @@ rows for the row major layout, or between columns for the column major layout.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
   template <typename Group, typename Ta, typename Tb, typename Tc,
-  std::size_t M, std::size_t K, std::size_t N, 
+  std::size_t M, std::size_t K, std::size_t N,
             layout LayoutA, layout LayoutB>
   joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
   joint_matrix_mad(Group sg,
@@ -262,14 +262,16 @@ Unlike `joint_matrix_load` that assumes that all the matrices are
 directly loaded from memory, `joint_matrix_fill`  makes it possible to
 multiply a matrix which is not directly loaded from memory but rather
 initialized directly in the register. On Intel AMX, if the
-initialization constant is zero, this would map to the `_tile_zero` intrinsic:
+initialization constant is zero, this would map to the `_tile_zero`
+intrinsic. Note that the value type `Tv` must be convertible to the
+matrix elements type `T`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+  template <typename Group, typename T, size_t Rows, size_t Cols,
            use Use, layout Layout, typename Tv>
   void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use,
-  NumRows, NumCols, Layout> &m, Tv v);
+                         Rows, Cols, Layout> &m, Tv v);
 }
 ```
 IMPORTANT: In the current implementation, only the `sub_group` scope
@@ -287,17 +289,18 @@ examples of this usage. When the operation depends on the element
 index of the matrix, an Intel-specific extension is available as part
 of the * link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix]
 
-Besides the `Group` and the `joint_matrix` argument,
-`joint_matrix_apply` takes a lambda expression as an argument that
-specifies the specific operation on each of the elements of the input
-matrix.
+Besides the `Group` and the `joint_matrix` arguments,
+`joint_matrix_apply` takes a C++ Callable object which is invoked once
+for each element of the matrix. This callable object must be invocable
+with a single parameter of type `T&`. Commonly, applications pass a
+lambda expression.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template<typename Group, typename T, use Use, size_t M, size_t N,
+  template<typename Group, typename T, use Use, size_t Rows, size_t Cols,
   layout Layout, typename F>
-  void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, M, N,
-  Layout>C, F&& lambda);
+  void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows, Cols,
+  Layout>C, F&& func);
 }
 ```
 
@@ -306,41 +309,55 @@ multiplied by `alpha`. Then, an activation function, `relu` in this
 example, is applied on each of the elements of `C`.
 
 ```c++
-joint_matrix_apply(sg, C, [=](T x) {
+joint_matrix_apply(sg, C, [=](T &x) {
     x *= alpha;
     relu(x);
 });
 ```
 IMPORTANT: `joint_matrix_apply` is not implemented yet.
 
-=== Joint Matrix Additional Types
+=== Support for `tf32` Floating Point Type
 Besides C++ `half`, `float`, `double` types, and `sycl::bfloat16` types, joint
 matrix implementations may support other low-precision floating-point types
-such as tf32. tf32 type has a 19 bit format with one sign bit, 8
-exponent bits offering the same range as fp32,  and 10 mantissa bits
-offering same precision as  half type. The usage of tf32 type is
+such as `tf32`. `tf32` type has a 19 bit format with one sign bit, 8
+exponent bits offering the same range as `fp32`,  and 10 mantissa bits
+offering same precision as  half type. The usage of `tf32` type is
 restricted to `joint_matrix` using:
 `sycl::ext::oneapi::experimental::matrix::precision::tf32`.
 
-Joint matrix type tf32 is defined as an empty class with no member functions.
+Joint matrix type `tf32` is defined as an empty class with no member functions.
 ```c++
 namespace precision {
   class tf32;
 }
 ```
-Besides the type, one conversion function is added:
-`round_to_tf32` that  performs the rounding to tf32.
+In this case, a `tf32` joint matrix type is declared by using the
+`precision::tf32` type for the `T` template parameter as follows:
+
+```c++
+joint_matrix<sub_group, precision::tf32, use::a, tM, tK,
+             layout::row_major> tA;
+```
+
+The purpose of this support is to reduce the precision of the
+`joint_matrix_mad` operation. The rest of the application uses `fp32`
+type. Specifically, joint matrix load/store/fill  perform float type
+memory access to/from tf32 joint matrix. Also, the return type of
+element-wise accesses of a tf32 `joint_matrix` returns
+float. Consequently, general arithmetic is done on `fp32` data.
+
+Joint matrix APIs manipulate floats. No implicit rounding happens when
+users load or store data to/from joint matrices. By default,
+`joint_matrix_mad` works on truncated values (13 bits set to zero). If
+users want joint matrix data to be actually rounded to `tf32` instead of
+truncated, an explicit rounding function should be used. A new function
+`round_to_tf32` is added to  perform the rounding to `tf32`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  float round_to_tf32(float &elem);
+  float round_to_tf32(float elem);
 }
 ```
-Joint matrix load/store/fill  perform float type memory access to/from
-tf32 joint matrix. Also, the return type of element-wise accesses of a
-tf32 `joint_matrix` returns float. In this case, general arithmetic is
-done on fp32 data.
-
 
 === Example using int8_t type
 ```c++
@@ -534,7 +551,7 @@ size_t NDRangeN = N / myparams::N;
 
 ===== Default Values Example:
 ```c++
-using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
+using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
 size_t NDRangeM = M / myparams::M;
 size_t NDRangeN = N / myparams::N;
@@ -549,57 +566,33 @@ myparams::joint_matrix_accumulator<sub_group> sub_c;
 This provides a more general query interface with information about
 sizes, types,  and scopes that are supported by a specific matrix
 implementation. This is needed to avoid padding by the user, for
-tuning, and efficient code generation if used by a library. The
-general query returns an array of `combinations` of `combination`
-type. Each combination includes the sizes and the types for the
-matrices A, B, and accumulator. Note that for each matrix hardware,
-the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize,
-ksize` exclusively, depending on whether the implementation supports a
-continuous or discrete number of sizes. For example, the Intel AMX
-implementation supports a continuous number of sizes, so the `max_*`
-variant is applied and only the maximum number is returned. The Intel
-XMX implementation, on the other hand, supports a discrete list of
-numbers so the `msize, nsize, ksize` variant is applied.
+tuning, and efficient code generation if used by a library.
 
 The table below provides a description for each of the device matrix
-desciptors that can be queried using `get_info` API.
+descriptors that can be queried using `get_info` API.
 
 [frame="none",options="header"]
 |======================
 | Device descriptors | Return type| Description
-|`ext::oneapi::experimental::info::device::matrix::numtiles`| `uint32_t`
-|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
 |`ext::oneapi::experimental::info::device::matrix::scopes`
 |`std::vector<scope_t>`
-|indicates the memory and execution scopes supported by the matrix
-implementation
-|`ext::oneapi::experimental::info::device::matrix::num_scopes`|`uint32_t`
-|indicates number of scopes supported by the matrix implementation
+|tells the execution scopes that are supported by the joint matrix
+operations on this device
 |`ext::oneapi::experimental::info::device::matrix::combinations` |
 `std::vector<combination>`| tells the set of supported matrix sizes
-and types
-|`ext::oneapi::experimental::info::device::matrix::num_combinations`|`uint32_t`
-|indicates number of combinations supported by the matrix
-implementation which corresponds to the size of the `combinations` array
-|`combination` 
-|`combination`
-|composes the types and sizes of A, B, accumulator matrices allowed in
-one combination
-|`max_msize`, `max_nsize`, `max_ksize`| `uint32_t`| if the matrix
-implementation supports a continuous number of element sizes, each of
-these members is non-zero, and the matrix implementation supports all
-element sizes from 1 up to (and including) that number. By contrast,
-if the TPU implementation supports a discrete number of element
-sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`| `uint32_t`| if the matrix implementation
-supports a discrete number of element sizes, each of these members is
-non-zero, and the value tells one of the supported element sizes. By
-contrast, if the matrix hardware supports a continuous number of
-element sizes, each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| `matrix_type` | indicates the
-types supported in the combination
+and types on this device
 |======================
 
+The general query returns a vector of `combinations` of `combination`
+type. Each combination includes the sizes and the types for the
+matrices A, B, and accumulator. Note that for each matrix hardware,
+the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize,
+ksize` exclusively, depending on whether the implementation supports a
+continuous or discrete number of sizes. If a device support a
+continuous number of sizes, the `max_*` variant is applied and only
+the maximum number is returned. However, if a device supports a
+discrete list of numbers so the `msize, nsize, ksize` variant is applied.
+
 ```c++
 enum class scope_t {
   sub_group,
@@ -611,14 +604,10 @@ enum class matrix_type {
   tf32,
   fp32,
   fp64,
-  sint2,
-  sint4,
   sint8,
   sint16,
   sint32,
   sint64,
-  uint2,
-  uint4,
   uint8,
   uint16,
   uint32,
@@ -637,6 +626,44 @@ struct combination {
 };
 ```
 
+Each combination of the `combinations` vector composes the types and
+sizes of A, B, accumulator matrices supported by the device
+implementation. The
+table below provides a description of each member of the `combination` struct.
+
+[frame="none",options="header"]
+|======================
+| Member of `combination` | Description
+|`max_msize`, `max_nsize`, `max_ksize`| if the matrix implementation
+supports a continuous number of element sizes, each of these members
+is non-zero, and the matrix implementation supports all element sizes
+from 1 up to (and including) that number. By contrast, if the TPU
+implementation supports a discrete number of element sizes, each of
+these members has the value zero
+|`msize`, `nsize`, `ksize`| if the matrix implementation supports a
+discrete number of element sizes, each of these members is non-zero,
+and the value tells one of the supported element sizes. By contrast,
+if the matrix hardware supports a continuous number of element sizes,
+each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| indicates the types supported in
+the combination. these are of type `matrix_type` which tells the list
+of types that are supported for the A, B, and accumulator matrices in
+the `T` template parameter as follows: +
+`bf16`: `sycl::bfloat16` +
+`fp16`: `sycl::half` +
+`tf32`: `sycl::ext::oneapi::experimental::matrix::precision::tf32` +
+`fp32`: `float` +
+`fp64`: `double` +
+`sint8`: signed 8 bits signed integer +
+`sint16`: `signed short` +
+`sint32`: `signed int` +
+`sint64`: `signed long` +
+`uint8`: unsigned 8 bits integer +
+`uint16`: `unsigned short` +
+`uint32`: `unsigned int` +
+`uint64`: `unsigned long` 
+|======================
+
 ===== General Query Example:
 ```c++
 constexpr int M = 1500; // with msize = 8 and msize = 4,
@@ -644,12 +671,15 @@ constexpr int M = 1500; // with msize = 8 and msize = 4,
           // remaining 500 using 125 sequence of 4-sized ops
 auto combinations = device.get_info<info::device::matrix::combinations>();
 
-constexpr int {msize, nsize, ksize} = break_dimension(combinations, M);
-constexpr int msize_remainder = break_dimension_remainder(combinations, M);
+int {msize, nsize, ksize} = break_dimension(combinations, M);
+int msize_remainder = break_dimension_remainder(combinations, M);
 // device code:
-joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
-joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
-joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
+
+//joint_matrix<sub_group, int8_t, use::a, msize, ksize,
+// layout::row_major> sub_a;
+//joint_matrix<sub_group, int8_t, use::b, ksize, nsize,
+// layout::row_major> sub_b;
+//joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
 //Remainder handling
 ```
 

From 71595914786f3ce74d7f00b6155aa31737f2ca92 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 28 Feb 2023 13:46:18 -0800
Subject: [PATCH 18/51] Incorporate the last batch of Greg's comments

---
 .../sycl_ext_intel_matrix.asciidoc            |  31 +++--
 .../sycl_ext_oneapi_matrix.asciidoc           | 121 +++++++++++-------
 2 files changed, 95 insertions(+), 57 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 6e0e3d76b7f51..b12d2c148deec 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -101,10 +101,12 @@ the following section.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
+
 enum class layout {
   packed
 };
-}
+
+} // namespace sycl::ext::intel::experimental::matrix
 ```
 
 
@@ -137,13 +139,16 @@ store on matrix `a` and `b` as well.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
-  template <typename Group, typename T, typename S,
-            size_t NumRows, size_t NumCols,
-            use Use, layout Layout, access::address_space Space>
-  void joint_matrix_store(Group sg,
+
+template <typename Group, typename T, typename S,
+          size_t NumRows, size_t NumCols,
+          use Use, layout Layout,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_store(Group g,
     joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
     multi_ptr<S, Space, IsDecorated> src, size_t stride);
-}
+
+} // namespace sycl::ext::intel::experimental::matrix
 ```
 
 ==== Element Indexing and Piece-Wise Operations
@@ -228,8 +233,9 @@ The code listing below shows a synopsis of these new APIs.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
-   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg,
-   joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+
+wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group g,
+ joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
 
 template <typename T, size_t Rows, size_t Cols, use Use, layout
 Layout, typename Group>
@@ -250,7 +256,8 @@ class wi_element {
 
   std::tuple<size_t, size_t> get_coord();
 };
-}
+
+} // namespace sycl::ext::intel::experimental::matrix
 ```
 
 In the following example `wi_data_c` is a reference to the WI owned
@@ -379,7 +386,11 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 ```
 
 === Intel-Specific Runtime Query
-Besides the query we provide in ../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix], some device descriptors are Intel hardware specific. These are provided as part of `ext::intel::experimental::info::device::matrix` namespace:
+Besides the query we provide in
+../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
+some device descriptors are Intel hardware specific. These are
+provided as part of `ext::intel::experimental::info::device::matrix`
+namespace:
 
 [frame="none",options="header"]
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index b72da3317ace9..5f73aabdf4450 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -101,11 +101,13 @@ the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
+
 template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
           layout Layout = (Use == use::accumulator) ?
           layout::dynamic : /*unspecified*/ >
 struct joint_matrix;
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 When the `Use` parameter is `use::accumulator`, the `Layout` parameter
 defaults to `layout::dynamic`, and it is invalid to specify any other
@@ -120,12 +122,14 @@ implementations to reason about the layout of the matrix in registers.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
+
 enum class use {
   a,
   b,
   accumulator
 };
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 ==== Shape
@@ -133,15 +137,20 @@ The shape of a `joint_matrix` refers to its number of rows `Rows` and
 number of columns `Cols`.
 
 ==== Layout
-This specifies the memory layout and it can be row major or column major.
+This specifies the memory layout and it can be row major or column
+major. `dynamic` layout is used on the `joint_matrix` type when this
+is specified on the memory operations instead for the `accumulator`
+matrix.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
+
 enum class layout {
   row_major,
   col_major,
   dynamic
- };
+}; // namespace sycl::ext::oneapi::experimental::matrix
+
 }
 ```
 
@@ -169,12 +178,7 @@ matrix hardware implements new features.
 Since the matrix functions are group operations (as defined in Section
 4.17.3 of the SYCL specification), the matrix API has to be accessed
 by all the work-items in the group in a convergent control flow. The
-`Group` template argument can be a work-group or a sub-group. These
-functions will be called once by each work item in the group.
-
-To be aligned with the SYCL 2020 group algorithms, an additional group
-argument is added to the matrix operations to designate that these
-functions are collective operations. The {dpcpp} syntax is the following:
+`Group` template argument can be a work-group or a sub-group.
 
 IMPORTANT: In the current implementation, only the `sub_group` scope
 is supported.
@@ -182,21 +186,25 @@ is supported.
 ==== Load
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, typename S,
-            size_t Rows, size_t Cols,
-            access::address_space Space>
-  void joint_matrix_load(Group sg,
+
+template <typename Group, typename T, typename S,
+          size_t Rows, size_t Cols,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_load(Group g,
     joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<S, Space, IsDecorated> src, size_t stride, layout Layout);
 
-  template <typename Group, typename T, typename S,
-            size_t Rows, size_t Cols,
-            use Use, layout Layout, access::address_space Space>
-  void joint_matrix_load(Group sg,
+// Only available when Layout != layout::dynamic
+template <typename Group, typename T, typename S,
+          size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_load(Group g,
     joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
     multi_ptr<S, Space, IsDecorated> src, size_t stride);
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 `joint_matrix_load` loads data from memory to the 2d tiles/registers
@@ -207,26 +215,28 @@ The first overload that takes memory layout as an argument is only
 available for a `joint_matrix` type that used the default value
 `layout::dynamic`.
 The second overload without a memory layout must not be used with a
-`joint_matrix` type that used the default value `layout::dynamic`.
+`joint_matrix` type that has `layout::dynamic`.
 
-The base pointer `src` here determines the starting address of the
+The base pointer `src` of type `S` here determines the starting address of the
 matrix to be loaded from. `Layout` determines whether the data is
-being read in a row (`row_major`), column major (`column_major`)
+being read in a row (`row_major`), column major (`col_major`)
 fashion. `stride` describes the number of elements between consecutive
 rows for the row major layout, or between columns for the column major
-layout.
-
+layout. Note that the type `S` must be convertible to matrix elements
+type `T`.
 
 ==== Store
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t Rows, size_t Cols,
-            access::address_space Space>
-  void joint_matrix_store(Group sg,
+
+template <typename Group, typename T, size_t Rows, size_t Cols,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_store(Group g,
     joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 This function stores the data in the accumulator matrix from the 2d
 tiles back to memory.
@@ -242,15 +252,17 @@ rows for the row major layout, or between columns for the column major layout.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc,
+
+template <typename Group, typename Ta, typename Tb, typename Tc,
   std::size_t M, std::size_t K, std::size_t N,
             layout LayoutA, layout LayoutB>
   joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
-  joint_matrix_mad(Group sg,
-    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
-    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
+  joint_matrix_mad(Group g,
+    joint_matrix<Group, Ta, use::a, M, K, LayoutA> A,
+    joint_matrix<Group, Tb, use::b, K, N, LayoutB> B,
     joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 The matrix multiply and add function performs the multiply operation
 on the matrices `A` and `B`, accumulates the result with `C` and returns
@@ -268,11 +280,13 @@ matrix elements type `T`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t Rows, size_t Cols,
-           use Use, layout Layout, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use,
+
+template <typename Group, typename T, size_t Rows, size_t Cols,
+          use Use, layout Layout, typename Tv>
+  void joint_matrix_fill(Group g, joint_matrix<Group, T, Use,
                          Rows, Cols, Layout> &m, Tv v);
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 IMPORTANT: In the current implementation, only the `sub_group` scope
 is supported.
@@ -297,11 +311,13 @@ lambda expression.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  template<typename Group, typename T, use Use, size_t Rows, size_t Cols,
+
+template<typename Group, typename T, use Use, size_t Rows, size_t Cols,
   layout Layout, typename F>
   void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows, Cols,
   Layout>C, F&& func);
-}
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 In the following example, every element of the matrix `C` is
@@ -327,9 +343,11 @@ restricted to `joint_matrix` using:
 
 Joint matrix type `tf32` is defined as an empty class with no member functions.
 ```c++
-namespace precision {
-  class tf32;
-}
+namespace sycl::ext::oneapi::experimental::matrix::precision {
+
+class tf32;
+
+} // namespace sycl::ext::oneapi::experimental::matrix::precision
 ```
 In this case, a `tf32` joint matrix type is declared by using the
 `precision::tf32` type for the `T` template parameter as follows:
@@ -355,8 +373,10 @@ truncated, an explicit rounding function should be used. A new function
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
-  float round_to_tf32(float elem);
-}
+
+float round_to_tf32(float elem);
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 === Example using int8_t type
@@ -394,7 +414,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
    });
    joint_matrix_store(sg, tC,
         multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
-	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+        sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
 }).wait();
 ```
 
@@ -457,6 +477,7 @@ accumulator
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
+
 template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void,
 int sM=0, int sN=0, int sK=0>
 struct tpu_params;
@@ -538,6 +559,8 @@ enum class tpu {
   xmx16,
   amx
 };
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 ===== Validation Example:
 ```c++
@@ -575,7 +598,7 @@ descriptors that can be queried using `get_info` API.
 |======================
 | Device descriptors | Return type| Description
 |`ext::oneapi::experimental::info::device::matrix::scopes`
-|`std::vector<scope_t>`
+|`std::vector<scope>`
 |tells the execution scopes that are supported by the joint matrix
 operations on this device
 |`ext::oneapi::experimental::info::device::matrix::combinations` |
@@ -594,7 +617,9 @@ the maximum number is returned. However, if a device supports a
 discrete list of numbers so the `msize, nsize, ksize` variant is applied.
 
 ```c++
-enum class scope_t {
+namespace sycl::ext::oneapi::experimental::matrix {
+
+enum class scope {
   sub_group,
   work_group
 };
@@ -624,6 +649,8 @@ struct combination {
   matrix_type btype;
   matrix_type accumulatortype;
 };
+
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 Each combination of the `combinations` vector composes the types and

From 5b9fdfc22031d25e3fb1dcd82e677288cab5f408 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 1 Mar 2023 19:17:44 -0800
Subject: [PATCH 19/51] incorporate Greg's comments: query syntax

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 150 ++++++++----------
 1 file changed, 70 insertions(+), 80 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 5f73aabdf4450..bf37be98bc252 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -56,7 +56,7 @@ The `joint_matrix_mad` function is an optional kernel feature as defined
 in section 5.7 of the core SYCL specification.  Each device supports
 only certain values for the `M`, `N`, and `K` template parameters and
 only certain types for the `Ta`, `Tb`, and `Tc` template parameters.
-Applications can use the query API in `tpu_params` or
+Applications can use the query API in `matrix_params` or
 `get_info<experimental::info::device::matrix>` to determine the set of
 legal parameters for each device.  If the application submits a kernel using
 an unsupported `joint_matrix_mad` combination, the implementation throws a
@@ -419,6 +419,17 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 ```
 
 === Query Interface
+Most devices support only certain values for the `Rows` and `Cols`
+template parameters and only certain types for the `T` template
+parameter. Moreover, most devices support only certain combinations of
+these template parameter for the A, B, and accumulator matrices. This
+extension adds two query APIs that can be used to determine the set of
+legal parameters for a particular device. One form provides
+`constexpr` values for these parameters, which can be used when the
+application knows the specific device architecture on which it will
+run. The other form uses the standard information descriptor queries
+for the device object.
+
 Intel AMX, Intel XMX and Nvidia Tensor Cores matrix hardware support different
 sizes and types (see Appendix: Supported Combinations Per
 Hardware). The query interface is used to validate user code and
@@ -450,7 +461,7 @@ defined.
 
 [frame="none",options="header"]
 |======================
-| Member/type alias in `tpu_params` | Description
+| Member/type alias in `matrix_params` | Description
 |`type_a`| type alias for the type of matrix A
 |`type_b`| type alias for the type of matrix B
 |`type_accumulator`| type alias for the type of matrix accumulator
@@ -478,82 +489,61 @@ accumulator
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void,
-int sM=0, int sN=0, int sK=0>
-struct tpu_params;
-
-// Validation form: Valid or not
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int sM, int sN, int
-sK, layout>
-struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
-    typename std::enable_if<(
-        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
-  // Validate that parameters are supported
-  static_assert(
-      (sM == 0 && sN == 0 && sK == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
-      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
-      "using: dev.get_info<experimental::info::device::matrix>(); and
-      then check out matrix::combinations array");
-  using type_a = Ta; // this type alias is not available in the
-  current implementation 
-  using type_b = Tb; // this type alias is not available in the
-  current implementation
-  using type_accumulator = Tc; // this type alias is not available in
-  the current implementation
-
-  // if combination is valid, construct the matrices
-
-  static constexpr std::size_t M = (sM != 0) ? sM : 16;
-  static constexpr std::size_t N = (sN != 0) ? sN : 16;
-  static constexpr std::size_t K =
-      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM,
-  defaultK, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK,
-  defaultN, LayoutB>;
+// This is the validation form, when all template parameters are
+// specified.
+template<tpu Dev, typename Ta=void, typename Tb=void, typename
+         Taccumulator=void, size_t sM=0, size_t sN=0, size_t sK=0>
+struct matrix_params {
+  // An implementation typically uses static_assert here to trigger a
+  // compilation error when the matrix types or shapes are not
+  // supported by the device identified by "Dev".
+
+  using type_a = /* implementation defined */;
+  using type_b = /* implementation defined */;
+  using type_accumulator = /* implementation defined */;
+
+  static constexpr size_t M = sM;
+  static constexpr size_t N = sN;
+  static constexpr size_t K = sK;
+
+  template <typename Group, layout Layout>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, sM, sK, Layout>;
+
+  template <typename Group, layout Layout>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, sK, sN, Layout>;
+
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc,
-  use::accumulator, defaultM, defaultN>;
+  using joint_matrix_accumulator = joint_matrix<Group, Taccumulator,
+  use::accumulator, sM, sN>;
 };
 
-// Default values form: Sizes-only query
-// Specialization for when only types are given, need to query only sizes
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for Intel AMX, supported types are
-                 int8_t, uint8_t, and bfloat16) ");
-
-  using type_a = Ta; // this type alias is not available in the
-  current implementation 
-  using type_b = Tb; // this type alias is not available in the
-  current implementation
-  using type_accumulator = Tc; // this type alias is not available in
-  the current implementation
-
-  // construct the matrices using the default sizes
-  static constexpr std::size_t M = 16;
-  static constexpr std::size_t N = 16;
-  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
+// This is the default values form, where the matrix dimensions are
+// omitted.
+template<tpu Dev, typename Ta, typename Tb, typename Taccumulator>
+struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
+  // An implementation typically uses static_assert here to trigger a
+  compilation error when the matrix types are not supported by the
+  device identified by "Dev".
+
+  using type_a = /* implementation defined */;
+  using type_b = /* implementation defined */;
+  using type_accumulator = /* implementation defined */;
+
+  static constexpr size_t M = /* implementation defined */;
+  static constexpr size_t N = /* implementation defined */;
+  static constexpr size_t K = /* implementation defined */;
+
+  template <typename Group, layout Layout>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, sM, sK, Layout>;
+
+  template <typename Group, layout Layout>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, sK, sN, Layout>;
+
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc,
-  use::accumulator, M, N>;
+  using joint_matrix_accumulator = joint_matrix<Group, Taccumulator,
+  use::accumulator, sM, sN>;
 };
+
 enum class tpu {
   xmx8,
   xmx16,
@@ -564,17 +554,17 @@ enum class tpu {
 ```
 ===== Validation Example:
 ```c++
-// User can provide sizes besides the types and tpu_params can assert
+// User can provide sizes besides the types and matrix_params can assert
   if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;
+using myparams = matrix_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;
 size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
 size_t NDRangeN = N / myparams::N;
 ```
 
 ===== Default Values Example:
 ```c++
-using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int>;
+using myparams = matrix_params<tpu::xmx16, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
 size_t NDRangeM = M / myparams::M;
 size_t NDRangeN = N / myparams::N;
@@ -664,9 +654,9 @@ table below provides a description of each member of the `combination` struct.
 |`max_msize`, `max_nsize`, `max_ksize`| if the matrix implementation
 supports a continuous number of element sizes, each of these members
 is non-zero, and the matrix implementation supports all element sizes
-from 1 up to (and including) that number. By contrast, if the TPU
-implementation supports a discrete number of element sizes, each of
-these members has the value zero
+from 1 up to (and including) that number. By contrast, if the matrix
+hardware implementation supports a discrete number of element sizes,
+each of these members has the value zero
 |`msize`, `nsize`, `ksize`| if the matrix implementation supports a
 discrete number of element sizes, each of these members is non-zero,
 and the value tells one of the supported element sizes. By contrast,
@@ -715,7 +705,7 @@ int msize_remainder = break_dimension_remainder(combinations, M);
 The table below provides a list of the combinations that
 `joint_matrix` implementations support on each of Intel AMX and Intel
 XMX hardware. Note that these can be returned in a parametrized way
-using the `tpu_params` query class.
+using the `matrix_params` query class.
 
 ==== Intel AMX Supported Combinations
 

From e0f683ed45b38875c767af572bd63ac3dd7200db Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 2 Mar 2023 08:29:07 -0800
Subject: [PATCH 20/51] use sycl::ext::oneapi::experimental::architecture and
 remove scope query

---
 .../sycl_ext_intel_matrix.asciidoc            |  7 +-
 .../sycl_ext_oneapi_matrix.asciidoc           | 87 +++++++------------
 2 files changed, 35 insertions(+), 59 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index b12d2c148deec..f29689a813955 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -270,13 +270,10 @@ automatically to vectorize the contiguous portion of the matrix.
 ```c++
 auto wi_data_c = get_wi_data(sg, matC);
 for (int i = 0; i < wi_data_c.length(); i++)
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i"
-	is in the vector owned by a WI, not in the matrix C
+  wi_data_c[i] *= alpha;    // Note that the indexing here "i"
+  //is in the vector owned by a WI, not in the matrix C
 ```
 
-IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.
-
 ===== Work-item data to joint matrix mapping coordinates
 The `wi_data` and `wi_element` classes provide access to the matrix
 elements that are local to the calling work-item. However, the
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index bf37be98bc252..adfd94da71cdc 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -158,12 +158,8 @@ enum class layout {
 In this API, we use the terminology of `joint_matrix` instead of plain
 `matrix` to emphasize that the matrix is shared among a group of work
 items and is not private to each work item. The group scope is added
-as an additional template parameter.
-
-IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported
-
-When the group is a `sycl::sub_group`, a matrix is declared as follows:
+as an additional template parameter. `Group` template parameter must
+be `sycl::sub_group`. In this case, a matrix is declared as follows:
 
 ```c++
 joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
@@ -178,10 +174,7 @@ matrix hardware implements new features.
 Since the matrix functions are group operations (as defined in Section
 4.17.3 of the SYCL specification), the matrix API has to be accessed
 by all the work-items in the group in a convergent control flow. The
-`Group` template argument can be a work-group or a sub-group.
-
-IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.
+`Group` template argument must be `sycl::sub_group`.
 
 ==== Load
 ```c++
@@ -288,8 +281,6 @@ template <typename Group, typename T, size_t Rows, size_t Cols,
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-IMPORTANT: In the current implementation, only the `sub_group` scope
-is supported.
 
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
@@ -422,27 +413,21 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 Most devices support only certain values for the `Rows` and `Cols`
 template parameters and only certain types for the `T` template
 parameter. Moreover, most devices support only certain combinations of
-these template parameter for the A, B, and accumulator matrices. This
-extension adds two query APIs that can be used to determine the set of
-legal parameters for a particular device. One form provides
-`constexpr` values for these parameters, which can be used when the
-application knows the specific device architecture on which it will
-run. The other form uses the standard information descriptor queries
-for the device object.
-
-Intel AMX, Intel XMX and Nvidia Tensor Cores matrix hardware support different
-sizes and types (see Appendix: Supported Combinations Per
-Hardware). The query interface is used to validate user code and
-inform them about supported types, sizes, scope, and layouts by the
-implementation. This also offers development and tuning productivity
-by both scientists and library developers. We provide two types of the
-query interface: compile-time query and runtime query.
+these template parameter for the A, B, and accumulator matrices (see
+Appendix: Supported Combinations Per Hardware). This extension adds
+two query APIs that can be used to determine the set of legal
+parameters for a particular device. One form provides `constexpr`
+values for these parameters, which can be used when the application
+knows the specific device architecture on which it will run. The other
+form uses the standard information descriptor queries for the device
+object.
 
 ==== Compile-Time Query
 This returns `constexpr` values to use in `joint_matrix` template
-arguments but depends on an enumeration of the matrix hardware that
-can be tested.  The compile-time query interface proposed here
-consists of two functionalities:
+arguments but depends on an enumeration of the matrix hardware (See
+`sycl::ext::oneapi::experimental::architecture`) that can be tested.
+The compile-time query interface proposed here consists of two
+functionalities:
 
 - Validation: at compile time, the validation functionality informs
   the user whether a specific combination is valid or not. This takes
@@ -480,10 +465,12 @@ default size for K; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
 provide sizes, this is the same value K that the user provides if K is
 supported by the implementation
-|`joint_matrix_a`| type alias for `joint_matrix` for matrix A
-|`joint_matrix_b`| type alias for `joint_matrix` for matrix B |
-`joint_matrix_accumulator`| type alias for `joint_matrix` for matrix
-accumulator
+|`template <typename Group, layout Layout> using joint_matrix_a;`| type
+alias for `joint_matrix` for matrix A
+|`template <typename Group, layout Layout> using joint_matrix_b;`| type
+alias for `joint_matrix` for matrix B
+|`template <typename Group> using joint_matrix_accumulator;`| type
+alias for `joint_matrix` for matrix accumulator
 |======================
 
 ```c++
@@ -491,8 +478,9 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 // This is the validation form, when all template parameters are
 // specified.
-template<tpu Dev, typename Ta=void, typename Tb=void, typename
-         Taccumulator=void, size_t sM=0, size_t sN=0, size_t sK=0>
+template<sycl::ext::oneapi::experimental::architecture Dev, typename
+Ta=void, typename Tb=void, typename Taccumulator=void, size_t sM=0,
+size_t sN=0, size_t sK=0>
 struct matrix_params {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types or shapes are not
@@ -519,7 +507,8 @@ struct matrix_params {
 
 // This is the default values form, where the matrix dimensions are
 // omitted.
-template<tpu Dev, typename Ta, typename Tb, typename Taccumulator>
+template<sycl::ext::oneapi::experimental::architecture Dev, typename
+Ta, typename Tb, typename Taccumulator>
 struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   // An implementation typically uses static_assert here to trigger a
   compilation error when the matrix types are not supported by the
@@ -544,12 +533,6 @@ struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   use::accumulator, sM, sN>;
 };
 
-enum class tpu {
-  xmx8,
-  xmx16,
-  amx
-};
-
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 ===== Validation Example:
@@ -557,14 +540,18 @@ enum class tpu {
 // User can provide sizes besides the types and matrix_params can assert
   if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = matrix_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;
+using myparams =
+matrix_params<sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc,
+int8_t, int8_t, int, 16, 16, 32>;
 size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
 size_t NDRangeN = N / myparams::N;
 ```
 
 ===== Default Values Example:
 ```c++
-using myparams = matrix_params<tpu::xmx16, int8_t, int8_t, int>;
+using myparams =
+matrix_params<sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc,
+int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
 size_t NDRangeM = M / myparams::M;
 size_t NDRangeN = N / myparams::N;
@@ -577,7 +564,7 @@ myparams::joint_matrix_accumulator<sub_group> sub_c;
 ```
 ==== Runtime Query
 This provides a more general query interface with information about
-sizes, types,  and scopes that are supported by a specific matrix
+sizes and types that are supported by a specific matrix
 implementation. This is needed to avoid padding by the user, for
 tuning, and efficient code generation if used by a library.
 
@@ -587,10 +574,6 @@ descriptors that can be queried using `get_info` API.
 [frame="none",options="header"]
 |======================
 | Device descriptors | Return type| Description
-|`ext::oneapi::experimental::info::device::matrix::scopes`
-|`std::vector<scope>`
-|tells the execution scopes that are supported by the joint matrix
-operations on this device
 |`ext::oneapi::experimental::info::device::matrix::combinations` |
 `std::vector<combination>`| tells the set of supported matrix sizes
 and types on this device
@@ -609,10 +592,6 @@ discrete list of numbers so the `msize, nsize, ksize` variant is applied.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-enum class scope {
-  sub_group,
-  work_group
-};
 enum class matrix_type {
   bf16,
   fp16,

From 008dbfc79eb7f06c9b213ba14187a081baddc263 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 2 Mar 2023 08:36:04 -0800
Subject: [PATCH 21/51] fix the comments formatting

---
 .../sycl_ext_intel_matrix.asciidoc                 | 14 ++------------
 .../sycl_ext_oneapi_matrix.asciidoc                |  6 +++---
 2 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index f29689a813955..5b4c6bf5d35b7 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -172,16 +172,6 @@ as operands (such as a sum of all elements in a row for
 example). Quantization that is needed for conversion between low
 precision types like `int8_t` and `fp32` uses piece-wise operations.
 
-// We explored multiple options to enable this feature in the matrix
-interface: 1) Allowing non-restrictive element indexing on the matrix
-elements would result into slow indexing on the GPU, 2) Operator
-overloading can represent only element-wise operations and not the
-operations on pieces (row, column, diagonal, etc) of the matrix. 3)
-Providing specific functions for these piece-wise operations can
-resolve some of the functions we know of today like the ones involved
-in quantization but it is not general to any problem that may occur in
-the future. 
-
 ===== Explicit conversion with mapping from SIMD to SPMD
 The data elements in a `joint_matrix` are distributed or shared across
 the work-items in the Group in an implementation-defined way. There is
@@ -310,7 +300,7 @@ for a 16-bit type.
 
 ===== Example 1: 16-bit elements
       // Example of a 4 row x 4 column matrix using a 16-bit data
-      element, in row-major layout.
+      // element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
       // a1, b1, c1, d1
@@ -328,7 +318,7 @@ for a 16-bit type.
 ===== Example 2: 8-bit elements
 
       // Example of a 4 row x 4 column matrix using a 8-bit data
-      element, in row-major layout.
+      // element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
       // a1, b1, c1, d1
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index adfd94da71cdc..b026e78721954 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -511,8 +511,8 @@ template<sycl::ext::oneapi::experimental::architecture Dev, typename
 Ta, typename Tb, typename Taccumulator>
 struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   // An implementation typically uses static_assert here to trigger a
-  compilation error when the matrix types are not supported by the
-  device identified by "Dev".
+  // compilation error when the matrix types are not supported by the
+  // device identified by "Dev".
 
   using type_a = /* implementation defined */;
   using type_b = /* implementation defined */;
@@ -538,7 +538,7 @@ struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
 ===== Validation Example:
 ```c++
 // User can provide sizes besides the types and matrix_params can assert
-  if they are supported or not
+// if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
 using myparams =
 matrix_params<sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc,

From efb103a7d826237fef6e9f3f304936c252e4b5bc Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 6 Mar 2023 14:11:09 -0800
Subject: [PATCH 22/51] -	Add overloads and explanation for each of the
 API in the tf32 section -	Add clarification about rounding to tf32 -
 Correct runtime query example -	Besides joint_matrix_mad, add
 joint_matrix to the optional device checks

---
 .../sycl_ext_intel_matrix.asciidoc            |  61 +++----
 .../sycl_ext_oneapi_matrix.asciidoc           | 165 ++++++++++++------
 2 files changed, 138 insertions(+), 88 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 5b4c6bf5d35b7..ef2c21d38e09a 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -339,42 +339,43 @@ using namespace sycl::ext::oneapi::experimental::matrix;
 queue q;
 range<2> G = {M/tM, N};
 range<2> L = {1, SG_SIZE};
-int8_t *memA = malloc_shared<int8_t>(M*K, q);
-int8_t *memB = malloc_shared<int8_t>(K*N, q);
-int32_t *memC = malloc_shared<int32_t>(M*N, q);
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
+auto bufA = sycl::buffer{memA, sycl::range{M*K}};
+auto bufB = sycl::buffer{memB, sycl::range{K*N}};
+auto bufC = sycl::buffer{memC, sycl::range{M*N}};
+q.submit([&](sycl::handler& cgh) {
+  auto accA = sycl::accessor{bufA, cgh, sycl::read_only};
+  auto accB = sycl::accessor{bufB, cgh, sycl::read_only};
+  auto accC = sycl::accessor{bufC, cgh, sycl::read_write};
+  cgh.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
   [[sycl::reqd_sub_group_size(SG_SIZE)]] {
-   const auto global_idx = item.get_global_id(0);
-   const auto global_idy = item.get_global_id(1);
-   const auto sg_startx = global_idx - item.get_local_id(0);
-   const auto sg_starty = global_idy - item.get_local_id(1);
-   sub_group sg = item.get_sub_group();
-   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
-   joint_matrix<sub_group, int8_t, use::b, tK, tN,
+    const auto global_idx = item.get_global_id(0);
+    const auto global_idy = item.get_global_id(1);
+    const auto sg_startx = global_idx - item.get_local_id(0);
+    const auto sg_starty = global_idy - item.get_local_id(1);
+    sub_group sg = item.get_sub_group();
+    joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
+    joint_matrix<sub_group, int8_t, use::b, tK, tN,
                 ext::intel::experimental::matrix::layout::packed> tB;
-   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
-   joint_matrix_fill(sg, tC, 0);
-   for (int k = 0; k < K; k += tK) {
-     joint_matrix_load(sg, tA,
-          multi_ptr<int8_t, sycl::access::address_space::global_space>(memA) +
-	  sg_startx * tM * K + k, K);
-     joint_matrix_load(sg, tB,
-          multi_ptr<int8_t, sycl::access::address_space::global_space>(memB) +
-	  k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
-     tC = joint_matrix_mad(sg, tA, tB, tC);
-   }
-   auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
-   for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha;
-   joint_matrix_store(sg, tC,
-        multi_ptr<int32_t, sycl::access::address_space::global_space>(memC) +
-	sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
-}).wait();
+    joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
+    joint_matrix_fill(sg, tC, 0);
+    for (int k = 0; k < K; k += tK) {
+      joint_matrix_load(sg, tA, accA + sg_startx * tM * K + k, K);
+      joint_matrix_load(sg, tB, accB + k * N*4 + sg_starty/SG_SIZE*tN*4, N*4);
+      tC = joint_matrix_mad(sg, tA, tB, tC);
+    }
+    auto wi_data_c = ext::intel::experimental::matrix::get_wi_data(sg, tC);
+    for (int i = 0; i < wi_data_c.length(); i++)
+      wi_data_c[i] *= alpha;
+    joint_matrix_store(sg, tC,
+      accC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+  });
+});
+q.wait();
 ```
 
 === Intel-Specific Runtime Query
 Besides the query we provide in
-../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
+link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
 some device descriptors are Intel hardware specific. These are
 provided as part of `ext::intel::experimental::info::device::matrix`
 namespace:
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index b026e78721954..cf1113bde5556 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -52,16 +52,17 @@ that contain a matrix hardware, specifically Intel(R) Advanced Matrix
 Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R)
 XMX) and Nvidia(R) Tensor Cores.
 
-The `joint_matrix_mad` function is an optional kernel feature as defined
-in section 5.7 of the core SYCL specification.  Each device supports
-only certain values for the `M`, `N`, and `K` template parameters and
-only certain types for the `Ta`, `Tb`, and `Tc` template parameters.
-Applications can use the query API in `matrix_params` or
-`get_info<experimental::info::device::matrix>` to determine the set of
-legal parameters for each device.  If the application submits a kernel using
-an unsupported `joint_matrix_mad` combination, the implementation throws a
-synchronous exception with the `errc::kernel_not_supported` error code
-as described in section 5.7.
+The `joint_matrix` type and the `joint_matrix_mad` function are
+optional kernel features as defined in section 5.7 of the core SYCL
+specification.  Each device supports only certain values for the `M`,
+`N`, and `K` template parameters and only certain types for the `Ta`,
+`Tb`, and `Tc` template parameters. Applications can use the query API
+in `matrix_params` or `get_info<experimental::info::device::matrix>`
+to determine the set of legal parameters for each device.  If the
+application submits a kernel using an unsupported `joint_matrix` type
+or calls `joint_matrix_mad` with an unsupported combination, the
+implementation throws a synchronous exception with the
+`errc::kernel_not_supported` error code as described in section 5.7.
 
 == Overview
 Joint matrix is a SYCL extension for matrix hardware programming. It
@@ -180,22 +181,22 @@ by all the work-items in the group in a convergent control flow. The
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-template <typename Group, typename T, typename S,
+template <typename Group, typename T,
           size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
   void joint_matrix_load(Group g,
     joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
-    multi_ptr<S, Space, IsDecorated> src, size_t stride, layout Layout);
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
-template <typename Group, typename T, typename S,
+template <typename Group, typename T,
           size_t Rows, size_t Cols,
           use Use, layout Layout,
           access::address_space Space, access::decorated IsDecorated>
   void joint_matrix_load(Group g,
     joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
-    multi_ptr<S, Space, IsDecorated> src, size_t stride);
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -210,13 +211,12 @@ available for a `joint_matrix` type that used the default value
 The second overload without a memory layout must not be used with a
 `joint_matrix` type that has `layout::dynamic`.
 
-The base pointer `src` of type `S` here determines the starting address of the
+The base pointer `src` of type `T` here determines the starting address of the
 matrix to be loaded from. `Layout` determines whether the data is
 being read in a row (`row_major`), column major (`col_major`)
 fashion. `stride` describes the number of elements between consecutive
 rows for the row major layout, or between columns for the column major
-layout. Note that the type `S` must be convertible to matrix elements
-type `T`.
+layout.
 
 ==== Store
 ```c++
@@ -306,7 +306,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 template<typename Group, typename T, use Use, size_t Rows, size_t Cols,
   layout Layout, typename F>
   void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows, Cols,
-  Layout>C, F&& func);
+  Layout>& C, F&& func);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -348,19 +348,60 @@ joint_matrix<sub_group, precision::tf32, use::a, tM, tK,
              layout::row_major> tA;
 ```
 
-The purpose of this support is to reduce the precision of the
-`joint_matrix_mad` operation. The rest of the application uses `fp32`
-type. Specifically, joint matrix load/store/fill  perform float type
-memory access to/from tf32 joint matrix. Also, the return type of
-element-wise accesses of a tf32 `joint_matrix` returns
-float. Consequently, general arithmetic is done on `fp32` data.
+The purpose of this support is to accelerate the `joint_matrix_mad`
+operation while reducing its precision. The rest of the application
+uses `fp32` type.
 
-Joint matrix APIs manipulate floats. No implicit rounding happens when
+Specifically, joint matrix load performs float type memory access to
+tf32 joint matrix using the following overloads. Note that it is
+unspecified whether the implementation stores all 32 bits or only the
+19 bits into the tf32 joint matrix.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+
+template <typename Group, size_t Rows, size_t Cols,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_load(Group g,
+    joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
+    layout::dynamic> &res,
+    multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
+
+// Only available when Layout != layout::dynamic
+template <typename Group, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          access::address_space Space, access::decorated IsDecorated>
+  void joint_matrix_load(Group g,
+    joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
+    multi_ptr<float, Space, IsDecorated> src, size_t stride);
+
+} // namespace sycl::ext::oneapi::experimental::matrix
+```
+Joint matrix store is only available for the matrix accumulator for
+which tf32 does not apply. Also, `Tv` type in joint matrix fill used
+to initialize the tf32 joint matrix is `float`. Note that it is
+unspecified whether the implementation stores all 32 bits or only the
+19 bits into the tf32 joint matrix after the fill operation.
+
+Finally, the return type of element-wise accesses of a tf32
+`joint_matrix` is float. Consequently, general arithmetic is done on
+`fp32` data. In this case, the type used in the function object passed
+to `joint_matrix_apply` has to be `float`. In the example below, `C` is a
+joint matrix of type `precision::tf32`.
+
+```c++
+joint_matrix_apply(sg, C, [=](float &x) {
+    x *= alpha;
+});
+```
+
+Joint matrix APIs operate on floats. No implicit rounding happens when
 users load or store data to/from joint matrices. By default,
 `joint_matrix_mad` works on truncated values (13 bits set to zero). If
-users want joint matrix data to be actually rounded to `tf32` instead of
-truncated, an explicit rounding function should be used. A new function
-`round_to_tf32` is added to  perform the rounding to `tf32`.
+users want the joint matrix mantissas rounded from 23 bits (`float`) to
+10 bits `tf32` instead of truncated, an explicit rounding function
+should be used. A new function `round_to_tf32` is added to  perform
+the rounding to `tf32`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -486,9 +527,9 @@ struct matrix_params {
   // compilation error when the matrix types or shapes are not
   // supported by the device identified by "Dev".
 
-  using type_a = /* implementation defined */;
-  using type_b = /* implementation defined */;
-  using type_accumulator = /* implementation defined */;
+  using type_a = Ta;
+  using type_b = Tb;
+  using type_accumulator = Taccumulator;
 
   static constexpr size_t M = sM;
   static constexpr size_t N = sN;
@@ -514,23 +555,23 @@ struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   // compilation error when the matrix types are not supported by the
   // device identified by "Dev".
 
-  using type_a = /* implementation defined */;
-  using type_b = /* implementation defined */;
-  using type_accumulator = /* implementation defined */;
+  using type_a = Ta;
+  using type_b = Tb;
+  using type_accumulator = Taccumulator;
 
   static constexpr size_t M = /* implementation defined */;
   static constexpr size_t N = /* implementation defined */;
   static constexpr size_t K = /* implementation defined */;
 
   template <typename Group, layout Layout>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, sM, sK, Layout>;
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, Layout>;
 
   template <typename Group, layout Layout>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, sK, sN, Layout>;
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, Layout>;
 
   template <typename Group>
   using joint_matrix_accumulator = joint_matrix<Group, Taccumulator,
-  use::accumulator, sM, sN>;
+  use::accumulator, M, N>;
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -660,23 +701,20 @@ the `T` template parameter as follows: +
 `uint64`: `unsigned long` 
 |======================
 
-===== General Query Example:
+===== Runtime Query Example:
 ```c++
-constexpr int M = 1500; // with msize = 8 and msize = 4,
-          // M can be broken up to 125 sequence of 8-sized ops and
-          // remaining 500 using 125 sequence of 4-sized ops
-auto combinations = device.get_info<info::device::matrix::combinations>();
-
-int {msize, nsize, ksize} = break_dimension(combinations, M);
-int msize_remainder = break_dimension_remainder(combinations, M);
-// device code:
-
-//joint_matrix<sub_group, int8_t, use::a, msize, ksize,
-// layout::row_major> sub_a;
-//joint_matrix<sub_group, int8_t, use::b, ksize, nsize,
-// layout::row_major> sub_b;
-//joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
-//Remainder handling
+// Ta, Tb, Taccumulator are the types used in applications
+std::vector<combination> combinations =
+           device.get_info<info::device::matrix::combinations>();
+for (int i = 0; sizeof(combinations); i++) {
+  if (Ta == combinations[i].atype &&
+      Tb == combinations[i].btype &&
+      Tc == combinations[i].accumulatortype) {
+    // joint matrix GEMM kernel can be called using these sizes
+    joint_matrix_gemm(combinations[i].msize,
+         combinations[i].nsize, combinations[i].ksize);
+  }
+}
 ```
 
 === Appendix: Supported Combinations Per Hardware
@@ -687,6 +725,8 @@ XMX hardware. Note that these can be returned in a parametrized way
 using the `matrix_params` query class.
 
 ==== Intel AMX Supported Combinations
+This is currently available in
+`sycl::ext::oneapi::experimental::architecture::intel_cpu_spr`.
 
 [frame="none",options="header"]
 |======================
@@ -697,17 +737,26 @@ using the `matrix_params` query class.
 `matrix_type::fp32`   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
 
-==== Intel XMX Supported Combinations
+==== Intel XMX Supported
+This is currently available in
+`sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc` and
+`sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2`.
 
 [frame="none",options="header"]
 |======================
-| A type | B type | Accumulator type | M | N | K
+| A type | B type | Accumulator type | M | N | K | device
 | `matrix_type::(u)int8`  | `matrix_type::(u)int8` |
-`matrix_type::int32`  |  +<=+ 8 |  16 |  32
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32 |
+sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
+| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
 |  `matrix_type::fp16`       |  `matrix_type::fp16`   |
-`matrix_type::fp32`   |  +<=+ 8 |  16   |  16
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 |
+sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
+| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
 |  `matrix_type::bf16`       |  `matrix_type::bf16`   |
-`matrix_type::fp32`   |  +<=+ 8 |  16   |  16
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 |
+sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
+| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
 |======================
 
 

From e69ff85567f57b0bfb446b0f6edf424cf4384a8b Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 6 Mar 2023 14:15:29 -0800
Subject: [PATCH 23/51] typo

---
 .../sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc      | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index cf1113bde5556..5b5176a87a842 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -737,7 +737,7 @@ This is currently available in
 `matrix_type::fp32`   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
 |======================
 
-==== Intel XMX Supported
+==== Intel XMX Supported Combinations
 This is currently available in
 `sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc` and
 `sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2`.

From 6868a375f12b10201a48ce0b36ce681c5ae81832 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 10 Mar 2023 19:33:15 -0800
Subject: [PATCH 24/51] Address Greg's comments in the Intel extension

---
 .../sycl_ext_intel_matrix.asciidoc            | 184 +++++++++---------
 .../sycl_ext_oneapi_matrix.asciidoc           |   2 +-
 2 files changed, 97 insertions(+), 89 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index ef2c21d38e09a..6e12cdc933eb1 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -19,7 +19,7 @@
 
 == Notice
 
-Copyright (c) 2022-2022 Intel Corporation.  All rights reserved.
+Copyright (c) 2022-2023 Intel Corporation.  All rights reserved.
 
 NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
 trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
@@ -55,19 +55,21 @@ This document describes the extra features and details for the
 implementation of `joint_matrix` extension on Intel AMX and Intel
 XMX.
 
+The APIs in this extension may be used only on a device that has
+`aspect::ext_intel_matrix`. The application must check that the device
+has this aspect before submitting a kernel using any of the APIs in
+this extension. If the application fails to do this, the
+implementation throws a synchronous exception with the
+`errc::kernel_not_supported` error code when the kernel is submitted to
+the queue.
+
 == Overview
-The Intel backend implementations on both Intel AMX and Intel XMX
-support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`,
-`joint_matrix_mad`, `joint_matrix_fill`, `joint_matrix_apply`, and the
-query interface, as they are defined in the sycl_ext_oneapi_matrix
-extension. Besides element-wise operations with mapping information,
-there are additional specifics about the supported layouts that enable
-extra performance and functionality listed in this document.
-This extension presents some supplementary Intel AMX and Intel XMX
-features not contained within the sycl_ext_oneapi_matrix
-extension. The additional features are built on top of the
-sycl_ext_oneapi_matrix extension but are only supported by the Intel
-AMX and Intel XMX backends.
+This extension provides additional APIs related to the `joint_matrix`
+type that can be used only on Intel devices that have Intel AMX or
+Intel XMX technology. These Intel devices also support all of the
+generic matrix APIs specified in `sycl_ext_oneapi_matrix`, but
+applications can make use of the extended Intel specific APIs in this
+extension to gain additional performance and capabilities.
 
 == Specification
 
@@ -95,43 +97,34 @@ extension's APIs the implementation supports.
 === Joint Matrix Intel-Specific Matrix Features
 
 ==== Layout
-Besides row major and column major layouts, `layout` introduces the
-custom layout packed layout that refers to the VNNI format descibed in
-the following section.
+This extension adds a new layout type named `ext_intel_packed` which
+an application can use to indicate that the matrix data is loaded or
+stored in VNNI "packed" format.
 
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
 
 enum class layout {
-  packed
+  ext_intel_packed
 };
 
 } // namespace sycl::ext::intel::experimental::matrix
 ```
 
 
-==== Layout argument in `joint_matrix_load`
-`layout` in `joint_matrix_load` can take `packed` as argument to
-specify that the data has already been transformed into VNNI format
-(`packed`). in this case, `stride` argument of `joint_matrix_load`
-describes the number of elements between consecutive rows for packed
-layouts.
+Consequently, the layout argument `layout` in `joint_matrix_load` can
+take `ext_intel_packed` as argument to specify that the data has
+already been transformed into VNNI format. in this case, `stride`
+argument of `joint_matrix_load` describes the number of elements
+between consecutive rows for packed layouts.
 
 In order to get maximum performance on Intel AMX and Intel XMX,
 prepacking data in the memory is necessary. If users did not specify
-the packed layouts, transforms done by the implementation will be slow
-due to extra scatter/gather operations. Hence, we expose the `packed`
-layout to the user to specify that A or B have already been
-VNNIed. The packed or VNNI layout is introduced in the `VNNI layout`
-section below.
-
-IMPORTANT: In the current Intel AMX and Intel XMX implementations, the
-layout in the load of matrix B (provided by the `layout memL`
-parameter below) must be `packed` or `row_major`. Automatic VNNI
-transform is supported on AMX. The layout in the load of matrices A
-and C must be `row_major`, and the layout in the store of matrix C
-(provided by the `layout memL` parameter below) must also be
-`row_major`.
+the packed layouts, transforms done by the implementation may be slow
+due to extra scatter/gather operations. Hence, we expose the
+`ext_intel_packed` layout to the user to specify that A or B have
+already been VNNIed. The packed or VNNI layout is introduced in the
+`VNNI layout` section below.
 
 ==== Store Operation
 Besides store of matrix `accumulator`, the Intel implementation allows
@@ -140,37 +133,33 @@ store on matrix `a` and `b` as well.
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
 
-template <typename Group, typename T, typename S,
-          size_t NumRows, size_t NumCols,
-          use Use, layout Layout,
-          access::address_space Space, access::decorated IsDecorated>
+template <typename Group, typename T, size_t Rows, size_t Cols,
+          layout Layout, access::address_space Space,
+          access::decorated IsDecorated>
   void joint_matrix_store(Group g,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<S, Space, IsDecorated> src, size_t stride);
+    joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+
+template <typename Group, typename T, size_t Rows, size_t Cols,
+          layout Layout, access::address_space Space,
+          access::decorated IsDecorated>
+void joint_matrix_store(Group g,
+    joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::intel::experimental::matrix
 ```
 
 ==== Element Indexing and Piece-Wise Operations
-===== Background
-Besides matrix multiply and add, this extension aims to make it
-possible to perform piece-wise operations on matrices in a SPMD
-manner. The mechanisms that are recommended to perform such piece-wise
-operations depend upon which of the following classes the operation
-falls into:
-
-Class 1- Element-wise operations where the same operation is performed
-on every element of the matrix, such that the operation can be
-performed without knowledge of the position of the element within the
-matrix. Activation functions or adding a constant value to every
-element of the matrix are two examples. In this case
-`joint_matrix_apply` should be used. 
-
-Class 2- Piece-wise operations where the operation depends on the
-element index of the matrix or the operation takes multiple elements
-as operands (such as a sum of all elements in a row for
-example). Quantization that is needed for conversion between low
-precision types like `int8_t` and `fp32` uses piece-wise operations.
+The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides
+a way for the application to apply the same operation on every element
+of the matrix. However, some algorithms require the application to
+know the coordinates of each element as it operates on
+them. In this case, the operation depends on the element index of the
+matrix or the operation takes multiple elements as operands (such as a
+sum of all elements in a row for example). Quantization that is needed
+for conversion between low precision types like `int8_t` and `fp32`
+uses such piece-wise operations.
 
 ===== Explicit conversion with mapping from SIMD to SPMD
 The data elements in a `joint_matrix` are distributed or shared across
@@ -231,7 +220,7 @@ template <typename T, size_t Rows, size_t Cols, use Use, layout
 Layout, typename Group>
 class wi_data {
   size_t length();
-  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
+  wi_element<T, Rows, Cols, Use, Layout, Group> operator[](size_t i);
 };
 template <typename T, size_t Rows, size_t Cols,
           use Use, layout Layout,
@@ -287,37 +276,53 @@ for (int i = 0; i < data.length(); ++i) {
 IMPORTANT: `get_coord` is not implemented yet.
 
 ==== VNNI/Packed Layout
-Intel AMX and Intel XMX compute assumes that the B tile register
-(src1) is in the VNNI format as they need 32bit of K-data in A and B
-to be contiguous in memory.
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4
-in the case of 8-bit types. While the current implementation assumes
-that the matrix has been already packed by the user for performance
-reasons, the layout information is needed to inform the implementation
-about this transformation.  The following example illustrates how a
-matrix in `row_major` layout is transformed into the `packed` layout
-for a 16-bit type.
-
-===== Example 1: 16-bit elements
-      // Example of a 4 row x 4 column matrix using a 16-bit data
-      // element, in row-major layout.
+The `ext_intel_packed` layout (aka VNNI) is a special layout for
+matrix data that allows Intel AMX and Intel XMX devices to load
+matrices more efficiently (packing in 32 bits). This layout applies
+only to the A and B matrices, and may not be used with the accumulator
+matrix. The layout is different depending on whether the matrix
+element type is 8 bits or 16 bits, which are the only two element
+sizes supported for the A and B matrices on Intel AMX and Intel XMX
+devices.
+
+For an 8-bit element, the first four elements of column 0 are stored
+contiguously in memory, followed by the first four elements of column
+1, etc. This continues until the end of the row. After all the
+elements for rows 0 - 3 have been stored this way, the process
+repeats, starting with the next four elements of column 0. The diagram
+below illustrates this layout for a 8 x 4 matrix.
+
+===== Example 1: 8-bit elements
+
+      // Example of a 8 row x 4 column matrix using a 8-bit data
+      // element, in row-major layout, rows are shown horizontally.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
       // a1, b1, c1, d1
       // a2, b2, c2, d2
       // a3, b3, c3, d3
       // a4, b4, c4, d4
+      // a5, b5, c5, d5
+      // a6, b6, c6, d6
+      // a7, b7, c7, d7
+      // a8, b8, c8, d8
       // ---------------------------------
       // The same matrix reformatted in packed layout.
-      // Here, packing of 2 elements is needed to form 32 bits.
-      // Element a1 is contiguous in memory with element a2, etc.
+      // Here, packing of 4 elements is needed to form 32 bits.
+      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
       // ---------------------------------
-      // a1, a2, b1, b2, c1, c2, d1, d2
-      // a3, a4, b3, b4, c3, c4, d3, d4
+      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
+      // a5, a6, a7, a8, b5, b6, b7, b8, c5, c6, c7, c8, d5, d6, d7, d8
 
-===== Example 2: 8-bit elements
+For a 16-bit element, the first two elements of column 0 are stored
+contiguously in memory, followed by the first two elements of column
+1, etc. This continues until the end of the row. After all the
+elements for rows 0 - 1 have been stored this way, the process
+repeats, starting with the next two elements of column 0. The diagram
+below illustrates this layout for a 4 x 4 matrix.
 
-      // Example of a 4 row x 4 column matrix using a 8-bit data
+===== Example 2: 16-bit elements
+      // Example of a 4 row x 4 column matrix using a 16-bit data
       // element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
       // ---------------------------------
@@ -327,10 +332,11 @@ for a 16-bit type.
       // a4, b4, c4, d4
       // ---------------------------------
       // The same matrix reformatted in packed layout.
-      // Here, packing of 4 elements is needed to form 32 bits.
-      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
+      // Here, packing of 2 elements is needed to form 32 bits.
+      // Element a1 is contiguous in memory with element a2, etc.
       // ---------------------------------
-      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
+      // a1, a2, b1, b2, c1, c2, d1, d2
+      // a3, a4, b3, b4, c3, c4, d3, d4
 
 === Example using int8_t type
 ```c++
@@ -383,8 +389,10 @@ namespace:
 [frame="none",options="header"]
 |======================
 | Device descriptors | Return type| Description
-|`ext::oneapi::experimental::info::device::matrix::numtiles`| `uint32_t`
-|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|`ext::intel::experimental::info::device::matrix::numtiles`| `int`
+|If the matrix hardware in the device has separate storage (register
+files or tiles) from the rest of the processing units (e.g. Intel
+AMX), returns the number of tiles. For other devices, returns 0.
 |======================
 
 == Revision History
@@ -394,5 +402,5 @@ namespace:
 |Rev |Date       |Author     |Changes
 |1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
 layout information, iterative-based element-wise operations, and
-mapping 
+mapping
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 5b5176a87a842..701059c693d0b 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -284,7 +284,7 @@ template <typename Group, typename T, size_t Rows, size_t Cols,
 
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
-possible to perform piece-wise operations on matrices in a SPMD
+possible to perform element-wise operations on matrices in a SPMD
 manner. `joint_matrix_apply` function performs an element-wise
 operation where the same operation is performed on every element of
 the joint matrix, such that the operation can be performed without knowledge

From fb70d278b2e60c1b0ca023ba5cbb7445769bea55 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 20 Mar 2023 12:54:49 -0700
Subject: [PATCH 25/51] Add overload of joint matrix apply where row and col
 are provided

---
 .../sycl_ext_intel_matrix.asciidoc            | 21 ++++++++++++++++---
 .../sycl_ext_oneapi_matrix.asciidoc           |  4 ++--
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 6e12cdc933eb1..0414e265cbd54 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -94,7 +94,7 @@ extension's APIs the implementation supports.
 |===
 
 
-=== Joint Matrix Intel-Specific Matrix Features
+=== Joint Matrix Intel-Specific Features
 
 ==== Layout
 This extension adds a new layout type named `ext_intel_packed` which
@@ -269,11 +269,26 @@ auto data = get_wi_data(sg, tA);
 // each WI calculates local sum of rows
 for (int i = 0; i < data.length(); ++i) {
   auto [row, col] = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
+  sum_local_rows[row] += data[i];
 }
 ```
 
-IMPORTANT: `get_coord` is not implemented yet.
+===== Extending `joint_matrix_apply` with the mapping coordinates
+For use cases in which the order of accessing the elements of
+`wi_data` is important, indexing and the mapping coordinates API
+provided above should be used. If access order is not important,
+`joint_matrix_apply` is extended to provide the `row` and `col`
+arguments as part of the function object.
+
+Consequently, the sum of rows example provided above can be written as
+follows:
+
+```c++
+joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t  col) {
+   sum_local_rows[row] += val;
+});
+```
+
 
 ==== VNNI/Packed Layout
 The `ext_intel_packed` layout (aka VNNI) is a special layout for
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 701059c693d0b..14848c9439abf 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -150,9 +150,9 @@ enum class layout {
   row_major,
   col_major,
   dynamic
-}; // namespace sycl::ext::oneapi::experimental::matrix
+};
 
-}
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 ==== Group Memory Scope

From 433e65a0ae54caf0d515f0e46c097bdfc7978885 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 23 Mar 2023 10:17:27 -0700
Subject: [PATCH 26/51] Address Greg's comments: change packed name, add tf32
 rounding mode, other rewording and restructuring suggestions

---
 .../sycl_ext_intel_matrix.asciidoc            |  97 +++++----
 .../sycl_ext_oneapi_matrix.asciidoc           | 198 ++++++++++--------
 2 files changed, 167 insertions(+), 128 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 0414e265cbd54..782a58ddbb8c8 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -93,22 +93,34 @@ extension's APIs the implementation supports.
  feature-test macro always has this value.
 |===
 
+=== New Aspect for Intel-Specific Matrix APIs
+This extension adds a new device aspect:
 
-=== Joint Matrix Intel-Specific Features
+namespace sycl {
 
-==== Layout
+enum class aspect : /*unspecified*/ {
+  ext_intel_matrix
+};
+
+} // namespace sycl
+
+The `ext_intel_matrix` aspect indicates that the device is capable of
+using the extended joint matrix APIs that are defined in the sections
+that follow.
+
+=== New Layout Type
 This extension adds a new layout type named `ext_intel_packed` which
 an application can use to indicate that the matrix data is loaded or
 stored in VNNI "packed" format.
 
 ```c++
-namespace sycl::ext::intel::experimental::matrix {
+namespace sycl::ext::oneapi::experimental::matrix::layout {
 
 enum class layout {
   ext_intel_packed
 };
 
-} // namespace sycl::ext::intel::experimental::matrix
+} // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
 
@@ -126,7 +138,7 @@ due to extra scatter/gather operations. Hence, we expose the
 already been VNNIed. The packed or VNNI layout is introduced in the
 `VNNI layout` section below.
 
-==== Store Operation
+=== Additional Store Operations
 Besides store of matrix `accumulator`, the Intel implementation allows
 store on matrix `a` and `b` as well.
 
@@ -136,7 +148,7 @@ namespace sycl::ext::intel::experimental::matrix {
 template <typename Group, typename T, size_t Rows, size_t Cols,
           layout Layout, access::address_space Space,
           access::decorated IsDecorated>
-  void joint_matrix_store(Group g,
+void joint_matrix_store(Group g,
     joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
@@ -150,7 +162,7 @@ void joint_matrix_store(Group g,
 } // namespace sycl::ext::intel::experimental::matrix
 ```
 
-==== Element Indexing and Piece-Wise Operations
+=== Per-element Operations with Coordinates
 The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides
 a way for the application to apply the same operation on every element
 of the matrix. However, some algorithms require the application to
@@ -161,7 +173,7 @@ sum of all elements in a row for example). Quantization that is needed
 for conversion between low precision types like `int8_t` and `fp32`
 uses such piece-wise operations.
 
-===== Explicit conversion with mapping from SIMD to SPMD
+==== Explicit conversion with mapping from SIMD to SPMD
 The data elements in a `joint_matrix` are distributed or shared across
 the work-items in the Group in an implementation-defined way. There is
 no fixed allocation of matrix elements owned by a `joint_matrix`
@@ -253,7 +265,7 @@ for (int i = 0; i < wi_data_c.length(); i++)
   //is in the vector owned by a WI, not in the matrix C
 ```
 
-===== Work-item data to joint matrix mapping coordinates
+==== Work-item data to joint matrix mapping coordinates
 The `wi_data` and `wi_element` classes provide access to the matrix
 elements that are local to the calling work-item. However, the
 distribution of matrix elements to each work-item is
@@ -273,24 +285,50 @@ for (int i = 0; i < data.length(); ++i) {
 }
 ```
 
-===== Extending `joint_matrix_apply` with the mapping coordinates
-For use cases in which the order of accessing the elements of
-`wi_data` is important, indexing and the mapping coordinates API
-provided above should be used. If access order is not important,
-`joint_matrix_apply` is extended to provide the `row` and `col`
-arguments as part of the function object.
+==== Extending `joint_matrix_apply` with the mapping coordinates
+This extension adds a new form of the `joint_matrix_apply` function in
+the `sycl::ext::intel::matrix` namespace that allows the application
+to perform an operation on each element of the matrix. This function
+is similar to the form in `sycl_ext_oneapi_joint_matrix`, but it also
+provides the matrix coordinates of each element to the callback
+function:
+```c++
+namespace sycl::ext::intel::experimental::matrix {
 
-Consequently, the sum of rows example provided above can be written as
-follows:
+template<typename Group, typename T, use Use, size_t Rows, size_t
+         Cols, layout Layout, typename F>
+void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows,
+                        Cols, Layout>& C, F&& func);
+
+} // namespace sycl::ext::intel::experimental::matrix
+```
+The `func` callback is invoked with three parameters `(T& element,
+size_t row, size_t col)`, where `row` and `col` tell the coordinates
+of element in the joint matrix. To illustrate, the following example
+shows how you can use this API to sum the rows of a matrix:
 
 ```c++
 joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t  col) {
    sum_local_rows[row] += val;
 });
 ```
+=== New Device Information Descriptor
+Besides the query we provide in
+link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
+some device descriptors are Intel hardware specific. These are
+provided as part of `ext::intel::experimental::info::device::matrix`
+namespace:
 
+[frame="none",options="header"]
+|======================
+| Device descriptors | Return type| Description
+|`ext::intel::experimental::info::device::matrix::numtiles`| `int`
+|If the matrix hardware in the device has separate storage (register
+files or tiles) from the rest of the processing units (e.g. Intel
+AMX), returns the number of tiles. For other devices, returns 0.
+|======================
 
-==== VNNI/Packed Layout
+=== Packed Layout Format
 The `ext_intel_packed` layout (aka VNNI) is a special layout for
 matrix data that allows Intel AMX and Intel XMX devices to load
 matrices more efficiently (packing in 32 bits). This layout applies
@@ -307,7 +345,7 @@ elements for rows 0 - 3 have been stored this way, the process
 repeats, starting with the next four elements of column 0. The diagram
 below illustrates this layout for a 8 x 4 matrix.
 
-===== Example 1: 8-bit elements
+==== Example 1: 8-bit elements
 
       // Example of a 8 row x 4 column matrix using a 8-bit data
       // element, in row-major layout, rows are shown horizontally.
@@ -336,7 +374,7 @@ elements for rows 0 - 1 have been stored this way, the process
 repeats, starting with the next two elements of column 0. The diagram
 below illustrates this layout for a 4 x 4 matrix.
 
-===== Example 2: 16-bit elements
+==== Example 2: 16-bit elements
       // Example of a 4 row x 4 column matrix using a 16-bit data
       // element, in row-major layout.
       // Element a1 is contiguous in memory with element b1, etc.
@@ -376,7 +414,7 @@ q.submit([&](sycl::handler& cgh) {
     sub_group sg = item.get_sub_group();
     joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
     joint_matrix<sub_group, int8_t, use::b, tK, tN,
-                ext::intel::experimental::matrix::layout::packed> tB;
+                 layout::ext_intel_packed> tB; 
     joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
     joint_matrix_fill(sg, tC, 0);
     for (int k = 0; k < K; k += tK) {
@@ -393,23 +431,6 @@ q.submit([&](sycl::handler& cgh) {
 });
 q.wait();
 ```
-
-=== Intel-Specific Runtime Query
-Besides the query we provide in
-link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
-some device descriptors are Intel hardware specific. These are
-provided as part of `ext::intel::experimental::info::device::matrix`
-namespace:
-
-[frame="none",options="header"]
-|======================
-| Device descriptors | Return type| Description
-|`ext::intel::experimental::info::device::matrix::numtiles`| `int`
-|If the matrix hardware in the device has separate storage (register
-files or tiles) from the rest of the processing units (e.g. Intel
-AMX), returns the number of tiles. For other devices, returns 0.
-|======================
-
 == Revision History
 
 [frame="none",options="header"]
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 14848c9439abf..e756cdcea74b3 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -95,10 +95,11 @@ features the implementation supports.
 |===
 
 === New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to
-specify the group memory scope, the type of the elements, the shape,
-the matrix use, and the memory layout of the matrix. This results in
-the following description:
+This extension adds a new class named `joint_matrix`, which represents
+a small 2-dimensional matrix that supports native operations in
+hardware. There are a number of template parameters, namely the group
+scope, the type of the elements, the matrix use, the shape, and the
+memory layout of the matrix. This results in the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -106,20 +107,40 @@ namespace sycl::ext::oneapi::experimental::matrix {
 template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
           layout Layout = (Use == use::accumulator) ?
           layout::dynamic : /*unspecified*/ >
-struct joint_matrix;
+struct joint_matrix {
+  joint_matrix();
+};
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-When the `Use` parameter is `use::accumulator`, the `Layout` parameter
-defaults to `layout::dynamic`, and it is invalid to specify any other
-value for `Layout`. When `Use` has any other value, there is no default
-for `Layout`, and the application must specify one explicitly.
+==== Group Memory Scope
+In this API, we use the terminology of `joint_matrix` instead of plain
+`matrix` to emphasize that the matrix is shared among a group of work
+items and is not private to each work item. The group scope is added
+as an additional template parameter. This extension currently supports
+only the sub-group scope, so the `Group` template parameter must be
+`sycl::sub_group`. In this case, a matrix is declared as follows:
 
-==== Use
-The main operation performed by the matrix hardware is `D=C+A*B`. `Use`
-argument specifies the usage of the matrix: matrix left (`A`), matrix
-right (`B`) or accumulator (`C`) and `D`. This is required by backend
-implementations to reason about the layout of the matrix in registers.
+```c++
+joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
+```
+
+==== Element Type
+The `T` template parameter specifies the type of each element in the
+matrix. Each device supports only certain element types, so the
+application must use the query operations (defined below) to ensure
+that the element type is supported on the device where the kernel
+using this `joint_matrix` runs.
+
+==== Matrix Use
+The main operation performed by the matrix hardware is `D=C+A*B`. The
+`Use` template parameter specifies which of these terms (A, ,B, C, or D)
+corresponds to the `joint_matrix` object. The use enumeration defines
+the set of legal values. The A matrix must have the value `use::a`. The
+B matrix must have the value `use::b`. The C and D matrices must both
+have the value `use::accumulator`. This is used by backend
+implementations to reason about the layout of the matrix in
+registers.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -133,15 +154,18 @@ enum class use {
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
-==== Shape
-The shape of a `joint_matrix` refers to its number of rows `Rows` and
-number of columns `Cols`.
+==== Matrix Shape
+The `Rows` and `Cols` template parameters tell the number of rows and
+columns in the joint matrix. Each device supports only certain
+combinations of row and column sizes, so the application must use the
+query operations (defined below) to ensure that the matrix shape is
+supported on the device where the kernel using this `joint_matrix` runs.
 
-==== Layout
-This specifies the memory layout and it can be row major or column
-major. `dynamic` layout is used on the `joint_matrix` type when this
-is specified on the memory operations instead for the `accumulator`
-matrix.
+==== Matrix Layout
+The `Layout` template parameter specifies the memory layout of the
+matrix, using one of the values in the layout enumeration. The A and B
+matrices can be either `layout::row_major` or `layout::col_major` (but not
+`layout::dynamic`). The C and D matrices must be `layout::dynamic`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -154,28 +178,22 @@ enum class layout {
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-
-==== Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain
-`matrix` to emphasize that the matrix is shared among a group of work
-items and is not private to each work item. The group scope is added
-as an additional template parameter. `Group` template parameter must
-be `sycl::sub_group`. In this case, a matrix is declared as follows:
-
-```c++
-joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
-```
-
-=== Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common
-operations on matrices, namely load, store, and the actual multiply
-and add operation. This set of functions can be easily extended if the
-matrix hardware implements new features.
-
-Since the matrix functions are group operations (as defined in Section
-4.17.3 of the SYCL specification), the matrix API has to be accessed
-by all the work-items in the group in a convergent control flow. The
-`Group` template argument must be `sycl::sub_group`.
+Note that the `Layout` template parameters defaults to `layout::dynamic`
+when Use is `use::accumulator`, so applications need not specify this
+template parameter for the C or D matrices, and it is invalid to
+specify any other value for `Layout`. When `Use` has any other value,
+there is no default for `Layout`, and the application must specify one
+explicitly.
+
+=== Collective matrix operations
+The following operations (load, store, multiply-and-add, fill, and
+element-wise operations) are group functions as defined in section
+4.17.3 of the core SYCL specification. As such, they must be
+encountered in convergent control flow by the work-items in the group
+that performs the group operation. The `Group` template argument must
+be `sycl::sub_group`. This extension currently supports
+only the sub-group scope, so the `Group` template parameter must be
+`sycl::sub_group`.
 
 ==== Load
 ```c++
@@ -184,7 +202,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 template <typename Group, typename T,
           size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
-  void joint_matrix_load(Group g,
+void joint_matrix_load(Group g,
     joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
@@ -194,7 +212,7 @@ template <typename Group, typename T,
           size_t Rows, size_t Cols,
           use Use, layout Layout,
           access::address_space Space, access::decorated IsDecorated>
-  void joint_matrix_load(Group g,
+void joint_matrix_load(Group g,
     joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
@@ -224,7 +242,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
-  void joint_matrix_store(Group g,
+void joint_matrix_store(Group g,
     joint_matrix<Group, T, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
@@ -236,7 +254,7 @@ tiles back to memory.
 
 The base pointer `dest` here determines the starting address of the
 matrix to be stored. `Layout` determines whether the data is being
-written in a row (`row_major`), column major (`column_major`)
+written in a row (`row_major`), column major (`col_major`)
 fashion. `stride` describes the number of elements between consecutive
 rows for the row major layout, or between columns for the column major layout.
 
@@ -249,8 +267,8 @@ namespace sycl::ext::oneapi::experimental::matrix {
 template <typename Group, typename Ta, typename Tb, typename Tc,
   std::size_t M, std::size_t K, std::size_t N,
             layout LayoutA, layout LayoutB>
-  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
-  joint_matrix_mad(Group g,
+joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
+joint_matrix_mad(Group g,
     joint_matrix<Group, Ta, use::a, M, K, LayoutA> A,
     joint_matrix<Group, Tb, use::b, K, N, LayoutB> B,
     joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
@@ -261,8 +279,7 @@ The matrix multiply and add function performs the multiply operation
 on the matrices `A` and `B`, accumulates the result with `C` and returns
 the result.
 
-
-==== Matrix Initialization: `joint_matrix_fill`
+==== Fill (Initialization)
 Unlike `joint_matrix_load` that assumes that all the matrices are
 directly loaded from memory, `joint_matrix_fill`  makes it possible to
 multiply a matrix which is not directly loaded from memory but rather
@@ -276,7 +293,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           use Use, layout Layout, typename Tv>
-  void joint_matrix_fill(Group g, joint_matrix<Group, T, Use,
+void joint_matrix_fill(Group g, joint_matrix<Group, T, Use,
                          Rows, Cols, Layout> &m, Tv v);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -305,7 +322,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template<typename Group, typename T, use Use, size_t Rows, size_t Cols,
   layout Layout, typename F>
-  void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows, Cols,
+void joint_matrix_apply(Group g, joint_matrix<Group, T, Use, Rows, Cols,
   Layout>& C, F&& func);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -362,7 +379,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
-  void joint_matrix_load(Group g,
+void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
     multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
@@ -371,7 +388,7 @@ template <typename Group, size_t Rows, size_t Cols,
 template <typename Group, size_t Rows, size_t Cols,
           use Use, layout Layout,
           access::address_space Space, access::decorated IsDecorated>
-  void joint_matrix_load(Group g,
+void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
     multi_ptr<float, Space, IsDecorated> src, size_t stride);
 
@@ -401,7 +418,7 @@ users load or store data to/from joint matrices. By default,
 users want the joint matrix mantissas rounded from 23 bits (`float`) to
 10 bits `tf32` instead of truncated, an explicit rounding function
 should be used. A new function `round_to_tf32` is added to  perform
-the rounding to `tf32`.
+the round to nearest even (RTE) rounding mode.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -411,7 +428,7 @@ float round_to_tf32(float elem);
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
-=== Example using int8_t type
+=== Example using `int8_t` type
 ```c++
 using namespace sycl::ext::oneapi::experimental::matrix;
 
@@ -488,9 +505,6 @@ defined.
 [frame="none",options="header"]
 |======================
 | Member/type alias in `matrix_params` | Description
-|`type_a`| type alias for the type of matrix A
-|`type_b`| type alias for the type of matrix B
-|`type_accumulator`| type alias for the type of matrix accumulator
 |`M`|when no sizes are provided by the user, indicates the suggested
 default size for M; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
@@ -527,10 +541,6 @@ struct matrix_params {
   // compilation error when the matrix types or shapes are not
   // supported by the device identified by "Dev".
 
-  using type_a = Ta;
-  using type_b = Tb;
-  using type_accumulator = Taccumulator;
-
   static constexpr size_t M = sM;
   static constexpr size_t N = sN;
   static constexpr size_t K = sK;
@@ -555,10 +565,6 @@ struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   // compilation error when the matrix types are not supported by the
   // device identified by "Dev".
 
-  using type_a = Ta;
-  using type_b = Tb;
-  using type_accumulator = Taccumulator;
-
   static constexpr size_t M = /* implementation defined */;
   static constexpr size_t N = /* implementation defined */;
   static constexpr size_t K = /* implementation defined */;
@@ -691,14 +697,14 @@ the `T` template parameter as follows: +
 `tf32`: `sycl::ext::oneapi::experimental::matrix::precision::tf32` +
 `fp32`: `float` +
 `fp64`: `double` +
-`sint8`: signed 8 bits signed integer +
-`sint16`: `signed short` +
-`sint32`: `signed int` +
-`sint64`: `signed long` +
-`uint8`: unsigned 8 bits integer +
-`uint16`: `unsigned short` +
-`uint32`: `unsigned int` +
-`uint64`: `unsigned long` 
+`sint8`: `int8_t` +
+`sint16`: `int16_t` +
+`sint32`: `int32_t` +
+`sint64`: `int64_t` +
+`uint8`: `uint8_t` +
+`uint16`: `uint16_t` +
+`uint32`: `uint32_t` +
+`uint64`: `uint64_t`
 |======================
 
 ===== Runtime Query Example:
@@ -721,8 +727,8 @@ for (int i = 0; sizeof(combinations); i++) {
 
 The table below provides a list of the combinations that
 `joint_matrix` implementations support on each of Intel AMX and Intel
-XMX hardware. Note that these can be returned in a parametrized way
-using the `matrix_params` query class.
+XMX hardware. Note that these can be returned using
+`ext::oneapi::experimental::info::device::matrix::combinations`.
 
 ==== Intel AMX Supported Combinations
 This is currently available in
@@ -731,7 +737,13 @@ This is currently available in
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K
-| `matrix_type::(u)int8`  | `matrix_type::(u)int8` |
+| `matrix_type::uint8`  | `matrix_type::uint8` |
+`matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+| `matrix_type::uint8`  | `matrix_type::int8` |
+`matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+| `matrix_type::int8`  | `matrix_type::uint8` |
+`matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+| `matrix_type::int8`  | `matrix_type::int8` |
 `matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
 |  `matrix_type::bf16`       |  `matrix_type::bf16`   |
 `matrix_type::fp32`   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
@@ -745,18 +757,24 @@ This is currently available in
 [frame="none",options="header"]
 |======================
 | A type | B type | Accumulator type | M | N | K | device
-| `matrix_type::(u)int8`  | `matrix_type::(u)int8` |
-`matrix_type::int32`  |  +<=+ 8 |  16 |  32 |
-sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
-| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
+| `matrix_type::uint8`  | `matrix_type::uint8` |
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32 | architecture::intel_gpu_pvc
+| | | | |8||architecture::intel_gpu_dg2
+| `matrix_type::uint8`  | `matrix_type::int8` |
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32 | architecture::intel_gpu_pvc
+| | | | |8||architecture::intel_gpu_dg2
+| `matrix_type::int8`  | `matrix_type::uint8` |
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32 | architecture::intel_gpu_pvc
+| | | | |8||architecture::intel_gpu_dg2
+| `matrix_type::int8`  | `matrix_type::int8` |
+`matrix_type::int32`  |  +<=+ 8 |  16 |  32 | architecture::intel_gpu_pvc
+| | | | |8||architecture::intel_gpu_dg2
 |  `matrix_type::fp16`       |  `matrix_type::fp16`   |
-`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 |
-sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
-| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 | architecture::intel_gpu_pvc
+| | | | |8|| architecture::intel_gpu_dg2
 |  `matrix_type::bf16`       |  `matrix_type::bf16`   |
-`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 |
-sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc
-| | | | |8||sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2
+`matrix_type::fp32`   |  +<=+ 8 |  16   |  16 | architecture::intel_gpu_pvc
+| | | | |8|| architecture::intel_gpu_dg2
 |======================
 
 

From f5694eb97c47d4da19a3da854f15e07b5428ddb8 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 23 Mar 2023 10:33:00 -0700
Subject: [PATCH 27/51] fix formatting

---
 .../sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc     | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 782a58ddbb8c8..9e392db4464eb 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -95,7 +95,7 @@ extension's APIs the implementation supports.
 
 === New Aspect for Intel-Specific Matrix APIs
 This extension adds a new device aspect:
-
+```c++
 namespace sycl {
 
 enum class aspect : /*unspecified*/ {
@@ -103,7 +103,7 @@ enum class aspect : /*unspecified*/ {
 };
 
 } // namespace sycl
-
+```
 The `ext_intel_matrix` aspect indicates that the device is capable of
 using the extended joint matrix APIs that are defined in the sections
 that follow.

From 862880e873715a32bdddf77cd40dee598cd34a37 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 24 Apr 2023 14:17:42 -0700
Subject: [PATCH 28/51] Address Greg's comments: remove loop-based indexing,
 add Td and default to Tc in mad and query, add copy constructor and
 assignment op

---
 .../sycl_ext_intel_matrix.asciidoc            | 129 ++----------------
 .../sycl_ext_oneapi_matrix.asciidoc           |  57 +++++---
 2 files changed, 44 insertions(+), 142 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 9e392db4464eb..0a0cc6f549927 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -123,7 +123,6 @@ enum class layout {
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
-
 Consequently, the layout argument `layout` in `joint_matrix_load` can
 take `ext_intel_packed` as argument to specify that the data has
 already been transformed into VNNI format. in this case, `stride`
@@ -166,132 +165,20 @@ void joint_matrix_store(Group g,
 The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides
 a way for the application to apply the same operation on every element
 of the matrix. However, some algorithms require the application to
-know the coordinates of each element as it operates on
-them. In this case, the operation depends on the element index of the
-matrix or the operation takes multiple elements as operands (such as a
-sum of all elements in a row for example). Quantization that is needed
-for conversion between low precision types like `int8_t` and `fp32`
-uses such piece-wise operations.
-
-==== Explicit conversion with mapping from SIMD to SPMD
-The data elements in a `joint_matrix` are distributed or shared across
-the work-items in the Group in an implementation-defined way. There is
-no fixed allocation of matrix elements owned by a `joint_matrix`
-instance to the WIs comprising the group used to instantiate it. For
-instance, the matrix is a shared entity among the work items in the
-case of the AMX backend because the AMX tile that holds the matrix
-data is a 2d register that is shared among the work items. Therefore
-the partitioning among the WIs is implementation defined. However, it
-is necessary to allocate WIs to specific elements of the matrix in
-order to perform element-wise operations. In order to be able to
-perform element-wise operations in a general and efficient way, we
-provide a conversion function from the `joint_matrix` domain that is
-owned by a group of work items to the portion that is owned by each
-work item. This enables the WI to perform piece-wise operations on the
-matrix within the SYCL SPMD programming model.
-
-We introduce a new function `get_wi_data` that provides a view of the
-portion of the matrix that is owned by the current WI. The indexing
-provided inside the `wi_data` class accesses only the portion of the
-current WI and returns  `wi_element`. This latter holds a reference to
-the original joint_matrix that `wi_data` was constructed from. This
-means that modifying `wi_data` also modifies the corresponding joint
-matrix elements. Users can use the `=` operator to update the element
-of the `joint_matrix` represented by the `wi_element` after the
-element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data
-are owned by each thread in the group as this is implementation
-defined and changes from one backend to the other. For general
-piece-wise operations such as summing the rows of a matrix, the WI
-data to joint matrix mapping coordinates information must be known in
-order to reason about the matrix view and extract the relevant
-piece. However, for element-wise operations where the same operation
-is performed on all the elements of the matrix, having all the WIs in
-the group apply the operation inside a loop iterating over the
-`length` of `wi_data` guarantees the whole matrix element-wise operation.
-
-Note that `get_wi_data` cannot return a fixed size array length
-because the length of the WI portion is a runtime variable for the
-following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and
-partitioning among WIs is implementation defined.
-
-2- Sub group size is not generally fixed.
-
-The code listing below shows a synopsis of these new APIs.
-
-```c++
-namespace sycl::ext::intel::experimental::matrix {
-
-wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group g,
- joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
-
-template <typename T, size_t Rows, size_t Cols, use Use, layout
-Layout, typename Group>
-class wi_data {
-  size_t length();
-  wi_element<T, Rows, Cols, Use, Layout, Group> operator[](size_t i);
-};
-template <typename T, size_t Rows, size_t Cols,
-          use Use, layout Layout,
-          typename Group = sycl::sub_group>
-class wi_element {
-  operator T();
-  wi_element &operator=(const T &rhs);
-  wi_element &operator+=(const T &rhs);
-  wi_element &operator-=(const T &rhs);
-  wi_element &operator*=(const T &rhs);
-  wi_element &operator/=(const T &rhs);
-
-  std::tuple<size_t, size_t> get_coord();
-};
-
-} // namespace sycl::ext::intel::experimental::matrix
-```
+know the coordinates of each element as it operates on them. In this
+case, the joint matrix index must be known in order to reason about
+the matrix view and extract the relevant piece such as a sum of all
+elements in a row for example. For instance, quantization that is
+needed for conversion between low precision types like `int8_t` and `fp32`
+uses such logic.
 
-In the following example `wi_data_c` is a reference to the WI owned
-portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs`
-updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the sub group dimension will get enabled
-automatically to vectorize the contiguous portion of the matrix.
-
-
-```c++
-auto wi_data_c = get_wi_data(sg, matC);
-for (int i = 0; i < wi_data_c.length(); i++)
-  wi_data_c[i] *= alpha;    // Note that the indexing here "i"
-  //is in the vector owned by a WI, not in the matrix C
-```
-
-==== Work-item data to joint matrix mapping coordinates
-The `wi_data` and `wi_element` classes provide access to the matrix
-elements that are local to the calling work-item. However, the
-distribution of matrix elements to each work-item is
-implementation-defined, so application code cannot assume any fixed
-distribution. Instead, application code can use the `get_coord` method
-to query the matrix coordinates of an individual `wi_element`.
-
-`get_coord` returns [row,col] coordinates of the current object
-`wi_element` of the joint matrix.  The code above results into the following:
-
-```c++
-auto data = get_wi_data(sg, tA);
-// each WI calculates local sum of rows
-for (int i = 0; i < data.length(); ++i) {
-  auto [row, col] = data[i].get_coord();
-  sum_local_rows[row] += data[i];
-}
-```
-
-==== Extending `joint_matrix_apply` with the mapping coordinates
 This extension adds a new form of the `joint_matrix_apply` function in
 the `sycl::ext::intel::matrix` namespace that allows the application
 to perform an operation on each element of the matrix. This function
 is similar to the form in `sycl_ext_oneapi_joint_matrix`, but it also
 provides the matrix coordinates of each element to the callback
 function:
+
 ```c++
 namespace sycl::ext::intel::experimental::matrix {
 
@@ -414,7 +301,7 @@ q.submit([&](sycl::handler& cgh) {
     sub_group sg = item.get_sub_group();
     joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
     joint_matrix<sub_group, int8_t, use::b, tK, tN,
-                 layout::ext_intel_packed> tB; 
+                 layout::ext_intel_packed> tB;
     joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
     joint_matrix_fill(sg, tC, 0);
     for (int k = 0; k < K; k += tK) {
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index e756cdcea74b3..4406379651bb6 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -99,7 +99,7 @@ This extension adds a new class named `joint_matrix`, which represents
 a small 2-dimensional matrix that supports native operations in
 hardware. There are a number of template parameters, namely the group
 scope, the type of the elements, the matrix use, the shape, and the
-memory layout of the matrix. This results in the following description:
+memory layout of the matrix.  This results in the following description:
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -109,17 +109,25 @@ template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
           layout::dynamic : /*unspecified*/ >
 struct joint_matrix {
   joint_matrix();
+  joint_matrix(const joint_matrix &);
+  joint_matrix &operator=(const joint_matrix &);
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
+Note that the declaration of the matrix along with the copy
+constructor and assignment must appear in converged control flow.
+
 ==== Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain
-`matrix` to emphasize that the matrix is shared among a group of work
-items and is not private to each work item. The group scope is added
-as an additional template parameter. This extension currently supports
-only the sub-group scope, so the `Group` template parameter must be
-`sycl::sub_group`. In this case, a matrix is declared as follows:
+Most operations on the joint_matrix are group functions, meaning that
+all work items in a group collectively perform an operation on the
+same matrix. The `Group` template parameter specifies the execution
+scope of the work-items in the group. The `joint_matrix` is shared among the
+work items in the group and is not private to each work item. This
+extension currently supports only the sub-group scope, so the `Group`
+template parameter must be `sycl::sub_group`, and group operations for
+the joint matrix must be done collectively by the work-items in a
+single sub-group. In this case, a matrix is declared as follows:
 
 ```c++
 joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
@@ -266,7 +274,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename Ta, typename Tb, typename Tc,
   std::size_t M, std::size_t K, std::size_t N,
-            layout LayoutA, layout LayoutB>
+  layout LayoutA, layout LayoutB, typename Td = Tc>
 joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
 joint_matrix_mad(Group g,
     joint_matrix<Group, Ta, use::a, M, K, LayoutA> A,
@@ -534,7 +542,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 // This is the validation form, when all template parameters are
 // specified.
 template<sycl::ext::oneapi::experimental::architecture Dev, typename
-Ta=void, typename Tb=void, typename Taccumulator=void, size_t sM=0,
+Ta=void, typename Tb=void, typename Tc=void, typename Td=Tc, size_t sM=0,
 size_t sN=0, size_t sK=0>
 struct matrix_params {
   // An implementation typically uses static_assert here to trigger a
@@ -552,15 +560,17 @@ struct matrix_params {
   using joint_matrix_b = joint_matrix<Group, Tb, use::b, sK, sN, Layout>;
 
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Taccumulator,
-  use::accumulator, sM, sN>;
+  using joint_matrix_c = joint_matrix<Group, Tc, use::accumulator, sM, sN>;
+
+  template <typename Group>
+  using joint_matrix_d = joint_matrix<Group, Td, use::accumulator, sM, sN>;
 };
 
 // This is the default values form, where the matrix dimensions are
 // omitted.
 template<sycl::ext::oneapi::experimental::architecture Dev, typename
-Ta, typename Tb, typename Taccumulator>
-struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
+Ta, typename Tb, typename Tc, typename Td>
+struct matrix_params<Dev, Ta, Tb, Tc, Td, 0, 0, 0> {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types are not supported by the
   // device identified by "Dev".
@@ -576,8 +586,11 @@ struct matrix_params<Dev, Ta, Tb, Taccumulator, 0, 0, 0> {
   using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, Layout>;
 
   template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Taccumulator,
-  use::accumulator, M, N>;
+  using joint_matrix_c = joint_matrix<Group, Tc, use::accumulator, M, N>;
+
+  template <typename Group>
+  using joint_matrix_d = joint_matrix<Group, Td, use::accumulator, M, N>;
+  template <typename Group>
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -606,7 +619,7 @@ size_t NDRangeN = N / myparams::N;
 // device code: the matrices are constructed using the default dimensions
 myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
 myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
-myparams::joint_matrix_accumulator<sub_group> sub_c;
+myparams::joint_matrix_c<sub_group> sub_c;
 
 ```
 ==== Runtime Query
@@ -663,7 +676,8 @@ struct combination {
   uint32_t ksize;
   matrix_type atype;
   matrix_type btype;
-  matrix_type accumulatortype;
+  matrix_type ctype;
+  matrix_type dtype;
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -688,7 +702,7 @@ discrete number of element sizes, each of these members is non-zero,
 and the value tells one of the supported element sizes. By contrast,
 if the matrix hardware supports a continuous number of element sizes,
 each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| indicates the types supported in
+|`atype`, `btype`, `ctype`, `dtype`| indicates the types supported in
 the combination. these are of type `matrix_type` which tells the list
 of types that are supported for the A, B, and accumulator matrices in
 the `T` template parameter as follows: +
@@ -709,13 +723,14 @@ the `T` template parameter as follows: +
 
 ===== Runtime Query Example:
 ```c++
-// Ta, Tb, Taccumulator are the types used in applications
+// Ta, Tb, Tc, and Td are the types used in applications
 std::vector<combination> combinations =
            device.get_info<info::device::matrix::combinations>();
 for (int i = 0; sizeof(combinations); i++) {
   if (Ta == combinations[i].atype &&
       Tb == combinations[i].btype &&
-      Tc == combinations[i].accumulatortype) {
+      Tc == combinations[i].ctype &&
+      Td == combinations[i].dtype) {
     // joint matrix GEMM kernel can be called using these sizes
     joint_matrix_gemm(combinations[i].msize,
          combinations[i].nsize, combinations[i].ksize);
@@ -736,7 +751,7 @@ This is currently available in
 
 [frame="none",options="header"]
 |======================
-| A type | B type | Accumulator type | M | N | K
+| A type | B type | Accumulator type (`ctype` = `dtype`) | M | N | K
 | `matrix_type::uint8`  | `matrix_type::uint8` |
 `matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
 | `matrix_type::uint8`  | `matrix_type::int8` |

From 885cf09d4d76354cb1ff1334a72d7143141abcbc Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 23 May 2023 10:08:37 -0700
Subject: [PATCH 29/51] Incorporate Greg's suggestions

---
 .../sycl_ext_intel_matrix.asciidoc            |   2 +-
 .../sycl_ext_oneapi_matrix.asciidoc           | 271 ++++++++++--------
 2 files changed, 150 insertions(+), 123 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 0a0cc6f549927..187e5a7ce5a03 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -161,7 +161,7 @@ void joint_matrix_store(Group g,
 } // namespace sycl::ext::intel::experimental::matrix
 ```
 
-=== Per-element Operations with Coordinates
+=== Per-element Access with Coordinates
 The function `joint_matrix_apply` in `sycl_ext_oneapi_matrix` provides
 a way for the application to apply the same operation on every element
 of the matrix. However, some algorithms require the application to
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 4406379651bb6..2a5851e5d3a24 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -106,7 +106,7 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
           layout Layout = (Use == use::accumulator) ?
-          layout::dynamic : /*unspecified*/ >
+                          layout::dynamic : /*unspecified*/ >
 struct joint_matrix {
   joint_matrix();
   joint_matrix(const joint_matrix &);
@@ -115,8 +115,10 @@ struct joint_matrix {
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-Note that the declaration of the matrix along with the copy
-constructor and assignment must appear in converged control flow.
+The constructors for the `joint_matrix` type and the assignment
+operator are group functions as defined in section 4.17.3 of the core
+SYCL specification. They must be encountered in converged control flow
+by all work-items in the `Group`.
 
 ==== Group Memory Scope
 Most operations on the joint_matrix are group functions, meaning that
@@ -198,10 +200,7 @@ The following operations (load, store, multiply-and-add, fill, and
 element-wise operations) are group functions as defined in section
 4.17.3 of the core SYCL specification. As such, they must be
 encountered in convergent control flow by the work-items in the group
-that performs the group operation. The `Group` template argument must
-be `sycl::sub_group`. This extension currently supports
-only the sub-group scope, so the `Group` template parameter must be
-`sycl::sub_group`.
+that performs the group operation.
 
 ==== Load
 ```c++
@@ -211,9 +210,8 @@ template <typename Group, typename T,
           size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
-    joint_matrix<Group, T, use::accumulator, Rows, Cols,
-    layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
+    joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
+    multi_ptr<const T, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
 template <typename Group, typename T,
@@ -222,13 +220,13 @@ template <typename Group, typename T,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
     joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+    multi_ptr<const T, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
-`joint_matrix_load` loads data from memory to the 2d tiles/registers
-of the matrix hardware.
+`joint_matrix_load` loads data from memory to the registers of the
+matrix hardware.
 We define two overloads of the load function depending on whether the
 memory layout was declared as part of the `joint_matrix` type or not.
 The first overload that takes memory layout as an argument is only
@@ -251,14 +249,13 @@ namespace sycl::ext::oneapi::experimental::matrix {
 template <typename Group, typename T, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_store(Group g,
-    joint_matrix<Group, T, use::accumulator, Rows, Cols,
-    layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
+   const joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
+   multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-This function stores the data in the accumulator matrix from the 2d
-tiles back to memory.
+This function stores the data in the accumulator matrix from the
+registers back to memory.
 
 The base pointer `dest` here determines the starting address of the
 matrix to be stored. `Layout` determines whether the data is being
@@ -277,9 +274,9 @@ template <typename Group, typename Ta, typename Tb, typename Tc,
   layout LayoutA, layout LayoutB, typename Td = Tc>
 joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
 joint_matrix_mad(Group g,
-    joint_matrix<Group, Ta, use::a, M, K, LayoutA> A,
-    joint_matrix<Group, Tb, use::b, K, N, LayoutB> B,
-    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
+    const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
+    const joint_matrix<Group, Tb, use::b, K, N, LayoutB> &B,
+    const joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> &C);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -287,14 +284,17 @@ The matrix multiply and add function performs the multiply operation
 on the matrices `A` and `B`, accumulates the result with `C` and returns
 the result.
 
+Each device supports only certain combinations of types for the `A`,
+`B`, and `C` matrices. The application must use the query operations
+(defined below) to ensure that the combination of types is supported
+on the device where the kernel calling `joint_matrix_mad` runs.
+
 ==== Fill (Initialization)
 Unlike `joint_matrix_load` that assumes that all the matrices are
 directly loaded from memory, `joint_matrix_fill`  makes it possible to
 multiply a matrix which is not directly loaded from memory but rather
-initialized directly in the register. On Intel AMX, if the
-initialization constant is zero, this would map to the `_tile_zero`
-intrinsic. Note that the value type `Tv` must be convertible to the
-matrix elements type `T`.
+initialized directly in the register. Note that the value type `Tv`
+must be convertible to the matrix elements type `T`.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -317,7 +317,7 @@ of the position of the element within the matrix. Activation functions
 or adding a constant value to every element of the matrix are two
 examples of this usage. When the operation depends on the element
 index of the matrix, an Intel-specific extension is available as part
-of the * link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix]
+of the link:sycl_ext_intel_matrix.asciidoc[sycl_ext_intel_matrix]
 
 Besides the `Group` and the `joint_matrix` arguments,
 `joint_matrix_apply` takes a C++ Callable object which is invoked once
@@ -346,18 +346,17 @@ joint_matrix_apply(sg, C, [=](T &x) {
     relu(x);
 });
 ```
-IMPORTANT: `joint_matrix_apply` is not implemented yet.
-
-=== Support for `tf32` Floating Point Type
-Besides C++ `half`, `float`, `double` types, and `sycl::bfloat16` types, joint
-matrix implementations may support other low-precision floating-point types
-such as `tf32`. `tf32` type has a 19 bit format with one sign bit, 8
-exponent bits offering the same range as `fp32`,  and 10 mantissa bits
-offering same precision as  half type. The usage of `tf32` type is
-restricted to `joint_matrix` using:
-`sycl::ext::oneapi::experimental::matrix::precision::tf32`.
-
-Joint matrix type `tf32` is defined as an empty class with no member functions.
+
+=== Support for the TF32 Data Type
+Some devices support the TF32 floating point type for matrix
+elements. This type has a 19 bit format with one sign bit, 8 exponent
+bits (offering the same range as float), and 10 mantissa bits
+(offering the same precision as sycl::half). Use of this type can
+accelerate the joint_matrix_mad operation by reducing its
+precision. In order to declare a `joint_matrix` object with this
+element type, use `matrix::precision::tf32` in place of the `T`
+template parameter.
+
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix::precision {
 
@@ -365,22 +364,23 @@ class tf32;
 
 } // namespace sycl::ext::oneapi::experimental::matrix::precision
 ```
-In this case, a `tf32` joint matrix type is declared by using the
-`precision::tf32` type for the `T` template parameter as follows:
+
+For example:
 
 ```c++
 joint_matrix<sub_group, precision::tf32, use::a, tM, tK,
              layout::row_major> tA;
 ```
 
-The purpose of this support is to accelerate the `joint_matrix_mad`
-operation while reducing its precision. The rest of the application
-uses `fp32` type.
+Whenever the application loads, stores, fills, or accesses the
+elements of a TF32 matrix, the application sees the elements as
+float. There are special overloads of these functions for TF32 for
+this purpose.
 
-Specifically, joint matrix load performs float type memory access to
-tf32 joint matrix using the following overloads. Note that it is
-unspecified whether the implementation stores all 32 bits or only the
-19 bits into the tf32 joint matrix.
+==== TF32 load
+These overloads of `joint_matrix_load` load float values into a TF32
+matrix. It is unspecified whether the implementation loads all 32 bits
+into the joint matrix or if it only loads the relevant 19 bits.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -402,31 +402,52 @@ void joint_matrix_load(Group g,
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-Joint matrix store is only available for the matrix accumulator for
-which tf32 does not apply. Also, `Tv` type in joint matrix fill used
-to initialize the tf32 joint matrix is `float`. Note that it is
-unspecified whether the implementation stores all 32 bits or only the
-19 bits into the tf32 joint matrix after the fill operation.
-
-Finally, the return type of element-wise accesses of a tf32
-`joint_matrix` is float. Consequently, general arithmetic is done on
-`fp32` data. In this case, the type used in the function object passed
-to `joint_matrix_apply` has to be `float`. In the example below, `C` is a
-joint matrix of type `precision::tf32`.
+
+==== TF32 store
+This overload of joint_matrix_store stores float values from a TF32
+matrix.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+
+template <typename Group, size_t Rows, size_t Cols,
+          access::address_space Space, access::decorated IsDecorated>
+void joint_matrix_store(Group g,
+   const joint_matrix<Group, precision::tf32, use::accumulator, Rows,
+                      Cols,  layout::dynamic> &res,
+   multi_ptr<float, Space, IsDecorated> dest, size_t stride, layout Layout);
+
+} // namespace sycl::ext::oneapi::experimental::matrix
+```
+
+==== TF32 fill
+When `joint_matrix_fill` is called for a TF32 matrix, the type `Tv`
+(the type of the fill value) must be implicitly convertible to
+`float`. It is unspecified whether the implementation writes all 32
+bits of the value into the joint matrix or if it only writes the
+relevant 19 bits.
+
+==== TF32 element-wise operations
+When `joint_matrix_apply` is called for a TF32 matrix, the Callable
+object func is called with a single argument of type `float &`. When the
+application changes this value, it is unspecified whether the
+implementation writes back all 32 bits of the element into the joint
+matrix or if it only write the relevant 19 bits.
+
+In the example below, `C` is a joint matrix of type `precision::tf32`.
 
 ```c++
 joint_matrix_apply(sg, C, [=](float &x) {
     x *= alpha;
 });
 ```
-
-Joint matrix APIs operate on floats. No implicit rounding happens when
-users load or store data to/from joint matrices. By default,
-`joint_matrix_mad` works on truncated values (13 bits set to zero). If
-users want the joint matrix mantissas rounded from 23 bits (`float`) to
-10 bits `tf32` instead of truncated, an explicit rounding function
-should be used. A new function `round_to_tf32` is added to  perform
-the round to nearest even (RTE) rounding mode.
+==== Rounding TF32 values
+The functions `joint_matrix_load`, `joint_matrix_fill`, and
+`joint_matrix_apply` do not define any rounding mode when the float
+values are converted to TF32, and the implementation may either round
+or truncate these conversions. If an application wants more control
+over this rounding, it can use the `round_to_tf32` function. This
+performs the round to nearest even (RTE) rounding mode.
 
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
@@ -479,19 +500,26 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)
 Most devices support only certain values for the `Rows` and `Cols`
 template parameters and only certain types for the `T` template
 parameter. Moreover, most devices support only certain combinations of
-these template parameter for the A, B, and accumulator matrices (see
-Appendix: Supported Combinations Per Hardware). This extension adds
-two query APIs that can be used to determine the set of legal
-parameters for a particular device. One form provides `constexpr`
-values for these parameters, which can be used when the application
-knows the specific device architecture on which it will run. The other
-form uses the standard information descriptor queries for the device
-object.
+these template parameter for the A, B, C, and D matrices in the
+`joint_matrix_mad` function (see Appendix: Supported Combinations Per
+Hardware). This extension adds two query APIs that can be used to
+determine the set of legal parameters for a particular device. One
+form provides `constexpr` values for these parameters, which can be
+used when the application knows the specific device architecture on
+which it will run. The other form uses the standard information
+descriptor queries for the device object.
+
+The description below uses the terms `M`, `N`, and `K` to identify the
+matrix dimensions of a multiply and add operation `D = C + A*B`. The
+`D` and `C` matrices are `M` rows by `N` columns. The `A` matrix is
+`M` rows by `K` columns, and the `B` matrix is `K` rows by `N` columns.
 
 ==== Compile-Time Query
 This returns `constexpr` values to use in `joint_matrix` template
 arguments but depends on an enumeration of the matrix hardware (See
-`sycl::ext::oneapi::experimental::architecture`) that can be tested.
+`sycl::ext::oneapi::experimental::architecture`) in the
+link:../sycl_ext_oneapi_device_architecture.asciidoc[sycl_ext_oneapi_device_architecture]
+extension that can be tested.
 The compile-time query interface proposed here consists of two
 functionalities:
 
@@ -502,9 +530,9 @@ functionalities:
 - Default values: this provides a default shape if the user does not
   provide a specific combination. In this case, aliases to the
   `joint_matrix` type can be used, namely
-  `joint_matrix_a/b/accumulator` where no additional argument is
-  needed. This form happens when the user specifies all template
-  parameters except the sizes of the matrices (`tiles`) M, N, and K.
+  `joint_matrix_a/b/c/d` where no additional argument is needed. This
+  form happens when the user specifies all template parameters except
+  the sizes of the matrices M, N, and K.
 
 The table below provides a description for each of the member
 variables in `matrix_params` class and the forms in which  they are
@@ -528,12 +556,14 @@ default size for K; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
 provide sizes, this is the same value K that the user provides if K is
 supported by the implementation
-|`template <typename Group, layout Layout> using joint_matrix_a;`| type
-alias for `joint_matrix` for matrix A
-|`template <typename Group, layout Layout> using joint_matrix_b;`| type
-alias for `joint_matrix` for matrix B
-|`template <typename Group> using joint_matrix_accumulator;`| type
-alias for `joint_matrix` for matrix accumulator
+|`template <typename Group, layout Layout> +
+using joint_matrix_a;`| type alias for `joint_matrix` for matrix A
+|`template <typename Group, layout Layout> +
+using joint_matrix_b;`| type alias for `joint_matrix` for matrix B
+|`template <typename Group> +
+using joint_matrix_c;`| type alias for `joint_matrix` for the input matrix accumulator
+|`template <typename Group> +
+using joint_matrix_d;`| type alias for `joint_matrix` for the output matrix accumulator
 |======================
 
 ```c++
@@ -541,13 +571,12 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 // This is the validation form, when all template parameters are
 // specified.
-template<sycl::ext::oneapi::experimental::architecture Dev, typename
-Ta=void, typename Tb=void, typename Tc=void, typename Td=Tc, size_t sM=0,
-size_t sN=0, size_t sK=0>
+template<architecture Arch, typename Ta, typename Tb, typename Tc,
+typename Td, size_t sM, size_t sN, size_t sK>
 struct matrix_params {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types or shapes are not
-  // supported by the device identified by "Dev".
+  // supported by the device identified by the architecture "Arch".
 
   static constexpr size_t M = sM;
   static constexpr size_t N = sN;
@@ -568,12 +597,11 @@ struct matrix_params {
 
 // This is the default values form, where the matrix dimensions are
 // omitted.
-template<sycl::ext::oneapi::experimental::architecture Dev, typename
-Ta, typename Tb, typename Tc, typename Td>
-struct matrix_params<Dev, Ta, Tb, Tc, Td, 0, 0, 0> {
+template<architecture Arch, typename Ta, typename Tb, typename Tc, typename Td=Tc>
+struct matrix_params<Arch, Ta, Tb, Tc, Td, 0, 0, 0> {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types are not supported by the
-  // device identified by "Dev".
+  // device identified by the architecture "Arch".
 
   static constexpr size_t M = /* implementation defined */;
   static constexpr size_t N = /* implementation defined */;
@@ -590,7 +618,6 @@ struct matrix_params<Dev, Ta, Tb, Tc, Td, 0, 0, 0> {
 
   template <typename Group>
   using joint_matrix_d = joint_matrix<Group, Td, use::accumulator, M, N>;
-  template <typename Group>
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
@@ -600,18 +627,15 @@ struct matrix_params<Dev, Ta, Tb, Tc, Td, 0, 0, 0> {
 // User can provide sizes besides the types and matrix_params can assert
 // if they are supported or not
 // in this case, an assertion will happens as 16 is not a supported size for M
-using myparams =
-matrix_params<sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc,
-int8_t, int8_t, int, 16, 16, 32>;
+using myparams = matrix_params<architecture::intel_gpu_pvc, int8_t,
+                               int8_t, int, int, 16, 16, 32>;
 size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
 size_t NDRangeN = N / myparams::N;
 ```
 
 ===== Default Values Example:
 ```c++
-using myparams =
-matrix_params<sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc,
-int8_t, int8_t, int>;
+using myparams = matrix_params<architecture::intel_gpu_pvc, int8_t, int8_t, int>;
 // use this to construct the ranges on the host side
 size_t NDRangeM = M / myparams::M;
 size_t NDRangeN = N / myparams::N;
@@ -623,10 +647,10 @@ myparams::joint_matrix_c<sub_group> sub_c;
 
 ```
 ==== Runtime Query
-This provides a more general query interface with information about
-sizes and types that are supported by a specific matrix
-implementation. This is needed to avoid padding by the user, for
-tuning, and efficient code generation if used by a library.
+The runtime query does not require the application to hard-code a
+specific device type, but it also returns values that are not
+`constexpr`. It provides similar information as the compile time query
+API via an extended device information descriptor.
 
 The table below provides a description for each of the device matrix
 descriptors that can be queried using `get_info` API.
@@ -639,9 +663,9 @@ descriptors that can be queried using `get_info` API.
 and types on this device
 |======================
 
-The general query returns a vector of `combinations` of `combination`
+The runtime query returns a vector of `combinations` of `combination`
 type. Each combination includes the sizes and the types for the
-matrices A, B, and accumulator. Note that for each matrix hardware,
+matrices A, B, C, and D. Note that for each matrix hardware,
 the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize,
 ksize` exclusively, depending on whether the implementation supports a
 continuous or discrete number of sizes. If a device support a
@@ -668,12 +692,12 @@ enum class matrix_type {
   uint64
 };
 struct combination {
-  uint32_t max_msize;
-  uint32_t max_nsize;
-  uint32_t max_ksize;
-  uint32_t msize;
-  uint32_t nsize;
-  uint32_t ksize;
+  size_t max_msize;
+  size_t max_nsize;
+  size_t max_ksize;
+  size_t msize;
+  size_t nsize;
+  size_t ksize;
   matrix_type atype;
   matrix_type btype;
   matrix_type ctype;
@@ -684,9 +708,9 @@ struct combination {
 ```
 
 Each combination of the `combinations` vector composes the types and
-sizes of A, B, accumulator matrices supported by the device
-implementation. The
-table below provides a description of each member of the `combination` struct.
+sizes of A, B, C, and D matrices supported by the device
+implementation. The table below provides a description of each member
+of the `combination` struct.
 
 [frame="none",options="header"]
 |======================
@@ -704,7 +728,7 @@ if the matrix hardware supports a continuous number of element sizes,
 each of these members has the value zero
 |`atype`, `btype`, `ctype`, `dtype`| indicates the types supported in
 the combination. these are of type `matrix_type` which tells the list
-of types that are supported for the A, B, and accumulator matrices in
+of types that are supported for the A, B, C, and D matrices in
 the `T` template parameter as follows: +
 `bf16`: `sycl::bfloat16` +
 `fp16`: `sycl::half` +
@@ -746,12 +770,14 @@ XMX hardware. Note that these can be returned using
 `ext::oneapi::experimental::info::device::matrix::combinations`.
 
 ==== Intel AMX Supported Combinations
-This is currently available in
-`sycl::ext::oneapi::experimental::architecture::intel_cpu_spr`.
+This is currently available in devices with the architecture
+`architecture::intel_cpu_spr`. In this architecture's implementation,
+the C and D matrices support the same set of types, so they are shown
+in a single column.
 
 [frame="none",options="header"]
 |======================
-| A type | B type | Accumulator type (`ctype` = `dtype`) | M | N | K
+| A type | B type | C and D type | M | N | K
 | `matrix_type::uint8`  | `matrix_type::uint8` |
 `matrix_type::sint32`  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
 | `matrix_type::uint8`  | `matrix_type::int8` |
@@ -765,13 +791,14 @@ This is currently available in
 |======================
 
 ==== Intel XMX Supported Combinations
-This is currently available in
-`sycl::ext::oneapi::experimental::architecture::intel_gpu_pvc` and
-`sycl::ext::oneapi::experimental::architecture::intel_gpu_dg2`.
+This is currently available in devices with the architecture
+`architecture::intel_gpu_pvc` and `architecture::intel_gpu_dg2`. In
+these architectures' implementation, the C and D matrices support the
+same set of types, so they are shown in a single column.
 
 [frame="none",options="header"]
 |======================
-| A type | B type | Accumulator type | M | N | K | device
+| A type | B type | C and D type | M | N | K | device
 | `matrix_type::uint8`  | `matrix_type::uint8` |
 `matrix_type::int32`  |  +<=+ 8 |  16 |  32 | architecture::intel_gpu_pvc
 | | | | |8||architecture::intel_gpu_dg2

From d0a81af1268170f5579d6a44f74224248fb5c86c Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 23 May 2023 11:31:04 -0700
Subject: [PATCH 30/51] Incorporate Greg's small comments in intel-specific
 spec

---
 .../sycl_ext_intel_matrix.asciidoc                | 11 +++++------
 .../sycl_ext_oneapi_matrix.asciidoc               | 15 ++++++++-------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
index 187e5a7ce5a03..69a6f4459a87d 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -78,7 +78,7 @@ extension to gain additional performance and capabilities.
 This extension provides a feature-test macro as described in the core SYCL
 specification. An implementation supporting this extension must
 predefine the macro `SYCL_EXT_INTEL_MATRIX` to one of the values
-defined in the table below.Applications can test for the existence of
+defined in the table below. Applications can test for the existence of
 this macro to determine if the implementation supports this feature,
 or applications can test the macro's value to determine which of the
 extension's APIs the implementation supports.
@@ -125,7 +125,7 @@ enum class layout {
 
 Consequently, the layout argument `layout` in `joint_matrix_load` can
 take `ext_intel_packed` as argument to specify that the data has
-already been transformed into VNNI format. in this case, `stride`
+already been transformed into VNNI format. In this case, the `stride`
 argument of `joint_matrix_load` describes the number of elements
 between consecutive rows for packed layouts.
 
@@ -148,14 +148,14 @@ template <typename Group, typename T, size_t Rows, size_t Cols,
           layout Layout, access::address_space Space,
           access::decorated IsDecorated>
 void joint_matrix_store(Group g,
-    joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
+    const joint_matrix<Group, T, use::a, Rows, Cols, Layout> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           layout Layout, access::address_space Space,
           access::decorated IsDecorated>
 void joint_matrix_store(Group g,
-    joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
+    const joint_matrix<Group, T, use::b, Rows, Cols, Layout> &res,
     multi_ptr<T, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::intel::experimental::matrix
@@ -324,6 +324,5 @@ q.wait();
 |======================
 |Rev |Date       |Author     |Changes
 |1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
-layout information, iterative-based element-wise operations, and
-mapping
+layout information, and per-element access with coordinates API
 |======================
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
index 2a5851e5d3a24..4f623dd9fb983 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -270,8 +270,8 @@ rows for the row major layout, or between columns for the column major layout.
 namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename Ta, typename Tb, typename Tc,
-  std::size_t M, std::size_t K, std::size_t N,
-  layout LayoutA, layout LayoutB, typename Td = Tc>
+          std::size_t M, std::size_t K, std::size_t N, layout LayoutA, layout
+          LayoutB, typename Td = Tc>
 joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic>
 joint_matrix_mad(Group g,
     const joint_matrix<Group, Ta, use::a, M, K, LayoutA> &A,
@@ -763,7 +763,6 @@ for (int i = 0; sizeof(combinations); i++) {
 ```
 
 === Appendix: Supported Combinations Per Hardware
-
 The table below provides a list of the combinations that
 `joint_matrix` implementations support on each of Intel AMX and Intel
 XMX hardware. Note that these can be returned using
@@ -772,8 +771,9 @@ XMX hardware. Note that these can be returned using
 ==== Intel AMX Supported Combinations
 This is currently available in devices with the architecture
 `architecture::intel_cpu_spr`. In this architecture's implementation,
-the C and D matrices support the same set of types, so they are shown
-in a single column.
+the type of the C matrix must be the same as the type of the D
+matrix. Therefore, that common type is shown in a single column in the
+table below.
 
 [frame="none",options="header"]
 |======================
@@ -793,8 +793,9 @@ in a single column.
 ==== Intel XMX Supported Combinations
 This is currently available in devices with the architecture
 `architecture::intel_gpu_pvc` and `architecture::intel_gpu_dg2`. In
-these architectures' implementation, the C and D matrices support the
-same set of types, so they are shown in a single column.
+these architectures' implementation, the type of the C matrix must be
+the same as the type of the D matrix. Therefore, that common type is
+shown in a single column in the table below.
 
 [frame="none",options="header"]
 |======================

From cd4158866b6ac2968b30f66168777ebcb40185f0 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 25 May 2023 14:00:31 -0700
Subject: [PATCH 31/51] Rename folder name, add primary definition of
 matrix_params

---
 .../sycl_ext_intel_matrix.asciidoc                          | 0
 .../sycl_ext_oneapi_matrix.asciidoc                         | 6 +++++-
 2 files changed, 5 insertions(+), 1 deletion(-)
 rename sycl/doc/extensions/experimental/{sycl_ext_oneapi_matrix => sycl_ext_matrix}/sycl_ext_intel_matrix.asciidoc (100%)
 rename sycl/doc/extensions/experimental/{sycl_ext_oneapi_matrix => sycl_ext_matrix}/sycl_ext_oneapi_matrix.asciidoc (99%)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
similarity index 100%
rename from sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
rename to sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
similarity index 99%
rename from sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
rename to sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 4f623dd9fb983..a2bfd770af8f5 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -569,6 +569,10 @@ using joint_matrix_d;`| type alias for `joint_matrix` for the output matrix accu
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
+template<architecture Arch, typename Ta=void, typename Tb=void,
+typename Tc=void, typename Td=Tc, int sM=0, int sN=0, int sK=0> 
+struct matrix_params;
+
 // This is the validation form, when all template parameters are
 // specified.
 template<architecture Arch, typename Ta, typename Tb, typename Tc,
@@ -597,7 +601,7 @@ struct matrix_params {
 
 // This is the default values form, where the matrix dimensions are
 // omitted.
-template<architecture Arch, typename Ta, typename Tb, typename Tc, typename Td=Tc>
+template<architecture Arch, typename Ta, typename Tb, typename Tc, typename Td>
 struct matrix_params<Arch, Ta, Tb, Tc, Td, 0, 0, 0> {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types are not supported by the

From 0bf47c91ce1e3c0872d366d6fa5d16c9aa16eb58 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 25 May 2023 14:06:18 -0700
Subject: [PATCH 32/51] Add missing const to multi_ptr

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc           | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index a2bfd770af8f5..b54b9dc03cad1 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -390,7 +390,7 @@ template <typename Group, size_t Rows, size_t Cols,
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
-    multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
+    multi_ptr<const float, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
 template <typename Group, size_t Rows, size_t Cols,
@@ -398,7 +398,7 @@ template <typename Group, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
-    multi_ptr<float, Space, IsDecorated> src, size_t stride);
+    multi_ptr<const float, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```

From 15306d63b0e47b2e556ff2bce641a8ee70246282 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Tue, 30 May 2023 09:34:38 -0700
Subject: [PATCH 33/51] - Add copy function; - Add clarification about copy
 constructor and assignment op; - correct validation specialization

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 44 ++++++++++++++-----
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index b54b9dc03cad1..4a12500f121b9 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -118,7 +118,8 @@ struct joint_matrix {
 The constructors for the `joint_matrix` type and the assignment
 operator are group functions as defined in section 4.17.3 of the core
 SYCL specification. They must be encountered in converged control flow
-by all work-items in the `Group`.
+by all work-items in the `Group`. Note that the assignment operator
+and the copy constructor do not copy the entire matrix content.
 
 ==== Group Memory Scope
 Most operations on the joint_matrix are group functions, meaning that
@@ -206,21 +207,23 @@ that performs the group operation.
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-template <typename Group, typename T,
+// Only available when std::is_same_v<T1, std::remove_const_t<T2>>
+template <typename Group, typename T1, typename T2,
           size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
-    joint_matrix<Group, T, use::accumulator, Rows, Cols, layout::dynamic> &res,
-    multi_ptr<const T, Space, IsDecorated> src, size_t stride, layout Layout);
+    joint_matrix<Group, T1, use::accumulator, Rows, Cols, layout::dynamic> &res,
+    multi_ptr<T2, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
-template <typename Group, typename T,
+// and when std::is_same_v<T1, std::remove_const_t<T2>>
+template <typename Group, typename T1, typename T2,
           size_t Rows, size_t Cols,
           use Use, layout Layout,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
-    joint_matrix<Group, T, Use, Rows, Cols, Layout> &res,
-    multi_ptr<const T, Space, IsDecorated> src, size_t stride);
+    joint_matrix<Group, T1, Use, Rows, Cols, Layout> &res,
+    multi_ptr<T2, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -307,6 +310,23 @@ void joint_matrix_fill(Group g, joint_matrix<Group, T, Use,
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 
+==== Copy
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+
+template <typename Group, typename T, size_t Rows, size_t Cols,
+          use Use1, use Use2, layout Layout>
+void joint_matrix_assign(Group g, joint_matrix<Group, T, Use1,
+                         Rows, Cols, Layout> &dest, joint_matrix<Group, T, Use2,
+                         Rows, Cols, Layout> &src);
+
+} // namespace sycl::ext::oneapi::experimental::matrix
+```
+This function copies `RowsxCols` elements of type `T` from joint
+matrix `src` to joint matrix `dest`. The two matrcies must have the
+same scope, type, shape, and layout. Use can be different so this
+function converts between different `use` of matrices.
+
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
 possible to perform element-wise operations on matrices in a SPMD
@@ -390,7 +410,7 @@ template <typename Group, size_t Rows, size_t Cols,
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
-    multi_ptr<const float, Space, IsDecorated> src, size_t stride, layout Layout);
+    multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
 template <typename Group, size_t Rows, size_t Cols,
@@ -398,7 +418,7 @@ template <typename Group, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
-    multi_ptr<const float, Space, IsDecorated> src, size_t stride);
+    multi_ptr<float, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -569,15 +589,15 @@ using joint_matrix_d;`| type alias for `joint_matrix` for the output matrix accu
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-template<architecture Arch, typename Ta=void, typename Tb=void,
-typename Tc=void, typename Td=Tc, int sM=0, int sN=0, int sK=0> 
+template<architecture Arch, typename Ta, typename Tb, typename Tc,
+typename Td=Tc, size_t sM=0, size_t sN=0, size_t sK=0>
 struct matrix_params;
 
 // This is the validation form, when all template parameters are
 // specified.
 template<architecture Arch, typename Ta, typename Tb, typename Tc,
 typename Td, size_t sM, size_t sN, size_t sK>
-struct matrix_params {
+struct matrix_params<Arch, Ta, Tb, Tc, Td, sM, sN, sK> {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types or shapes are not
   // supported by the device identified by the architecture "Arch".

From bee344ed80788da896222cb0649e3a0634eeb0ca Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 31 May 2023 08:13:19 -0700
Subject: [PATCH 34/51] small typo correction

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc       | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 4a12500f121b9..ebfc3d599f251 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -304,8 +304,8 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           use Use, layout Layout, typename Tv>
-void joint_matrix_fill(Group g, joint_matrix<Group, T, Use,
-                         Rows, Cols, Layout> &m, Tv v);
+void joint_matrix_fill(Group g, joint_matrix<Group, T, Use, Rows,
+          Cols, Layout> &m, Tv v);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -316,13 +316,13 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           use Use1, use Use2, layout Layout>
-void joint_matrix_assign(Group g, joint_matrix<Group, T, Use1,
+void joint_matrix_copy(Group g, joint_matrix<Group, T, Use1,
                          Rows, Cols, Layout> &dest, joint_matrix<Group, T, Use2,
                          Rows, Cols, Layout> &src);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-This function copies `RowsxCols` elements of type `T` from joint
+This function copies `Rows x Cols` elements of type `T` from joint
 matrix `src` to joint matrix `dest`. The two matrcies must have the
 same scope, type, shape, and layout. Use can be different so this
 function converts between different `use` of matrices.

From e5648e4b252f23ab9a5dfcc15a2bdf2484b600a6 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 7 Jun 2023 13:45:23 -0700
Subject: [PATCH 35/51] Remove default copy constructor and assign op

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index ebfc3d599f251..e054211b3aacc 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -109,17 +109,16 @@ template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
                           layout::dynamic : /*unspecified*/ >
 struct joint_matrix {
   joint_matrix();
-  joint_matrix(const joint_matrix &);
-  joint_matrix &operator=(const joint_matrix &);
+  joint_matrix(const joint_matrix &) = delete;
+  joint_matrix &operator=(const joint_matrix &) = delete;
 };
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-The constructors for the `joint_matrix` type and the assignment
-operator are group functions as defined in section 4.17.3 of the core
-SYCL specification. They must be encountered in converged control flow
-by all work-items in the `Group`. Note that the assignment operator
-and the copy constructor do not copy the entire matrix content.
+The constructor for the `joint_matrix` type is a group function as
+defined in section 4.17.3 of the core SYCL specification. It must be
+encountered in converged control flow by all work-items in the
+`Group`.
 
 ==== Group Memory Scope
 Most operations on the joint_matrix are group functions, meaning that

From e22d057a7f31be42d090757598440a16624102cf Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 8 Jun 2023 11:28:27 -0700
Subject: [PATCH 36/51] fixed merge conflicts without merging and add Jack's
 Nvidia combinations table

---
 .../sycl_ext_oneapi_matrix.asciidoc           |  64 ++
 .../sycl_ext_intel_matrix.asciidoc            | 155 +++++
 .../sycl_ext_oneapi_matrix.asciidoc           | 650 ++++++++++++++++++
 3 files changed, 869 insertions(+)
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index e054211b3aacc..b7ff8eb629473 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -843,6 +843,70 @@ shown in a single column in the table below.
 | | | | |8|| architecture::intel_gpu_dg2
 |======================
 
+==== Nvidia Tensor Cores Supported Combinations
+
+The complete set of matrix data types and shapes that are supported by
+the `ext_oneapi_cuda` backend are represented in the following
+table. Tm indicates the matrix element data type held by a
+"multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc
+indicates the matrix element data type held by an "accumulator"
+`joint_matrix`: i.e requiring `use::accumulator`.
+
+IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target
+arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must
+be used, where `sm_xx` must be a Compute Capability that is equal to
+or greater than the appropriate Minimum Compute Capability. When an
+executable has been compiled for `sm_xx`, if the executable is run on
+a device with compute capability less than `sm_xx` then an error will
+be thrown. The mapping to Minimum Compute Capability from each
+supported parameter combination is specified in the following table.
+
+--
+[.center]
+|======================
+|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
+.3+|half  .3+|float
+|16 |16 |16 .6+| sm_70
+|8 |32 |16
+|32 |8 |16
+.3+|half  .3+|half
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+.3+|int8_t  .3+|int32_t
+|16 |16 |16 .6+| sm_72
+|8 |32 |16
+|32 |8 |16
+.3+|uint8_t  .3+|int32_t
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|precision::tf32  |float |16 |16 |8 .5+| sm_80
+.3+|bfloat16  .3+|float
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|double  |double |8 |8 |4
+|======================
+--
+
+The M, N, K triple from the above table defines the complete set of
+matrix shapes constructible:
+--
+[.center]
+|======================
+|use |NumRows | NumCols
+|a |M |K
+|b |K |N
+|accumulator | M| N
+|======================
+--
+
+IMPORTANT: The `stride` argument to `joint_matrix_load` and
+`joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a
+multiple of 4 when `T` is `float`; where `T` is the type of the
+`joint_matrix` elements. When `T` is not `half` or `float` there are
+no restrictions to `stride`.
 
 === Revision History
 
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
new file mode 100644
index 0000000000000..883c73c655217
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -0,0 +1,155 @@
+# Additional Intel-only specifics about matrix extension for DPC++
+
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 5 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
+ This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.
+
+## Introduction
+The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
+This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B.
+|======================
+
+
+## Extra Functionality
+
+### Layout
+Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
+
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+enum class layout {
+  packed
+};
+}
+```
+
+
+### Layout argument in `joint_matrix_load`
+`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
+
+In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
+
+IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
+
+### Store Operation
+Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
+
+#### Store
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            use Use, layout Layout, access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+}
+```
+
+
+## VNNI/Packed Layout
+Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
+
+#### Example 1: 16-bit elements
+      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed layout. 
+      // Here, packing of 2 elements is needed to form 32 bits.
+      // Element a1 is contiguous in memory with element a2, etc.
+      // ---------------------------------
+      // a1, a2, b1, b2, c1, c2, d1, d2
+      // a3, a4, b3, b4, c3, c4, d3, d4
+
+#### Example 2: 8-bit elements
+
+      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed layout.  
+      // Here, packing of 4 elements is needed to form 32 bits.
+      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
+      // ---------------------------------
+      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
+
+## Supported Combinations Per Hardware
+
+The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
+
+### Intel AMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+|======================
+
+### Intel XMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
+|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
+|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+|======================
+
+## Open Questions
+- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
new file mode 100644
index 0000000000000..cb430e7c794ef
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -0,0 +1,650 @@
+# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 5 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+
+**_NOTE:_** _This document describes the current design and API for the matrix
+extension to {dpcpp}. This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
+
+## Introduction
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
+|======================
+
+## Matrix API Versions
+
+While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
+|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|3     |Implementation on Nvidia Tensor Cores
+|======================
+
+## New `joint_matrix` class
+We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
+          layout Layout = layout::dynamic>
+struct joint_matrix {
+    joint_matrix() {}
+};
+}
+```
+
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
+
+#### Use
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class use {
+  a,
+  b,
+  accumulator
+};
+}
+```
+
+#### Shape
+The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
+
+#### Layout
+This specifies the memory layout and it can be row major or column major.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class layout {
+  row_major,
+  col_major,
+  dynamic
+ };
+}
+```
+
+
+#### Group Memory Scope
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported
+
+When the group is a `sycl::sub_group`, a matrix is declared as follows:
+
+```c++
+joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
+```
+
+
+## Matrix Operations and their Execution Scope
+We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
+
+Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
+
+To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Load
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
+    
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          use Use, layout Layout, access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+}
+```
+
+`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
+We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
+The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
+The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
+
+The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Store
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
+}
+```
+This function stores the data in the accumulator matrix from the 2d tiles back to memory.
+
+The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Multiply and Add
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
+    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
+    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
+    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
+}
+```
+The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
+
+
+#### Matrix Initialization: `joint_matrix_fill`
+The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+           use Use, layout Layout, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
+}
+```
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Element Indexing and Piece-Wise Operations
+##### Background
+Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
+
+Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
+
+Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
+
+##### Explicit conversion with mapping from SIMD to SPMD
+The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
+
+We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
+
+Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
+
+2- Sub group size is not generally fixed.
+
+The code listing below shows a synopsis of these new APIs.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+
+template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
+class wi_data {
+  size_t length();
+  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
+};
+template <typename T, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          typename Group = sycl::sub_group>
+class wi_element {
+  operator T();
+  wi_element &operator=(const T &rhs);
+…
+};
+}
+```
+
+In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
+
+
+```c++
+auto wi_data_c = get_wi_data(sg, matC);
+for (int i = 0; i < wi_data_c.length(); i++)
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
+```
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
+
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+
+## Example using int8_t type
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+range<2> G = {M/tM, N};
+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+int32_t *memC = malloc_shared<int32_t>(M*N, q);
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
+   const auto global_idx = item.get_global_id(0);
+   const auto global_idy = item.get_global_id(1);
+   const auto sg_startx = global_idx - item.get_local_id(0);
+   const auto sg_starty = global_idy - item.get_local_id(1);
+   sub_group sg = item.get_sub_group();
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
+   joint_matrix_fill(sg, tC, 0);
+   for (int k = 0; k < K; k += tK) {
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
+     tC = joint_matrix_mad(sg, tA, tB, tC);
+   }
+   auto wi_data_c = get_wi_data(sg, tC);
+   for (int i = 0; i < wi_data_c.length(); i++)
+     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
+   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+}).wait();
+```
+
+== Query Interface
+Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
+The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
+so there will be no runtime errors.
+The query interface proposed here consists of three functionalities:
+
+- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
+
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+
+The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
+
+[frame="none",options="header"]
+|======================
+| Member/type alias in `tpu_params` | Forms they are defined in |Description
+|`type_a`| validation, default values|type alias for the type of matrix A
+|`type_b`|  validation, default values|type alias for the type of matrix B
+|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
+|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
+|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
+|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
+|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
+|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
+|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
+|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
+|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
+|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
+|======================
+
+
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
+struct tpu_params;
+
+// Validation form: Valid or not
+// Specialization when both types and sizes are given
+template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
+struct tpu_params<
+    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
+    typename std::enable_if<(
+        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
+        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
+  // Validate that parameters are supported
+  static_assert(
+      (sM == 0 && sN == 0 && sK == 0) ||
+          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
+      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
+      "using: "
+      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
+
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // if combination is valid, construct the matrices
+
+  static constexpr std::size_t M = (sM != 0) ? sM : 16;
+  static constexpr std::size_t N = (sN != 0) ? sN : 16;
+  static constexpr std::size_t K =
+      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contains only the combination that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// Default values form: Sizes-only query
+// Specialization for when only types are given, need to query only sizes
+template <typename Ta, typename Tb, typename Tc>
+struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
+                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
+                                           !std::is_same_v<Tb, void> &&
+                                           !std::is_same_v<Tc, void>)>::type> {
+  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
+                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
+                "and bf16 (Note that unsigned short should be used in the"
+                "DPC++ code to implement bf16) ");
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // construct the matrices using the default sizes
+  static constexpr std::size_t M = 16;
+  static constexpr std::size_t N = 16;
+  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
+  // types that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// General query form:
+// types are not given, no default sizes and no implicit matrix construction
+template <int sM, int sN, int sK>
+struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  
+  static constexpr combination combinations[] = {
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+
+enum class tpu {
+  xmx8,
+  xmx16,
+  amx
+};
+
+enum class matrix_type {
+  bf16,
+  fp16,
+  tf32,
+  fp32,
+  fp64,
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32, 
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
+};
+
+enum class scope_t {
+  sub_group,
+  work_group
+};
+}
+```
+
+
+=== Validation Example:
+```c++
+// User can provide sizes besides the types and tpu_params can assert if they are supported or not
+// in this case, an assertion will happens as 16 is not a supported size for M
+using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
+size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
+size_t NDRangeN = N / myparams::N;
+```
+
+=== Default Values Example:
+```c++
+using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
+// use this to construct the ranges on the host side
+size_t NDRangeM = M / myparams::M;
+size_t NDRangeN = N / myparams::N;
+//if M, N, K do not multiply the default sizes, padding has to be done
+// device code: the matrices are constructed using the default dimensions
+myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
+myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
+myparams::joint_matrix_accumulator<sub_group> sub_c;
+
+```
+
+=== General Query Example:
+```c++
+constexpr int M = 1500; // with msize = 8 and msize = 4,
+          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
+tpu_params<tpu::xmx16> params;
+constexpr int msize = break_dimension(params, M);
+constexpr int msize_remainder = break_dimension_remainder(params, M);
+constexpr int nsize = params.combinations[0].nsize;
+constexpr int ksize = params.combinations[0].ksize;
+// device code:
+joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
+joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
+joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
+//Remainder handling
+```
+
+## Future-looking API
+
+### Memory scope
+The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
+
+
+```c++
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
+```
+We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
+
+### WI data to joint matrix mapping coordinates information for piece-wise operations
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
+
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length(); ++i) {
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
+
+So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length; ++i) {
+  auto row, col = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+=== Appendix: Supported Parameter Combinations Per Hardware
+
+The tables below provide a list of the parameter combinations that
+`joint_matrix` implementations support on each supported vendors hardware type.
+
+==== Nvidia Tensor Cores Supported Combinations
+
+The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
+
+IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must be used, where `sm_xx` must be a Compute Capability that is equal to or greater than the appropriate Minimum Compute Capability. When an executable has been compiled for `sm_xx`, if the executable is run on a device with compute capability less than `sm_xx` then an error will be thrown. The mapping to Minimum Compute Capability from each supported parameter combination is specified in the following table.
+
+--
+[.center]
+|======================
+|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
+.3+|half  .3+|float
+|16 |16 |16 .6+| sm_70
+|8 |32 |16
+|32 |8 |16
+.3+|half  .3+|half
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+.3+|int8_t  .3+|int32_t
+|16 |16 |16 .6+| sm_72
+|8 |32 |16
+|32 |8 |16
+.3+|uint8_t  .3+|int32_t
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|precision::tf32  |float |16 |16 |8 .5+| sm_80
+.3+|bfloat16  .3+|float
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|double  |double |8 |8 |4
+|======================
+--
+
+The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
+--
+[.center]
+|======================
+|use |NumRows | NumCols
+|a |M |K
+|b |K |N
+|accumulator | M| N
+|======================
+--
+
+IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. When `T` is not `half` or `float` there are no restrictions to `stride`.
+
+## TODO List
+- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
+- Add a more realistic and complete example that shows the value of the general query. 
+
+
+## Revision History
+
+[frame="none",options="header"]
+|======================
+|Rev |Date       |Author     |Changes
+|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
+|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
+|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
+|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
+|======================

From 0b4eecc83451825bc9bfdf131a159e876029a1e7 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 8 Jun 2023 11:32:23 -0700
Subject: [PATCH 37/51] Remove the oneapi matrix folder that is replaced here
 by matrix folder. It resulted from a merge conflict

---
 .../sycl_ext_intel_matrix.asciidoc            | 155 -----
 .../sycl_ext_oneapi_matrix.asciidoc           | 650 ------------------
 2 files changed, 805 deletions(-)
 delete mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
 delete mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
deleted file mode 100644
index 883c73c655217..0000000000000
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ /dev/null
@@ -1,155 +0,0 @@
-# Additional Intel-only specifics about matrix extension for DPC++
-
-:source-highlighter: coderay
-:coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
-
-// This section needs to be after the document title.
-:doctype: book
-:toc2:
-:toc: left
-:encoding: utf-8
-:lang: en
-
-:blank: pass:[ +]
-
-// Set the default source code type in this document to C++,
-// for syntax highlighting purposes.  This is needed because
-// docbook uses c++ and html5 uses cpp.
-:language: {basebackend@docbook:c++:cpp}
-
-
-== Notice
-
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
-
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
-
-This extension is written against the SYCL 2020 revision 5 specification.  All
-references below to the "core SYCL specification" or to section numbers in the
-SYCL specification refer to that revision.
-
-**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
- This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.
-
-## Introduction
-The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
-This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
-
-## Feature test macro
-
-This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B.
-|======================
-
-
-## Extra Functionality
-
-### Layout
-Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
-
-```c++
-namespace sycl::ext::intel::experimental::matrix {
-enum class layout {
-  packed
-};
-}
-```
-
-
-### Layout argument in `joint_matrix_load`
-`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
-
-In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
-
-IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
-
-### Store Operation
-Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
-
-#### Store
-```c++
-namespace sycl::ext::intel::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            use Use, layout Layout, access::address_space Space>
-  void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
-}
-```
-
-
-## VNNI/Packed Layout
-Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
-
-#### Example 1: 16-bit elements
-      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed layout. 
-      // Here, packing of 2 elements is needed to form 32 bits.
-      // Element a1 is contiguous in memory with element a2, etc.
-      // ---------------------------------
-      // a1, a2, b1, b2, c1, c2, d1, d2
-      // a3, a4, b3, b4, c3, c4, d3, d4
-
-#### Example 2: 8-bit elements
-
-      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed layout.  
-      // Here, packing of 4 elements is needed to form 32 bits.
-      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
-      // ---------------------------------
-      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
-
-## Supported Combinations Per Hardware
-
-The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
-
-### Intel AMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
-|======================
-
-### Intel XMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
-|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
-|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
-|======================
-
-## Open Questions
-- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
deleted file mode 100644
index cb430e7c794ef..0000000000000
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ /dev/null
@@ -1,650 +0,0 @@
-# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
-:source-highlighter: coderay
-:coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
-
-// This section needs to be after the document title.
-:doctype: book
-:toc2:
-:toc: left
-:encoding: utf-8
-:lang: en
-
-:blank: pass:[ +]
-
-// Set the default source code type in this document to C++,
-// for syntax highlighting purposes.  This is needed because
-// docbook uses c++ and html5 uses cpp.
-:language: {basebackend@docbook:c++:cpp}
-
-
-== Notice
-
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
-
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
-
-This extension is written against the SYCL 2020 revision 5 specification.  All
-references below to the "core SYCL specification" or to section numbers in the
-SYCL specification refer to that revision.
-
-
-**_NOTE:_** _This document describes the current design and API for the matrix
-extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
-
-## Introduction
-This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
-
-## Feature test macro
-
-This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
-|======================
-
-## Matrix API Versions
-
-While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
-|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
-|3     |Implementation on Nvidia Tensor Cores
-|======================
-
-## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
-          layout Layout = layout::dynamic>
-struct joint_matrix {
-    joint_matrix() {}
-};
-}
-```
-
-IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
-
-#### Use
-Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-enum class use {
-  a,
-  b,
-  accumulator
-};
-}
-```
-
-#### Shape
-The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
-
-#### Layout
-This specifies the memory layout and it can be row major or column major.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-enum class layout {
-  row_major,
-  col_major,
-  dynamic
- };
-}
-```
-
-
-#### Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported
-
-When the group is a `sycl::sub_group`, a matrix is declared as follows:
-
-```c++
-joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
-```
-
-
-## Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
-
-Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
-
-To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
-
-#### Load
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            access::address_space Space>
-  void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
-    
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout, access::address_space Space>
-  void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
-}
-```
-
-`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
-We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
-The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
-The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
-
-The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
-
-
-#### Store
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            access::address_space Space>
-  void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
-}
-```
-This function stores the data in the accumulator matrix from the 2d tiles back to memory.
-
-The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
-
-
-#### Multiply and Add
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
-            layout LayoutA, layout LayoutB>
-  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
-    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
-    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
-    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
-}
-```
-The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
-
-
-#### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-           use Use, layout Layout, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
-}
-```
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
-
-#### Element Indexing and Piece-Wise Operations
-##### Background
-Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
-
-Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
-
-Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
-
-// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
-
-##### Explicit conversion with mapping from SIMD to SPMD
-The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
-
-We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
-
-Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
-
-Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
-
-2- Sub group size is not generally fixed.
-
-The code listing below shows a synopsis of these new APIs.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
-
-template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
-class wi_data {
-  size_t length();
-  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
-};
-template <typename T, size_t Rows, size_t Cols,
-          use Use, layout Layout,
-          typename Group = sycl::sub_group>
-class wi_element {
-  operator T();
-  wi_element &operator=(const T &rhs);
-…
-};
-}
-```
-
-In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
-
-
-```c++
-auto wi_data_c = get_wi_data(sg, matC);
-for (int i = 0; i < wi_data_c.length(); i++)
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
-```
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
-
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
-
-## Example using int8_t type
-```c++
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-queue q;
-range<2> G = {M/tM, N};
-range<2> L = {1, SG_SIZE};
-int8_t *memA = malloc_shared<int8_t>(M*K, q);
-int8_t *memB = malloc_shared<int8_t>(K*N, q);
-int32_t *memC = malloc_shared<int32_t>(M*N, q);
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
-  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
-   const auto global_idx = item.get_global_id(0);
-   const auto global_idy = item.get_global_id(1);
-   const auto sg_startx = global_idx - item.get_local_id(0);
-   const auto sg_starty = global_idy - item.get_local_id(1);
-   sub_group sg = item.get_sub_group();
-   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
-   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;
-   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
-   joint_matrix_fill(sg, tC, 0);
-   for (int k = 0; k < K; k += tK) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
-     tC = joint_matrix_mad(sg, tA, tB, tC);
-   }
-   auto wi_data_c = get_wi_data(sg, tC);
-   for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
-}).wait();
-```
-
-== Query Interface
-Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
-The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.
-The query interface proposed here consists of three functionalities:
-
-- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
-
-- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
-
-- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
-
-The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
-
-[frame="none",options="header"]
-|======================
-| Member/type alias in `tpu_params` | Forms they are defined in |Description
-|`type_a`| validation, default values|type alias for the type of matrix A
-|`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
-|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
-|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
-|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
-|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
-|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
-|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
-|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
-|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
-|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
-|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
-|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
-|======================
-
-
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
-struct tpu_params;
-
-// Validation form: Valid or not
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
-struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
-    typename std::enable_if<(
-        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
-  // Validate that parameters are supported
-  static_assert(
-      (sM == 0 && sN == 0 && sK == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
-      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
-      "using: "
-      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
-
-
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
-
-  // if combination is valid, construct the matrices
-
-  static constexpr std::size_t M = (sM != 0) ? sM : 16;
-  static constexpr std::size_t N = (sN != 0) ? sN : 16;
-  static constexpr std::size_t K =
-      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
-  template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contains only the combination that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Default values form: Sizes-only query
-// Specialization for when only types are given, need to query only sizes
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
-                "and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16) ");
-
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
-
-  // construct the matrices using the default sizes
-  static constexpr std::size_t M = 16;
-  static constexpr std::size_t N = 16;
-  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
-  template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
-  // types that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// General query form:
-// types are not given, no default sizes and no implicit matrix construction
-template <int sM, int sN, int sK>
-struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  
-  static constexpr combination combinations[] = {
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-
-enum class tpu {
-  xmx8,
-  xmx16,
-  amx
-};
-
-enum class matrix_type {
-  bf16,
-  fp16,
-  tf32,
-  fp32,
-  fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32, 
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
-};
-
-enum class scope_t {
-  sub_group,
-  work_group
-};
-}
-```
-
-
-=== Validation Example:
-```c++
-// User can provide sizes besides the types and tpu_params can assert if they are supported or not
-// in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
-size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
-size_t NDRangeN = N / myparams::N;
-```
-
-=== Default Values Example:
-```c++
-using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
-// use this to construct the ranges on the host side
-size_t NDRangeM = M / myparams::M;
-size_t NDRangeN = N / myparams::N;
-//if M, N, K do not multiply the default sizes, padding has to be done
-// device code: the matrices are constructed using the default dimensions
-myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
-myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
-myparams::joint_matrix_accumulator<sub_group> sub_c;
-
-```
-
-=== General Query Example:
-```c++
-constexpr int M = 1500; // with msize = 8 and msize = 4,
-          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
-tpu_params<tpu::xmx16> params;
-constexpr int msize = break_dimension(params, M);
-constexpr int msize_remainder = break_dimension_remainder(params, M);
-constexpr int nsize = params.combinations[0].nsize;
-constexpr int ksize = params.combinations[0].ksize;
-// device code:
-joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
-joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
-joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
-//Remainder handling
-```
-
-## Future-looking API
-
-### Memory scope
-The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
-
-
-```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
-```
-We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
-
-### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
-
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length(); ++i) {
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
-
-So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length; ++i) {
-  auto row, col = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-=== Appendix: Supported Parameter Combinations Per Hardware
-
-The tables below provide a list of the parameter combinations that
-`joint_matrix` implementations support on each supported vendors hardware type.
-
-==== Nvidia Tensor Cores Supported Combinations
-
-The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
-
-IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must be used, where `sm_xx` must be a Compute Capability that is equal to or greater than the appropriate Minimum Compute Capability. When an executable has been compiled for `sm_xx`, if the executable is run on a device with compute capability less than `sm_xx` then an error will be thrown. The mapping to Minimum Compute Capability from each supported parameter combination is specified in the following table.
-
---
-[.center]
-|======================
-|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
-.3+|half  .3+|float
-|16 |16 |16 .6+| sm_70
-|8 |32 |16
-|32 |8 |16
-.3+|half  .3+|half
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-.3+|int8_t  .3+|int32_t
-|16 |16 |16 .6+| sm_72
-|8 |32 |16
-|32 |8 |16
-.3+|uint8_t  .3+|int32_t
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-|precision::tf32  |float |16 |16 |8 .5+| sm_80
-.3+|bfloat16  .3+|float
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-|double  |double |8 |8 |4
-|======================
---
-
-The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
---
-[.center]
-|======================
-|use |NumRows | NumCols
-|a |M |K
-|b |K |N
-|accumulator | M| N
-|======================
---
-
-IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. When `T` is not `half` or `float` there are no restrictions to `stride`.
-
-## TODO List
-- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add a more realistic and complete example that shows the value of the general query. 
-
-
-## Revision History
-
-[frame="none",options="header"]
-|======================
-|Rev |Date       |Author     |Changes
-|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
-|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
-|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
-|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
-|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
-|======================

From 8d80ad6499bbb49ca3bb2f86ebdb5458bc29d980 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 9 Jun 2023 10:21:52 -0700
Subject: [PATCH 38/51] Add old folder to try to fix conflicts

---
 .../sycl_ext_intel_matrix.asciidoc            | 155 +++++
 .../sycl_ext_oneapi_matrix.asciidoc           | 650 ++++++++++++++++++
 2 files changed, 805 insertions(+)
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
new file mode 100644
index 0000000000000..883c73c655217
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
@@ -0,0 +1,155 @@
+# Additional Intel-only specifics about matrix extension for DPC++
+
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 5 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
+ This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.
+
+## Introduction
+The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
+This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B.
+|======================
+
+
+## Extra Functionality
+
+### Layout
+Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
+
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+enum class layout {
+  packed
+};
+}
+```
+
+
+### Layout argument in `joint_matrix_load`
+`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
+
+In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
+
+IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
+
+### Store Operation
+Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
+
+#### Store
+```c++
+namespace sycl::ext::intel::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            use Use, layout Layout, access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+}
+```
+
+
+## VNNI/Packed Layout
+Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
+The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
+
+#### Example 1: 16-bit elements
+      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed layout. 
+      // Here, packing of 2 elements is needed to form 32 bits.
+      // Element a1 is contiguous in memory with element a2, etc.
+      // ---------------------------------
+      // a1, a2, b1, b2, c1, c2, d1, d2
+      // a3, a4, b3, b4, c3, c4, d3, d4
+
+#### Example 2: 8-bit elements
+
+      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
+      // Element a1 is contiguous in memory with element b1, etc.
+      // ---------------------------------
+      // a1, b1, c1, d1
+      // a2, b2, c2, d2
+      // a3, b3, c3, d3
+      // a4, b4, c4, d4
+      // ---------------------------------
+      // The same matrix reformatted in packed layout.  
+      // Here, packing of 4 elements is needed to form 32 bits.
+      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
+      // ---------------------------------
+      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
+
+## Supported Combinations Per Hardware
+
+The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
+
+### Intel AMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
+|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
+|======================
+
+### Intel XMX Supported Combinations
+
+[frame="none",options="header"]
+|======================
+| A type | B type | Accumulator type | M | N | K
+| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
+|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
+|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
+|======================
+
+## Open Questions
+- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
new file mode 100644
index 0000000000000..cb430e7c794ef
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -0,0 +1,650 @@
+# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 5 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+
+**_NOTE:_** _This document describes the current design and API for the matrix
+extension to {dpcpp}. This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
+
+## Introduction
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
+|======================
+
+## Matrix API Versions
+
+While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
+|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|3     |Implementation on Nvidia Tensor Cores
+|======================
+
+## New `joint_matrix` class
+We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
+          layout Layout = layout::dynamic>
+struct joint_matrix {
+    joint_matrix() {}
+};
+}
+```
+
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
+
+#### Use
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class use {
+  a,
+  b,
+  accumulator
+};
+}
+```
+
+#### Shape
+The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
+
+#### Layout
+This specifies the memory layout and it can be row major or column major.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class layout {
+  row_major,
+  col_major,
+  dynamic
+ };
+}
+```
+
+
+#### Group Memory Scope
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported
+
+When the group is a `sycl::sub_group`, a matrix is declared as follows:
+
+```c++
+joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
+```
+
+
+## Matrix Operations and their Execution Scope
+We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
+
+Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
+
+To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Load
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
+    
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          use Use, layout Layout, access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+}
+```
+
+`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
+We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
+The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
+The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
+
+The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Store
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
+}
+```
+This function stores the data in the accumulator matrix from the 2d tiles back to memory.
+
+The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Multiply and Add
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
+    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
+    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
+    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
+}
+```
+The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
+
+
+#### Matrix Initialization: `joint_matrix_fill`
+The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+           use Use, layout Layout, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
+}
+```
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Element Indexing and Piece-Wise Operations
+##### Background
+Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
+
+Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
+
+Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
+
+##### Explicit conversion with mapping from SIMD to SPMD
+The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
+
+We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
+
+Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
+
+2- Sub group size is not generally fixed.
+
+The code listing below shows a synopsis of these new APIs.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+
+template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
+class wi_data {
+  size_t length();
+  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
+};
+template <typename T, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          typename Group = sycl::sub_group>
+class wi_element {
+  operator T();
+  wi_element &operator=(const T &rhs);
+…
+};
+}
+```
+
+In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
+
+
+```c++
+auto wi_data_c = get_wi_data(sg, matC);
+for (int i = 0; i < wi_data_c.length(); i++)
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
+```
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
+
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+
+## Example using int8_t type
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+range<2> G = {M/tM, N};
+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+int32_t *memC = malloc_shared<int32_t>(M*N, q);
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
+   const auto global_idx = item.get_global_id(0);
+   const auto global_idy = item.get_global_id(1);
+   const auto sg_startx = global_idx - item.get_local_id(0);
+   const auto sg_starty = global_idy - item.get_local_id(1);
+   sub_group sg = item.get_sub_group();
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
+   joint_matrix_fill(sg, tC, 0);
+   for (int k = 0; k < K; k += tK) {
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
+     tC = joint_matrix_mad(sg, tA, tB, tC);
+   }
+   auto wi_data_c = get_wi_data(sg, tC);
+   for (int i = 0; i < wi_data_c.length(); i++)
+     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
+   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+}).wait();
+```
+
+== Query Interface
+Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
+The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
+so there will be no runtime errors.
+The query interface proposed here consists of three functionalities:
+
+- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
+
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+
+The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
+
+[frame="none",options="header"]
+|======================
+| Member/type alias in `tpu_params` | Forms they are defined in |Description
+|`type_a`| validation, default values|type alias for the type of matrix A
+|`type_b`|  validation, default values|type alias for the type of matrix B
+|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
+|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
+|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
+|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
+|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
+|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
+|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
+|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
+|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
+|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
+|======================
+
+
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
+struct tpu_params;
+
+// Validation form: Valid or not
+// Specialization when both types and sizes are given
+template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
+struct tpu_params<
+    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
+    typename std::enable_if<(
+        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
+        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
+  // Validate that parameters are supported
+  static_assert(
+      (sM == 0 && sN == 0 && sK == 0) ||
+          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
+      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
+      "using: "
+      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
+
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // if combination is valid, construct the matrices
+
+  static constexpr std::size_t M = (sM != 0) ? sM : 16;
+  static constexpr std::size_t N = (sN != 0) ? sN : 16;
+  static constexpr std::size_t K =
+      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contains only the combination that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// Default values form: Sizes-only query
+// Specialization for when only types are given, need to query only sizes
+template <typename Ta, typename Tb, typename Tc>
+struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
+                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
+                                           !std::is_same_v<Tb, void> &&
+                                           !std::is_same_v<Tc, void>)>::type> {
+  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
+                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
+                "and bf16 (Note that unsigned short should be used in the"
+                "DPC++ code to implement bf16) ");
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // construct the matrices using the default sizes
+  static constexpr std::size_t M = 16;
+  static constexpr std::size_t N = 16;
+  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
+  // types that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// General query form:
+// types are not given, no default sizes and no implicit matrix construction
+template <int sM, int sN, int sK>
+struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  
+  static constexpr combination combinations[] = {
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+
+enum class tpu {
+  xmx8,
+  xmx16,
+  amx
+};
+
+enum class matrix_type {
+  bf16,
+  fp16,
+  tf32,
+  fp32,
+  fp64,
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32, 
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
+};
+
+enum class scope_t {
+  sub_group,
+  work_group
+};
+}
+```
+
+
+=== Validation Example:
+```c++
+// User can provide sizes besides the types and tpu_params can assert if they are supported or not
+// in this case, an assertion will happens as 16 is not a supported size for M
+using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
+size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
+size_t NDRangeN = N / myparams::N;
+```
+
+=== Default Values Example:
+```c++
+using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
+// use this to construct the ranges on the host side
+size_t NDRangeM = M / myparams::M;
+size_t NDRangeN = N / myparams::N;
+//if M, N, K do not multiply the default sizes, padding has to be done
+// device code: the matrices are constructed using the default dimensions
+myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
+myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
+myparams::joint_matrix_accumulator<sub_group> sub_c;
+
+```
+
+=== General Query Example:
+```c++
+constexpr int M = 1500; // with msize = 8 and msize = 4,
+          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
+tpu_params<tpu::xmx16> params;
+constexpr int msize = break_dimension(params, M);
+constexpr int msize_remainder = break_dimension_remainder(params, M);
+constexpr int nsize = params.combinations[0].nsize;
+constexpr int ksize = params.combinations[0].ksize;
+// device code:
+joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
+joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
+joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
+//Remainder handling
+```
+
+## Future-looking API
+
+### Memory scope
+The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
+
+
+```c++
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
+```
+We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
+
+### WI data to joint matrix mapping coordinates information for piece-wise operations
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
+
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length(); ++i) {
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
+
+So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length; ++i) {
+  auto row, col = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+=== Appendix: Supported Parameter Combinations Per Hardware
+
+The tables below provide a list of the parameter combinations that
+`joint_matrix` implementations support on each supported vendors hardware type.
+
+==== Nvidia Tensor Cores Supported Combinations
+
+The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
+
+IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must be used, where `sm_xx` must be a Compute Capability that is equal to or greater than the appropriate Minimum Compute Capability. When an executable has been compiled for `sm_xx`, if the executable is run on a device with compute capability less than `sm_xx` then an error will be thrown. The mapping to Minimum Compute Capability from each supported parameter combination is specified in the following table.
+
+--
+[.center]
+|======================
+|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
+.3+|half  .3+|float
+|16 |16 |16 .6+| sm_70
+|8 |32 |16
+|32 |8 |16
+.3+|half  .3+|half
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+.3+|int8_t  .3+|int32_t
+|16 |16 |16 .6+| sm_72
+|8 |32 |16
+|32 |8 |16
+.3+|uint8_t  .3+|int32_t
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|precision::tf32  |float |16 |16 |8 .5+| sm_80
+.3+|bfloat16  .3+|float
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|double  |double |8 |8 |4
+|======================
+--
+
+The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
+--
+[.center]
+|======================
+|use |NumRows | NumCols
+|a |M |K
+|b |K |N
+|accumulator | M| N
+|======================
+--
+
+IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. When `T` is not `half` or `float` there are no restrictions to `stride`.
+
+## TODO List
+- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
+- Add a more realistic and complete example that shows the value of the general query. 
+
+
+## Revision History
+
+[frame="none",options="header"]
+|======================
+|Rev |Date       |Author     |Changes
+|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
+|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
+|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
+|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
+|======================

From 35c8744ba2f7899c96f5bfa472a7d259d4db2df6 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 9 Jun 2023 11:48:43 -0700
Subject: [PATCH 39/51] remove the old folder that resulted from the merge with
 sycl branch

---
 .../sycl_ext_intel_matrix.asciidoc            | 155 -----
 .../sycl_ext_oneapi_matrix.asciidoc           | 650 ------------------
 2 files changed, 805 deletions(-)
 delete mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
 delete mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
deleted file mode 100644
index 883c73c655217..0000000000000
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_intel_matrix.asciidoc
+++ /dev/null
@@ -1,155 +0,0 @@
-# Additional Intel-only specifics about matrix extension for DPC++
-
-:source-highlighter: coderay
-:coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
-
-// This section needs to be after the document title.
-:doctype: book
-:toc2:
-:toc: left
-:encoding: utf-8
-:lang: en
-
-:blank: pass:[ +]
-
-// Set the default source code type in this document to C++,
-// for syntax highlighting purposes.  This is needed because
-// docbook uses c++ and html5 uses cpp.
-:language: {basebackend@docbook:c++:cpp}
-
-
-== Notice
-
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
-
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
-
-This extension is written against the SYCL 2020 revision 5 specification.  All
-references below to the "core SYCL specification" or to section numbers in the
-SYCL specification refer to that revision.
-
-**_NOTE:_** This document describes the extra features and details for the implementation of `joint_matrix` extension on Intel AMX and Intel XMX.
- This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**.
-
-## Introduction
-The Intel backend implementations on both Intel AMX and Intel XMX  support `joint_matrix`, `joint_matrix_load`, `joint_matrix_store`, `joint_matrix_mad`, `joint_matrix_fill`, `get_wi_data`, and the query interface, as they are defined in the sycl_ext_oneapi_matrix extension. There are additional specifics about the supported layouts that enable extra performance and functionality listed in this document.
-This extension presents some supplementary Intel AMX and Intel XMX features not contained within the sycl_ext_oneapi_matrix extension. The additional features are built on top of the sycl_ext_oneapi_matrix extension but are only supported by the Intel AMX and Intel XMX backends.
-
-## Feature test macro
-
-This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_INTEL_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Introduce `packed` layout and extend `joint_matrix_store` to Matrix A and B.
-|======================
-
-
-## Extra Functionality
-
-### Layout
-Besides row major and column major layouts, `layout` introduces the custom layout packed layout that refers to the VNNI format descibed in the following section.
-
-```c++
-namespace sycl::ext::intel::experimental::matrix {
-enum class layout {
-  packed
-};
-}
-```
-
-
-### Layout argument in `joint_matrix_load`
-`layout` in `joint_matrix_load` can take `packed` as argument to specify that the data has already been transformed into VNNI format (`packed`). in this case, `stride` argument of `joint_matrix_load` describes the number of elements between consecutive rows for packed layouts.
-
-In order to get maximum performance on Intel AMX and Intel XMX, prepacking data in the memory is necessary. If users did not specify the packed layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose the `packed` layout to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in the `VNNI layout` section below.
-
-IMPORTANT: In the current Intel AMX and Intel XMX implementations, the layout in the load of matrix B (provided by the `layout memL` parameter below) must be `packed` or `row_major`. Automatic VNNI transform is supported on AMX. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C (provided by the `layout memL` parameter below) must also be `row_major`.
-
-### Store Operation
-Besides store of matrix `accumulator`, the Intel implementation allows store on matrix `a` and `b` as well. 
-
-#### Store
-```c++
-namespace sycl::ext::intel::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            use Use, layout Layout, access::address_space Space>
-  void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
-}
-```
-
-
-## VNNI/Packed Layout
-Intel AMX and Intel XMX compute assumes that the B tile register (src1) is in the VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. 
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transformation.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed` layout for a 16-bit type.
-
-#### Example 1: 16-bit elements
-      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed layout. 
-      // Here, packing of 2 elements is needed to form 32 bits.
-      // Element a1 is contiguous in memory with element a2, etc.
-      // ---------------------------------
-      // a1, a2, b1, b2, c1, c2, d1, d2
-      // a3, a4, b3, b4, c3, c4, d3, d4
-
-#### Example 2: 8-bit elements
-
-      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed layout.  
-      // Here, packing of 4 elements is needed to form 32 bits.
-      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
-      // ---------------------------------
-      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
-
-## Supported Combinations Per Hardware
-
-The table below provides a list of the combinations that `joint_matrix` implementations support on each of Intel AMX and Intel XMX hardware. Note that these can be returned in a parametrized way using the `tpu_params` query class.
-
-### Intel AMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 16 |  +<=+ 16 |  +<=+ 64
-|  bf16       |  bf16   |   fp32   |  +<=+ 16 |  +<=+ 16   |  +<=+ 32
-|======================
-
-### Intel XMX Supported Combinations
-
-[frame="none",options="header"]
-|======================
-| A type | B type | Accumulator type | M | N | K
-| (u)int8_t  | (u)int8_t |  int32_t  |  +<=+ 8 |  16 |  32
-|  fp16       |  fp16   |   fp32   |  +<=+ 8 |  16   |  16
-|  bf16       |  bf16   |   fp32   |  +<=+ 8 |  16   |  16
-|======================
-
-## Open Questions
-- Should the same class, `joint_matrix`, handle both cases where sizes are constant (GPU case) and when sizes are variable (CPU case)? Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes that can be variable. The ability to define only one interface for both would make it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. In a previous version of the design, we used `sycl::dynamic_extent`  to differentiate between static and dynamic sizes. But since this was not implemented at all, we decided to remove it. We can revisit this design choice if this comes up as part of a customer request or if SPIRV matrix extension extends its support to dynamic sizes.
diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
deleted file mode 100644
index cb430e7c794ef..0000000000000
--- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ /dev/null
@@ -1,650 +0,0 @@
-# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
-:source-highlighter: coderay
-:coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
-
-// This section needs to be after the document title.
-:doctype: book
-:toc2:
-:toc: left
-:encoding: utf-8
-:lang: en
-
-:blank: pass:[ +]
-
-// Set the default source code type in this document to C++,
-// for syntax highlighting purposes.  This is needed because
-// docbook uses c++ and html5 uses cpp.
-:language: {basebackend@docbook:c++:cpp}
-
-
-== Notice
-
-Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
-
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
-
-This extension is written against the SYCL 2020 revision 5 specification.  All
-references below to the "core SYCL specification" or to section numbers in the
-SYCL specification refer to that revision.
-
-
-**_NOTE:_** _This document describes the current design and API for the matrix
-extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
-
-## Introduction
-This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
-
-## Feature test macro
-
-This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
-|======================
-
-## Matrix API Versions
-
-While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
-|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
-|3     |Implementation on Nvidia Tensor Cores
-|======================
-
-## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
-          layout Layout = layout::dynamic>
-struct joint_matrix {
-    joint_matrix() {}
-};
-}
-```
-
-IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
-
-#### Use
-Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-enum class use {
-  a,
-  b,
-  accumulator
-};
-}
-```
-
-#### Shape
-The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
-
-#### Layout
-This specifies the memory layout and it can be row major or column major.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-enum class layout {
-  row_major,
-  col_major,
-  dynamic
- };
-}
-```
-
-
-#### Group Memory Scope
-In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported
-
-When the group is a `sycl::sub_group`, a matrix is declared as follows:
-
-```c++
-joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
-```
-
-
-## Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
-
-Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
-
-To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
-
-#### Load
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            access::address_space Space>
-  void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
-    
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          use Use, layout Layout, access::address_space Space>
-  void joint_matrix_load(Group sg,
-    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride);
-}
-```
-
-`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
-We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
-The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
-The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
-
-The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
-
-
-#### Store
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-            access::address_space Space>
-  void joint_matrix_store(Group sg,
-    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
-    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
-}
-```
-This function stores the data in the accumulator matrix from the 2d tiles back to memory.
-
-The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
-
-
-#### Multiply and Add
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
-            layout LayoutA, layout LayoutB>
-  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
-    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
-    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
-    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
-}
-```
-The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
-
-
-#### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-           use Use, layout Layout, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
-}
-```
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
-
-#### Element Indexing and Piece-Wise Operations
-##### Background
-Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
-
-Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
-
-Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
-
-// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
-
-##### Explicit conversion with mapping from SIMD to SPMD
-The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
-
-We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
-
-Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
-
-Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
-
-2- Sub group size is not generally fixed.
-
-The code listing below shows a synopsis of these new APIs.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
-
-template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
-class wi_data {
-  size_t length();
-  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
-};
-template <typename T, size_t Rows, size_t Cols,
-          use Use, layout Layout,
-          typename Group = sycl::sub_group>
-class wi_element {
-  operator T();
-  wi_element &operator=(const T &rhs);
-…
-};
-}
-```
-
-In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
-
-
-```c++
-auto wi_data_c = get_wi_data(sg, matC);
-for (int i = 0; i < wi_data_c.length(); i++)
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
-```
-
-IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
-
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
-
-## Example using int8_t type
-```c++
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-queue q;
-range<2> G = {M/tM, N};
-range<2> L = {1, SG_SIZE};
-int8_t *memA = malloc_shared<int8_t>(M*K, q);
-int8_t *memB = malloc_shared<int8_t>(K*N, q);
-int32_t *memC = malloc_shared<int32_t>(M*N, q);
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
-  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
-   const auto global_idx = item.get_global_id(0);
-   const auto global_idy = item.get_global_id(1);
-   const auto sg_startx = global_idx - item.get_local_id(0);
-   const auto sg_starty = global_idy - item.get_local_id(1);
-   sub_group sg = item.get_sub_group();
-   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
-   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;
-   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
-   joint_matrix_fill(sg, tC, 0);
-   for (int k = 0; k < K; k += tK) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
-     tC = joint_matrix_mad(sg, tA, tB, tC);
-   }
-   auto wi_data_c = get_wi_data(sg, tC);
-   for (int i = 0; i < wi_data_c.length(); i++)
-     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
-}).wait();
-```
-
-== Query Interface
-Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
-The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.
-The query interface proposed here consists of three functionalities:
-
-- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
-
-- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
-
-- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
-
-The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
-
-[frame="none",options="header"]
-|======================
-| Member/type alias in `tpu_params` | Forms they are defined in |Description
-|`type_a`| validation, default values|type alias for the type of matrix A
-|`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
-|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
-|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
-|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
-|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
-|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
-|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
-|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
-|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
-|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
-|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
-|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
-|======================
-
-
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
-struct tpu_params;
-
-// Validation form: Valid or not
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
-struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
-    typename std::enable_if<(
-        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
-  // Validate that parameters are supported
-  static_assert(
-      (sM == 0 && sN == 0 && sK == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
-      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
-      "using: "
-      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
-
-
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
-
-  // if combination is valid, construct the matrices
-
-  static constexpr std::size_t M = (sM != 0) ? sM : 16;
-  static constexpr std::size_t N = (sN != 0) ? sN : 16;
-  static constexpr std::size_t K =
-      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
-  template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contains only the combination that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Default values form: Sizes-only query
-// Specialization for when only types are given, need to query only sizes
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
-                "and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16) ");
-
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_accumulator = Tc; // this type alias is not available in the current implementation
-
-  // construct the matrices using the default sizes
-  static constexpr std::size_t M = 16;
-  static constexpr std::size_t N = 16;
-  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group, layout LayoutA>
-  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
-  template <typename Group, layout LayoutB>
-  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
-  template <typename Group>
-  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
-
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
-  // types that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// General query form:
-// types are not given, no default sizes and no implicit matrix construction
-template <int sM, int sN, int sK>
-struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scopes[] = {scope_t::sub_group};
-  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type accumulatortype;
-  };
-  
-  static constexpr combination combinations[] = {
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-
-enum class tpu {
-  xmx8,
-  xmx16,
-  amx
-};
-
-enum class matrix_type {
-  bf16,
-  fp16,
-  tf32,
-  fp32,
-  fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32, 
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
-};
-
-enum class scope_t {
-  sub_group,
-  work_group
-};
-}
-```
-
-
-=== Validation Example:
-```c++
-// User can provide sizes besides the types and tpu_params can assert if they are supported or not
-// in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
-size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
-size_t NDRangeN = N / myparams::N;
-```
-
-=== Default Values Example:
-```c++
-using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
-// use this to construct the ranges on the host side
-size_t NDRangeM = M / myparams::M;
-size_t NDRangeN = N / myparams::N;
-//if M, N, K do not multiply the default sizes, padding has to be done
-// device code: the matrices are constructed using the default dimensions
-myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
-myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
-myparams::joint_matrix_accumulator<sub_group> sub_c;
-
-```
-
-=== General Query Example:
-```c++
-constexpr int M = 1500; // with msize = 8 and msize = 4,
-          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
-tpu_params<tpu::xmx16> params;
-constexpr int msize = break_dimension(params, M);
-constexpr int msize_remainder = break_dimension_remainder(params, M);
-constexpr int nsize = params.combinations[0].nsize;
-constexpr int ksize = params.combinations[0].ksize;
-// device code:
-joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
-joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
-joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
-//Remainder handling
-```
-
-## Future-looking API
-
-### Memory scope
-The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
-
-
-```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
-```
-We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
-
-### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
-Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
-
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length(); ++i) {
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
-
-So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
-```c++
-auto data = get_wi_data(sg, C);
-for (int i = 0; i < data.length; ++i) {
-  auto row, col = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-=== Appendix: Supported Parameter Combinations Per Hardware
-
-The tables below provide a list of the parameter combinations that
-`joint_matrix` implementations support on each supported vendors hardware type.
-
-==== Nvidia Tensor Cores Supported Combinations
-
-The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
-
-IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must be used, where `sm_xx` must be a Compute Capability that is equal to or greater than the appropriate Minimum Compute Capability. When an executable has been compiled for `sm_xx`, if the executable is run on a device with compute capability less than `sm_xx` then an error will be thrown. The mapping to Minimum Compute Capability from each supported parameter combination is specified in the following table.
-
---
-[.center]
-|======================
-|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
-.3+|half  .3+|float
-|16 |16 |16 .6+| sm_70
-|8 |32 |16
-|32 |8 |16
-.3+|half  .3+|half
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-.3+|int8_t  .3+|int32_t
-|16 |16 |16 .6+| sm_72
-|8 |32 |16
-|32 |8 |16
-.3+|uint8_t  .3+|int32_t
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-|precision::tf32  |float |16 |16 |8 .5+| sm_80
-.3+|bfloat16  .3+|float
-|16 |16 |16
-|8 |32 |16
-|32 |8 |16
-|double  |double |8 |8 |4
-|======================
---
-
-The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
---
-[.center]
-|======================
-|use |NumRows | NumCols
-|a |M |K
-|b |K |N
-|accumulator | M| N
-|======================
---
-
-IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. When `T` is not `half` or `float` there are no restrictions to `stride`.
-
-## TODO List
-- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add a more realistic and complete example that shows the value of the general query. 
-
-
-## Revision History
-
-[frame="none",options="header"]
-|======================
-|Rev |Date       |Author     |Changes
-|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
-|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
-|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
-|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
-|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
-|======================

From d63bdb829694a33af96a0a2383f1121cbcb2a590 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 29 Jun 2023 08:07:19 -0700
Subject: [PATCH 40/51] address Greg's comments: change Nvidia table, minor
 formatting

---
 sycl/ReleaseNotes.md                          | 14 +++----
 .../sycl_ext_intel_matrix.asciidoc            |  4 +-
 .../sycl_ext_oneapi_matrix.asciidoc           | 39 ++++++++++---------
 3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/sycl/ReleaseNotes.md b/sycl/ReleaseNotes.md
index 2e4243e7a887d..7c704f9bf3201 100644
--- a/sycl/ReleaseNotes.md
+++ b/sycl/ReleaseNotes.md
@@ -60,7 +60,7 @@ extension. [1d993446] [4f7787c8]
 - Implemented `ext::oneapi::experimental::radix_sorter` from the
 [`sycl_ext_oneapi_group_sort`](doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc)
 extension proposal. [86ba1809]
-- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
+- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
 extension for CUDA. [166bbc36]
 - Added support for sorting over sub-groups. [168767c6]
 - Added C++ API wrappers for the Intel math functions `ceil`, `floor`, `rint`,
@@ -174,7 +174,7 @@ extension proposal to allow the compiler to determine the initiation interval.
 - Updated the [`sycl_ext_intel_usm_address_spaces`](doc/extensions/supported/sycl_ext_intel_usm_address_spaces.asciidoc)
 extension to adhere to SYCL 2020 `multi_ptr`. [4a9e9a0e]
 - Added a new matrix use parameter to `joint_matrix` from the
-[`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
+[`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
 extension specification. [52f34fd5]
 - Removed `queue::size` and `queue::get_wait_list` functions from the
 `sycl_ext_oneapi_queue_status_query` extension due to performance overhead
@@ -421,7 +421,7 @@ Release notes for commit range [`4043dda3..0f579bae`](https://github.com/intel/l
   to mark `has_property` API as `noexcept`. [7805aa3f]
 - Updated [`sycl_ext_intel_device_info`](doc/extensions/supported/sycl_ext_intel_device_info.md)
   to support querying free device memory. [0eeef2b3]
-- Updated [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
+- Updated [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
   with description of new matrix features. [770f540d]
 - Moved [`sycl_ext_oneapi_invoke_simd`](doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc)
   extensions specification from `proposed` to `experimental` because
@@ -1067,7 +1067,7 @@ Release notes for commit range 23ca0c2..27f59d8
    Level Zero, ESIMD emulator, HIP [2b0ebab376dc]
  - Added support for `sycl::ext::intel::experimental::esimd_ballot` function
    [0bbb091c1baa]
- - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
+ - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
    [711ba58c30a8]
 
 ### Documentation
@@ -1459,7 +1459,7 @@ Release notes for commit range 4fc5ebe..bd68232
  - Added [sRGBA support](doc/extensions/supported/sycl_ext_oneapi_srgb.asciidoc)
    [e488327][191efdd]
  - Added a preview feature implementation for the DPC++ experimental
-   [matrix extension](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
+   [matrix extension](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
    [7f218531] [a95f46d]
  - Added support for SYCL 2020 exceptions [5c0f748][eef07606][5af8c43d]
  - Added support for [sycl_ext_intel_bf16_conversion extension](doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc)
@@ -1723,7 +1723,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
    for querying of free device memory in LevelZero backend extension [fa428bf]
  - Added [InvokeSIMD](doc/extensions/proposed/sycl_ext_oneapi_invoke_simd.asciidoc) and
    [Uniform](doc/extensions/proposed/sycl_ext_oneapi_uniform.asciidoc) extensions [72e1611]
- - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
+ - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
  - Implemented SYCL 2020 `sycl::span` [9356d53]
  - Added [device-if](doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) extension
    [4fb95fc]
@@ -1869,7 +1869,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
  - Fixed build issue when CUDA 11 is used [f7224f1]
  - Fixed caching of sub-devices in Level Zero backend[4c34f93]
  - Fixed requesting of USM memory allocation info on CUDA [691f842]
- - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
+ - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
    behaviour to return `A*B+C` instead of assigning the result to `C` [ea59c2b]
  - Workaround an issue in Level Zero backend when event isn't waited upon its
    completion but is queried for its status in an infinite loop  [bfef316]
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
index 69a6f4459a87d..aed4826e7238c 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
@@ -39,7 +39,7 @@ SYCL specification refer to that revision.
 
 This extension also depends on the following other SYCL extensions:
 
-* link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[
+* link:../experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc[
   sycl_ext_oneapi_matrix]
 
 == Status
@@ -201,7 +201,7 @@ joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t  col) {
 ```
 === New Device Information Descriptor
 Besides the query we provide in
-link:../experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
+link:../experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
 some device descriptors are Intel hardware specific. These are
 provided as part of `ext::intel::experimental::info::device::matrix`
 namespace:
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index b7ff8eb629473..a5483a394f3ab 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -315,9 +315,9 @@ namespace sycl::ext::oneapi::experimental::matrix {
 
 template <typename Group, typename T, size_t Rows, size_t Cols,
           use Use1, use Use2, layout Layout>
-void joint_matrix_copy(Group g, joint_matrix<Group, T, Use1,
-                         Rows, Cols, Layout> &dest, joint_matrix<Group, T, Use2,
-                         Rows, Cols, Layout> &src);
+void joint_matrix_copy(Group g,
+                       joint_matrix<Group, T, Use1, Rows, Cols, Layout> &dest,
+                       joint_matrix<Group, T, Use2, Rows, Cols, Layout> &src);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -589,13 +589,13 @@ using joint_matrix_d;`| type alias for `joint_matrix` for the output matrix accu
 namespace sycl::ext::oneapi::experimental::matrix {
 
 template<architecture Arch, typename Ta, typename Tb, typename Tc,
-typename Td=Tc, size_t sM=0, size_t sN=0, size_t sK=0>
+         typename Td=Tc, size_t sM=0, size_t sN=0, size_t sK=0>
 struct matrix_params;
 
 // This is the validation form, when all template parameters are
 // specified.
 template<architecture Arch, typename Ta, typename Tb, typename Tc,
-typename Td, size_t sM, size_t sN, size_t sK>
+         typename Td, size_t sM, size_t sN, size_t sK>
 struct matrix_params<Arch, Ta, Tb, Tc, Td, sM, sN, sK> {
   // An implementation typically uses static_assert here to trigger a
   // compilation error when the matrix types or shapes are not
@@ -844,13 +844,12 @@ shown in a single column in the table below.
 |======================
 
 ==== Nvidia Tensor Cores Supported Combinations
-
 The complete set of matrix data types and shapes that are supported by
 the `ext_oneapi_cuda` backend are represented in the following
-table. Tm indicates the matrix element data type held by a
-"multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc
-indicates the matrix element data type held by an "accumulator"
-`joint_matrix`: i.e requiring `use::accumulator`.
+table. In this architecture's implementation,
+the type of the A matrix must be the same as the type of the B
+matrix. Also, the type of the C matrix must be the same as the type of the D
+matrix.
 
 IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target
 arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must
@@ -864,16 +863,16 @@ supported parameter combination is specified in the following table.
 --
 [.center]
 |======================
-|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
-.3+|half  .3+|float
+| A and B type | C and D type | M | N | K | Minimum Compute Capability
+.3+| `matrix_type::fp16`  .3+| `matrix_type::fp32`
 |16 |16 |16 .6+| sm_70
 |8 |32 |16
 |32 |8 |16
-.3+|half  .3+|half
+.3+| `matrix_type::fp16`  .3+| `matrix_type::fp16`
 |16 |16 |16
 |8 |32 |16
 |32 |8 |16
-.3+|int8_t  .3+|int32_t
+.3+| `matrix_type::int8`  .3+| `matrix_type::int32`
 |16 |16 |16 .6+| sm_72
 |8 |32 |16
 |32 |8 |16
@@ -881,12 +880,12 @@ supported parameter combination is specified in the following table.
 |16 |16 |16
 |8 |32 |16
 |32 |8 |16
-|precision::tf32  |float |16 |16 |8 .5+| sm_80
-.3+|bfloat16  .3+|float
+| `matrix_type::tf32`  | `matrix_type::fp32` |16 |16 |8 .5+| sm_80
+.3+|`matrix_type::bf16`  .3+| `matrix_type::fp32`
 |16 |16 |16
 |8 |32 |16
 |32 |8 |16
-|double  |double |8 |8 |4
+| `matrix_type::fp64`  | `matrix_type::fp64` |8 |8 |4
 |======================
 --
 
@@ -922,7 +921,9 @@ new matrix use parameter and remove reference to the AOT AMX initial
 implementation 
 |5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it
 portable across Intel AMX, Intel XMX and Nvidia Tensor Cores, and move
-the Intel-specifics to a separate extension document.
+the Intel-specifics to a separate extension document
 |6   |2023-01-09 |Dounia Khaldi |Add `joint_matrix_apply` API, tf32
-type, runtime query, and supported combinations appendix.
+type, runtime query, and supported combinations appendix for Intel AMX
+and Intel XMX
+|7   |2023-04-11 |Jack Kirk |Add Nvidia Tensor Cores supported combinations
 |======================

From 7bfb8e550e0c70268abe0498e089426e124a29cb Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Thu, 29 Jun 2023 08:12:26 -0700
Subject: [PATCH 41/51] corrected two types in the Nvidia table

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc             | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index a5483a394f3ab..afbbde9db12a8 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -876,7 +876,7 @@ supported parameter combination is specified in the following table.
 |16 |16 |16 .6+| sm_72
 |8 |32 |16
 |32 |8 |16
-.3+|uint8_t  .3+|int32_t
+.3+|`matrix_type::uint8`  .3+|`matrix_type::int32`
 |16 |16 |16
 |8 |32 |16
 |32 |8 |16

From 08fd2dbc01d154639f4c3453915d0073ad93974e Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Fri, 28 Jul 2023 07:56:20 -0700
Subject: [PATCH 42/51] address Greg, Jack, and Alexey comments

---
 sycl/ReleaseNotes.md                          | 14 ++++-----
 .../sycl_ext_oneapi_matrix.asciidoc           | 30 ++++++-------------
 2 files changed, 16 insertions(+), 28 deletions(-)

diff --git a/sycl/ReleaseNotes.md b/sycl/ReleaseNotes.md
index 7c704f9bf3201..fec0756e8d491 100644
--- a/sycl/ReleaseNotes.md
+++ b/sycl/ReleaseNotes.md
@@ -60,7 +60,7 @@ extension. [1d993446] [4f7787c8]
 - Implemented `ext::oneapi::experimental::radix_sorter` from the
 [`sycl_ext_oneapi_group_sort`](doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc)
 extension proposal. [86ba1809]
-- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+- Implemented a new unified interface for the [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/7dab76e1d33341b1e6bf339ab933552281abb3e2/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
 extension for CUDA. [166bbc36]
 - Added support for sorting over sub-groups. [168767c6]
 - Added C++ API wrappers for the Intel math functions `ceil`, `floor`, `rint`,
@@ -174,7 +174,7 @@ extension proposal to allow the compiler to determine the initiation interval.
 - Updated the [`sycl_ext_intel_usm_address_spaces`](doc/extensions/supported/sycl_ext_intel_usm_address_spaces.asciidoc)
 extension to adhere to SYCL 2020 `multi_ptr`. [4a9e9a0e]
 - Added a new matrix use parameter to `joint_matrix` from the
-[`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+[`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
 extension specification. [52f34fd5]
 - Removed `queue::size` and `queue::get_wait_list` functions from the
 `sycl_ext_oneapi_queue_status_query` extension due to performance overhead
@@ -421,7 +421,7 @@ Release notes for commit range [`4043dda3..0f579bae`](https://github.com/intel/l
   to mark `has_property` API as `noexcept`. [7805aa3f]
 - Updated [`sycl_ext_intel_device_info`](doc/extensions/supported/sycl_ext_intel_device_info.md)
   to support querying free device memory. [0eeef2b3]
-- Updated [`sycl_ext_oneapi_matrix`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+- Updated [`sycl_ext_oneapi_matrix`](https://github.com/intel/llvm/blob/770f540d8b600c8c16df12dfccbf38fa780cf77a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix.asciidoc)
   with description of new matrix features. [770f540d]
 - Moved [`sycl_ext_oneapi_invoke_simd`](doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc)
   extensions specification from `proposed` to `experimental` because
@@ -1067,7 +1067,7 @@ Release notes for commit range 23ca0c2..27f59d8
    Level Zero, ESIMD emulator, HIP [2b0ebab376dc]
  - Added support for `sycl::ext::intel::experimental::esimd_ballot` function
    [0bbb091c1baa]
- - Added initial support for [Tensorcore matrix extension](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+ - Added initial support for [Tensor Cores matrix extension](https://github.com/intel/llvm/blob/f2983fc0d8fcd7bd6022a7006ad489c591838041/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc)
    [711ba58c30a8]
 
 ### Documentation
@@ -1459,7 +1459,7 @@ Release notes for commit range 4fc5ebe..bd68232
  - Added [sRGBA support](doc/extensions/supported/sycl_ext_oneapi_srgb.asciidoc)
    [e488327][191efdd]
  - Added a preview feature implementation for the DPC++ experimental
-   [matrix extension](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+   [matrix extension](https://github.com/intel/llvm/blob/467ef25a309ec882027052f3d4c3df58c11ee2ac/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
    [7f218531] [a95f46d]
  - Added support for SYCL 2020 exceptions [5c0f748][eef07606][5af8c43d]
  - Added support for [sycl_ext_intel_bf16_conversion extension](doc/extensions/experimental/sycl_ext_intel_bf16_conversion.asciidoc)
@@ -1723,7 +1723,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
    for querying of free device memory in LevelZero backend extension [fa428bf]
  - Added [InvokeSIMD](doc/extensions/proposed/sycl_ext_oneapi_invoke_simd.asciidoc) and
    [Uniform](doc/extensions/proposed/sycl_ext_oneapi_uniform.asciidoc) extensions [72e1611]
- - Added [Matrix Programming Extension for DPC++ document](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc) [ace4c733]
+ - Added [Matrix Programming Extension for DPC++ document](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc) [ace4c733]
  - Implemented SYCL 2020 `sycl::span` [9356d53]
  - Added [device-if](doc/extensions/proposed/sycl_ext_oneapi_device_if.asciidoc) extension
    [4fb95fc]
@@ -1869,7 +1869,7 @@ Release notes for commit range 6a49170027fb..962909fe9e78
  - Fixed build issue when CUDA 11 is used [f7224f1]
  - Fixed caching of sub-devices in Level Zero backend[4c34f93]
  - Fixed requesting of USM memory allocation info on CUDA [691f842]
- - Fixed [`joint_matrix_mad`](doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc)
+ - Fixed [`joint_matrix_mad`](https://github.com/intel/llvm/blob/ce12ec028681aa90133c518126014b0881d9e6bc/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc)
    behaviour to return `A*B+C` instead of assigning the result to `C` [ea59c2b]
  - Workaround an issue in Level Zero backend when event isn't waited upon its
    completion but is queried for its status in an infinite loop  [bfef316]
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index afbbde9db12a8..a71d2c820eeee 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -69,7 +69,7 @@ Joint matrix is a SYCL extension for matrix hardware programming. It
 unifies targets like Intel AMX in CPUs, Intel XMX in Intel GPUs and
 Nvidia Tensor Cores. This provides a portable and performant API for
 users who want to build their own neural networks applications,
-perform custom optimzations, or experiment with new operations in a
+perform custom optimizations, or experiment with new operations in a
 timely and performing manner.
 
 == Specification
@@ -138,9 +138,10 @@ joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
 ==== Element Type
 The `T` template parameter specifies the type of each element in the
 matrix. Each device supports only certain element types, so the
-application must use the query operations (defined below) to ensure
-that the element type is supported on the device where the kernel
-using this `joint_matrix` runs.
+application must ensure that the element type is supported on the
+device where the kernel using this joint_matrix runs. The query
+functions (defined below) may be used to determine the set of element
+types that are supported on a device.
 
 ==== Matrix Use
 The main operation performed by the matrix hardware is `D=C+A*B`. The
@@ -165,7 +166,7 @@ enum class use {
 ```
 
 ==== Matrix Shape
-The `Rows` and `Cols` template parameters tell the number of rows and
+The `Rows` and `Cols` template parameters provide the number of rows and
 columns in the joint matrix. Each device supports only certain
 combinations of row and column sizes, so the application must use the
 query operations (defined below) to ensure that the matrix shape is
@@ -322,7 +323,7 @@ void joint_matrix_copy(Group g,
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 This function copies `Rows x Cols` elements of type `T` from joint
-matrix `src` to joint matrix `dest`. The two matrcies must have the
+matrix `src` to joint matrix `dest`. The two matrices must have the
 same scope, type, shape, and layout. Use can be different so this
 function converts between different `use` of matrices.
 
@@ -860,8 +861,8 @@ a device with compute capability less than `sm_xx` then an error will
 be thrown. The mapping to Minimum Compute Capability from each
 supported parameter combination is specified in the following table.
 
---
-[.center]
+
+[frame="none",options="header"]
 |======================
 | A and B type | C and D type | M | N | K | Minimum Compute Capability
 .3+| `matrix_type::fp16`  .3+| `matrix_type::fp32`
@@ -887,19 +888,6 @@ supported parameter combination is specified in the following table.
 |32 |8 |16
 | `matrix_type::fp64`  | `matrix_type::fp64` |8 |8 |4
 |======================
---
-
-The M, N, K triple from the above table defines the complete set of
-matrix shapes constructible:
---
-[.center]
-|======================
-|use |NumRows | NumCols
-|a |M |K
-|b |K |N
-|accumulator | M| N
-|======================
---
 
 IMPORTANT: The `stride` argument to `joint_matrix_load` and
 `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a

From d7d0a70090e9b0448c49e8152c25e8880f83ff31 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Mon, 31 Jul 2023 08:29:57 -0700
Subject: [PATCH 43/51] Clarify use of must when referring to the query
 interface

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc  | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index a71d2c820eeee..0cb5e0b48f2a8 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -166,11 +166,13 @@ enum class use {
 ```
 
 ==== Matrix Shape
-The `Rows` and `Cols` template parameters provide the number of rows and
-columns in the joint matrix. Each device supports only certain
-combinations of row and column sizes, so the application must use the
-query operations (defined below) to ensure that the matrix shape is
-supported on the device where the kernel using this `joint_matrix` runs.
+The `Rows` and `Cols` template parameters provide the number of rows
+and columns in the joint matrix. Each device supports only certain
+combinations of row and column sizes, so the application must ensure
+that the combination is supported on the device where the kernel using
+this `joint_matrix` runs. The query functions (defined below) may be
+used to determine the set of combinations that are supported on a
+device.
 
 ==== Matrix Layout
 The `Layout` template parameter specifies the memory layout of the

From bf8e00c1c7645b09c575e150e81137b8a8501d82 Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 2 Aug 2023 11:49:29 -0700
Subject: [PATCH 44/51] Address Greg's comments: fix 2 broken lines, const
 multi_ptr, line wrap

---
 .../sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc     |  5 ++---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc    | 10 ++++++----
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
index aed4826e7238c..c01241b25afcb 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
@@ -39,8 +39,7 @@ SYCL specification refer to that revision.
 
 This extension also depends on the following other SYCL extensions:
 
-* link:../experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc[
-  sycl_ext_oneapi_matrix]
+* link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix]
 
 == Status
 This is an experimental extension specification, intended to provide early
@@ -201,7 +200,7 @@ joint_matrix_apply(sg, A, [=](T &val, size_t row, size_t  col) {
 ```
 === New Device Information Descriptor
 Besides the query we provide in
-link:../experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
+link:sycl_ext_oneapi_matrix.asciidoc[sycl_ext_oneapi_matrix],
 some device descriptors are Intel hardware specific. These are
 provided as part of `ext::intel::experimental::info::device::matrix`
 namespace:
diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 0cb5e0b48f2a8..7328f68709680 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -412,7 +412,7 @@ template <typename Group, size_t Rows, size_t Cols,
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
     layout::dynamic> &res,
-    multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
+    multi_ptr<const float, Space, IsDecorated> src, size_t stride, layout Layout);
 
 // Only available when Layout != layout::dynamic
 template <typename Group, size_t Rows, size_t Cols,
@@ -420,7 +420,7 @@ template <typename Group, size_t Rows, size_t Cols,
           access::address_space Space, access::decorated IsDecorated>
 void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
-    multi_ptr<float, Space, IsDecorated> src, size_t stride);
+    multi_ptr<const float, Space, IsDecorated> src, size_t stride);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
@@ -583,9 +583,11 @@ using joint_matrix_a;`| type alias for `joint_matrix` for matrix A
 |`template <typename Group, layout Layout> +
 using joint_matrix_b;`| type alias for `joint_matrix` for matrix B
 |`template <typename Group> +
-using joint_matrix_c;`| type alias for `joint_matrix` for the input matrix accumulator
+using joint_matrix_c;`| type alias for `joint_matrix` for the input
+matrix accumulator
 |`template <typename Group> +
-using joint_matrix_d;`| type alias for `joint_matrix` for the output matrix accumulator
+using joint_matrix_d;`| type alias for `joint_matrix` for the output
+matrix accumulator
 |======================
 
 ```c++

From 84af291b30998b83e42695c61f410b8410c1481f Mon Sep 17 00:00:00 2001
From: Dounia <dounia.khaldi@intel.com>
Date: Wed, 2 Aug 2023 12:02:23 -0700
Subject: [PATCH 45/51] Add clarifications about joint_matrix_copy

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 21 ++++++++++++-------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 7328f68709680..04af15f6e99e3 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -316,18 +316,23 @@ void joint_matrix_fill(Group g, joint_matrix<Group, T, Use, Rows,
 ```c++
 namespace sycl::ext::oneapi::experimental::matrix {
 
-template <typename Group, typename T, size_t Rows, size_t Cols,
-          use Use1, use Use2, layout Layout>
+template <typename Group, typename T1, typename T2, size_t Rows,
+          size_t Cols, use Use1, use Use2, layout Layout> 
 void joint_matrix_copy(Group g,
-                       joint_matrix<Group, T, Use1, Rows, Cols, Layout> &dest,
-                       joint_matrix<Group, T, Use2, Rows, Cols, Layout> &src);
+                       joint_matrix<Group, T1, Use1, Rows, Cols, Layout> &src,
+                       joint_matrix<Group, T2, Use2, Rows, Cols, Layout> &dest);
 
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
-This function copies `Rows x Cols` elements of type `T` from joint
-matrix `src` to joint matrix `dest`. The two matrices must have the
-same scope, type, shape, and layout. Use can be different so this
-function converts between different `use` of matrices.
+This function copies `Rows x Cols` elements of type `T1` from joint
+matrix `src` to `Rows x Cols` elements of type `T2` of joint matrix
+`dest`. The two matrices must have the same scope, shape, and
+layout. Use and type can be different so this function converts
+between different `use` of matrices. The application must ensure that
+the resulting matrix type is supported on the device where the kernel
+using this `joint_matrix` runs. The query functions (defined below)
+may be used to determine the set of combinations that are supported on
+a device.
 
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it

From 2c2af7d7bb2d48fbc97d47483ebe3e8fed3e877c Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Mon, 7 Aug 2023 09:11:24 -0700
Subject: [PATCH 46/51] Add non const overload to tf32 load as implicit
 conversion for multi_ptr is not supported

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 21 ++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 04af15f6e99e3..8492ac6bd16b0 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -328,11 +328,7 @@ This function copies `Rows x Cols` elements of type `T1` from joint
 matrix `src` to `Rows x Cols` elements of type `T2` of joint matrix
 `dest`. The two matrices must have the same scope, shape, and
 layout. Use and type can be different so this function converts
-between different `use` of matrices. The application must ensure that
-the resulting matrix type is supported on the device where the kernel
-using this `joint_matrix` runs. The query functions (defined below)
-may be used to determine the set of combinations that are supported on
-a device.
+between different `use` of matrices.
 
 ==== Element-Wise Operations
 Besides matrix multiply and add, this extension aims to make it
@@ -419,6 +415,13 @@ void joint_matrix_load(Group g,
     layout::dynamic> &res,
     multi_ptr<const float, Space, IsDecorated> src, size_t stride, layout Layout);
 
+template <typename Group, size_t Rows, size_t Cols,
+          access::address_space Space, access::decorated IsDecorated>
+void joint_matrix_load(Group g,
+    joint_matrix<Group, precision::tf32, use::accumulator, Rows, Cols,
+    layout::dynamic> &res,
+    multi_ptr<float, Space, IsDecorated> src, size_t stride, layout Layout);
+
 // Only available when Layout != layout::dynamic
 template <typename Group, size_t Rows, size_t Cols,
           use Use, layout Layout,
@@ -427,6 +430,14 @@ void joint_matrix_load(Group g,
     joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
     multi_ptr<const float, Space, IsDecorated> src, size_t stride);
 
+// Only available when Layout != layout::dynamic
+template <typename Group, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          access::address_space Space, access::decorated IsDecorated>
+void joint_matrix_load(Group g,
+    joint_matrix<Group, precision::tf32, Use, Rows, Cols, Layout> &res,
+    multi_ptr<float, Space, IsDecorated> src, size_t stride);
+
 } // namespace sycl::ext::oneapi::experimental::matrix
 ```
 

From e8bde89e839761a2024883265b3fc00f74244d95 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Wed, 9 Aug 2023 09:39:36 -0700
Subject: [PATCH 47/51] minor clarification

---
 .../experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
index c01241b25afcb..d6de22bdae391 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_intel_matrix.asciidoc
@@ -323,5 +323,5 @@ q.wait();
 |======================
 |Rev |Date       |Author     |Changes
 |1   |2022-11-07 |Dounia Khaldi |Add Intel-specific store API,
-layout information, and per-element access with coordinates API
+layout information, and `joint_matrix_apply` with coordinates API
 |======================

From a7f92ce2dc006914878f8dcc913afcb9a8a97213 Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Wed, 23 Aug 2023 07:05:16 -0700
Subject: [PATCH 48/51] fix width of query table

---
 .../experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 8492ac6bd16b0..d7dbd3e21760d 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -883,6 +883,7 @@ supported parameter combination is specified in the following table.
 
 
 [frame="none",options="header"]
+[cols="40%, 60%"]
 |======================
 | A and B type | C and D type | M | N | K | Minimum Compute Capability
 .3+| `matrix_type::fp16`  .3+| `matrix_type::fp32`

From 789b59341398ad13c192317d67be6c81fbfdb10b Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Fri, 25 Aug 2023 12:47:16 -0700
Subject: [PATCH 49/51] fix the width for the right table

---
 .../sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc            | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index d7dbd3e21760d..0e1688c96a603 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -576,7 +576,7 @@ The table below provides a description for each of the member
 variables in `matrix_params` class and the forms in which  they are
 defined.
 
-[frame="none",options="header"]
+[frame="none",options="header",cols="40%,60%"]
 |======================
 | Member/type alias in `matrix_params` | Description
 |`M`|when no sizes are provided by the user, indicates the suggested
@@ -883,7 +883,6 @@ supported parameter combination is specified in the following table.
 
 
 [frame="none",options="header"]
-[cols="40%, 60%"]
 |======================
 | A and B type | C and D type | M | N | K | Minimum Compute Capability
 .3+| `matrix_type::fp16`  .3+| `matrix_type::fp32`

From ee282500c36070319828ff5d62e585f5fa5e887d Mon Sep 17 00:00:00 2001
From: Greg Lueck <gregory.m.lueck@intel.com>
Date: Fri, 25 Aug 2023 16:03:47 -0400
Subject: [PATCH 50/51] Avoid line breaks in table by using source block

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 65 +++++++++++++++----
 1 file changed, 52 insertions(+), 13 deletions(-)

diff --git a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
index 0e1688c96a603..94c2bebe04906 100644
--- a/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
+++ b/sycl/doc/extensions/experimental/sycl_ext_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -579,31 +579,70 @@ defined.
 [frame="none",options="header",cols="40%,60%"]
 |======================
 | Member/type alias in `matrix_params` | Description
-|`M`|when no sizes are provided by the user, indicates the suggested
+a|
+[source]
+----
+static constexpr size_t M
+----
+|when no sizes are provided by the user, indicates the suggested
 default size for M; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
 provide sizes, this is the same value M that the user provides if M is
 supported by the implementation
-|`N`|when no sizes are provided by the user, indicates the suggested
+
+a|
+[source]
+----
+static constexpr size_t N
+----
+|when no sizes are provided by the user, indicates the suggested
 default size for N; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
 provide sizes, this is the same value N that the user provides if N is
 supported by the implementation
-|`K`| when no sizes are provided by the user, indicates the suggested
+
+a|
+[source]
+----
+static constexpr size_t K
+----
+|when no sizes are provided by the user, indicates the suggested
 default size for K; usually this corresponds to the maximum size the
 implementation supports. In validation mode, where the user does
 provide sizes, this is the same value K that the user provides if K is
 supported by the implementation
-|`template <typename Group, layout Layout> +
-using joint_matrix_a;`| type alias for `joint_matrix` for matrix A
-|`template <typename Group, layout Layout> +
-using joint_matrix_b;`| type alias for `joint_matrix` for matrix B
-|`template <typename Group> +
-using joint_matrix_c;`| type alias for `joint_matrix` for the input
-matrix accumulator
-|`template <typename Group> +
-using joint_matrix_d;`| type alias for `joint_matrix` for the output
-matrix accumulator
+
+a|
+[source]
+----
+template <typename Group, layout Layout>
+using joint_matrix_a
+----
+|type alias for `joint_matrix` for matrix A
+
+a|
+[source]
+----
+template <typename Group, layout Layout>
+using joint_matrix_b
+----
+|type alias for `joint_matrix` for matrix B
+
+a|
+[source]
+----
+template <typename Group>
+using joint_matrix_c
+----
+|type alias for `joint_matrix` for the input matrix accumulator
+
+a|
+[source]
+----
+template <typename Group>
+using joint_matrix_d
+----
+|type alias for `joint_matrix` for the output matrix accumulator
 |======================
 
 ```c++

From 2d80d16a2a2fdd9d306f82c5e0609f64708900cf Mon Sep 17 00:00:00 2001
From: Dounia Khaldi <dounia.khaldi@intel.com>
Date: Mon, 28 Aug 2023 08:25:12 -0700
Subject: [PATCH 51/51] add the conflicted file first in order to resolve the
 conflict

---
 .../sycl_ext_oneapi_matrix.asciidoc           | 650 ++++++++++++++++++
 1 file changed, 650 insertions(+)
 create mode 100644 sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc

diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
new file mode 100644
index 0000000000000..d93f1a1598ded
--- /dev/null
+++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_matrix/sycl_ext_oneapi_matrix.asciidoc
@@ -0,0 +1,650 @@
+# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
+:source-highlighter: coderay
+:coderay-linenums-mode: table
+:dpcpp: pass:[DPC++]
+
+// This section needs to be after the document title.
+:doctype: book
+:toc2:
+:toc: left
+:encoding: utf-8
+:lang: en
+
+:blank: pass:[ +]
+
+// Set the default source code type in this document to C++,
+// for syntax highlighting purposes.  This is needed because
+// docbook uses c++ and html5 uses cpp.
+:language: {basebackend@docbook:c++:cpp}
+
+
+== Notice
+
+Copyright (c) 2021-2022 Intel Corporation.  All rights reserved.
+
+NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
+trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
+used by permission by Khronos.
+
+This extension is written against the SYCL 2020 revision 5 specification.  All
+references below to the "core SYCL specification" or to section numbers in the
+SYCL specification refer to that revision.
+
+
+**_NOTE:_** _This document describes the current design and API for the matrix
+extension to {dpcpp}. This is an initial experimental version to try out functionality
+and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (Intel(R) AMX), Intel(R) Xe Matrix Extensions (Intel(R) XMX) and Nvidia(R) Tensor Cores._
+
+## Introduction
+This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Intel XMX in Intel GPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
+
+## Feature test macro
+
+This extension provides a feature-test macro as described in the core SYCL
+specification section 6.3.3 "Feature test macros".  Therefore, an
+implementation supporting this extension must predefine the macro
+`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
+Applications can test for the existence of this macro to determine if the
+implementation supports this feature, or applications can test the macro's
+value to determine which of the extension's APIs the implementation supports.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |The APIs of this experimental extension are not versioned, so the feature-test macro always has this value. 
+|======================
+
+## Matrix API Versions
+
+While this document presents the core API that unifies Intel AMX, Intel XMX, and Nvidia Tensor Cores, the implementations support slightly different versions of the API. For this reason, we introduce a new macro, namely `SYCL_EXT_ONEAPI_MATRIX_VERSION`  to distinguish between these different implementations. The goal in the next few months is to get rid of this implementation versioning macro. These are the current values for this macro.
+
+[frame="none",options="header"]
+|======================
+|Value |Description
+|1     |Initial extension JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported. The old API used for this implementation is detailed in link:../../deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc[matrix extension]
+|2     |JIT implementation on Intel AMX and Intel XMX. load, store, mad, fill, piece-wise operations, and the query interface are supported 
+|3     |Implementation on Nvidia Tensor Cores
+|======================
+
+## New `joint_matrix` class
+We introduce a new class called `joint_matrix`. The user needs to specify the group memory scope, the type of the elements, the shape, the matrix use, and the memory layout of the matrix. This results in the following description:
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template <typename Group, typename T, use Use, size_t Rows, size_t Cols,
+          layout Layout = layout::dynamic>
+struct joint_matrix {
+    joint_matrix() {}
+};
+}
+```
+
+IMPORTANT: Matrix layout defaulting to `layout::dynamic` applies only to matrix with `use::accumulator`
+
+#### Use
+Specifying the usage of the matrix: matrix left (A), matrix right (B) or accumulator +(C)+ is required by backend implementations to reason about the layout of the matrix in registers.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class use {
+  a,
+  b,
+  accumulator
+};
+}
+```
+
+#### Shape
+The shape of a `joint_matrix` refers to its number of rows `Rows` and number of columns `Cols`.
+
+#### Layout
+This specifies the memory layout and it can be row major or column major.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+enum class layout {
+  row_major,
+  col_major,
+  dynamic
+ };
+}
+```
+
+
+#### Group Memory Scope
+In this API, we use the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The group scope is added as an additional template parameter and is also part of the constructor arguments.
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported
+
+When the group is a `sycl::sub_group`, a matrix is declared as follows:
+
+```c++
+joint_matrix<sub_group, int8_t, use::a, tM, tN, layout::row_major> tA;
+```
+
+
+## Matrix Operations and their Execution Scope
+We define three new functions needed to perform the main and common operations on matrices, namely load, store, and the actual multiply and add operation. This set of functions can be easily extended if the matrix hardware implements new features.
+
+Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group.
+
+To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Load
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride, layout Layout);
+    
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+          use Use, layout Layout, access::address_space Space>
+  void joint_matrix_load(Group sg,
+    joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &res,
+    multi_ptr<T, Space, IsDecorated> src, size_t stride);
+}
+```
+
+`joint_matrix_load` loads data from memory to the 2d tiles/registers of the tensor hardware.
+We define two overloads of the load function depending on whether the memory layout was declared as part of the `joint_matrix` type or not. 
+The first overload that takes memory layout as an argument is only available for a `joint_matrix` type that used the default value `layout::dynamic`.
+The second overload without a memory layout must not be used with a `joint_matrix` type that used the default value `layout::dynamic`.
+
+The base pointer `src` here determines the starting address of the matrix to be loaded from. `Layout` determines whether the data is being read in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Store
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+            access::address_space Space>
+  void joint_matrix_store(Group sg,
+    joint_matrix<Group, T, use::accumulator, NumRows, NumCols, layout::dynamic> &res,
+    multi_ptr<T, Space, IsDecorated> dest, size_t stride, layout Layout);
+}
+```
+This function stores the data in the accumulator matrix from the 2d tiles back to memory.
+
+The base pointer `dest` here determines the starting address of the matrix to be stored. `Layout` determines whether the data is being written in a row (`row_major`), column major (`column_major`) fashion. `stride` describes the number of elements between consecutive rows for the row major layout, or between columns for the column major layout. 
+
+
+#### Multiply and Add
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename Ta, typename Tb, typename Tc, std::size_t M, std::size_t K, std::size_t N, 
+            layout LayoutA, layout LayoutB>
+  joint_matrix<Group, Td, use::accumulator, M, N, layout::dynamic> joint_matrix_mad(Group sg,
+    joint_matrix<Group, Ta, use::a, M, K, layoutA> A,
+    joint_matrix<Group, Tb, use::b, K, N, layoutB> B,
+    joint_matrix<Group, Tc, use::accumulator, M, N, layout::dynamic> C);
+}
+```
+The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
+
+
+#### Matrix Initialization: `joint_matrix_fill`
+The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to the `_tile_zero` intrinsic: 
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+  template <typename Group, typename T, size_t NumRows, size_t NumCols,
+           use Use, layout Layout, typename Tv>
+  void joint_matrix_fill(Group sg, joint_matrix<Group, T, Use, NumRows, NumCols, Layout> &m, Tv v);
+}
+```
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.  
+
+#### Element Indexing and Piece-Wise Operations
+##### Background
+Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
+
+Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
+
+Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
+
+// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
+
+##### Explicit conversion with mapping from SIMD to SPMD
+The data elements in a `joint_matrix` are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix in order to perform element-wise operations. In order to be able to perform element-wise operations in a general and efficient way, we provide a conversion function from the `joint_matrix` domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model.
+
+We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. The indexing provided inside the `wi_data` class accesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. This means that modifying `wi_data` also modifies the corresponding joint matrix elements. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
+
+Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and changes from one backend to the other. For general piece-wise operations such as summing the rows of a matrix, the WI data to joint matrix mapping coordinates information must be known in order to reason about the matrix view and extract the relevant piece. However, for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
+
+Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
+
+Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
+
+1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
+
+2- Sub group size is not generally fixed.
+
+The code listing below shows a synopsis of these new APIs.
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+   wi_data<group, T, Use, Rows, Cols, Layout> get_wi_data(Group sg, joint_matrix<Group, T, Use, Rows, Cols, Layout> Mat);
+
+template <typename T, size_t Rows, size_t Cols, use Use, layout Layout, typename Group>
+class wi_data {
+  size_t length();
+  wi_element<T, NumRows, NumCols, Use, Layout, Group> operator[](size_t i);
+};
+template <typename T, size_t Rows, size_t Cols,
+          use Use, layout Layout,
+          typename Group = sycl::sub_group>
+class wi_element {
+  operator T();
+  wi_element &operator=(const T &rhs);
+…
+};
+}
+```
+
+In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
+Vectorization along the sub group dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
+
+
+```c++
+auto wi_data_c = get_wi_data(sg, matC);
+for (int i = 0; i < wi_data_c.length(); i++)
+        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
+```
+
+IMPORTANT: In the current implementation, only the `sub_group` scope is supported.
+
+IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet.
+
+## Example using int8_t type
+```c++
+using namespace sycl::ext::oneapi::experimental::matrix;
+
+queue q;
+range<2> G = {M/tM, N};
+range<2> L = {1, SG_SIZE};
+int8_t *memA = malloc_shared<int8_t>(M*K, q);
+int8_t *memB = malloc_shared<int8_t>(K*N, q);
+int32_t *memC = malloc_shared<int32_t>(M*N, q);
+q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
+  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
+   const auto global_idx = item.get_global_id(0);
+   const auto global_idy = item.get_global_id(1);
+   const auto sg_startx = global_idx - item.get_local_id(0);
+   const auto sg_starty = global_idy - item.get_local_id(1);
+   sub_group sg = item.get_sub_group();
+   joint_matrix<sub_group, int8_t, use::a, tM, tK, layout::row_major> tA;
+   joint_matrix<sub_group, int8_t, use::b, tK, tN, layout::row_major> tB;
+   joint_matrix<sub_group, int32_t, use::accumulator, tM, tN> tC;
+   joint_matrix_fill(sg, tC, 0);
+   for (int k = 0; k < K; k += tK) {
+     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K);
+     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN, N); 
+     tC = joint_matrix_mad(sg, tA, tB, tC);
+   }
+   auto wi_data_c = get_wi_data(sg, tC);
+   for (int i = 0; i < wi_data_c.length(); i++)
+     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
+   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, layout::row_major);
+}).wait();
+```
+
+== Query Interface
+Intel AMX, Intel XMX and Nvidia TPUs support different sizes and types.
+The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
+This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
+so there will be no runtime errors.
+The query interface proposed here consists of three functionalities:
+
+- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
+
+- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/accumulator` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
+
+- General query: the general query interface provides information  about sizes, types,  and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query returns an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and accumulator. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively, depending on whether the implementation supports a continuous or discrete number of sizes. For example, the Intel AMX implementation supports a continuous number of sizes, so the `max_*` variant is applied and only the maximum number is returned. The Intel XMX implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
+
+The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
+
+[frame="none",options="header"]
+|======================
+| Member/type alias in `tpu_params` | Forms they are defined in |Description
+|`type_a`| validation, default values|type alias for the type of matrix A
+|`type_b`|  validation, default values|type alias for the type of matrix B
+|`type_accumulator`|  validation, default values|type alias for the type of matrix accumulator
+|`M`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
+|`N`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
+|`K`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
+|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
+|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
+|`joint_matrix_accumulator`|  validation, default values| type alias for `joint_matrix` for matrix accumulator
+|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to Intel XMX)
+|scopes| validation, default values, general query| indicates the memory and execution scopes supported by the TPU implementation
+|`combination` |  validation, default values, general query|composes the types and sizes of A, B, accumulator matrices allowed in one combination
+|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
+|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
+|`atype`, `btype`, `accumulatortype`| validation, default values, general query| indicates the types supported in the combination
+|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
+|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
+|======================
+
+
+
+```c++
+namespace sycl::ext::oneapi::experimental::matrix {
+template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int sM=0, int sN=0, int sK=0>
+struct tpu_params;
+
+// Validation form: Valid or not
+// Specialization when both types and sizes are given
+template <typename Ta, typename Tb, typename Tc, int sM, int sN, int sK, layout>
+struct tpu_params<
+    tpu::amx, Ta, Tb, Tc, sM, sN, sK,
+    typename std::enable_if<(
+        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
+        !std::is_same_v<Tc, void> && sM != 0 && sN != 0 && sK != 0)>::type> {
+  // Validate that parameters are supported
+  static_assert(
+      (sM == 0 && sN == 0 && sK == 0) ||
+          (is_combination_valid_amx<Ta, Tb, Tc>(sM, sN, sK)),
+      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
+      "using: "
+      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
+
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // if combination is valid, construct the matrices
+
+  static constexpr std::size_t M = (sM != 0) ? sM : 16;
+  static constexpr std::size_t N = (sN != 0) ? sN : 16;
+  static constexpr std::size_t K =
+      (sK != 0) ? sK : ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, defaultM, defaultK, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, defaultK, defaultN, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, defaultM, defaultN>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contains only the combination that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, sM, sN, sK}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// Default values form: Sizes-only query
+// Specialization for when only types are given, need to query only sizes
+template <typename Ta, typename Tb, typename Tc>
+struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
+                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
+                                           !std::is_same_v<Tb, void> &&
+                                           !std::is_same_v<Tc, void>)>::type> {
+  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
+                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
+                "and bf16 (Note that unsigned short should be used in the"
+                "DPC++ code to implement bf16) ");
+
+  using type_a = Ta; // this type alias is not available in the current implementation 
+  using type_b = Tb; // this type alias is not available in the current implementation
+  using type_accumulator = Tc; // this type alias is not available in the current implementation
+
+  // construct the matrices using the default sizes
+  static constexpr std::size_t M = 16;
+  static constexpr std::size_t N = 16;
+  static constexpr std::size_t K = ((sizeof(Ta) == 1) ? 64 : 32);
+
+  template <typename Group, layout LayoutA>
+  using joint_matrix_a = joint_matrix<Group, Ta, use::a, M, K, LayoutA>;
+  template <typename Group, layout LayoutB>
+  using joint_matrix_b = joint_matrix<Group, Tb, use::b, K, N, LayoutB>;
+  template <typename Group>
+  using joint_matrix_accumulator = joint_matrix<Group, Tc, use::accumulator, M, N>;
+
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
+  // types that the user provided
+  static constexpr combination combinations[] = {
+      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+// General query form:
+// types are not given, no default sizes and no implicit matrix construction
+template <int sM, int sN, int sK>
+struct tpu_params<tpu::amx, void, void, void, sM, sN, sK> {
+  static constexpr uint32_t numtiles = 8;
+  static constexpr scope_t scopes[] = {scope_t::sub_group};
+  static constexpr int num_scopes = sizeof(scopes) / sizeof(scope_t);
+  struct combination {
+    uint32_t max_msize;
+    uint32_t max_nsize;
+    uint32_t max_ksize;
+    uint32_t msize;
+    uint32_t nsize;
+    uint32_t ksize;
+    matrix_type atype;
+    matrix_type btype;
+    matrix_type accumulatortype;
+  };
+  
+  static constexpr combination combinations[] = {
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
+      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
+      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
+  static constexpr int num_combinations =
+      sizeof(combinations) / sizeof(combination);
+};
+
+
+enum class tpu {
+  xmx8,
+  xmx16,
+  amx
+};
+
+enum class matrix_type {
+  bf16,
+  fp16,
+  tf32,
+  fp32,
+  fp64,
+  sint2,
+  sint4,
+  sint8,
+  sint16,
+  sint32, 
+  sint64,
+  uint2,
+  uint4,
+  uint8,
+  uint16,
+  uint32,
+  uint64
+};
+
+enum class scope_t {
+  sub_group,
+  work_group
+};
+}
+```
+
+
+=== Validation Example:
+```c++
+// User can provide sizes besides the types and tpu_params can assert if they are supported or not
+// in this case, an assertion will happens as 16 is not a supported size for M
+using myparams = tpu_params<tpu::xmx16, int8_t, int8_t, int, 16, 16, 32>;  
+size_t NDRangeM = M / myparams::M;  //Assertion would happen at this line
+size_t NDRangeN = N / myparams::N;
+```
+
+=== Default Values Example:
+```c++
+using myparams = tpu_params_both<tpu::xmx16, int8_t, int8_t, int>;
+// use this to construct the ranges on the host side
+size_t NDRangeM = M / myparams::M;
+size_t NDRangeN = N / myparams::N;
+//if M, N, K do not multiply the default sizes, padding has to be done
+// device code: the matrices are constructed using the default dimensions
+myparams::joint_matrix_a<sub_group, layout::row_major> sub_a;
+myparams::joint_matrix_b<sub_group, layout::row_major> sub_b;
+myparams::joint_matrix_accumulator<sub_group> sub_c;
+
+```
+
+=== General Query Example:
+```c++
+constexpr int M = 1500; // with msize = 8 and msize = 4,
+          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
+tpu_params<tpu::xmx16> params;
+constexpr int msize = break_dimension(params, M);
+constexpr int msize_remainder = break_dimension_remainder(params, M);
+constexpr int nsize = params.combinations[0].nsize;
+constexpr int ksize = params.combinations[0].ksize;
+// device code:
+joint_matrix<sub_group, int8_t, use::a, msize, ksize, layout::row_major> sub_a;
+joint_matrix<sub_group, int8_t, use::b, ksize, nsize, layout::row_major> sub_b;
+joint_matrix<sub_group, int, use::accumulator, msize, nsize> sub_c;
+//Remainder handling
+```
+
+## Future-looking API
+
+### Memory scope
+The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
+
+
+```c++
+multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN, use::a>>(sg);
+```
+We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
+
+### WI data to joint matrix mapping coordinates information for piece-wise operations
+The indexing provided inside the `wi_data` class accesses only the portion of the matrix held by the current WI. It is not possible to know the location of this portion in the original matrix.  This coordinates mapping  is implementation defined and changes from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping information is needed to reason about the matrix view.
+Within the joint matrix extension, we want to write, as much as possible, one code to run on different backends. If backend X states that a WI owns one exact row of the matrix for instance, writing the following code will work only on that backend for that version of hardware. If a different hardware and implementation is used, the same WI may own only half of the row if, for example, the SG size increased. 
+
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length(); ++i) {
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-coding this mapping, we use general backend and target-agnostic functionality, especially in the JIT compilation mode of SYCL. For this reason we would like to be able to query this mapping so that code does not have to change from one version to the other.
+
+So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
+```c++
+auto data = get_wi_data(sg, C);
+for (int i = 0; i < data.length; ++i) {
+  auto row, col = data[i].get_coord();
+  sum_of_local_rows[row] += data[i];
+}
+```
+
+=== Appendix: Supported Parameter Combinations Per Hardware
+
+The tables below provide a list of the parameter combinations that
+`joint_matrix` implementations support on each supported vendors hardware type.
+
+==== Nvidia Tensor Cores Supported Combinations
+
+The complete set of matrix data types and shapes that are supported by the `ext_oneapi_cuda` backend are represented in the following table. Tm indicates the matrix element data type held by a "multiplicand" `joint_matrix`: i.e requiring `use::a` or `use::b`. Tc indicates the matrix element data type held by an "accumulator" `joint_matrix`: i.e requiring `use::accumulator`.
+
+IMPORTANT: When compiling for the `ext_oneapi_cuda` backend the target arch backend flag, `-Xsycl-target-backend --cuda-gpu-arch=sm_xx`, must be used, where `sm_xx` must be a Compute Capability that is equal to or greater than the appropriate Minimum Compute Capability. When an executable has been compiled for `sm_xx`, if the executable is run on a device with compute capability less than `sm_xx` then an error will be thrown. The mapping to Minimum Compute Capability from each supported parameter combination is specified in the following table.
+
+--
+[.center]
+|======================
+|Tm (`use::a` or `use::b`) |Tc (`use::accumulator`) |M |N |K | Minimum Compute Capability
+.3+|half  .3+|float
+|16 |16 |16 .6+| sm_70
+|8 |32 |16
+|32 |8 |16
+.3+|half  .3+|half
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+.3+|int8_t  .3+|int32_t
+|16 |16 |16 .6+| sm_72
+|8 |32 |16
+|32 |8 |16
+.3+|uint8_t  .3+|int32_t
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|precision::tf32  |float |16 |16 |8 .5+| sm_80
+.3+|bfloat16  .3+|float
+|16 |16 |16
+|8 |32 |16
+|32 |8 |16
+|double  |double |8 |8 |4
+|======================
+--
+
+The M, N, K triple from the above table defines the complete set of matrix shapes constructible:
+--
+[.center]
+|======================
+|use |NumRows | NumCols
+|a |M |K
+|b |K |N
+|accumulator | M| N
+|======================
+--
+
+IMPORTANT: The `stride` argument to `joint_matrix_load` and `joint_matrix_store` must be a multiple of 8 when `T` is `half`, and a multiple of 4 when `T` is `float`; where `T` is the type of the `joint_matrix` elements. When `T` is not `half` or `float` there are no restrictions to `stride`.
+
+## TODO List
+- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
+- Add a more realistic and complete example that shows the value of the general query. 
+
+
+## Revision History
+
+[frame="none",options="header"]
+|======================
+|Rev |Date       |Author     |Changes
+|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
+|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
+|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
+|4   |2022-08-25 |Dounia Khaldi |Update the matrix spec by adding the new matrix use parameter and remove reference to the AOT AMX initial implementation 
+|5   |2022-11-07 |Dounia Khaldi |Update the matrix spec by making it portable across Intel AMX, Intel XMX and Nvidia tensor Cores, and move the Intel-specifics to a separate extension document.  
+|======================