async copy code with latest tip #42

carlushuang · 2023-12-01T07:14:32Z

rocm/pytorch-private:rocm_6.0_12969_vllm_v4p

qianfengz

Rename nhead_radio_qk to nhead_ratio_qk

qianfengz · 2023-12-11T06:57:40Z

example/91_tile_program/fmha_fwd.cpp

+    const ck::index_t batch_stride_q    = (nhead * seqlen_q * hdim_q);
+    const ck::index_t batch_stride_k    = (nhead_k * seqlen_k * hdim_q);
+    const ck::index_t batch_stride_v    = (nhead_k * hdim_v * seqlen_k);
+    const ck::index_t batch_stride_bias = (0 * nhead * seqlen_q * seqlen_k);
+    const ck::index_t batch_stride_o    = (nhead * seqlen_q * hdim_v);


Use long_index_t for calculated offset and ensure the expression is computed using long_index_t

@qianfengz Because we store strides as index_t in kernel arguments. Does it mean we also have to pass long_index_t to the kernel? If not, I think using index_t here is more reasonable.

Sorry, you are right, I though it is in the kernel

…le into fmha_attemp_async_copy_unify

…+ extension building

…alid

Add bhalf2_t, bhalf4_t inner_product

Speed up fMHA kernel in bias mode (avoid unnecessary conversion to f64)

* WIP add generic masking * now local is not correct * fix bug in local atn * support when a whole row is masked * fix a bug in local attn

* Re-organize example directories * Move reference operation into sub-folder * Move mask types into dedicated files * Separate utility interface & implementation * Resume pipeline changes in fmha_fwd.cpp * Rename folder 'fmha_fwd' to 'fmha' * Move more function to utils.* * Remove 'fmha_fwd_kernel_invoker.hpp' * Re-format files * Move Kargs types into dedicated file * Fix formating * Fix compilation errors * Avoid instantiating unused types * Extract configurable codes * Add missing include directive * Instantiate template functions outside fmha_fwd.cpp * Separate implementation files * Merge config files * Merge duplicated code * Remove no-longer used file * Unify enum name * Extract no_mask kernel * Further separate template specializations * Use file(GLOB) to get file list * Include needed config file only once * Remove debug message * Add comment to explain template specializations * Move impl files under 'kernels' sub-folder * Only include *.inc in *.inc files * Add extra type arg to control selected kernel * Add kernel specializations for bf16 * Switch kernel according to cmdline options * Re-order type parameters * Reduce loop indent level * Instantiate launch_kernel() * Rename source files * Remove duplicated codes * Remove more duplicated codes * Clean up codes * Rename 'FmhaMaskType' to 'FmhaMasks' * Remove no-longer used include directive * Move template declarations into dedicated header * use python codegen * modify validation logic * format print and add smoke_test script * modify bf16 elimit, add benchmark script --------- Co-authored-by: carlushuang <[email protected]>

…gfix Fix in block_masking.hpp

example/91_tile_program/fmha/fmha_fwd.cpp

include/ck/tile_program/tile/static_distributed_tensor.hpp

* Support more head dim values * Change the head dim check logic according to config * Add missing include directive * Use fewer occupancy to prevent register spills * Rename closure object * Make expr more readable

Extract logic as new helper function: get_x_indices_from_distributed_indices()

* add lse parameters to kernel * add store lse in kernel * add lse host ref and check result * add parameter to control store lse or not * fix kernel template kStoreLSE value * move lse store to pipeline * fix output err info * fix output err info2 * change lse 4 dim to 3 dim * mask storing lse in example * remove divide in kernel * remove pointer in function reference_batched_softmax * set LSE is template parameter in FmhaFwdKernelSelector * remove parameter stride_lse * fix bug for using nullopt in function reference_batched_softmax --------- Co-authored-by: letaoqin <[email protected]>

* Add new fmha pipeline: BlockFmhaPipelineQSKSVS * Update tile size for hdim=256 * Use BlockFmhaPipelineQRKSVS for hdim=256 * Revert "Update tile size for hdim=256" This reverts commit 8cd70c2. * Remove FMHA_FWD_SUPPORT_HDIM_256 option and run hdim=256 tests

* compute correct * improve perf, but seems pipeline has duplicated ISA * refactor generate_kernel * remove duplicated GetBlockQKGemm * finialize a more generic codegen * refactor into autogen API * fix some comment * Use occupancy=1 for hdim=256 * support hdim=256 * modify some comment * we no longer need to change target inside a file * update bench script * add readme * modify --------- Co-authored-by: Po Yen, Chen <[email protected]>

Fix wrong argument order of transform_tensor_view()

* Validate m values before use it * Ensure closure return type is same as param * Replace Lowest() to Infinity() calls * Fix format * Update all the pipelines * Only validate m if FmhaMask::IsMasking is true

…sion (#77) * Rename TileFmhaTraits<> padding flags * Rename FmhaFwdKernel padding attributes * Separate kPadHeadDimQV into kPadHeadDimQ/V

* Allow setting random seed for uniform filler * Add seed option to fmha example * Add normal distribution tensor filler * Align tensor init method with xformers * Fix variable type * Validate m if we have bias tensor * Add comment to explain why we validate m under bias mode * Remove blank line * Support no seed scenario for fillers * Do not apply random seed if user set seed=0

* Check padding edge in IsEdgeTile() * Rename variables in IsEdgeTile() * Rename template parameter & fix compilation error * Only check right boundary for now * Add "i_" prefix for indicating index variables

* Validate v_max before use in reference_batched_softmax() * Dump LSE even if nothing to do in pipeline * Allow infinity reference value for check_err() * Allow infinity reference value while checking LSE * Rename variable * Generate independent seqlens_k in group mode seqlens_k is no longer greater than corresponding element in seqlens_q * Use std::isinf() to check -INF value * Remove check for NaN * Do not clear buffer before use

carlushuang added 7 commits December 1, 2023 01:13

async copy code

441de82

modify stream config

f90c80a

mofidy some internal API

94e7723

remove some useleff code

3ab658e

support MQA/GQA

3380eeb

rename some code

f810f41

seperate async to different pipeline

149a242

qianfengz reviewed Dec 7, 2023

View reviewed changes

carlushuang added 5 commits December 7, 2023 00:04

rename radio->ratio

72b5a47

merge feature/fmha-pad-support aece827

8f3a9ad

merge main

5999fd9

Merge branch 'main' into fmha_attemp_async_copy_unify

4790257

add missing bf16 type

ced4670

qianfengz reviewed Dec 11, 2023

View reviewed changes

poyenc and others added 7 commits December 11, 2023 02:11

Fix loop counter update logics

cfcc7e7

Merge branch 'fmha_attemp_async_copy_unify' of github.com:asroy/ck_ti…

6048487

…le into fmha_attemp_async_copy_unify

Disable exp() and log() overloading for half_t to support xformers C+…

5a24af3

…+ extension building

Add include/ck/config.h to support xformers c++ extension building

d180391

refactor mask in async copy pipeline

c1814f9

Make sure RNG data for MaskUpperTriangleFromBottomRightPredicate is v…

7fab8b0

…alid

Use std::make_tuple() to construct temp std::tuple<>

dc9ba2e

carlushuang changed the title ~~[WIP] async copy code with latest tip~~ async copy code with latest tip Dec 12, 2023

qianfengz and others added 8 commits December 13, 2023 14:01

Add bhalf2_t, bhalf4_t inner_product

b7e3f3b

Merge pull request #55 from asroy/fmah_attemp_async_copy_unify_innerprod

3ffae93

Add bhalf2_t, bhalf4_t inner_product

Choose constant according precision

d205bc5

Avoid inefficient instruction

7ddff7b

Merge pull request #57 from asroy/feature/speed-up-bias-mode

81f1b0f

Speed up fMHA kernel in bias mode (avoid unnecessary conversion to f64)

Remove sched_barrier() for non-bias mode

6b888b6

WIP add generic masking (#59)

3913a40

* WIP add generic masking * now local is not correct * fix bug in local atn * support when a whole row is masked * fix a bug in local attn

add __device__ to make_generic_attention_mask_coordinates_from_lr_window

afea739

poyenc and others added 6 commits January 5, 2024 22:57

modify bench script

0c1bf34

Fix in block_masking.hpp

33a2ee1

Merge pull request #62 from asroy/fmha_attemp_async_copy_unify_maskin…

62b17b9

…gfix Fix in block_masking.hpp

Fix inconsistent mask creation logics (#63)

65c8f98

support non-broadcast in block reduce sync

539f967

danyao12 reviewed Jan 9, 2024

View reviewed changes

example/91_tile_program/fmha/fmha_fwd.cpp Outdated Show resolved Hide resolved

carlushuang commented Jan 10, 2024

View reviewed changes

include/ck/tile_program/tile/static_distributed_tensor.hpp Outdated Show resolved Hide resolved

poyenc and others added 5 commits January 11, 2024 03:49

Fix wrong data type used for bias tensor

f188b80

Flexible head dimension (#66)

1787c23

* Support more head dim values * Change the head dim check logic according to config * Add missing include directive * Use fewer occupancy to prevent register spills * Rename closure object * Make expr more readable

Fix complation error

cd4c060

Extract distributed indices convertion logics as function

bf427ce

Extract logic as new helper function: get_x_indices_from_distributed_indices()

poyenc assigned carlushuang Jan 19, 2024

poyenc and others added 13 commits January 19, 2024 19:29

Fix wrong arg order of transform_tensor_view()

97997b6

Merge pull request #73 from asroy/feature/fix-wrong-trans-desc

f8c746b

Fix wrong argument order of transform_tensor_view()

Validate m values before use them (#75)

5b6b5df

* Validate m values before use it * Ensure closure return type is same as param * Replace Lowest() to Infinity() calls * Fix format * Update all the pipelines * Only validate m if FmhaMask::IsMasking is true

Rename & separate TileFmhaTraits<> padding flags for better comprehen…

9a302e6

…sion (#77) * Rename TileFmhaTraits<> padding flags * Rename FmhaFwdKernel padding attributes * Separate kPadHeadDimQV into kPadHeadDimQ/V

Fallback changes for init=0

1bed0e7

Add back sched_barrier() in pipeline

0d231a7

Fix README.md wording (#78)

e914fa2

restore init dist

d1adca3

Check padding boundary in GenericAttentionMask<>::IsEdgeTile() (#81)

eb53e23

* Check padding edge in IsEdgeTile() * Rename variables in IsEdgeTile() * Rename template parameter & fix compilation error * Only check right boundary for now * Add "i_" prefix for indicating index variables

carlushuang requested review from asroy, junliume and zjing14 as code owners February 4, 2024 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

async copy code with latest tip #42

async copy code with latest tip #42

Uh oh!

carlushuang commented Dec 1, 2023 •

edited

Loading

Uh oh!

qianfengz left a comment

Uh oh!

qianfengz Dec 11, 2023

Uh oh!

poyenc Dec 11, 2023

Uh oh!

qianfengz Dec 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

async copy code with latest tip #42

Are you sure you want to change the base?

async copy code with latest tip #42

Uh oh!

Conversation

carlushuang commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qianfengz left a comment

Choose a reason for hiding this comment

Uh oh!

qianfengz Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

poyenc Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

qianfengz Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

carlushuang commented Dec 1, 2023 •

edited

Loading

qianfengz Dec 11, 2023 •

edited

Loading