-
Notifications
You must be signed in to change notification settings - Fork 0
async copy code with latest tip #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
qianfengz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename nhead_radio_qk to nhead_ratio_qk
example/91_tile_program/fmha_fwd.cpp
Outdated
| const ck::index_t batch_stride_q = (nhead * seqlen_q * hdim_q); | ||
| const ck::index_t batch_stride_k = (nhead_k * seqlen_k * hdim_q); | ||
| const ck::index_t batch_stride_v = (nhead_k * hdim_v * seqlen_k); | ||
| const ck::index_t batch_stride_bias = (0 * nhead * seqlen_q * seqlen_k); | ||
| const ck::index_t batch_stride_o = (nhead * seqlen_q * hdim_v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use long_index_t for calculated offset and ensure the expression is computed using long_index_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qianfengz Because we store strides as index_t in kernel arguments. Does it mean we also have to pass long_index_t to the kernel? If not, I think using index_t here is more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you are right, I though it is in the kernel
…le into fmha_attemp_async_copy_unify
…+ extension building
Add bhalf2_t, bhalf4_t inner_product
Speed up fMHA kernel in bias mode (avoid unnecessary conversion to f64)
* WIP add generic masking * now local is not correct * fix bug in local atn * support when a whole row is masked * fix a bug in local attn
* Re-organize example directories * Move reference operation into sub-folder * Move mask types into dedicated files * Separate utility interface & implementation * Resume pipeline changes in fmha_fwd.cpp * Rename folder 'fmha_fwd' to 'fmha' * Move more function to utils.* * Remove 'fmha_fwd_kernel_invoker.hpp' * Re-format files * Move Kargs types into dedicated file * Fix formating * Fix compilation errors * Avoid instantiating unused types * Extract configurable codes * Add missing include directive * Instantiate template functions outside fmha_fwd.cpp * Separate implementation files * Merge config files * Merge duplicated code * Remove no-longer used file * Unify enum name * Extract no_mask kernel * Further separate template specializations * Use file(GLOB) to get file list * Include needed config file only once * Remove debug message * Add comment to explain template specializations * Move impl files under 'kernels' sub-folder * Only include *.inc in *.inc files * Add extra type arg to control selected kernel * Add kernel specializations for bf16 * Switch kernel according to cmdline options * Re-order type parameters * Reduce loop indent level * Instantiate launch_kernel() * Rename source files * Remove duplicated codes * Remove more duplicated codes * Clean up codes * Rename 'FmhaMaskType' to 'FmhaMasks' * Remove no-longer used include directive * Move template declarations into dedicated header * use python codegen * modify validation logic * format print and add smoke_test script * modify bf16 elimit, add benchmark script --------- Co-authored-by: carlushuang <[email protected]>
…gfix Fix in block_masking.hpp
* Support more head dim values * Change the head dim check logic according to config * Add missing include directive * Use fewer occupancy to prevent register spills * Rename closure object * Make expr more readable
Extract logic as new helper function: get_x_indices_from_distributed_indices()
* add lse parameters to kernel * add store lse in kernel * add lse host ref and check result * add parameter to control store lse or not * fix kernel template kStoreLSE value * move lse store to pipeline * fix output err info * fix output err info2 * change lse 4 dim to 3 dim * mask storing lse in example * remove divide in kernel * remove pointer in function reference_batched_softmax * set LSE is template parameter in FmhaFwdKernelSelector * remove parameter stride_lse * fix bug for using nullopt in function reference_batched_softmax --------- Co-authored-by: letaoqin <[email protected]>
* Add new fmha pipeline: BlockFmhaPipelineQSKSVS * Update tile size for hdim=256 * Use BlockFmhaPipelineQRKSVS for hdim=256 * Revert "Update tile size for hdim=256" This reverts commit 8cd70c2. * Remove FMHA_FWD_SUPPORT_HDIM_256 option and run hdim=256 tests
* compute correct * improve perf, but seems pipeline has duplicated ISA * refactor generate_kernel * remove duplicated GetBlockQKGemm * finialize a more generic codegen * refactor into autogen API * fix some comment * Use occupancy=1 for hdim=256 * support hdim=256 * modify some comment * we no longer need to change target inside a file * update bench script * add readme * modify --------- Co-authored-by: Po Yen, Chen <[email protected]>
Fix wrong argument order of transform_tensor_view()
* Validate m values before use it * Ensure closure return type is same as param * Replace Lowest() to Infinity() calls * Fix format * Update all the pipelines * Only validate m if FmhaMask::IsMasking is true
…sion (#77) * Rename TileFmhaTraits<> padding flags * Rename FmhaFwdKernel padding attributes * Separate kPadHeadDimQV into kPadHeadDimQ/V
* Allow setting random seed for uniform filler * Add seed option to fmha example * Add normal distribution tensor filler * Align tensor init method with xformers * Fix variable type * Validate m if we have bias tensor * Add comment to explain why we validate m under bias mode * Remove blank line * Support no seed scenario for fillers * Do not apply random seed if user set seed=0
* Check padding edge in IsEdgeTile() * Rename variables in IsEdgeTile() * Rename template parameter & fix compilation error * Only check right boundary for now * Add "i_" prefix for indicating index variables
* Validate v_max before use in reference_batched_softmax() * Dump LSE even if nothing to do in pipeline * Allow infinity reference value for check_err() * Allow infinity reference value while checking LSE * Rename variable * Generate independent seqlens_k in group mode seqlens_k is no longer greater than corresponding element in seqlens_q * Use std::isinf() to check -INF value * Remove check for NaN * Do not clear buffer before use
rocm/pytorch-private:rocm_6.0_12969_vllm_v4p