Bugfix: Attention computation is not equivalent if fused_attn is false #145

kandelak · 2025-05-16T12:52:46Z

There is a bug in calculation of the attention if fused_attn is set to false.

To replicate this bug, set fused_attn to false and you will get very poor reconstruction whereas if it is set to true (default), it works. With this change, it works again (reimplemented non-efficient fused_attention basically)

Possible reason: The training was done using F.scaled_dot_product_attention which is internally different from the "else branch" where attention calculation happens in a non-efficient way.

Bugfix: Attention computation is not equivalent if fused_attn is false

0528872

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Bugfix: Attention computation is not equivalent if fused_attn is false #145

kandelak commented May 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Are you sure you want to change the base?

Bugfix: Attention computation is not equivalent if fused_attn is false #145

Conversation

kandelak commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kandelak commented May 16, 2025 •

edited

Loading