You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug TransformerEncoder does not receive or apply masks generated by Embedding(mask_zero=True), despite the documentation stating that it does:
"This layer will correctly compute an attention mask from an implicit Keras padding mask (for example, by passing mask_zero=True to a keras.layers.Embedding layer).”
In practice, the call() method of TransformerEncoder does not accept mask as an argument, and therefore Keras does not pass the automatic mask to it.
As a result, attention is incorrectly computed on padding tokens, and training is impacted.
To Reproduce
You can reproduce the issue with this minimal example:
The output vectors corresponding to the padding positions (tokens 0) should either:
Be masked (e.g., filled with zeros or left unchanged), or
Remain constant if attention is completely masked out.
Actual output
The output for the padding tokens is non-zero and distinct, which indicates that the model applied attention to padding positions, violating the intended masking behavior.
Expected behavior
The attention mechanism inside TransformerEncoder should ignore padded tokens (zeros), based on the implicit mask generated by Embedding(mask_zero=True).
Either:
The call() method should accept mask=None and internally map it to padding_mask, OR
The documentation should clearly state that automatic masking is not supported and that the user must pass padding_mask=... manually.
Would you like to help us fix it?
Yes, I’d be happy to help. A minimal fix could involve:
Adding mask=None to the call() signature
Automatically mapping it to padding_mask inside the layer
The text was updated successfully, but these errors were encountered:
After a deeper inspection, it turns out that the TransformerEncoderdoes correctly receive the mask generated by Embedding(mask_zero=True), but it happens in a somewhat non-obvious way:
The mask is not passed via the mask= argument to the call() method.
Instead, Keras internally attaches the mask as a private attribute named _keras_mask to the input tensor.
Inside the TransformerEncoder, the merge_padding_and_attention_mask() utility function retrieves this mask from inputs._keras_mask if padding_mask is not explicitly passed.
So the actual behavior is correct:
Attention masking is applied properly, even without explicitly passing padding_mask=....
This mechanism relies on Keras’ internal mask propagation and works as long as the model is built using the Functional API (or the Sequential API with proper layer chaining).
However, if the encoder is wrapped in a custom layer or called manually, the user must manually pass mask=... and route it to padding_mask.
Suggestion:
There’s no need to modify the code itself, but it would be helpful to clarify the documentation, e.g., by mentioning that:
"Mask propagation happens implicitly through inputs._keras_mask, and the layer will respect it via internal utilities. However, when using custom wrappers or subclassing, you may need to pass the mask manually as padding_mask=...."
Let me know if you'd like me to submit a small PR with this doc clarification!
Describe the bug
TransformerEncoder
does not receive or apply masks generated byEmbedding(mask_zero=True)
, despite the documentation stating that it does:In practice, the
call()
method ofTransformerEncoder
does not acceptmask
as an argument, and therefore Keras does not pass the automatic mask to it.As a result, attention is incorrectly computed on padding tokens, and training is impacted.
To Reproduce
You can reproduce the issue with this minimal example:
Expected output
The output vectors corresponding to the padding positions (tokens
0
) should either:Actual output
The output for the padding tokens is non-zero and distinct, which indicates that the model applied attention to padding positions, violating the intended masking behavior.
Expected behavior
The attention mechanism inside
TransformerEncoder
should ignore padded tokens (zeros), based on the implicit mask generated byEmbedding(mask_zero=True)
.Either:
call()
method should acceptmask=None
and internally map it topadding_mask
,OR
padding_mask=...
manually.Would you like to help us fix it?
Yes, I’d be happy to help. A minimal fix could involve:
mask=None
to thecall()
signaturepadding_mask
inside the layerThe text was updated successfully, but these errors were encountered: