TransformerEncoder silently ignores mask_zero=True despite docs claim

**Describe the bug**  
`TransformerEncoder` does **not receive or apply masks** generated by `Embedding(mask_zero=True)`, despite the documentation stating that it does:

> _"This layer will correctly compute an attention mask from an implicit Keras padding mask (for example, by passing mask_zero=True to a keras.layers.Embedding layer).”_

In practice, the `call()` method of `TransformerEncoder` does **not accept `mask` as an argument**, and therefore Keras does **not pass** the automatic mask to it.  
As a result, attention is incorrectly computed **on padding tokens**, and training is impacted.

---

**To Reproduce**  
You can reproduce the issue with this minimal example:

```python
import tensorflow as tf
import keras_nlp

input_data = tf.constant([[12, 34, 0, 0]])
embedding = tf.keras.layers.Embedding(input_dim=100, output_dim=32, mask_zero=True)
embedded = embedding(input_data)

mask = embedding.compute_mask(input_data)
print("Mask from Embedding:", mask.numpy())  # [[True, True, False, False]]

encoder = keras_nlp.layers.TransformerEncoder(intermediate_dim=64, num_heads=2)

inputs = tf.keras.Input(shape=(None,), dtype=tf.int32)
x = tf.keras.layers.Embedding(input_dim=100, output_dim=32, mask_zero=True)(inputs)
x = keras_nlp.layers.TransformerEncoder(intermediate_dim=64, num_heads=2)(x)
model = tf.keras.Model(inputs, x)

output = model.predict(input_data)
print("Output from TransformerEncoder:", output)
```

### Expected output  
The output vectors corresponding to the padding positions (tokens `0`) should either:
- Be masked (e.g., filled with zeros or left unchanged), or
- Remain constant if attention is completely masked out.

### Actual output  
The output for the padding tokens is **non-zero** and **distinct**, which indicates that the model applied attention to padding positions, violating the intended masking behavior.

---

**Expected behavior**  
The attention mechanism inside `TransformerEncoder` should ignore padded tokens (zeros), based on the implicit mask generated by `Embedding(mask_zero=True)`.

Either:
- The `call()` method should accept `mask=None` and internally map it to `padding_mask`,  
**OR**
- The documentation should clearly state that automatic masking is not supported and that the user must pass `padding_mask=...` manually.

---

**Would you like to help us fix it?**  
Yes, I’d be happy to help. A minimal fix could involve:
- Adding `mask=None` to the `call()` signature
- Automatically mapping it to `padding_mask` inside the layer


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TransformerEncoder silently ignores mask_zero=True despite docs claim #2220

Expected output

Actual output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TransformerEncoder silently ignores mask_zero=True despite docs claim #2220

Description

Expected output

Actual output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions