Skip to content

Conversation

rfejgin
Copy link
Collaborator

@rfejgin rfejgin commented Sep 19, 2025

Problem: We have observed a very rare but persistent issue where Magpie can generate a zero-length output (for a standard length input text). Here's what was happening: the EOS logit at the very first timestep would become very large for particular utterances after application of CFG. Interestingly, before CFG neither the conditional nor unconditional EOS logit was particularly large; but CFG amplified the difference between them which resultsed in the post-CFG EOS logit being the maximum among all logits.

Fix: In this PR we avoid this particular early termination issue by disallowing EOS in the first 4 timesteps, corresponding to about 186ms (with a 21.5 Hz codec). That helps the model avoid first-frame termination and it then actually generates the rest of the utterance correctly. The value of 4 steps was chosen because the codec requires a minimum of 4 frames to decode (albeit at a batch level). This way we sample those 4 steps rather than potentially force-replace them with zero-index tokens.

Note that I also examined the logits for an instance of mid-sentence termination and the logits did not follow the pattern of EOS becoming large only after CFG in that case. So CFG doesn't appear to be at the root of all early-termination issues, just the start-of-utterance one.

(Still, it's probably worthwhile thinking about how to improve the underlying CFG mechanism; but for now we need to a safeguard so that these zero-length generations don't happen.)

I ran evaluations and got similar metrics with and without this constraint, which should kick in very rarely.

The number of frames is configuratble via a parameter to infer_batch().

This is a workaround to the observation that when CFG is on
we sometimes terminate after zero tokens. It appears to be an artifacts
of CFG, since the EOS logit is not particularly large for the conditional
logits; only post-CFG.

Signed-off-by: Fejgin, Roy <[email protected]>
@github-actions github-actions bot added the TTS label Sep 19, 2025
@rfejgin rfejgin marked this pull request as ready for review September 24, 2025 01:25
Signed-off-by: Fejgin, Roy <[email protected]>
@rfejgin rfejgin force-pushed the magpietts_2508_forbid_eos_near_start branch from 925661f to 1005d1b Compare September 24, 2025 03:11
(to aid in debugging rare issues)

Signed-off-by: Fejgin, Roy <[email protected]>
Copy link
Collaborator

@paarthneekhara paarthneekhara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
@rfejgin rfejgin enabled auto-merge (squash) September 25, 2025 01:23
@rfejgin rfejgin disabled auto-merge September 25, 2025 03:49
@rfejgin rfejgin enabled auto-merge (squash) September 26, 2025 20:45
@rfejgin rfejgin added Run CICD and removed Run CICD labels Sep 27, 2025
@rfejgin rfejgin added Run CICD and removed Run CICD labels Sep 27, 2025
@rfejgin rfejgin merged commit f3878d7 into NVIDIA-NeMo:magpietts_2508 Sep 27, 2025
95 of 103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants