-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Don't allow EOS until 4 frames have been generated #14761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rfejgin
merged 13 commits into
NVIDIA-NeMo:magpietts_2508
from
rfejgin:magpietts_2508_forbid_eos_near_start
Sep 27, 2025
Merged
Don't allow EOS until 4 frames have been generated #14761
rfejgin
merged 13 commits into
NVIDIA-NeMo:magpietts_2508
from
rfejgin:magpietts_2508_forbid_eos_near_start
Sep 27, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The number of frames is configuratble via a parameter to infer_batch(). This is a workaround to the observation that when CFG is on we sometimes terminate after zero tokens. It appears to be an artifacts of CFG, since the EOS logit is not particularly large for the conditional logits; only post-CFG. Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
925661f
to
1005d1b
Compare
…8_forbid_eos_near_start Signed-off-by: Fejgin, Roy <[email protected]>
(to aid in debugging rare issues) Signed-off-by: Fejgin, Roy <[email protected]>
paarthneekhara
approved these changes
Sep 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Signed-off-by: Fejgin, Roy <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem: We have observed a very rare but persistent issue where Magpie can generate a zero-length output (for a standard length input text). Here's what was happening: the EOS logit at the very first timestep would become very large for particular utterances after application of CFG. Interestingly, before CFG neither the conditional nor unconditional EOS logit was particularly large; but CFG amplified the difference between them which resultsed in the post-CFG EOS logit being the maximum among all logits.
Fix: In this PR we avoid this particular early termination issue by disallowing EOS in the first 4 timesteps, corresponding to about 186ms (with a 21.5 Hz codec). That helps the model avoid first-frame termination and it then actually generates the rest of the utterance correctly. The value of 4 steps was chosen because the codec requires a minimum of 4 frames to decode (albeit at a batch level). This way we sample those 4 steps rather than potentially force-replace them with zero-index tokens.
Note that I also examined the logits for an instance of mid-sentence termination and the logits did not follow the pattern of EOS becoming large only after CFG in that case. So CFG doesn't appear to be at the root of all early-termination issues, just the start-of-utterance one.
(Still, it's probably worthwhile thinking about how to improve the underlying CFG mechanism; but for now we need to a safeguard so that these zero-length generations don't happen.)
I ran evaluations and got similar metrics with and without this constraint, which should kick in very rarely.