Skip to content

[Feature]: Evaluate prompt presence on subsequent audio chunks #19772

Open
@NickLucche

Description

@NickLucche

🚀 The feature, motivation and pitch

Starting with #19597 , vllm now supports chunking audios longer than 30s when serving Whisper.
The logic is pretty simple right now, as the audio is chunked at semi-fixed intervals, looking for "silence" in a small window around the chunk limit.
The request is then executed in a "concurrent mode", batching the audio chunks.

for i, chunk in enumerate(chunks):
prompt = {
"encoder_prompt": {
"prompt": "",
"multi_modal_data": {
"audio": (chunk, sr),
},
},
"decoder_prompt":
f"<|startoftranscript|>{lang_token}<|transcribe|><|notimestamps|>{request.prompt}"
if i == 0 else ""
}

Hence there's no sequential dependency at the moment, in particular the transcription of chunk_i is not piped as prompt to chunk_i+1 (optimal strategy, as per the Whisper paper).
In this regard, it would be nice to asses with longer audio samples whether feeding the original prompt to subsequent chunks after the first one is actually beneficial to the quality of the generated output.
My understanding is that the prompt will condition the model on the text that appeared in the past 30s, and hence it may actually be harmful to the final quality of the transcription, given a long enough input.
This task requires evaluating the precision/error rate on longer sequences similarly to what it was done here https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/correctness/test_transcription_api_correctness.py, with and without original prompt for all chunks but the first one.

Here's the detail of the line I am referring to:

f"<|startoftranscript|>{lang_token}<|transcribe|><|notimestamps|>{request.prompt}"

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions