Description
🚀 The feature, motivation and pitch
Starting with #19597 , vllm now supports chunking audios longer than 30s when serving Whisper.
The logic is pretty simple right now, as the audio is chunked at semi-fixed intervals, looking for "silence" in a small window around the chunk limit.
The request is then executed in a "concurrent mode", batching the audio chunks.
vllm/vllm/entrypoints/openai/serving_transcription.py
Lines 215 to 226 in cda9230
Hence there's no sequential dependency at the moment, in particular the transcription of chunk_i is not piped as prompt to chunk_i+1 (optimal strategy, as per the Whisper paper).
In this regard, it would be nice to asses with longer audio samples whether feeding the original prompt to subsequent chunks after the first one is actually beneficial to the quality of the generated output.
My understanding is that the prompt will condition the model on the text that appeared in the past 30s, and hence it may actually be harmful to the final quality of the transcription, given a long enough input.
This task requires evaluating the precision/error rate on longer sequences similarly to what it was done here https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/correctness/test_transcription_api_correctness.py, with and without original prompt for all chunks but the first one.
Here's the detail of the line I am referring to:
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.