[Feature]: Evaluate prompt presence on subsequent audio chunks

### 🚀 The feature, motivation and pitch

Starting with #19597 , vllm now supports chunking audios longer than 30s when serving Whisper.
The logic is pretty simple right now, as the audio is chunked at semi-fixed intervals, looking for "silence" in a small window around the chunk limit.
The request is then executed in a "concurrent mode", batching the audio chunks. 
https://github.com/vllm-project/vllm/blob/cda92307c145e7722cdc33e6d26e105eeb22b882/vllm/entrypoints/openai/serving_transcription.py#L215-L226

Hence there's no sequential dependency at the moment, in particular the transcription of chunk_i is not piped as prompt to chunk_i+1 (optimal strategy, as per the Whisper paper).
In this regard,  it would be nice to asses with longer audio samples whether feeding the original prompt to subsequent chunks after the first one is actually beneficial to the quality of the generated output.
My understanding is that the prompt will condition the model on the text that appeared in the past 30s, and hence it may actually be harmful to the final quality of the transcription, given a long enough input.
This task requires evaluating the precision/error rate on longer sequences similarly to what it was done here https://github.com/vllm-project/vllm/blob/main/tests/entrypoints/openai/correctness/test_transcription_api_correctness.py, with and without original prompt for all chunks but the first one.  


Here's the detail of the line I am referring to:
https://github.com/vllm-project/vllm/blob/cda92307c145e7722cdc33e6d26e105eeb22b882/vllm/entrypoints/openai/serving_transcription.py#L224


### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	for i, chunk in enumerate(chunks):
	prompt = {
	"encoder_prompt": {
	"prompt": "",
	"multi_modal_data": {
	"audio": (chunk, sr),
	},
	},
	"decoder_prompt":
	f"<\|startoftranscript\|>{lang_token}<\|transcribe\|><\|notimestamps\|>{request.prompt}"
	if i == 0 else ""
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Evaluate prompt presence on subsequent audio chunks #19772

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Evaluate prompt presence on subsequent audio chunks #19772

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions