Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 65 additions & 28 deletions docs/speech-to-text/real-time/end-of-turn.mdx
Original file line number Diff line number Diff line change
@@ -1,63 +1,87 @@
---
description: 'Learn how Speechmatics detects end of utterances'
description: 'Create responsive voice applications with end of turn detection'
keywords:
[
speechmatics,
end of utterance,
end of turn,
transcription,
speech recognition,
asr
asr,
voice ai,
conversation,
turn-taking
]
toc_max_heading_level: 2
---
import CodeBlock from "@theme/CodeBlock";
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import eouStreamingPythonExample from "./assets/end-of-utterance-streaming-example.py"
import eouFilePythonExample from "./assets/end-of-utterance-file-example.py"

# End of Turn Detection
# Turn detection

To improve user experience in responsive real-time scenarios it important to know when a person has finished speaking. This is especially important for voice AI, translation, and dictation use cases. Detecting an 'End of Turn' can be used to trigger actions such as generating a response in a Voice AI agent.
Build responsive voice applications by detecting when users finish speaking.

To get started, check out the [Configuration Example](#end-of-utterance-configuration) below.
## Benefits

## Use Cases
- Create natural conversational experiences with proper turn-taking
- Reduce response latency in voice assistants and chatbots
- Improve user experience with timely system responses
- Enable more human-like interactions in voice applications

**Voice AI & Conversational Systems**: Enable voice assistants and chatbots to detect when the user has finished speaking, allowing the system to respond promptly without awkward delays.
## Use cases

**Real-time Translation**: Critical for live interpretation services where translations need to be delivered as soon as the speaker completes their thought, maintaining the flow of conversation.
* **Voice AI** - Detect when to generate responses in conversational agents
* **Real-time translation** - Deliver translations as soon as speakers complete thoughts
* **Dictation** - Determine when users have finished speaking to finalize transcription

**Dictation & Transcription**: Helps dictation software determine when users have completed their input, improving speed of final transcription and user experience.
## How it works

A **turn**, or **utterance**, is a continuous piece of speech from a single speaker, typically separated by pauses. In conversation systems, detecting the end of an utterance helps determine when it's appropriate for another speaker (or AI system) to respond.

## End of Utterance Configuration
Speechmatics' Speech-To-Text allows you to use a period of silence to determine when a user has finished speaking. This is known as **End of Utterance** detection and is one way to detect End of Turn.
Speechmatics offers two complementary approaches to detect when a speaker has finished their turn:

To enable End of Utterance detection, include the following in the [StartRecognition](/api-ref/realtime-transcription-websocket#startrecognition) message:
1. **Silence-based detection** - Identifies pauses between speech
2. **Semantic detection** - Analyzes linguistic context to identify natural endpoints

## Silence-based detection

Detect natural pauses in speech by configuring the silence threshold in your transcription request.

### Configuration

Add the `end_of_utterance_silence_trigger` parameter to your [StartRecognition](/api-ref/realtime-transcription-websocket#startrecognition) message:

```json
{
"type": "transcription",
"transcription_config": {
"conversation_config": {
"end_of_utterance_silence_trigger": 0.5
"end_of_utterance_silence_trigger": 0.5
},
"language": "en",
"language": "en"
}
}
```
* `end_of_utterance_silence_trigger` (Number): Allowed between 0 and 2 seconds. Setting to 0 seconds disables detection. This is the number of seconds of non-speech (silence) to wait before an End of Utterance is identified. When this happens, speechmatics will send a [`Final` transcript](/speech-to-text/real-time/quickstart#final-transcripts) message, followed by an extra `EndOfUtterance` message

**Notes**
The `end_of_utterance_silence_trigger` parameter specifies the silence duration (0-2s) that triggers end of utterance detection.

:::info
Setting `end_of_utterance_silence_trigger` to 0 disables detection.
:::

### Recommended settings

- We recommend 0.5-0.8 seconds for most voice AI applications. Longer values (0.8-1.2s) may be better for dictation applications.
- Keep the `end_of_utterance_silence_trigger` lower than the max_delay value.
- `EndOfUtterance` messages are only sent after some speech is recognised and duplicate `EndOfUtterance` messages will never be sent for the same period of silence.
- The `EndOfUtterance` message is not related to any specific individual identified by [Diarization](/speech-to-text/output-enhancements/diarization) and will not contain speaker information.
- **Voice AI applications**: 0.5-0.8 seconds
- **Dictation applications**: 0.8-1.2 seconds

### Example End of Utterance Message
### Response format

When an end of utterance is detected, you'll receive:

1. A [`Final` transcript](/speech-to-text/real-time/quickstart#final-transcripts) message
2. An `EndOfUtterance` message

```json
{
Expand All @@ -70,17 +94,30 @@ To enable End of Utterance detection, include the following in the [StartRecogni
}
```

## Semantic End of Turn
:::tip
- Keep `end_of_utterance_silence_trigger` lower than the `max_delay` value
- Messages are only sent after speech is recognized
- Duplicate messages are never sent for the same silence period
- Messages don't contain speaker information from [diarization](/speech-to-text/output-enhancements/diarization)
:::

## Semantic end of turn

For more natural conversations, combine silence detection with linguistic context analysis. This approach understands when a speaker has completed their thought based on the content of their speech.

Semantic end of turn detection is available through our [Flow service](/voice-agents-flow), which combines multiple signals for optimal turn detection:

While silence-based End of Utterance is enough for many use cases, it is often improved by combining it with the context of the conversation. This is known as 'Semantic End of Turn Detection'. You can try Semantic End of Turn right away with our free [Flow service demo](https://www.speechmatics.com/flow)!
- Silence duration
- Linguistic completeness
- Question detection
- Prosodic features

{/* TODO add anchor here when flow docs are merged */}
Semantic End of Turn comes [already included](/voice-agents-flow) in Flow to provide the best experience for your users. You can also check out our [Semantic End-of-Turn detection "how to"](https://blog.speechmatics.com/semantic-turn-detection) guide for more details on how to implement this in your own application.
Try semantic end of turn detection with our free [Flow service demo](https://www.speechmatics.com/flow) or read our [implementation guide](https://blog.speechmatics.com/semantic-turn-detection).

## Code Examples
## Code examples

<Tabs groupId="eou-examples">
<TabItem value="streaming" label="Python - Live Streaming">
<TabItem value="streaming" label="Python - Live streaming">

Real-time streaming from microphone - ideal for voice AI applications.

Expand Down