SLT 2024 Challenge: Track-2 Post-ASR-Speaker-Tagging Baseline and Instructions for Track-2
Checkout the leaderboard that automatically evaluates both dev set and eval set.
Task-2 Speaker Tagging Leaderboard
SLT 2024 Challenge GenSEC Track 2: Post-ASR-Speaker-Tagging
-
Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system.
-
Since the traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations.
-
In the provided dataset, we refer to these erroneous transcript as
err_source_text(Error source text). Here is an example. -
Erroneous Original Transcript
err_source_text:
[
{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"},
{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"},
{"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
{"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay, then let's talk about our gigs sounds"},
{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"},
]Note that the word well i, discussed and sounds are tagged with wrong speakers.
- We expect track2 participants to generate the corrected speaker taggings.
- Corrected Transcript Example (hypothesis):
[
{"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"what should we talk about"},
{"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"well i don't tell you what's need to be discussed"},
{"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"because that's something you should figure out"},
{"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
{"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"}
]- Note that
start_timeandend_timecannot be estimated so the timestamps are all assigned as0.0. - Please ensure that the order of sentences is maintained so that the output transcripts can be evaluated correctly.
- Dataset: All development set and evaluation set data samples are formatted in the
seglst.jsonformat, which is a list containing dictionary variables with the keys specified above:
{
"session_id": str,
"start_time": float,
"end_time": float,
"speaker": str,
"words": str,
}- The participants should only use text (transcripts) as the only modality. We do not provide any speech (audio) signal for the transcripts.
- The participants are allowed to correct the words (e.g.
spk1:hi are wowtospk1:how are you) without changing the speaker labels. That is, this involves Track-1 in a way. - The participants are allowed to use any type of language model and methods.
- It does not need to be instruct (chat-based) large language models such as GPTs, LLaMa.
- No restrictions on the parameter size of the LLM.
- The participants can use prompt tuning, model alignment or any type of fine-tuning methods.
- The participants are also allowed to use beam search decoding techniques with LLMs.
-
The submitted system output format should be session by session
seglst.jsonformat and evaluated bycpwermetric. -
The participants will submit two json files:
(1)
err_dev.hyp.seglst.json
(2)err_eval.hyp.seglst.jsonfor both dev and eval set, respectively.
-
In each
err_dev.hyp.seglst.jsonerr_eval.hyp.seglst.json, there is only one list containing the all 142 (dev), 104 (eval) sessions and each session is separated bysession_idkey.
- Example of the final submission form
err_dev.hyp.seglst.jsonanderr_eval.hyp.seglst.json:
[
{"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"well it is what it is"},
{"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"yeah so be it"},
{"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"wow you are late again"},
{"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sorry traffic jam"},
{"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker3", "words":"hey how was the last night"}
]The baseline system is based on the system proposed in Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
(We refer to this method as Contextual Beam Search (CBS)). Note that Track-2 GenSEC challenge only allows text modality, so this method injects placehold probabilities represented by peak_prob.
The prposed CBS method brings the beam search technique used for ASR language model to speaker diarization.
In CBS method, the following three probability values are needed:
P(E|S): Speaker diarization posterior probability (Given speaker S, acoustic observation E)
P(W): th probability of the next word W
P(S|W): the conditional probability value of the speaker S given the next word
Note that the CBS approach assumes that one word is spoken by one speaker. In this baseline system, a placeholder speaker probability peak_prob is added since we do not have access to acoustic-only speaker diarization system.
The following diagram explains how beam search decoding works with speaker diarization and ASR.
The overall data-flow is shown as follows. Note that we have fixed value for speaker probability values.
Run the following commands at the main level of this repository.
The baseline system works with conda environment with python 3.10.
conda create --name llmspk python=3.10
You need to install the following packages
kenlm
arpa
numpy
hydra-core
meeteval
tqdm
requests
simplejson
pydiardecode @ git+https://github.com/tango4j/pydiardecode@main
Simply install all the requirments.
pip install -r requirements.txt
mkdir -p arpa_model
cd arpa_model
wget https://kaldi-asr.org/models/5/4gram_small.arpa.gz
gunzip 4gram_small.arpa.gz
Clone the dataset from Hugging Face server.
git clone https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging
In the above folder, you will see the following folder structures.
.
├── err_source_text
│ ├── train
│ │ ├── session_aa01bbcc.seglst.json
│.
│..
│ │ ├── session_bb00bbcc.seglst.json
│ │ └── session_dd01bbcc.seglst.json
│ ├── dev
│ │ ├── session_02d73d95.seglst.json
│.
│..
│ │ ├── session_fcd0a550.seglst.json
│ │ └── session_ff16b903.seglst.json
│ └── eval
│ ├── session_0bea34fa.seglst.json
│..
│...
│ ├── session_f84edf1f.seglst.json
│ └── session_febfa7aa.seglst.json
├── ref_annotated_text
│ ├── train
│..
│ │ └── session_dd01bbcc.seglst.json
│ ├── dev
│ ├── session_0259446c.seglst.json
│..
│ └── eval
│..
│ └── session_ff16b903.seglst.jsonThe file counts are as follows:
err_source_text: train 222 dev 13 files, eval 11 filesref_annotated_text: train 222 dev 13 files, eval 11 files (only accessible through leaderboard evaluations)
Run the following commands to construct the input list files err_dev.src.list and err_dev.ref.list.
find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/err_source_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.src.list
find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/ref_annotated_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.ref.list
For eval set list, err_eval.src.list and err_eval.ref.list.
find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/err_source_text/eval -maxdepth 1 -type f -name "*.seglst.json" > err_eval.src.list
find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/ref_annotated_text/eval -maxdepth 1 -type f -name "*.seglst.json" > err_eval.ref.list
Now you are ready to launch the baseline script.
Launch the baseline script run_speaker_tagging_beam_search.sh
### Speaker Tagging Task-2 Parameters
BASEPATH=${PWD}
DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa
ASRDIAR_FILE_NAME=err_dev
OPTUNA_STUDY_NAME=speaker_beam_search_${ASRDIAR_FILE_NAME}
WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging
INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list
GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list
DIAR_OUT_DOWNLOAD=$WORKSPACE/$ASRDIAR_FILE_NAME
mkdir -p $DIAR_OUT_DOWNLOAD
python $BASEPATH/speaker_tagging_beamsearch.py \
asrdiar_file_name=$ASRDIAR_FILE_NAME \
hyper_params_optim=false \
arpa_language_model=$BASEPATH/arpa_model/4gram_small.arpa \
groundtruth_ref_list_path=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list \
input_error_src_list_path=$BASEPATH/$ASRDIAR_FILE_NAME.src.list \
alpha=0.7378102172641824 \
beta=0.029893025590158093 \
beam_width=9 \
word_window=50 \
parallel_chunk_word_len=175 \
out_dir=$WORKSPACE \
peak_prob=0.96 || exit 1You have successfully run the baseline system if you get the following text printed on your screen.
The baseline system results in a cpWER of 0.2454 while the original erroneous source has a cpWER of 0.2465.
[YYYY-MM-DD HH:MM:SS,782][root][INFO] - -> HYPOTHESIS cpWER=0.2454
[YYYY-MM-DD HH:MM:SS,783][root][INFO] - -> SOURCE cpWER=0.2465
[YYYY-MM-DD HH:MM:SS,783][root][INFO] - -> Average cpWER DIFF=-0.0147
[YYYY-MM-DD HH:MM:SS,783][root][INFO] - -> HYPOTHESIS Improved cpWER=-0.0011 We use MeetEval software to evaluate cpWER.
cpWER measures both speaker tagging and word error rate (WER) by testing all the permutation of trancripts and choosing the permutation that
gives the lowest error.
echo "Evaluating the original source transcript."
meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
echo "Source cpWER: " $(jq '.error_rate' $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst_cpwer.json)
echo "Evaluating the original hypothesis transcript."
meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json
echo "Hypothesis cpWER: " $(jq '.error_rate' $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst_cpwer.json)The cpwer result will be stored in ./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json file.
cat ./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json` The result file contains a json-dictionary. "error_rate" is the cpwer value we want to minimize.
{
"error_rate": 0.24536675570166427,
"errors": 5971,
"length": 24335,
"insertions": 842,
"deletions": 3454,
"substitutions": 1675,
"reference_self_overlap": null,
"hypothesis_self_overlap": null,
"missed_speaker": 0,
"falarm_speaker": 0,
"scored_speaker": 26,
"assignment": null
}This is an example of GPT-based speaker tagging correction. The following text is the prompt fed into Chat-GPT (GPT-4), including the example in this document.
- Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system.
- Traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations.
- In the provided dataset, we refer to these erroneous transcripts as `err_source_text` (Error source text). Here is an example.
- Erroneous Original Transcript `err_source_text`:
[
{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"},
{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"},
{"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
{"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs sounds"},
{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"},
]
Relocate the words to make the conversation looks and sounds natural.
- Do not add or remove words.
- Do not switch the speaker labels.
- Do not change speaker1 to speaker2 or speaker2 to speaker 1.
- Move one or two words to other speakers dictionary to make the conversation sound correct.- The answer from GPT-4 (Tested on 05/01/2024).
To correct the conversation's flow in the transcript, I'll rearrange the words between the speakers while keeping the speakers consistent with the rules you provided. Here is the corrected version:
json [code]
[
{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"},
{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"well i don't tell you what's need to be"},
{"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"},
{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"},
{"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"}
]
Here are the changes I made:
I moved "well i" from speaker1's first utterance to the beginning of speaker2's first utterance to better connect the dialogue.
"sounds" was moved to the beginning of speaker2's last utterance to better transition from speaker1's proposal to discuss their gigs.In this example, note that the word well i, and sounds are corrected properly while GPT-4 missed discussed and left it assigned to speaker1.
- The answer from Llama3 (Tested on 05/01/2024).
Here is the relocated transcript to make the conversation sound natural:
[
{"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"},
{"session_id":"session_gen1sec2", "start_time":13.32, "end_time":15.08, "speaker":"speaker1", "words":"well i don't tell you what's need to be"},
{"session_id":"session_gen1sec2", "start_time":15.11, "end_time":17.08, "speaker":"speaker2", "words":"discussed because that's something you should figure out"},
{"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker1", "words":"okay then let's talk about our gigs"},
{"session_id":"session_gen1sec2", "start_time":21.65, "end_time":22.40, "speaker":"speaker1", "words":"sounds good"},
{"session_id":"session_gen1sec2", "start_time":22.45, "end_time":23.92, "speaker":"speaker2", "words":"do you have any specific ideas"},
]
I made some adjustments to the words to make the conversation flow naturally. Let me know if this meets your requirements!In this Llama3 example, note that the word well, i and sounds are corrected but nearly all sentences got swapped speaker labels.
@inproceedings{park2024enhancing, title={Enhancing speaker diarization with large language models: A contextual beam search approach}, author={Park, Tae Jin and Dhawan, Kunal and Koluguri, Nithin and Balam, Jagadeesh}, booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={10861--10865}, year={2024}, organization={IEEE} }




