Unofficial reference implementation and reproducible pipeline for
โTwo-Stage Movie Script Summarization: An Efficient Method for Low-Resource Long Document Summarizationโ (CreativeSumm @ COLING 2022).
- Paper: https://aclanthology.org/2022.creativesumm-1.9/
- PDF: https://aclanthology.org/2022.creativesumm-1.9.pdf
- BibTeX: in the ๐ Citation section
This repository implements a two-stage pipeline for summarizing movie scripts โ movie plots:
-
Stage A โ Script Condensation (Heuristic Extraction):
Extract actions and salient dialogues from screenplay format to drastically shorten the input while keeping core narrative content. -
Stage B โ Abstractive Summarization (LED + Efficient Finetuning):
Use Longformer-EncoderโDecoder (LED) with parameter-efficient strategies (BitFit, NoisyTune) to generate coherent plot summaries from the condensed script.
Why this design?
Movie scripts are long (tens of thousands of tokens) and structurally idiosyncratic. The heuristic pass reduces length and noise; LED then focuses computation on what matters.
We release the simplified version of our code. This release is for the quick implementation of our work. It is not meant to replicate our results in our paper.
Python: 3.9โ3.11
git clone https://github.com/<you>/two-stage-script-sum.git
cd two-stage-script-sum
# Create env (conda or venv)
python -m venv .venv && source .venv/bin/activate
# Install deps
pip install -r requirements.txt
# or, minimal:
pip install torch transformers datasets accelerate evaluate rouge-score nltk sentencepiece
python -c "import nltk; nltk.download('punkt')"- Use gradient accumulation for long contexts.
- Enable mixed precision (
--fp16) if your hardware supports it. - For especially long scripts, chunk condensed text into overlapping windows and merge beams.
- The two-stage pipeline substantially reduces token length before generation.
- With LED + BitFit/NoisyTune, you should see improvements over zero-shot LED baselines on standard automatic metrics.
Exact numbers depend on dataset split, preprocessing choices, and compute budget.
If you use this repo, please cite the original paper:
Liu, Dongqi; Hong, Xudong; Lin, Pin-Jie; Chang, Ernie; Demberg, Vera (2022).
Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization.
In Proceedings of the Workshop on Automatic Summarization for Creative Writing (CreativeSumm @ COLING 2022), pp. 57โ66.
BibTeX
@inproceedings{pu-etal-2022-two,
title = {Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization},
author = {Liu, Dongqi and Hong, Xudong and Lin, Pin-Jie and Chang, Ernie and Demberg, Vera},
editor = {Mckeown, Kathleen},
booktitle = {Proceedings of the Workshop on Automatic Summarization for Creative Writing},
month = {oct},
year = {2022},
address = {Gyeongju, Republic of Korea},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2022.creativesumm-1.9/},
pages = {57--66}
}- CreativeSumm 2022 Shared Task (overview & data pointers): https://creativesumm.github.io/sharedtask
- Longformer-EncoderโDecoder (LED) model: available via Hugging Face Transformers
We thank the CreativeSumm organizers and the ACL community.
This repo builds on the open-source NLP ecosystem (PyTorch, Transformers, Datasets, Evaluate).
- Xudong Hong ([email protected]) โ open issues/PRs welcome!