The French Transitions Extractor is a Streamlit-based Python application designed to streamline the process of extracting structured linguistic data from .docx
documents. Specifically, it focuses on identifying "triplets" composed of a preceding paragraph (paragraph_a
), a transitional phrase (transition
), and a succeeding paragraph (paragraph_b
). Additionally, it compiles a comprehensive list of all unique transitional phrases encountered across the processed documents.
This tool is particularly useful for Natural Language Processing (NLP) tasks, linguistic analysis, or dataset generation where specific contextual examples of transitions are required.
.docx
Document Processing: Upload and parse Word documents.- Article Structure Recognition: Identifies distinct articles based on a predefined header and marker-based structure.
- Triplet Extraction: Extracts
(paragraph_a, transition, paragraph_b)
triplets suitable for few-shot learning or fine-tuning language models. - Unique Transition Listing: Compiles a sorted list of all unique transitional phrases found.
- Multiple Output Formats: Generates data in
.json
,.jsonl
, and.txt
formats for various use cases. - Rejection Logging: Provides logs for examples or transitions that fail validation criteria during extraction.
- User-Friendly Interface: Intuitive web interface built with Streamlit for ease of use.
Follow these instructions to set up and run the application on your local machine.
Before you begin, ensure you have the following installed:
- Python 3.8+
- Git
-
Clone the repository:
git clone [https://github.com/elvis07jr/streamlit-transition-extractor.git](https://github.com/elvis07jr/streamlit-transition-extractor.git) cd streamlit-transitions-extractor
-
Create and activate a virtual environment (recommended):
python -m venv venv # On Windows: .\venv\Scripts\activate # On macOS/Linux: source venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
The project is organized as follows: French-Transitions-Extractor/ ├── app.py # Main Streamlit application ├── data_processing.py # Contains logic for parsing DOCX and extracting articles/triplets ├── utils.py # Helper functions for generating various output file formats ├── requirements.txt # List of Python dependencies └── README.md # This README file
-
Place your
.docx
files: Ensure the Word documents you wish to process (e.g.,word10.docx
,word374.docx
) are accessible on your computer. You can place them directly in the project directory for convenience, or browse to their location during upload. -
Run the Streamlit application: With your virtual environment activated, navigate to the project's root directory in your terminal and run:
streamlit run app.py
This will open the application in your default web browser.
-
Process Documents:
- In the web interface, use the "1. Upload Your Document" section to upload a
.docx
file. - Click "2. Start Extraction" to initiate the data processing.
- Review the "3. Extraction Summary" for a high-level overview of the extracted examples.
- In "4. Generate and Save Outputs," select the desired output file formats (e.g.,
fewshot_examples.json
,transitions_only.txt
). - Click "Generate and Save Selected Outputs" to download the files to your local machine.
- In the web interface, use the "1. Upload Your Document" section to upload a
Compiled by: https://github.com/elvis07jr Email: [email protected]
This project is licensed under the MIT License - see the LICENSE
file (if you create one) for details.