Skip to content

Commit e2e16f9

Browse files
Merge pull request #6 from jeremymanning/main
Repository reorganization and training improvements
2 parents 01b142c + 542a37a commit e2e16f9

File tree

18 files changed

+623
-701680
lines changed

18 files changed

+623
-701680
lines changed

.github/workflows/tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,11 +63,11 @@ jobs:
6363
6464
- name: Test figure generation (single)
6565
run: |
66-
python generate_figures.py --figure 1a --data tests/data/test_model_results.pkl --output tests/output_single
66+
python code/generate_figures.py --figure 1a --data tests/data/test_model_results.pkl --output tests/output_single
6767
6868
- name: Test figure generation (all)
6969
run: |
70-
python generate_figures.py --data tests/data/test_model_results.pkl --output tests/output_all
70+
python code/generate_figures.py --data tests/data/test_model_results.pkl --output tests/output_all
7171
timeout-minutes: 5
7272

7373
- name: Upload test artifacts

README.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,16 @@ llm-stylometry/
2121
│ ├── utils/ # Helper utilities
2222
│ ├── visualization/ # Plotting and visualization
2323
│ └── cli_utils.py # CLI helper functions
24-
├── code/ # Original analysis scripts
25-
│ ├── main.py # Model training script
24+
├── code/ # Training and CLI scripts
25+
│ ├── generate_figures.py # Main CLI entry point
26+
│ ├── consolidate_model_results.py # Result consolidation
27+
│ ├── main.py # Model training orchestration
2628
│ ├── clean.py # Data preprocessing
27-
│ └── ... # Various analysis scripts
29+
│ └── ... # Supporting training modules
2830
├── data/ # Datasets and results
2931
│ ├── raw/ # Original texts from Project Gutenberg
3032
│ ├── cleaned/ # Preprocessed texts by author
31-
│ ├── model_results.pkl # Consolidated model training results
32-
│ └── model_results.csv # Model results in CSV format
33+
│ └── model_results.pkl # Consolidated model training results
3334
├── models/ # Trained models (80 total)
3435
│ └── {author}_tokenizer=gpt2_seed={0-9}/
3536
├── paper/ # LaTeX paper and figures
@@ -40,7 +41,6 @@ llm-stylometry/
4041
│ ├── data/ # Test data and fixtures
4142
│ ├── test_*.py # Test modules
4243
│ └── check_outputs.py # Output validation script
43-
├── generate_figures.py # Main CLI entry point
4444
├── run_llm_stylometry.sh # Shell wrapper for easy setup
4545
├── LICENSE # MIT License
4646
├── README.md # This file
@@ -168,16 +168,17 @@ fig = generate_all_losses_figure(
168168
**Note**: Training requires a CUDA-enabled GPU and takes significant time (~80 models total).
169169

170170
```bash
171-
# Using the CLI (recommended)
171+
# Using the CLI (recommended - handles all steps automatically)
172172
./run_llm_stylometry.sh --train
173-
174-
# Or manually
175-
conda activate llm-stylometry
176-
python code/clean.py # Clean data
177-
python code/main.py # Train models
178-
python consolidate_model_results.py # Consolidate results
179173
```
180174

175+
This command will:
176+
1. Clean and prepare the data if needed
177+
2. Train all 80 models (8 authors × 10 seeds)
178+
3. Consolidate results into `data/model_results.pkl`
179+
180+
The training pipeline automatically handles data preparation, model training across available GPUs, and result consolidation. Individual model checkpoints and loss logs are saved in the `models/` directory.
181+
181182
### Model Configuration
182183

183184
Each model uses:

code/all_losses.py

Lines changed: 0 additions & 121 deletions
This file was deleted.

code/confusion_matrix.py

Lines changed: 0 additions & 35 deletions
This file was deleted.

0 commit comments

Comments
 (0)