Skip to content

LLMjp v4 Midtraining (ref: OLMo2 midtrainig) #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
8684527
Add README.md for v4 midtraining
koshieguchi May 5, 2025
0a25b7c
Add dolmino uncompression script
so298 May 5, 2025
79823e8
Add v4-high-quality-abci training recipe (will fix later)
koshieguchi May 5, 2025
a357ec4
Update midtrain dir for olmo2-midtraining
koshieguchi May 5, 2025
c272503
Fix task directory
koshieguchi May 5, 2025
273ed44
Rename uncompress to extract
so298 May 5, 2025
bd99856
Fix extract.sh
so298 May 7, 2025
54f109e
Add merge file script
so298 May 7, 2025
1170220
Add tokenize.sh
so298 May 9, 2025
479d13e
Update scripts
koshieguchi May 9, 2025
71bee41
Fix merge_files.sh
so298 May 9, 2025
046449d
Update preprocess/build_train_data.sh
koshieguchi May 9, 2025
73582af
Add extracted train_data.sh
koshieguchi May 9, 2025
c766d8f
Fix merge
so298 May 9, 2025
f32a683
Update train_data
koshieguchi May 9, 2025
b9bdabd
Update iterations
koshieguchi May 9, 2025
aeb5ab0
Update README.md
koshieguchi May 9, 2025
dcdabbd
Update iterations
koshieguchi May 9, 2025
8d5136c
Change `environments` to `environment`
koshieguchi May 9, 2025
a3264b8
Update parameter files
koshieguchi May 12, 2025
6244a5c
Set up 1.3b midtraining script
so298 May 12, 2025
dee9ea5
Fix checkpoint convert script
koshieguchi May 13, 2025
fa2134c
Add wandb
koshieguchi May 14, 2025
07f8cc5
Add 50B, 100B, and 300B data
koshieguchi May 15, 2025
9ba82f4
Add dataset_size
koshieguchi May 15, 2025
f2498b2
Add script with job dependency
so298 May 15, 2025
19f3e20
Add dataset size option
so298 May 15, 2025
925e8ac
Add train_iters.txt to gitignore
so298 May 15, 2025
9b0902a
Update optimizer settings
koshieguchi May 16, 2025
1d8e96e
Update convert script
koshieguchi May 16, 2025
e526432
Update scripts
so298 May 28, 2025
96bb339
Add mergekit config
so298 May 28, 2025
bfe460f
Update README.md
koshieguchi May 29, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 145 additions & 0 deletions pretrain/scripts/v4-midtraining/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
# LLMjp-v4 Midtraining

## Overview

OLMo2におけるMidtrainingをLL-jp-4-enのモデルで再現する実験を行う

## データセットの割合

合計Token数: 55,797,411,281 tokens

| Datasets | Tokens | Source(%) | Mix(%) | Original OLMo2 Mix (%) |
|---------------|----------------|-----------|--------|------------------------|
| DCLM | 26,540,912,669 | 3.23% | 47.57% | 47.20% |
| FLAN | 9,242,742,021 | 50.00% | 16.56% | 16.60% |
| peS2o | 3,236,969,300 | 5.15% | 5.80% | 5.85% |
| Wikipedia | 3,896,965,449 | 100.00% | 6.98% | 7.11% |
| Stackexchange | 1,464,772,187 | 100.00% | 2.63% | 2.45% |
| Math | 11,415,049,655 | 100.00% | 20.46% | 20.80% |

### tokenize

```bash
export EXP_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/"
export EXP_SCRIPT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining"
cd $EXP_DIR

# 1. Huggingfaceからdolmino-mix-1124をダウンロード
huggingface-cli download allenai/dolmino-mix-1124 --local-dir "$EXP_DIR/dolmino-mix-1124"

cd $EXP_SCRIPT_DIR
# 2. データセットの展開 (`$EXP_DIR/dolmino-mix-1124-extracted` に展開される)
bash ./preprocess/extract.sh

# 3. データセットファイルのmerge (`$EXP_DIR/dolmino-mix-1124-extracted-merged` に結合ファイルが作成される)
qsub ./preprocess/merge_files.sh

# (3が完了したら)
# 4. データセットのtokenize (`$EXP_DIR/dolmino-mix-1124-tokenized` にtokenizeされたファイルが作成される)
qsub ./preprocess/tokenize.sh

# (optional) 中間ファイルの削除
rm -rf $EXP_DIR/dolmino-mix-1124-extracted $EXP_DIR/dolmino-mix-1124-extracted-merged
```

### データセットの作成

データセットの作成前に事前にtokenizeが完了している必要がある。

```sh
# ./tasks/v4-dolmino-mix-1124/train_data.all.shを作成
# 自動的にtoken数を計算し、"token数 PATH"をtrain_data.all.shに書き込む
./preprocess/build_train_data.sh

# ./tasks/v4-dolmino-mix-1124/train_data.all.shから./tasks/v4-dolmino-mix-1124/train_data_50B.shを作成
# dolminoのmidtrainingと同じ配合の50Bのデータセットサイズになるようにtoken数を更新する
./preprocess/update_train_data_to_50B.sh
# 100B, 300Bも同様
```

## 環境構築

ref: [scripts/pretrain/installers/v4-megatron-abci at 0130-instruct-pretrain · llm-jp/scripts](https://github.com/llm-jp/scripts/tree/0130-instruct-pretrain/pretrain/installers/v4-megatron-abci)

```sh
cd /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/install-scripts/pretrain/installers/v4-megatron-abci
bash run_setup.sh /path/to/target_dir
# ex
# bash run_setup.sh /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment
```

> [!CAUTION]
> Transformer engineのv1.10以上を使うとエラーが出るため、environment2を今回利用している(Transformer engineのversionを1.9にdowngradeした。)
> ref: https://docs.nvidia.com/nemo-framework/user-guide/24.07/knownissues.html

> [!CAUTION]
> `environment/src/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py`の72行目に"weights_only=False"を加えた
> ref: https://github.com/huggingface/accelerate/issues/3539


## job実行

```sh
cd /path/to/v4-midtraining

# example:
# 1.3b-llama3-ecjk
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B 16

# 7.7b-llama3-ecjk
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16
```

### [Option] 依存関係付きのjob実行

qsub の `-W depend=...` の機能を利用して、ジョブ間に依存関係をつけて実行するためのスクリプトを用意している。
`run_train.sh` ではなく `run_train_with_deps.sh` を利用して実行する。

```sh
# 最後の引数に `-W depend=` に渡す値を書く
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16 afterok:xxxx.pbs1:yyyy.pbs1
```

依存関係の詳しい記法は ABCI 3.0 上で `man qsub` を参照すること

## Checkpoint変換

> [!CAUTION]
> 下のスクリプトを実行する前に、`scripts/pretrain/scripts/v4-midtraining/midtrain/params`の`--no-load-optim`を外してください。

```sh
cd /path/to/v4-midtraining

bash convert/convert_latest.sh {TASK_DIR} {PARAM_NAME} {DATASET_SIZE}

# example:
bash convert/convert_latest.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B
```

> [!CAUTION]
> `/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment2/src/Megatron-LM/tools/checkpoint/loader_mcore.py`の先頭に以下のコードを加えた
> ```
> import json, os, sys, torch, functools
> torch.load = functools.partial(torch.load, weights_only=False)
> ```

## Model soup

[arcee-ai/mergekit](https://github.com/arcee-ai/mergekit) を利用して、モデルのマージを行う

モデルマージ用の環境は `$EXP_DIR/venv-mergekit` に用意した

```sh
source $EXP_DIR/venv-mergekit/bin/activate

# 初回にmergekitをインストール
pip install mergekit
```

`./merge/` 配下にマージの設定ファイルを配置している

merge実行コマンド

```sh
mergekit-yaml merge/your_config.yaml model/output/path/
```
23 changes: 23 additions & 0 deletions pretrain/scripts/v4-midtraining/convert/convert_latest.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash

# LLM-jp v4 model converter (PBS version)
# Usage:
# bash convert_latest.sh \
# /path/to/task \ ... TASK_DIR: path to the model to save
# v3-13b \ ... PARAM_NAME: model config; corresponding file in `params/` should exist

set -eu -o pipefail

task_dir=$1; shift
param_name=$1; shift
dataset_size=$1; shift # 50B or 100B or 300B
iter=$(cat ${task_dir}/${param_name}/${dataset_size}/checkpoints/latest_checkpointed_iteration.txt)

script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining

qsub \
-v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},ITER=${iter},RTYPE=rt_HF \
-m n \
-o /dev/null \
-e /dev/null \
${script_root}/convert/qsub_convert.sh
154 changes: 154 additions & 0 deletions pretrain/scripts/v4-midtraining/convert/qsub_convert.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
#!/bin/bash
#PBS -P gcg51557
#PBS -q R9920251000
#PBS -N 0156_convert
#PBS -l select=1
#PBS -o /dev/null
#PBS -e /dev/null
#PBS -m n

cd $PBS_O_WORKDIR

JOBID=${PBS_JOBID%%.*}
mkdir -p ${TASK_DIR}/logs
LOGFILE=${TASK_DIR}/logs/convert-$JOBID.out
ERRFILE=${TASK_DIR}/logs/convert-$JOBID.err
exec > $LOGFILE 2> $ERRFILE

set -eu -o pipefail

# Arguments
EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining/midtrain
# ENV_DIR=${EXPERIMENT_DIR}/environment2
ENV_DIR=${EXPERIMENT_DIR}/environment3
echo "EXPERIMENT_DIR=${EXPERIMENT_DIR}"
echo "SCRIPT_DIR=${SCRIPT_DIR}"
echo "TASK_DIR=${TASK_DIR}"
echo "PARAM_NAME=${PARAM_NAME}"
echo "DATASET_SIZE=${DATASET_SIZE}"
echo "ITER=${ITER}"

# Setup environment
source ${SCRIPT_DIR}/common/setup.sh

export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE | hostname -f)
export MASTER_PORT=$((10000 + RANDOM % 1000))
echo "hostname: ${MASTER_ADDR}"

ITER_NAME=iter_$(printf %07d ${ITER}) # iter_0123456

MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM
TOKENIZER_MODEL_PATH=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2
OUTPUT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints_hf/${ITER_NAME}
echo "OUTPUT_DIR=${OUTPUT_DIR}"

# Setup working directory
TEMP_DIR=$(mktemp -d "${HOME}/converter_${JOBID}_XXXXXX")
echo "TEMP_DIR=${TEMP_DIR}"
function rm_tempdir {
if [ -e ${TEMP_DIR} ]; then
echo "Removing remporary directory: ${TEMP_DIR}"
rm -rf ${TEMP_DIR}
echo "Done removing"
fi
}
trap rm_tempdir EXIT
trap 'trap - EXIT; rm_tempdir; exit 1' INT PIPE TERM

########
# Step 1: Convert `torch_dist` format to `torch`
# This process requires to launch the trainer script with the same parallelism configs.
########
echo "Start converting: torch_dist --> torch"

# Prepare source model at specific iteration
mkdir ${TEMP_DIR}/torch_dist
echo ${ITER} > ${TEMP_DIR}/torch_dist/latest_checkpointed_iteration.txt
ln -s ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints/${ITER_NAME} ${TEMP_DIR}/torch_dist/${ITER_NAME}

# Load ALL_PARAMS
source ${SCRIPT_DIR}/params/${PARAM_NAME}.sh
# Remove wandb params
EXCLUDE_KEYS=("--wandb-entity" "--wandb-project" "--wandb-exp-name")
NEW_PARAMS=()
skip_next=0
for param in "${ALL_PARAMS[@]}"; do
if [[ $skip_next -eq 1 ]]; then
skip_next=0
continue
fi
for key in "${EXCLUDE_KEYS[@]}"; do
if [[ "$param" == "$key" ]]; then
skip_next=1
continue 2
fi
done
NEW_PARAMS+=("$param")
done
ALL_PARAMS=("${NEW_PARAMS[@]}")

# Add params specific to model conversion
ALL_PARAMS+=(
--load ${TEMP_DIR}/torch_dist
--ckpt-convert-format torch
--ckpt-convert-save ${TEMP_DIR}
)
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"

NUM_NODES=$(wc -l < $PBS_NODEFILE)
NUM_GPUS_PER_NODE=8
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
echo NUM_NODES=$NUM_NODES
echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE
echo NUM_GPUS=$NUM_GPUS

export NVTE_FUSED_ATTN=0
# Launch trainer script to convert the checkpoint
mpirun \
--display-allocation \
--report-bindings \
--oversubscribe \
-np ${NUM_GPUS} \
--npernode ${NUM_GPUS_PER_NODE} \
-bind-to none \
-map-by slot \
python ${MEGATRON_PATH}/pretrain_gpt.py \
${ALL_PARAMS[@]}

#echo "Files created by the Step 1:"
find ${TEMP_DIR}/torch | sort

########
# Step 2: Convert `torch` to `Hugging Face Llama2`
########

echo "Start converting: torch --> hf"

python ${MEGATRON_PATH}/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llmjp4_hf \
--load-dir ${TEMP_DIR}/torch \
--save-dir ${OUTPUT_DIR} \
--hf-tokenizer-path ${TOKENIZER_MODEL_PATH} \
--save-dtype bfloat16 \
--loader-transformer-impl transformer_engine \
--megatron-path ${MEGATRON_PATH}

echo "Files created by the Step 2:"
find ${OUTPUT_DIR} | sort

########
# Step 3: Replace tokenizer model
########

echo "Start replacing tokenizer"

cp ${TOKENIZER_MODEL_PATH}/* ${OUTPUT_DIR}

echo "Final model files:"
find ${OUTPUT_DIR} | sort

echo "Done processing"
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Merge configuration for 1.7B model with fixed 3e-5 learning rate and iteration 1866317

merge_method: linear
models:
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42/iter_1866317/
parameters:
weight: 1.0
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed666/iter_1866317/
parameters:
weight: 1.0
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42069/iter_1866317/
parameters:
weight: 1.0
dtype: bfloat16
32 changes: 32 additions & 0 deletions pretrain/scripts/v4-midtraining/midtrain/common/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Script for setup trainer environment.

source /etc/profile.d/modules.sh
# module load cuda/12.1/12.1.1
module load cuda/12.4/12.4.1
module load cudnn/9.5/9.5.1
module load hpcx/2.20
# module load nccl/2.23/2.23.4-1
module load nccl/2.25/2.25.1-1
# echo $(module list)
loaded=$(module -t list 2>&1)
echo "-----"
echo "Modules: $loaded"
echo "-----"

# ENV_DIR=${EXPERIMENT_DIR}/environments
# ENV_DIR=${EXPERIMENT_DIR}/environment2
ENV_DIR=${EXPERIMENT_DIR}/environment3

source ${ENV_DIR}/venv/bin/activate
# source ${ENV_DIR}/scripts/environment.sh # ADD

## Debug/logging flags
export LOGLEVEL=INFO
# export NCCL_DEBUG=WARN
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=0
export CUDNN_LOGDEST_DBG=stderr
export CUDNN_LOGERR_DBG=1
Loading