Skip to content

Commit d39d900

Browse files
koshieguchiso298
andauthored
LLMjp v4 Midtraining (ref: OLMo2 midtrainig) (#79)
Co-authored-by: Sosuke Hosokawa <[email protected]>
1 parent cb0bb7e commit d39d900

28 files changed

+1403
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# LLMjp-v4 Midtraining
2+
3+
## Overview
4+
5+
OLMo2におけるMidtrainingをLL-jp-4-enのモデルで再現する実験を行う
6+
7+
## データセットの割合
8+
9+
合計Token数: 55,797,411,281 tokens
10+
11+
| Datasets | Tokens | Source(%) | Mix(%) | Original OLMo2 Mix (%) |
12+
|---------------|----------------|-----------|--------|------------------------|
13+
| DCLM | 26,540,912,669 | 3.23% | 47.57% | 47.20% |
14+
| FLAN | 9,242,742,021 | 50.00% | 16.56% | 16.60% |
15+
| peS2o | 3,236,969,300 | 5.15% | 5.80% | 5.85% |
16+
| Wikipedia | 3,896,965,449 | 100.00% | 6.98% | 7.11% |
17+
| Stackexchange | 1,464,772,187 | 100.00% | 2.63% | 2.45% |
18+
| Math | 11,415,049,655 | 100.00% | 20.46% | 20.80% |
19+
20+
### tokenize
21+
22+
```bash
23+
export EXP_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/"
24+
export EXP_SCRIPT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining"
25+
cd $EXP_DIR
26+
27+
# 1. Huggingfaceからdolmino-mix-1124をダウンロード
28+
huggingface-cli download allenai/dolmino-mix-1124 --local-dir "$EXP_DIR/dolmino-mix-1124"
29+
30+
cd $EXP_SCRIPT_DIR
31+
# 2. データセットの展開 (`$EXP_DIR/dolmino-mix-1124-extracted` に展開される)
32+
bash ./preprocess/extract.sh
33+
34+
# 3. データセットファイルのmerge (`$EXP_DIR/dolmino-mix-1124-extracted-merged` に結合ファイルが作成される)
35+
qsub ./preprocess/merge_files.sh
36+
37+
# (3が完了したら)
38+
# 4. データセットのtokenize (`$EXP_DIR/dolmino-mix-1124-tokenized` にtokenizeされたファイルが作成される)
39+
qsub ./preprocess/tokenize.sh
40+
41+
# (optional) 中間ファイルの削除
42+
rm -rf $EXP_DIR/dolmino-mix-1124-extracted $EXP_DIR/dolmino-mix-1124-extracted-merged
43+
```
44+
45+
### データセットの作成
46+
47+
データセットの作成前に事前にtokenizeが完了している必要がある。
48+
49+
```sh
50+
# ./tasks/v4-dolmino-mix-1124/train_data.all.shを作成
51+
# 自動的にtoken数を計算し、"token数 PATH"をtrain_data.all.shに書き込む
52+
./preprocess/build_train_data.sh
53+
54+
# ./tasks/v4-dolmino-mix-1124/train_data.all.shから./tasks/v4-dolmino-mix-1124/train_data_50B.shを作成
55+
# dolminoのmidtrainingと同じ配合の50Bのデータセットサイズになるようにtoken数を更新する
56+
./preprocess/update_train_data_to_50B.sh
57+
# 100B, 300Bも同様
58+
```
59+
60+
## 環境構築
61+
62+
ref: [scripts/pretrain/installers/v4-megatron-abci at 0130-instruct-pretrain · llm-jp/scripts](https://github.com/llm-jp/scripts/tree/0130-instruct-pretrain/pretrain/installers/v4-megatron-abci)
63+
64+
```sh
65+
cd /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/install-scripts/pretrain/installers/v4-megatron-abci
66+
bash run_setup.sh /path/to/target_dir
67+
# ex
68+
# bash run_setup.sh /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment
69+
```
70+
71+
> [!CAUTION]
72+
> Transformer engineのv1.10以上を使うとエラーが出るため、environment2を今回利用している(Transformer engineのversionを1.9にdowngradeした。)
73+
> ref: https://docs.nvidia.com/nemo-framework/user-guide/24.07/knownissues.html
74+
75+
> [!CAUTION]
76+
> `environment/src/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py`の72行目に"weights_only=False"を加えた
77+
> ref: https://github.com/huggingface/accelerate/issues/3539
78+
79+
80+
## job実行
81+
82+
```sh
83+
cd /path/to/v4-midtraining
84+
85+
# example:
86+
# 1.3b-llama3-ecjk
87+
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B 16
88+
89+
# 7.7b-llama3-ecjk
90+
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16
91+
```
92+
93+
### [Option] 依存関係付きのjob実行
94+
95+
qsub の `-W depend=...` の機能を利用して、ジョブ間に依存関係をつけて実行するためのスクリプトを用意している。
96+
`run_train.sh` ではなく `run_train_with_deps.sh` を利用して実行する。
97+
98+
```sh
99+
# 最後の引数に `-W depend=` に渡す値を書く
100+
bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16 afterok:xxxx.pbs1:yyyy.pbs1
101+
```
102+
103+
依存関係の詳しい記法は ABCI 3.0 上で `man qsub` を参照すること
104+
105+
## Checkpoint変換
106+
107+
> [!CAUTION]
108+
> 下のスクリプトを実行する前に、`scripts/pretrain/scripts/v4-midtraining/midtrain/params``--no-load-optim`を外してください。
109+
110+
```sh
111+
cd /path/to/v4-midtraining
112+
113+
bash convert/convert_latest.sh {TASK_DIR} {PARAM_NAME} {DATASET_SIZE}
114+
115+
# example:
116+
bash convert/convert_latest.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B
117+
```
118+
119+
> [!CAUTION]
120+
> `/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment2/src/Megatron-LM/tools/checkpoint/loader_mcore.py`の先頭に以下のコードを加えた
121+
> ```
122+
> import json, os, sys, torch, functools
123+
> torch.load = functools.partial(torch.load, weights_only=False)
124+
> ```
125+
126+
## Model soup
127+
128+
[arcee-ai/mergekit](https://github.com/arcee-ai/mergekit) を利用して、モデルのマージを行う
129+
130+
モデルマージ用の環境は `$EXP_DIR/venv-mergekit` に用意した
131+
132+
```sh
133+
source $EXP_DIR/venv-mergekit/bin/activate
134+
135+
# 初回にmergekitをインストール
136+
pip install mergekit
137+
```
138+
139+
`./merge/` 配下にマージの設定ファイルを配置している
140+
141+
merge実行コマンド
142+
143+
```sh
144+
mergekit-yaml merge/your_config.yaml model/output/path/
145+
```
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/bin/bash
2+
3+
# LLM-jp v4 model converter (PBS version)
4+
# Usage:
5+
# bash convert_latest.sh \
6+
# /path/to/task \ ... TASK_DIR: path to the model to save
7+
# v3-13b \ ... PARAM_NAME: model config; corresponding file in `params/` should exist
8+
9+
set -eu -o pipefail
10+
11+
task_dir=$1; shift
12+
param_name=$1; shift
13+
dataset_size=$1; shift # 50B or 100B or 300B
14+
iter=$(cat ${task_dir}/${param_name}/${dataset_size}/checkpoints/latest_checkpointed_iteration.txt)
15+
16+
script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining
17+
18+
qsub \
19+
-v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},ITER=${iter},RTYPE=rt_HF \
20+
-m n \
21+
-o /dev/null \
22+
-e /dev/null \
23+
${script_root}/convert/qsub_convert.sh
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
#!/bin/bash
2+
#PBS -P gcg51557
3+
#PBS -q R9920251000
4+
#PBS -N 0156_convert
5+
#PBS -l select=1
6+
#PBS -o /dev/null
7+
#PBS -e /dev/null
8+
#PBS -m n
9+
10+
cd $PBS_O_WORKDIR
11+
12+
JOBID=${PBS_JOBID%%.*}
13+
mkdir -p ${TASK_DIR}/logs
14+
LOGFILE=${TASK_DIR}/logs/convert-$JOBID.out
15+
ERRFILE=${TASK_DIR}/logs/convert-$JOBID.err
16+
exec > $LOGFILE 2> $ERRFILE
17+
18+
set -eu -o pipefail
19+
20+
# Arguments
21+
EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction
22+
SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining/midtrain
23+
# ENV_DIR=${EXPERIMENT_DIR}/environment2
24+
ENV_DIR=${EXPERIMENT_DIR}/environment3
25+
echo "EXPERIMENT_DIR=${EXPERIMENT_DIR}"
26+
echo "SCRIPT_DIR=${SCRIPT_DIR}"
27+
echo "TASK_DIR=${TASK_DIR}"
28+
echo "PARAM_NAME=${PARAM_NAME}"
29+
echo "DATASET_SIZE=${DATASET_SIZE}"
30+
echo "ITER=${ITER}"
31+
32+
# Setup environment
33+
source ${SCRIPT_DIR}/common/setup.sh
34+
35+
export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE | hostname -f)
36+
export MASTER_PORT=$((10000 + RANDOM % 1000))
37+
echo "hostname: ${MASTER_ADDR}"
38+
39+
ITER_NAME=iter_$(printf %07d ${ITER}) # iter_0123456
40+
41+
MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM
42+
TOKENIZER_MODEL_PATH=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2
43+
OUTPUT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints_hf/${ITER_NAME}
44+
echo "OUTPUT_DIR=${OUTPUT_DIR}"
45+
46+
# Setup working directory
47+
TEMP_DIR=$(mktemp -d "${HOME}/converter_${JOBID}_XXXXXX")
48+
echo "TEMP_DIR=${TEMP_DIR}"
49+
function rm_tempdir {
50+
if [ -e ${TEMP_DIR} ]; then
51+
echo "Removing remporary directory: ${TEMP_DIR}"
52+
rm -rf ${TEMP_DIR}
53+
echo "Done removing"
54+
fi
55+
}
56+
trap rm_tempdir EXIT
57+
trap 'trap - EXIT; rm_tempdir; exit 1' INT PIPE TERM
58+
59+
########
60+
# Step 1: Convert `torch_dist` format to `torch`
61+
# This process requires to launch the trainer script with the same parallelism configs.
62+
########
63+
echo "Start converting: torch_dist --> torch"
64+
65+
# Prepare source model at specific iteration
66+
mkdir ${TEMP_DIR}/torch_dist
67+
echo ${ITER} > ${TEMP_DIR}/torch_dist/latest_checkpointed_iteration.txt
68+
ln -s ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints/${ITER_NAME} ${TEMP_DIR}/torch_dist/${ITER_NAME}
69+
70+
# Load ALL_PARAMS
71+
source ${SCRIPT_DIR}/params/${PARAM_NAME}.sh
72+
# Remove wandb params
73+
EXCLUDE_KEYS=("--wandb-entity" "--wandb-project" "--wandb-exp-name")
74+
NEW_PARAMS=()
75+
skip_next=0
76+
for param in "${ALL_PARAMS[@]}"; do
77+
if [[ $skip_next -eq 1 ]]; then
78+
skip_next=0
79+
continue
80+
fi
81+
for key in "${EXCLUDE_KEYS[@]}"; do
82+
if [[ "$param" == "$key" ]]; then
83+
skip_next=1
84+
continue 2
85+
fi
86+
done
87+
NEW_PARAMS+=("$param")
88+
done
89+
ALL_PARAMS=("${NEW_PARAMS[@]}")
90+
91+
# Add params specific to model conversion
92+
ALL_PARAMS+=(
93+
--load ${TEMP_DIR}/torch_dist
94+
--ckpt-convert-format torch
95+
--ckpt-convert-save ${TEMP_DIR}
96+
)
97+
echo "ALL_PARAMS: ${ALL_PARAMS[@]}"
98+
99+
NUM_NODES=$(wc -l < $PBS_NODEFILE)
100+
NUM_GPUS_PER_NODE=8
101+
NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE}))
102+
echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}"
103+
echo NUM_NODES=$NUM_NODES
104+
echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE
105+
echo NUM_GPUS=$NUM_GPUS
106+
107+
export NVTE_FUSED_ATTN=0
108+
# Launch trainer script to convert the checkpoint
109+
mpirun \
110+
--display-allocation \
111+
--report-bindings \
112+
--oversubscribe \
113+
-np ${NUM_GPUS} \
114+
--npernode ${NUM_GPUS_PER_NODE} \
115+
-bind-to none \
116+
-map-by slot \
117+
python ${MEGATRON_PATH}/pretrain_gpt.py \
118+
${ALL_PARAMS[@]}
119+
120+
#echo "Files created by the Step 1:"
121+
find ${TEMP_DIR}/torch | sort
122+
123+
########
124+
# Step 2: Convert `torch` to `Hugging Face Llama2`
125+
########
126+
127+
echo "Start converting: torch --> hf"
128+
129+
python ${MEGATRON_PATH}/tools/checkpoint/convert.py \
130+
--model-type GPT \
131+
--loader mcore \
132+
--saver llmjp4_hf \
133+
--load-dir ${TEMP_DIR}/torch \
134+
--save-dir ${OUTPUT_DIR} \
135+
--hf-tokenizer-path ${TOKENIZER_MODEL_PATH} \
136+
--save-dtype bfloat16 \
137+
--loader-transformer-impl transformer_engine \
138+
--megatron-path ${MEGATRON_PATH}
139+
140+
echo "Files created by the Step 2:"
141+
find ${OUTPUT_DIR} | sort
142+
143+
########
144+
# Step 3: Replace tokenizer model
145+
########
146+
147+
echo "Start replacing tokenizer"
148+
149+
cp ${TOKENIZER_MODEL_PATH}/* ${OUTPUT_DIR}
150+
151+
echo "Final model files:"
152+
find ${OUTPUT_DIR} | sort
153+
154+
echo "Done processing"
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Merge configuration for 1.7B model with fixed 3e-5 learning rate and iteration 1866317
2+
3+
merge_method: linear
4+
models:
5+
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42/iter_1866317/
6+
parameters:
7+
weight: 1.0
8+
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed666/iter_1866317/
9+
parameters:
10+
weight: 1.0
11+
- model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42069/iter_1866317/
12+
parameters:
13+
weight: 1.0
14+
dtype: bfloat16
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Script for setup trainer environment.
2+
3+
source /etc/profile.d/modules.sh
4+
# module load cuda/12.1/12.1.1
5+
module load cuda/12.4/12.4.1
6+
module load cudnn/9.5/9.5.1
7+
module load hpcx/2.20
8+
# module load nccl/2.23/2.23.4-1
9+
module load nccl/2.25/2.25.1-1
10+
# echo $(module list)
11+
loaded=$(module -t list 2>&1)
12+
echo "-----"
13+
echo "Modules: $loaded"
14+
echo "-----"
15+
16+
# ENV_DIR=${EXPERIMENT_DIR}/environments
17+
# ENV_DIR=${EXPERIMENT_DIR}/environment2
18+
ENV_DIR=${EXPERIMENT_DIR}/environment3
19+
20+
source ${ENV_DIR}/venv/bin/activate
21+
# source ${ENV_DIR}/scripts/environment.sh # ADD
22+
23+
## Debug/logging flags
24+
export LOGLEVEL=INFO
25+
# export NCCL_DEBUG=WARN
26+
export NCCL_DEBUG=INFO
27+
export NCCL_DEBUG_SUBSYS=WARN
28+
export PYTHONFAULTHANDLER=1
29+
export CUDA_DEVICE_MAX_CONNECTIONS=1
30+
export CUDA_LAUNCH_BLOCKING=0
31+
export CUDNN_LOGDEST_DBG=stderr
32+
export CUDNN_LOGERR_DBG=1

0 commit comments

Comments
 (0)