diff --git a/pretrain/scripts/v4-midtraining/README.md b/pretrain/scripts/v4-midtraining/README.md new file mode 100644 index 0000000..303743a --- /dev/null +++ b/pretrain/scripts/v4-midtraining/README.md @@ -0,0 +1,145 @@ +# LLMjp-v4 Midtraining + +## Overview + +OLMo2におけるMidtrainingをLL-jp-4-enのモデルで再現する実験を行う + +## データセットの割合 + +合計Token数: 55,797,411,281 tokens + +| Datasets | Tokens | Source(%) | Mix(%) | Original OLMo2 Mix (%) | +|---------------|----------------|-----------|--------|------------------------| +| DCLM | 26,540,912,669 | 3.23% | 47.57% | 47.20% | +| FLAN | 9,242,742,021 | 50.00% | 16.56% | 16.60% | +| peS2o | 3,236,969,300 | 5.15% | 5.80% | 5.85% | +| Wikipedia | 3,896,965,449 | 100.00% | 6.98% | 7.11% | +| Stackexchange | 1,464,772,187 | 100.00% | 2.63% | 2.45% | +| Math | 11,415,049,655 | 100.00% | 20.46% | 20.80% | + +### tokenize + +```bash +export EXP_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/" +export EXP_SCRIPT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining" +cd $EXP_DIR + +# 1. Huggingfaceからdolmino-mix-1124をダウンロード +huggingface-cli download allenai/dolmino-mix-1124 --local-dir "$EXP_DIR/dolmino-mix-1124" + +cd $EXP_SCRIPT_DIR +# 2. データセットの展開 (`$EXP_DIR/dolmino-mix-1124-extracted` に展開される) +bash ./preprocess/extract.sh + +# 3. データセットファイルのmerge (`$EXP_DIR/dolmino-mix-1124-extracted-merged` に結合ファイルが作成される) +qsub ./preprocess/merge_files.sh + +# (3が完了したら) +# 4. データセットのtokenize (`$EXP_DIR/dolmino-mix-1124-tokenized` にtokenizeされたファイルが作成される) +qsub ./preprocess/tokenize.sh + +# (optional) 中間ファイルの削除 +rm -rf $EXP_DIR/dolmino-mix-1124-extracted $EXP_DIR/dolmino-mix-1124-extracted-merged +``` + +### データセットの作成 + +データセットの作成前に事前にtokenizeが完了している必要がある。 + +```sh +# ./tasks/v4-dolmino-mix-1124/train_data.all.shを作成 +# 自動的にtoken数を計算し、"token数 PATH"をtrain_data.all.shに書き込む +./preprocess/build_train_data.sh + +# ./tasks/v4-dolmino-mix-1124/train_data.all.shから./tasks/v4-dolmino-mix-1124/train_data_50B.shを作成 +# dolminoのmidtrainingと同じ配合の50Bのデータセットサイズになるようにtoken数を更新する +./preprocess/update_train_data_to_50B.sh +# 100B, 300Bも同様 +``` + +## 環境構築 + +ref: [scripts/pretrain/installers/v4-megatron-abci at 0130-instruct-pretrain · llm-jp/scripts](https://github.com/llm-jp/scripts/tree/0130-instruct-pretrain/pretrain/installers/v4-megatron-abci) + +```sh +cd /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/install-scripts/pretrain/installers/v4-megatron-abci +bash run_setup.sh /path/to/target_dir +# ex +# bash run_setup.sh /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment +``` + +> [!CAUTION] +> Transformer engineのv1.10以上を使うとエラーが出るため、environment2を今回利用している(Transformer engineのversionを1.9にdowngradeした。) +> ref: https://docs.nvidia.com/nemo-framework/user-guide/24.07/knownissues.html + +> [!CAUTION] +> `environment/src/Megatron-LM/megatron/core/dist_checkpointing/strategies/common.py`の72行目に"weights_only=False"を加えた +> ref: https://github.com/huggingface/accelerate/issues/3539 + + +## job実行 + +```sh +cd /path/to/v4-midtraining + +# example: +# 1.3b-llama3-ecjk +bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B 16 + +# 7.7b-llama3-ecjk +bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16 +``` + +### [Option] 依存関係付きのjob実行 + +qsub の `-W depend=...` の機能を利用して、ジョブ間に依存関係をつけて実行するためのスクリプトを用意している。 +`run_train.sh` ではなく `run_train_with_deps.sh` を利用して実行する。 + +```sh +# 最後の引数に `-W depend=` に渡す値を書く +bash midtrain/run_train.sh $(realpath tasks/v4-dolmino-mix-1124) 7.7b-llama3-ecjk 50B 16 afterok:xxxx.pbs1:yyyy.pbs1 +``` + +依存関係の詳しい記法は ABCI 3.0 上で `man qsub` を参照すること + +## Checkpoint変換 + +> [!CAUTION] +> 下のスクリプトを実行する前に、`scripts/pretrain/scripts/v4-midtraining/midtrain/params`の`--no-load-optim`を外してください。 + +```sh +cd /path/to/v4-midtraining + +bash convert/convert_latest.sh {TASK_DIR} {PARAM_NAME} {DATASET_SIZE} + +# example: +bash convert/convert_latest.sh $(realpath tasks/v4-dolmino-mix-1124) 1.3b-llama3-ecjk 50B +``` + +> [!CAUTION] +> `/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/environment2/src/Megatron-LM/tools/checkpoint/loader_mcore.py`の先頭に以下のコードを加えた +> ``` +> import json, os, sys, torch, functools +> torch.load = functools.partial(torch.load, weights_only=False) +> ``` + +## Model soup + +[arcee-ai/mergekit](https://github.com/arcee-ai/mergekit) を利用して、モデルのマージを行う + +モデルマージ用の環境は `$EXP_DIR/venv-mergekit` に用意した + +```sh +source $EXP_DIR/venv-mergekit/bin/activate + +# 初回にmergekitをインストール +pip install mergekit +``` + +`./merge/` 配下にマージの設定ファイルを配置している + +merge実行コマンド + +```sh +mergekit-yaml merge/your_config.yaml model/output/path/ +``` diff --git a/pretrain/scripts/v4-midtraining/convert/convert_latest.sh b/pretrain/scripts/v4-midtraining/convert/convert_latest.sh new file mode 100644 index 0000000..d239db2 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/convert/convert_latest.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +# LLM-jp v4 model converter (PBS version) +# Usage: +# bash convert_latest.sh \ +# /path/to/task \ ... TASK_DIR: path to the model to save +# v3-13b \ ... PARAM_NAME: model config; corresponding file in `params/` should exist + +set -eu -o pipefail + +task_dir=$1; shift +param_name=$1; shift +dataset_size=$1; shift # 50B or 100B or 300B +iter=$(cat ${task_dir}/${param_name}/${dataset_size}/checkpoints/latest_checkpointed_iteration.txt) + +script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining + +qsub \ + -v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},ITER=${iter},RTYPE=rt_HF \ + -m n \ + -o /dev/null \ + -e /dev/null \ + ${script_root}/convert/qsub_convert.sh diff --git a/pretrain/scripts/v4-midtraining/convert/qsub_convert.sh b/pretrain/scripts/v4-midtraining/convert/qsub_convert.sh new file mode 100644 index 0000000..e5871c0 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/convert/qsub_convert.sh @@ -0,0 +1,154 @@ +#!/bin/bash +#PBS -P gcg51557 +#PBS -q R9920251000 +#PBS -N 0156_convert +#PBS -l select=1 +#PBS -o /dev/null +#PBS -e /dev/null +#PBS -m n + +cd $PBS_O_WORKDIR + +JOBID=${PBS_JOBID%%.*} +mkdir -p ${TASK_DIR}/logs +LOGFILE=${TASK_DIR}/logs/convert-$JOBID.out +ERRFILE=${TASK_DIR}/logs/convert-$JOBID.err +exec > $LOGFILE 2> $ERRFILE + +set -eu -o pipefail + +# Arguments +EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction +SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining/midtrain +# ENV_DIR=${EXPERIMENT_DIR}/environment2 +ENV_DIR=${EXPERIMENT_DIR}/environment3 +echo "EXPERIMENT_DIR=${EXPERIMENT_DIR}" +echo "SCRIPT_DIR=${SCRIPT_DIR}" +echo "TASK_DIR=${TASK_DIR}" +echo "PARAM_NAME=${PARAM_NAME}" +echo "DATASET_SIZE=${DATASET_SIZE}" +echo "ITER=${ITER}" + +# Setup environment +source ${SCRIPT_DIR}/common/setup.sh + +export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE | hostname -f) +export MASTER_PORT=$((10000 + RANDOM % 1000)) +echo "hostname: ${MASTER_ADDR}" + +ITER_NAME=iter_$(printf %07d ${ITER}) # iter_0123456 + +MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM +TOKENIZER_MODEL_PATH=${ENV_DIR}/src/llm-jp-tokenizer/hf/ver3.0/llm-jp-tokenizer-100k.ver3.0b2 +OUTPUT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints_hf/${ITER_NAME} +echo "OUTPUT_DIR=${OUTPUT_DIR}" + +# Setup working directory +TEMP_DIR=$(mktemp -d "${HOME}/converter_${JOBID}_XXXXXX") +echo "TEMP_DIR=${TEMP_DIR}" +function rm_tempdir { + if [ -e ${TEMP_DIR} ]; then + echo "Removing remporary directory: ${TEMP_DIR}" + rm -rf ${TEMP_DIR} + echo "Done removing" + fi +} +trap rm_tempdir EXIT +trap 'trap - EXIT; rm_tempdir; exit 1' INT PIPE TERM + +######## +# Step 1: Convert `torch_dist` format to `torch` +# This process requires to launch the trainer script with the same parallelism configs. +######## +echo "Start converting: torch_dist --> torch" + +# Prepare source model at specific iteration +mkdir ${TEMP_DIR}/torch_dist +echo ${ITER} > ${TEMP_DIR}/torch_dist/latest_checkpointed_iteration.txt +ln -s ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints/${ITER_NAME} ${TEMP_DIR}/torch_dist/${ITER_NAME} + +# Load ALL_PARAMS +source ${SCRIPT_DIR}/params/${PARAM_NAME}.sh +# Remove wandb params +EXCLUDE_KEYS=("--wandb-entity" "--wandb-project" "--wandb-exp-name") +NEW_PARAMS=() +skip_next=0 +for param in "${ALL_PARAMS[@]}"; do + if [[ $skip_next -eq 1 ]]; then + skip_next=0 + continue + fi + for key in "${EXCLUDE_KEYS[@]}"; do + if [[ "$param" == "$key" ]]; then + skip_next=1 + continue 2 + fi + done + NEW_PARAMS+=("$param") +done +ALL_PARAMS=("${NEW_PARAMS[@]}") + +# Add params specific to model conversion +ALL_PARAMS+=( + --load ${TEMP_DIR}/torch_dist + --ckpt-convert-format torch + --ckpt-convert-save ${TEMP_DIR} +) +echo "ALL_PARAMS: ${ALL_PARAMS[@]}" + +NUM_NODES=$(wc -l < $PBS_NODEFILE) +NUM_GPUS_PER_NODE=8 +NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE})) +echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}" +echo NUM_NODES=$NUM_NODES +echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE +echo NUM_GPUS=$NUM_GPUS + +export NVTE_FUSED_ATTN=0 +# Launch trainer script to convert the checkpoint +mpirun \ + --display-allocation \ + --report-bindings \ + --oversubscribe \ + -np ${NUM_GPUS} \ + --npernode ${NUM_GPUS_PER_NODE} \ + -bind-to none \ + -map-by slot \ + python ${MEGATRON_PATH}/pretrain_gpt.py \ + ${ALL_PARAMS[@]} + +#echo "Files created by the Step 1:" +find ${TEMP_DIR}/torch | sort + +######## +# Step 2: Convert `torch` to `Hugging Face Llama2` +######## + +echo "Start converting: torch --> hf" + +python ${MEGATRON_PATH}/tools/checkpoint/convert.py \ + --model-type GPT \ + --loader mcore \ + --saver llmjp4_hf \ + --load-dir ${TEMP_DIR}/torch \ + --save-dir ${OUTPUT_DIR} \ + --hf-tokenizer-path ${TOKENIZER_MODEL_PATH} \ + --save-dtype bfloat16 \ + --loader-transformer-impl transformer_engine \ + --megatron-path ${MEGATRON_PATH} + +echo "Files created by the Step 2:" +find ${OUTPUT_DIR} | sort + +######## +# Step 3: Replace tokenizer model +######## + +echo "Start replacing tokenizer" + +cp ${TOKENIZER_MODEL_PATH}/* ${OUTPUT_DIR} + +echo "Final model files:" +find ${OUTPUT_DIR} | sort + +echo "Done processing" diff --git a/pretrain/scripts/v4-midtraining/merge/1.7b_3e-5-fix_iter1866317.yaml b/pretrain/scripts/v4-midtraining/merge/1.7b_3e-5-fix_iter1866317.yaml new file mode 100644 index 0000000..f5f5293 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/merge/1.7b_3e-5-fix_iter1866317.yaml @@ -0,0 +1,14 @@ +# Merge configuration for 1.7B model with fixed 3e-5 learning rate and iteration 1866317 + +merge_method: linear +models: + - model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42/iter_1866317/ + parameters: + weight: 1.0 + - model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed666/iter_1866317/ + parameters: + weight: 1.0 + - model: /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/checkpoints_hf/3e-5-fix/seed42069/iter_1866317/ + parameters: + weight: 1.0 +dtype: bfloat16 diff --git a/pretrain/scripts/v4-midtraining/midtrain/common/setup.sh b/pretrain/scripts/v4-midtraining/midtrain/common/setup.sh new file mode 100644 index 0000000..b9d5272 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/common/setup.sh @@ -0,0 +1,32 @@ +# Script for setup trainer environment. + +source /etc/profile.d/modules.sh +# module load cuda/12.1/12.1.1 +module load cuda/12.4/12.4.1 +module load cudnn/9.5/9.5.1 +module load hpcx/2.20 +# module load nccl/2.23/2.23.4-1 +module load nccl/2.25/2.25.1-1 +# echo $(module list) +loaded=$(module -t list 2>&1) +echo "-----" +echo "Modules: $loaded" +echo "-----" + +# ENV_DIR=${EXPERIMENT_DIR}/environments +# ENV_DIR=${EXPERIMENT_DIR}/environment2 +ENV_DIR=${EXPERIMENT_DIR}/environment3 + +source ${ENV_DIR}/venv/bin/activate +# source ${ENV_DIR}/scripts/environment.sh # ADD + +## Debug/logging flags +export LOGLEVEL=INFO +# export NCCL_DEBUG=WARN +export NCCL_DEBUG=INFO +export NCCL_DEBUG_SUBSYS=WARN +export PYTHONFAULTHANDLER=1 +export CUDA_DEVICE_MAX_CONNECTIONS=1 +export CUDA_LAUNCH_BLOCKING=0 +export CUDNN_LOGDEST_DBG=stderr +export CUDNN_LOGERR_DBG=1 diff --git a/pretrain/scripts/v4-midtraining/midtrain/params/1.3b-llama3-ecjk.sh b/pretrain/scripts/v4-midtraining/midtrain/params/1.3b-llama3-ecjk.sh new file mode 100644 index 0000000..b8fe350 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/params/1.3b-llama3-ecjk.sh @@ -0,0 +1,157 @@ +# Pretraining hyperparameters for v4 1.3B. +# Model card: https://github.com/llm-jp/model-cards/pull/31 +# Ref: https://github.com/llm-jp/scripts/blob/ec3516a38f93047b7bc0d8305879d62a375e6ee2/pretrain/scripts/v4-training/params/1.3b-cont1.sh + +ALL_PARAMS=() + +# Model hyperparameters +ALL_PARAMS+=( + --num-layers 16 + --hidden-size 2048 + --ffn-hidden-size 7168 + --num-attention-heads 16 + --group-query-attention + --num-query-groups 8 + --seq-length 8192 + --max-position-embeddings 8192 + --position-embedding-type rope + --rotary-base 500000 + --untie-embeddings-and-output-weights + --swiglu + --normalization RMSNorm + --norm-epsilon 1e-5 + --disable-bias-linear +) + +# Tokenizer +ALL_PARAMS+=( + --tokenizer-type Llama2Tokenizer + --tokenizer-model ${ENV_DIR}/src/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model +) + +# Optimizer hyperparameters +ALL_PARAMS+=( + --optimizer adam + # --lr 3e-4 # will be defined later + # --min-lr 3e-5 # will be defined later + --adam-beta1 0.9 + --adam-beta2 0.95 + --adam-eps 1e-8 + --clip-grad 1.0 + --weight-decay 0.1 + --init-method-std 0.02 + --attention-dropout 0.0 + --hidden-dropout 0.0 + --override-opt_param-scheduler + # --no-load-optim +) + +# pretrain_iters: 1,859,665 +# 50B: ceil( 55,797,411,281 / 8192 / 1024 ) == 6652 +# 50B sum: 1,859,665 + 6,652 = 1,866,317 +# 100B: ceil( 113,460,356,693 / 8192 / 1024 ) == 13,526 +# 100B sum: 1,859,665 + 13,526 = 1,873,191 +# 300B: ceil( 337,681,167,151 / 8192 / 1024 ) == 40,255 +# 300B sum: 1,859,665 + 40,255 = 1,899,920 +MIDTRAIN_START=1859665 +TRAIN_ITERS=$(cat ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/train_iters.txt) +MIDTRAIN_ITERS=$((TRAIN_ITERS - MIDTRAIN_START)) + +# Scheduler +ALL_PARAMS+=( + --lr 3e-5 # Start LR + --min-lr 3e-5 # End LR + # --min-lr 0 # End LR + # --lr-warmup-iters ${MIDTRAIN_START} # No warmup + --lr-warmup-iters 0 # No warmup + # --lr-decay-iters ${TRAIN_ITERS} + --lr-decay-iters ${MIDTRAIN_ITERS} + --lr-decay-style linear + --train-iters ${TRAIN_ITERS} + --eval-interval 999999999 + --eval-iters 0 +) + +# Batch sizes +ALL_PARAMS+=( + --micro-batch-size 4 + --global-batch-size 1024 +) + +# Parallelism +ALL_PARAMS+=( + --tensor-model-parallel-size 1 + --pipeline-model-parallel-size 1 + --context-parallel-size 1 + --sequence-parallel + --use-distributed-optimizer + --distributed-backend nccl + # NOTE(odashi): Increasing timeout is required to prepare 15.6T dataset. + --distributed-timeout-minutes 120 + --use-mpi +) + +# Load TRAIN_DATA_PATH +source ${TASK_DIR}/train_data_${DATASET_SIZE}.sh # options: 50B, 100B, and 300B +SEED=42069 +# Dataset +ALL_PARAMS+=( + --data-path ${TRAIN_DATA_PATH[@]} + --data-cache-path ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/cache + --split 1,0,0 + --seed ${SEED} +) + + TASK_CHECKPOINT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints +mkdir -p ${TASK_CHECKPOINT_DIR} + +if [ -e ${TASK_CHECKPOINT_DIR}/${PARAM_NAME}/${DATASET_SIZE}/latest_checkpointed_iteration.txt ]; then + # Continue existing training + ALL_PARAMS+=( + --load ${TASK_CHECKPOINT_DIR} + --save ${TASK_CHECKPOINT_DIR} + ) + echo "Continue existing training" +else + # Start new training from scratch + ALL_PARAMS+=( + --load ${TASK_CHECKPOINT_DIR} + --save ${TASK_CHECKPOINT_DIR} + ) + echo "Start new training from scratch" +fi +ALL_PARAMS+=( + --save-interval 1000 +) + +# Other implementation-related parameters +ALL_PARAMS+=( + --bf16 + --use-mcore-models + --no-masked-softmax-fusion + --use-flash-attn + + # NOTE(odashi): For adjusting throughput + #--recompute-activations + #--recompute-granularity selective + #--overlap-grad-reduce + #--overlap-param-gather + + --attention-softmax-in-fp32 + --transformer-impl transformer_engine + + # NOTE(odashi): Newer implementation requires to set attention backend by parameter. + #--attention-backend flash +) + +# NOTE(odashi): Disable fused attention for Sakura cluster due to some inconsistency. +export NVTE_FUSED_ATTN=0 + +# Logging +ALL_PARAMS+=( + --log-interval 1 + --log-throughput + --wandb-entity llm-jp + --wandb-project 0156_midtrain + --wandb-exp-name train_$(basename ${TASK_DIR}) +) diff --git a/pretrain/scripts/v4-midtraining/midtrain/params/7.7b-llama3-ecjk.sh b/pretrain/scripts/v4-midtraining/midtrain/params/7.7b-llama3-ecjk.sh new file mode 100644 index 0000000..d17d279 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/params/7.7b-llama3-ecjk.sh @@ -0,0 +1,158 @@ +# Pretraining hyperparameters for v4 7.7B. +# Model card: https://github.com/llm-jp/model-cards/pull/30 +# Ref: https://github.com/llm-jp/scripts/blob/ec3516a38f93047b7bc0d8305879d62a375e6ee2/pretrain/scripts/v4-training/params/7.7b-cont1.sh + +ALL_PARAMS=() + +# Model hyperparameters +ALL_PARAMS+=( + --num-layers 32 + --hidden-size 4096 + --ffn-hidden-size 14336 + --num-attention-heads 32 + --group-query-attention + --num-query-groups 8 + --seq-length 8192 + --max-position-embeddings 8192 + --position-embedding-type rope + --rotary-base 500000 + --untie-embeddings-and-output-weights + --swiglu + --normalization RMSNorm + --norm-epsilon 1e-5 + --disable-bias-linear +) + +# Tokenizer +ALL_PARAMS+=( + --tokenizer-type Llama2Tokenizer + --tokenizer-model ${ENV_DIR}/src/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model +) + +# Optimizer hyperparameters +ALL_PARAMS+=( + --optimizer adam + # --lr 3e-4 # will be defined later + # --min-lr 3e-5 # will be defined later + --adam-beta1 0.9 + --adam-beta2 0.95 + --adam-eps 1e-8 + --clip-grad 1.0 + --weight-decay 0.1 + --init-method-std 0.02 + --attention-dropout 0.0 + --hidden-dropout 0.0 + --override-opt_param-scheduler + # --no-load-optim +) + +# pretrain_iters: 1,859,665 +# 50B: ceil( 55,797,411,281 / 8192 / 1024 ) == 6652 +# 50B sum: 1,859,665 + 6,652 = 1,866,317 +# 100B: ceil( 113,460,356,693 / 8192 / 1024 ) == 13,526 +# 100B sum: 1,859,665 + 13,526 = 1,873,191 +# 300B: ceil( 337,681,167,151 / 8192 / 1024 ) == 40,255 +# 300B sum: 1,859,665 + 40,255 = 1,899,920 +MIDTRAIN_START=1859665 +TRAIN_ITERS=$(cat ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/train_iters.txt) +MIDTRAIN_ITERS=$((TRAIN_ITERS - MIDTRAIN_START)) + +# Scheduler +# Scheduler +ALL_PARAMS+=( + --lr 3e-5 # Start LR + --min-lr 3e-5 # End LR + # --min-lr 0 # End LR + # --lr-warmup-iters ${MIDTRAIN_START} # No warmup + --lr-warmup-iters 0 # No warmup + # --lr-decay-iters ${TRAIN_ITERS} + --lr-decay-iters ${MIDTRAIN_ITERS} + --lr-decay-style linear + --train-iters ${TRAIN_ITERS} + --eval-interval 999999999 + --eval-iters 0 +) + +# Batch sizes +ALL_PARAMS+=( + --micro-batch-size 2 + --global-batch-size 1024 +) + +# Parallelism +ALL_PARAMS+=( + --tensor-model-parallel-size 1 + --pipeline-model-parallel-size 2 + --context-parallel-size 1 + --sequence-parallel + --use-distributed-optimizer + --distributed-backend nccl + # NOTE(odashi): Increasing timeout is required to prepare 15.6T dataset. + --distributed-timeout-minutes 120 + --use-mpi +) + +# Load TRAIN_DATA_PATH +source ${TASK_DIR}/train_data_${DATASET_SIZE}.sh # options: 50B, 100B, and 300B +SEED=42 +# Dataset +ALL_PARAMS+=( + --data-path ${TRAIN_DATA_PATH[@]} + --data-cache-path ${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/cache + --split 1,0,0 + --seed ${SEED} +) + + TASK_CHECKPOINT_DIR=${TASK_DIR}/${PARAM_NAME}/${DATASET_SIZE}/checkpoints +mkdir -p ${TASK_CHECKPOINT_DIR} + +if [ -e ${TASK_CHECKPOINT_DIR}/${PARAM_NAME}/${DATASET_SIZE}/latest_checkpointed_iteration.txt ]; then + # Continue existing training + ALL_PARAMS+=( + --load ${TASK_CHECKPOINT_DIR} + --save ${TASK_CHECKPOINT_DIR} + ) + echo "Continue existing training" +else + # Start new training from scratch + ALL_PARAMS+=( + --load ${TASK_CHECKPOINT_DIR} + --save ${TASK_CHECKPOINT_DIR} + ) + echo "Start new training from scratch" +fi +ALL_PARAMS+=( + --save-interval 1000 +) + +# Other implementation-related parameters +ALL_PARAMS+=( + --bf16 + --use-mcore-models + --no-masked-softmax-fusion + --use-flash-attn + + # NOTE(odashi): For adjusting throughput + #--recompute-activations + #--recompute-granularity selective + #--overlap-grad-reduce + #--overlap-param-gather + + --attention-softmax-in-fp32 + --transformer-impl transformer_engine + + # NOTE(odashi): Newer implementation requires to set attention backend by parameter. + #--attention-backend flash +) + +# NOTE(odashi): Disable fused attention for Sakura cluster due to some inconsistency. +export NVTE_FUSED_ATTN=0 + +# Logging +ALL_PARAMS+=( + --log-interval 1 + --log-throughput + --wandb-entity llm-jp + --wandb-project 0156_midtrain + --wandb-exp-name train_$(basename ${TASK_DIR}) +) diff --git a/pretrain/scripts/v4-midtraining/midtrain/qsub_train.sh b/pretrain/scripts/v4-midtraining/midtrain/qsub_train.sh new file mode 100644 index 0000000..cc4b5cf --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/qsub_train.sh @@ -0,0 +1,63 @@ +#!/bin/bash +#PBS -P gcg51557 +#PBS -q R9920251000 +#PBS -N 0156_olmo2-midtrain-reproduction +#PBS -l select=16 +#PBS -l walltime=10000:00:00 +#PBS -m n + +cd $PBS_O_WORKDIR + +JOBID=${PBS_JOBID%%.*} +mkdir -p ${TASK_DIR}/logs +LOGFILE=${TASK_DIR}/logs/train-${JOBID}.out +ERRFILE=${TASK_DIR}/logs/train-${JOBID}.err +exec > $LOGFILE 2> $ERRFILE + +set -eu -o pipefail + +EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction +SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining/midtrain +# ENV_DIR=${EXPERIMENT_DIR}/environments +# ENV_DIR=${EXPERIMENT_DIR}/environment2 +ENV_DIR=${EXPERIMENT_DIR}/environment3 + +# Setup environment +source ${SCRIPT_DIR}/common/setup.sh + +source ${ENV_DIR}/venv/bin/activate + +export MASTER_ADDR=$(head -n 1 $PBS_NODEFILE | hostname -f) +export MASTER_PORT=$((10000 + RANDOM % 1000)) +echo "hostname: ${MASTER_ADDR}" + +NUM_NODES=$(wc -l < $PBS_NODEFILE) +NUM_GPUS_PER_NODE=8 +NUM_GPUS=$((${NUM_NODES} * ${NUM_GPUS_PER_NODE})) +echo "nnodes: ${NUM_NODES}; ngpus: ${NUM_GPUS}" +echo NUM_NODES=$NUM_NODES +echo NUM_GPUS_PER_NODE=$NUM_GPUS_PER_NODE +echo NUM_GPUS=$NUM_GPUS + +cat $PBS_NODEFILE + +# Load TRAIN_DATA_PATH +source ${TASK_DIR}/train_data_${DATASET_SIZE}.sh # options: 50B, 100B, and 300B +echo "TRAIN_DATA_PATH: ${TRAIN_DATA_PATH}" + +# Load ALL_PARAMS +source ${SCRIPT_DIR}/params/${PARAM_NAME}.sh +echo "ALL_PARAMS: ${ALL_PARAMS[@]}" + +export NVTE_FUSED_ATTN=0 + +mpirun \ + --display-allocation \ + --report-bindings \ + --oversubscribe \ + -np $NUM_GPUS \ + --npernode $NUM_GPUS_PER_NODE \ + -bind-to none \ + -map-by slot \ + python ${ENV_DIR}/src/Megatron-LM/pretrain_gpt.py \ + ${ALL_PARAMS[@]} diff --git a/pretrain/scripts/v4-midtraining/midtrain/run_train.sh b/pretrain/scripts/v4-midtraining/midtrain/run_train.sh new file mode 100644 index 0000000..709679b --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/run_train.sh @@ -0,0 +1,22 @@ +#!/bin/bash + +set -eu -o pipefail + +if [ $# -ne 4 ]; then + >&2 echo "Usage: $0 " + >&2 echo "Example: $0 v4-high-quality v3-13b 32" + exit 1 +fi + +task_dir=$1; shift +param_name=$1; shift +dataset_size=$1; shift # 50B or 100B or 300B +num_nodes=$1; shift + +script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining + +qsub -l select=${num_nodes} \ + -v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},RTYPE=rt_HF \ + -o /dev/null -e /dev/null \ + -m n \ + ${script_root}/midtrain/qsub_train.sh diff --git a/pretrain/scripts/v4-midtraining/midtrain/run_train_with_deps.sh b/pretrain/scripts/v4-midtraining/midtrain/run_train_with_deps.sh new file mode 100644 index 0000000..f17a759 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/midtrain/run_train_with_deps.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -eu -o pipefail + +if [ $# -ne 5 ]; then + >&2 echo "Usage: $0 " + >&2 echo "Example: $0 v4-high-quality v3-13b 32 afterok:xxxx.pbs1" + exit 1 +fi + +task_dir=$1; shift +param_name=$1; shift +dataset_size=$1; shift # 50B or 100B or 300B +num_nodes=$1; shift + +# qsub -W depend="$job_dependency" ... +# See `man qsub` +job_dependency=$1; shift + +script_root=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining + +qsub -l select=${num_nodes} \ + -v TASK_DIR=${task_dir},PARAM_NAME=${param_name},DATASET_SIZE=${dataset_size},RTYPE=rt_HF \ + -o /dev/null -e /dev/null \ + -m n \ + -W depend="$job_dependency" \ + ${script_root}/midtrain/qsub_train.sh diff --git a/pretrain/scripts/v4-midtraining/preprocess/build_train_data.sh b/pretrain/scripts/v4-midtraining/preprocess/build_train_data.sh new file mode 100755 index 0000000..3ec6a64 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/build_train_data.sh @@ -0,0 +1,25 @@ +#!/usr/bin/env bash + +set -euo pipefail + +ROOT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized" +OUT_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124" +OUT_FILE="${OUT_DIR}/train_data.all.sh" + +mkdir -p "${OUT_DIR}" + +{ + echo "# Auto-generated: $(date '+%F %T')" + echo "export TRAIN_DATA_PATH=(" + + find "${ROOT_DIR}" -type f -name '*_text_document.bin' | sort | while read -r BIN; do + BYTES=$(stat -c%s "${BIN}") + TOKENS=$(( BYTES / 4 )) + PREFIX="${BIN%.bin}" + printf " %s %s\n" "${TOKENS}" "${PREFIX}" + done + + echo ")" +} > "${OUT_FILE}" + +echo "Generated ${OUT_FILE}" diff --git a/pretrain/scripts/v4-midtraining/preprocess/extract.sh b/pretrain/scripts/v4-midtraining/preprocess/extract.sh new file mode 100755 index 0000000..471a815 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/extract.sh @@ -0,0 +1,97 @@ +#!/bin/bash + +set -eu -o pipefail + +DATA_ROOT="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124/data" +OUTPUT_ROOT="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-extracted" + +mkdir -p "$OUTPUT_ROOT" + +extract_zstd() { + local input_file="$1" + local output_file="${input_file%.zst}" + output_file="${output_file/$DATA_ROOT/$OUTPUT_ROOT}" + mkdir -p "$(dirname "$output_file")" + echo zstd -d "$input_file" -o "$output_file" + zstd -f -d "$input_file" -o "$output_file" +} + +extract_gzip() { + local input_file="$1" + local output_file="${input_file%.gz}" + output_file="${output_file/$DATA_ROOT/$OUTPUT_ROOT}" + mkdir -p "$(dirname "$output_file")" + echo gunzip -c "$input_file" \> "$output_file" + gunzip -c "$input_file" > "$output_file" +} + +copy_only() { + local input_file="$1" + local output_file="${input_file}" + output_file="${output_file/$DATA_ROOT/$OUTPUT_ROOT}" + mkdir -p "$(dirname "$output_file")" + echo cp "$input_file" "$output_file" + cp "$input_file" "$output_file" +} + +# DCLM +for file in $(find "$DATA_ROOT/dclm" -name "*.json.zst" -type f); do + extract_zstd "$file" +done + +# flan +for file in $(find "$DATA_ROOT/flan" -name "*.json.gz" -type f); do + extract_gzip "$file" +done + +# pes2o +for file in $(find "$DATA_ROOT/pes2o" -name "*.json.gz" -type f); do + extract_gzip "$file" +done + +# stackexchange +for file in $(find "$DATA_ROOT/stackexchange" -name "*.json.gz" -type f); do + extract_gzip "$file" +done + +# wiki +for file in $(find "$DATA_ROOT/wiki" -name "*.json.gz" -type f); do + extract_gzip "$file" +done + +# math +## codesearchnet-owmfilter +for file in $(find "$DATA_ROOT/math/codesearchnet-owmfilter" -name "*.jsonl.gz" -type f); do + extract_gzip "$file" +done + +## gsm8k (train only) +for file in $(find $DATA_ROOT/math/gsm8k/**/train -name "*.jsonl.zst" -type f); do + extract_zstd "$file" +done + +## metamath-owmfilter +for file in $(find "$DATA_ROOT/math/metamath-owmfilter" -name "*.jsonl.gz" -type f); do + extract_gzip "$file" +done + +## tulu_math +for file in $(find "$DATA_ROOT/math/tulu_math" -name "*.jsonl" -type f); do + copy_only "$file" +done + +## dolmino_math_synth +for file in $(find "$DATA_ROOT/math/dolmino_math_synth" -name "*.jsonl" -type f); do + copy_only "$file" +done + +## mathcoder2-synthmath +for file in $(find "$DATA_ROOT/math/mathcoder2-synthmath" -name "*.jsonl" -type f); do + copy_only "$file" +done + +## tinyGSM-MIND +for file in $(find "$DATA_ROOT/math/tinyGSM-MIND" -name "*.jsonl.gz" -type f); do + extract_gzip "$file" +done + diff --git a/pretrain/scripts/v4-midtraining/preprocess/merge_files.sh b/pretrain/scripts/v4-midtraining/preprocess/merge_files.sh new file mode 100755 index 0000000..d56d6c0 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/merge_files.sh @@ -0,0 +1,139 @@ +#!/bin/bash +#PBS -P gcg51557 +#PBS -q R9920251000 +#PBS -N 0156_preprocess_merge_files +#PBS -l select=1 +#PBS -o /dev/null +#PBS -e /dev/null +#PBS -m n +#PBS -v RTYPE=rt_HC + +set -eu -o pipefail +shopt -s globstar +shopt -s nullglob +shopt -s failglob + +EXP_DIR="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction" +DATA_ROOT="${EXP_DIR}/dolmino-mix-1124-extracted" +OUTPUT_ROOT="${EXP_DIR}/dolmino-mix-1124-extracted-merged" + +JOBID=${PBS_JOBID:-shell} +JOBID=${JOBID%%.*} +LOG_DIR="${EXP_DIR}/task/logs" +mkdir -p "$LOG_DIR" +exec > "$LOG_DIR/merge_files-$JOBID.log" 2>&1 + +min() { + a="$1" + b="$2" + if [ "$a" -lt "$b" ]; then + echo "$a" + else + echo "$b" + fi +} + +# Workaround for `codesearchnet-ownfilter` and `dolmino-mathsynth` +merge_jsonl_nl() { + for f in "$@"; do + cat "$f" + # If the file is not empty and does not have new line at the end, add a new line. + [ -s "$f" ] && [ "$(tail -c1 "$f")" != $'\n' ] && printf '\n' + done +} + +# DCLM +## 0000 - 0009, 0010 - 0019, ..., 0240 - 0246 +DCLM_DIR="$DATA_ROOT/dclm" +max_num=246 +increment=10 +for i in $(seq 0 $increment $max_num); do + # cat "$DCLM_DIR/0000/*.json" "$DCLM_DIR/0001/*.json" ... "$DCLM_DIR/0009/*.json" > "$OUTPUT_ROOT/dclm/dclm-0000-0009.jsonl" + start=$i + end=$(min $(($i + $increment - 1)) $max_num) + echo "Merging DCLM files from $start to $end" + dir_list=$(seq -f "${DCLM_DIR}/%04g" -s " " $start $end) + concat_files=$(find $dir_list -name "*.json" | sort) + output_file="$OUTPUT_ROOT/dclm/dclm-$(printf '%04d' $start)-$(printf '%04d' $end).jsonl" + mkdir -p "$(dirname "$output_file")" + cat $concat_files > $output_file + echo "Output file: $output_file" +done + +# flan +echo "Merging FLAN files" +output_flan="$OUTPUT_ROOT/flan/flan-all.jsonl" +mkdir -p "$(dirname "$output_flan")" +cat $DATA_ROOT/flan/*.json > "$output_flan" +echo "Output file: $output_flan" + +# pes2o +echo "Merging PES2O files" +output_pes2o="$OUTPUT_ROOT/pes2o/pes2o-all.jsonl" +mkdir -p "$(dirname "$output_pes2o")" +cat $DATA_ROOT/pes2o/*.json > "$output_pes2o" +echo "Output file: $output_pes2o" + +# stackexchange +echo "Merging StackExchange files" +output_stackexchange="$OUTPUT_ROOT/stackexchange/stackexchange-all.jsonl" +mkdir -p "$(dirname "$output_stackexchange")" +cat $DATA_ROOT/stackexchange/*.json > "$output_stackexchange" +echo "Output file: $output_stackexchange" + +# wiki +echo "Merging Wiki files" +output_wiki="$OUTPUT_ROOT/wiki/wiki-all.jsonl" +mkdir -p "$(dirname "$output_wiki")" +cat $DATA_ROOT/wiki/*.json > "$output_wiki" +echo "Output file: $output_wiki" + +# math +## codesearchnet-owmfilter +echo "Merging codesearchnet-owmfilter files" +output_codesearchnet="$OUTPUT_ROOT/math/codesearchnet-owmfilter-all.jsonl" +mkdir -p "$(dirname "$output_codesearchnet")" +merge_jsonl_nl $DATA_ROOT/math/codesearchnet-owmfilter/**/*.jsonl > "$output_codesearchnet" +echo "Output file: $output_codesearchnet" + +## gsm8k +echo "Merging gsm8k files" +output_gsm8k="$OUTPUT_ROOT/math/gsm8k-all.jsonl" +mkdir -p "$(dirname "$output_gsm8k")" +merge_jsonl_nl $DATA_ROOT/math/gsm8k/**/*.jsonl > "$output_gsm8k" +echo "Output file: $output_gsm8k" + +## metamath-owmfilter +echo "Merging metamath-owmfilter files" +output_metamath="$OUTPUT_ROOT/math/metamath-owmfilter-all.jsonl" +mkdir -p "$(dirname "$output_metamath")" +merge_jsonl_nl $DATA_ROOT/math/metamath-owmfilter/**/*.jsonl > "$output_metamath" +echo "Output file: $output_metamath" + +## tulu_math +echo "Merging tulu_math files" +output_tulu_math="$OUTPUT_ROOT/math/tulu_math-all.jsonl" +mkdir -p "$(dirname "$output_tulu_math")" +merge_jsonl_nl $DATA_ROOT/math/tulu_math/**/*.jsonl > "$output_tulu_math" +echo "Output file: $output_tulu_math" + +## dolmino_math_synth +echo "Merging dolmino_math_synth files" +output_dolmino_math_synth="$OUTPUT_ROOT/math/dolmino_math_synth-all.jsonl" +mkdir -p "$(dirname "$output_dolmino_math_synth")" +merge_jsonl_nl $DATA_ROOT/math/dolmino_math_synth/**/*.jsonl > "$output_dolmino_math_synth" +echo "Output file: $output_dolmino_math_synth" + +## mathcoder2-synthmath +echo "Merging mathcoder2-synthmath files" +output_mathcoder2="$OUTPUT_ROOT/math/mathcoder2-synthmath-all.jsonl" +mkdir -p "$(dirname "$output_mathcoder2")" +merge_jsonl_nl $DATA_ROOT/math/mathcoder2-synthmath/**/*.jsonl > "$output_mathcoder2" +echo "Output file: $output_mathcoder2" + +## tinyGSM-MIND +echo "Merging tinyGSM-MIND files" +output_tinygsm_mind="$OUTPUT_ROOT/math/tinyGSM-MIND-all.jsonl" +mkdir -p "$(dirname "$output_tinygsm_mind")" +merge_jsonl_nl $DATA_ROOT/math/tinyGSM-MIND/**/*.jsonl > "$output_tinygsm_mind" +echo "Output file: $output_tinygsm_mind" diff --git a/pretrain/scripts/v4-midtraining/preprocess/tokenize.sh b/pretrain/scripts/v4-midtraining/preprocess/tokenize.sh new file mode 100755 index 0000000..c0c7b32 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/tokenize.sh @@ -0,0 +1,73 @@ +#!/bin/bash +#PBS -P gcg51557 +#PBS -q R9920251000 +#PBS -l walltime=12:00:00 +#PBS -N 0156_tokenize +#PBS -l select=1 +#PBS -o /dev/null +#PBS -e /dev/null +#PBS -m n +#PBS -v RTYPE=rt_HF + +cd $PBS_O_WORKDIR + +EXPERIMENT_DIR=/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction +SCRIPT_DIR=${EXPERIMENT_DIR}/scripts/pretrain/scripts/v4-midtraining +# ENV_DIR=${EXPERIMENT_DIR}/environments +ENV_DIR=${EXPERIMENT_DIR}/environment2 +MEGATRON_PATH=${ENV_DIR}/src/Megatron-LM + +JOBID=${PBS_JOBID%%.*} +TASK_DIR="$EXPERIMENT_DIR/task" +TOKENIZE_LOG_DIR="${TASK_DIR}/logs/tokenize-$JOBID/" +mkdir -p ${TOKENIZE_LOG_DIR} +LOGFILE=${TOKENIZE_LOG_DIR}/stdout.log +ERRFILE=${TOKENIZE_LOG_DIR}/stderr.log +exec > $LOGFILE 2> $ERRFILE + +set -eu -o pipefail + +# Arguments +echo "EXPERIMENT_DIR=${EXPERIMENT_DIR}" +echo "SCRIPT_DIR=${SCRIPT_DIR}" + +# Load environments +source ${ENV_DIR}/venv/bin/activate +source ${ENV_DIR}/scripts/environment.sh + +# Tokenizer config +export TOKENIZER_MODEL="${ENV_DIR}/src/llm-jp-tokenizer/models/ver3.0/llm-jp-tokenizer-100k.ver3.0b1.model" +export TOKENIZER_TYPE=Llama2Tokenizer + +export WORKERS_PER_PROC=16 +N_PROCS=$(($(nproc) / $WORKERS_PER_PROC)) + +export DATA_DIR=${EXPERIMENT_DIR}/dolmino-mix-1124-extracted-merged +export OUTPUT_DIR=${EXPERIMENT_DIR}/dolmino-mix-1124-tokenized +mkdir -p ${OUTPUT_DIR} +export MEGATRON_PATH +export TOKENIZE_LOG_DIR + +# Tokenize +find ${DATA_DIR} -name "*.jsonl" -print0 | \ + sort -z | \ + xargs -0 -P${N_PROCS} -I "{}" bash -c ' + file="{}" + echo "Tokenizing ${file}" + relative_path="${file#${DATA_DIR}/}" + output_path="${OUTPUT_DIR}/${relative_path}" + tokenize_log_file="${TOKENIZE_LOG_DIR}/${relative_path}.log" + mkdir -p "$(dirname "$output_path")" + mkdir -p "$(dirname "$tokenize_log_file")" + + python $MEGATRON_PATH/tools/preprocess_data.py \ + --input "$file" \ + --output-prefix "${output_path%.jsonl}" \ + --tokenizer-model "$TOKENIZER_MODEL" \ + --tokenizer-type "$TOKENIZER_TYPE" \ + --workers "$WORKERS_PER_PROC" \ + --append-eod > "$tokenize_log_file" 2>&1 + + echo "Tokenization completed for ${file}" + ' + diff --git a/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_100B.sh b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_100B.sh new file mode 100755 index 0000000..f4a8be0 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_100B.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +update_train_data () { + local IN="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh" + local OUT="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_100B.sh" + + awk ' + BEGIN { + FS = OFS = " " + } + function ceil(x) { return (x == int(x) ? x : int(x) + 1) } + + /^[[:space:]]*[0-9]/ { + tok = $1 + path = $2 + + ratio = 1 # default 100% + if (path ~ /\/dclm\//) ratio = 0.0685 # 6.85 % + else if (path ~ /\/pes2o\//) ratio = 0.167 # 16.7 % + else if (path ~ /\/math\//) ratio = 2.0 # 200 % + # flan / stackexchange / wiki は ratio = 1 + + newtok = ceil(tok * ratio) + $1 = newtok + print + next + } + + { print } + ' "$IN" > "$OUT" + + echo "Created $OUT" +} + +update_train_data() + diff --git a/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_300B.sh b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_300B.sh new file mode 100755 index 0000000..680f6aa --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_300B.sh @@ -0,0 +1,38 @@ +#!/bin/bash + +update_train_data () { + local IN="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh" + local OUT="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_300B.sh" + + awk ' + BEGIN { + FS = OFS = " " + } + function ceil(x) { return (x == int(x) ? x : int(x) + 1) } + + /^[[:space:]]*[0-9]/ { + tok = $1 + path = $2 + + ratio = 1 # default 100% + if (path ~ /\/dclm\//) ratio = 0.2078 # 20.78 % + else if (path ~ /\/flan\//) ratio = 2.0 # 200 % + else if (path ~ /\/stackexchange\//) ratio = 4.0 # 400 % + else if (path ~ /\/math\//) ratio = 4.0 # 400 % + else if (path ~ /\/wiki\//) ratio = 4.0 # 400 % + # peS2o ratio = 1 + + newtok = ceil(tok * ratio) + $1 = newtok + print + next + } + + { print } + ' "$IN" > "$OUT" + + echo "Created $OUT" +} + +update_train_data() + diff --git a/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_50B.sh b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_50B.sh new file mode 100755 index 0000000..746c828 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/preprocess/update_train_data_to_50B.sh @@ -0,0 +1,36 @@ +#!/bin/bash + +update_train_data () { + local IN="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh" + local OUT="/groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/scripts/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_50B.sh" + + awk ' + BEGIN { + FS = OFS = " " + } + function ceil(x) { return (x == int(x) ? x : int(x) + 1) } + + /^[[:space:]]*[0-9]/ { + tok = $1 + path = $2 + + ratio = 1 # default 100% + if (path ~ /\/dclm\//) ratio = 0.0323 # 3.23 % + else if (path ~ /\/flan\//) ratio = 0.5 # 50 % + else if (path ~ /\/pes2o\//) ratio = 0.0515 # 5.15 % + # math / stackexchange / wiki: ratio = 1 + + newtok = ceil(tok * ratio) + $1 = newtok + print + next + } + + { print } + ' "$IN" > "$OUT" + + echo "Created $OUT" +} + +update_train_data() + diff --git a/pretrain/scripts/v4-midtraining/tasks/.gitignore b/pretrain/scripts/v4-midtraining/tasks/.gitignore new file mode 100644 index 0000000..0d5b44a --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/.gitignore @@ -0,0 +1,6 @@ +cache/ +checkpoints/ +checkpoints_hf/ +logs/ +checkpoints_bak/ +train_iters.txt diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/100B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/100B/train_iters.txt new file mode 100644 index 0000000..bf42282 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/100B/train_iters.txt @@ -0,0 +1 @@ +1873191 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/300B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/300B/train_iters.txt new file mode 100644 index 0000000..9950a88 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/300B/train_iters.txt @@ -0,0 +1 @@ +1899920 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/train_iters.txt new file mode 100644 index 0000000..1e68821 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/1.3b-llama3-ecjk/50B/train_iters.txt @@ -0,0 +1 @@ +1866317 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/100B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/100B/train_iters.txt new file mode 100644 index 0000000..bf42282 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/100B/train_iters.txt @@ -0,0 +1 @@ +1873191 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/300B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/300B/train_iters.txt new file mode 100644 index 0000000..9950a88 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/300B/train_iters.txt @@ -0,0 +1 @@ +1899920 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/50B/train_iters.txt b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/50B/train_iters.txt new file mode 100644 index 0000000..1e68821 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/7.7b-llama3-ecjk/50B/train_iters.txt @@ -0,0 +1 @@ +1866317 diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh new file mode 100644 index 0000000..6f7b7aa --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data.all.sh @@ -0,0 +1,38 @@ +export TRAIN_DATA_PATH=( + 33257644821 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0000-0009_text_document + 33578224849 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0010-0019_text_document + 33376091045 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0020-0029_text_document + 33395497290 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0030-0039_text_document + 33691505776 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0040-0049_text_document + 33254542614 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0050-0059_text_document + 33261965033 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0060-0069_text_document + 33191959688 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0070-0079_text_document + 33471144652 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0080-0089_text_document + 33537413045 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0090-0099_text_document + 33611132762 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0100-0109_text_document + 33551029590 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0110-0119_text_document + 32990519564 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0120-0129_text_document + 33411440903 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0130-0139_text_document + 33306772969 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0140-0149_text_document + 33501253112 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0150-0159_text_document + 33428263069 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0160-0169_text_document + 33529729657 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0170-0179_text_document + 33332439633 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0180-0189_text_document + 33359204624 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0190-0199_text_document + 33403446961 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0200-0209_text_document + 33449850162 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0210-0219_text_document + 33106746850 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0220-0229_text_document + 33027126637 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0230-0239_text_document + 20675136940 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0240-0246_text_document + 18485484042 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/flan/flan-all_text_document + 2174159 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/codesearchnet-owmfilter-all_text_document + 31677007 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/dolmino_math_synth-all_text_document + 2841494 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/gsm8k-all_text_document + 4098243004 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/mathcoder2-synthmath-all_text_document + 85423408 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/metamath-owmfilter-all_text_document + 6944299886 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tinyGSM-MIND-all_text_document + 250390697 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tulu_math-all_text_document + 62853772802 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/pes2o/pes2o-all_text_document + 1464772187 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/stackexchange/stackexchange-all_text_document + 3896965449 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/wiki/wiki-all_text_document +) diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_100B.sh b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_100B.sh new file mode 100644 index 0000000..05bc48f --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_100B.sh @@ -0,0 +1,38 @@ +export TRAIN_DATA_PATH=( +2278148671 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0000-0009_text_document +2300108403 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0010-0019_text_document +2286262237 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0020-0029_text_document +2287591565 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0030-0039_text_document +2307868146 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0040-0049_text_document +2277936170 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0050-0059_text_document +2278444605 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0060-0069_text_document +2273649239 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0070-0079_text_document +2292773409 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0080-0089_text_document +2297312794 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0090-0099_text_document +2302362595 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0100-0109_text_document +2298245527 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0110-0119_text_document +2259850591 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0120-0129_text_document +2288683702 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0130-0139_text_document +2281513949 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0140-0149_text_document +2294835839 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0150-0159_text_document +2289836021 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0160-0169_text_document +2296786482 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0170-0179_text_document +2283272115 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0180-0189_text_document +2285105517 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0190-0199_text_document +2288136117 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0200-0209_text_document +2291314737 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0210-0219_text_document +2267812160 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0220-0229_text_document +2262358175 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0230-0239_text_document +1416246881 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0240-0246_text_document +18485484042 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/flan/flan-all_text_document +4348318 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/codesearchnet-owmfilter-all_text_document +63354014 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/dolmino_math_synth-all_text_document +5682988 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/gsm8k-all_text_document +8196486008 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/mathcoder2-synthmath-all_text_document +170846816 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/metamath-owmfilter-all_text_document +13888599772 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tinyGSM-MIND-all_text_document +500781394 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tulu_math-all_text_document +10496580058 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/pes2o/pes2o-all_text_document +1464772187 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/stackexchange/stackexchange-all_text_document +3896965449 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/wiki/wiki-all_text_document +) diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_300B.sh b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_300B.sh new file mode 100644 index 0000000..6744b62 --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_300B.sh @@ -0,0 +1,38 @@ +export TRAIN_DATA_PATH=( +6910938594 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0000-0009_text_document +6977555124 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0010-0019_text_document +6935551720 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0020-0029_text_document +6939584337 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0030-0039_text_document +7001094901 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0040-0049_text_document +6910293956 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0050-0059_text_document +6911836334 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0060-0069_text_document +6897289224 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0070-0079_text_document +6955303859 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0080-0089_text_document +6969074431 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0090-0099_text_document +6984393388 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0100-0109_text_document +6971903949 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0110-0119_text_document +6855429966 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0120-0129_text_document +6942897420 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0130-0139_text_document +6921147423 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0140-0149_text_document +6961560397 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0150-0159_text_document +6946393066 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0160-0169_text_document +6967477823 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0170-0179_text_document +6926480956 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0180-0189_text_document +6932042721 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0190-0199_text_document +6941236279 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0200-0209_text_document +6950878864 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0210-0219_text_document +6879581996 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0220-0229_text_document +6863036916 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0230-0239_text_document +4296293457 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0240-0246_text_document +36970968084 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/flan/flan-all_text_document +8696636 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/codesearchnet-owmfilter-all_text_document +126708028 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/dolmino_math_synth-all_text_document +11365976 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/gsm8k-all_text_document +16392972016 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/mathcoder2-synthmath-all_text_document +341693632 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/metamath-owmfilter-all_text_document +27777199544 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tinyGSM-MIND-all_text_document +1001562788 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tulu_math-all_text_document +62853772802 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/pes2o/pes2o-all_text_document +5859088748 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/stackexchange/stackexchange-all_text_document +15587861796 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/wiki/wiki-all_text_document +) diff --git a/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_50B.sh b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_50B.sh new file mode 100644 index 0000000..307b78c --- /dev/null +++ b/pretrain/scripts/v4-midtraining/tasks/v4-dolmino-mix-1124/train_data_50B.sh @@ -0,0 +1,38 @@ +export TRAIN_DATA_PATH=( +1074221928 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0000-0009_text_document +1084576663 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0010-0019_text_document +1078047741 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0020-0029_text_document +1078674563 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0030-0039_text_document +1088235637 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0040-0049_text_document +1074121727 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0050-0059_text_document +1074361471 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0060-0069_text_document +1072100298 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0070-0079_text_document +1081117973 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0080-0089_text_document +1083258442 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0090-0099_text_document +1085639589 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0100-0109_text_document +1083698256 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0110-0119_text_document +1065593782 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0120-0129_text_document +1079189542 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0130-0139_text_document +1075808767 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0140-0149_text_document +1082090476 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0150-0159_text_document +1079732898 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0160-0169_text_document +1083010268 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0170-0179_text_document +1076637801 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0180-0189_text_document +1077502310 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0190-0199_text_document +1078931337 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0200-0209_text_document +1080430161 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0210-0219_text_document +1069347924 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0220-0229_text_document +1066776191 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0230-0239_text_document +667806924 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/dclm/dclm-0240-0246_text_document +9242742021 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/flan/flan-all_text_document +2174159 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/codesearchnet-owmfilter-all_text_document +31677007 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/dolmino_math_synth-all_text_document +2841494 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/gsm8k-all_text_document +4098243004 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/mathcoder2-synthmath-all_text_document +85423408 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/metamath-owmfilter-all_text_document +6944299886 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tinyGSM-MIND-all_text_document +250390697 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/math/tulu_math-all_text_document +3236969300 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/pes2o/pes2o-all_text_document +1464772187 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/stackexchange/stackexchange-all_text_document +3896965449 /groups/gcg51557/experiments/0156_olmo2-midtrain-reproduction/dolmino-mix-1124-tokenized/wiki/wiki-all_text_document +)