Skip to content

Conversation

@Charles2530
Copy link

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- 重命名文件提高可读性:
  - corrected_config.py → train_config.py
  - corrected_fa_config.py → flash_attention_config.py
  - modify_quant_type.py → quant_type_modifier.py

- 整合所有markdown文档到README.md中
- 删除重复和过时的文档文件
- 添加完整的量化训练脚本系统:
  - 支持3个模型:llama31-8b, llama32-1b, deepseek2_lite
  - 支持2个数据集:wikipedia, dolma
  - 支持10种量化方式:bf16, FA, Linear, FA+Linear组合
  - 生成60个训练脚本覆盖所有组合

- 核心功能:
  - 量化类型修改工具
  - 统一的训练配置管理
  - Flash Attention专用配置
  - 完整的文档和使用指南
- 新增 improved_script_template.sh 模板,支持动态量化类型修改
- 新增 update_scripts_with_pattern_v2.py 自动化脚本更新工具
- 创建 examples/deepseek2_lite/train_deepseek2_lite_h100_fp8.sh 训练脚本
- 重构所有训练脚本,统一使用新的执行模式:
  * 导出 HOST_TENSORBOARD_LOGS_PATH 环境变量
  * 使用 sed 动态修改量化类型(hifp8/mxfp8/mxfp4)
  * 使用 tee 命令记录带时间戳的训练日志
- 清理无用文件:删除备份文件和过时的配置脚本
- 支持 llama32-1b、llama31-8b、deepseek2_lite 三种模型
- 支持多种量化类型:bf16、hifp8、mxfp8、mxfp4、fp8
主要改进:
- 新增tensor_saver.py模块,支持细粒度的tensor保存
- 改进tensor命名规则,支持phase(pre/post)和component(FA/linear)标识
- 修改attention和linear层代码,保存forward/backward输入输出tensor
- 新增tensor收集脚本,支持mxfp8/mxfp4/hifp8量化类型
- 重组script目录结构,添加数据处理和可视化工具
- 新增完整的文档和使用指南

新增功能:
- 支持pre/post阶段tensor保存
- 支持FA/linear组件tensor保存
- 支持多种量化类型tensor收集
- 支持tensor可视化分析
- 支持数据处理脚本

文件变更:
- 新增: megatron/core/tensor_saver.py
- 修改: megatron/core/tensor_parallel/layers.py
- 修改: megatron/core/transformer/dot_product_attention.py
- 新增: 多个tensor收集和可视化脚本
- 新增: 完整的文档和使用指南
主要改进:
- 统一tensor保存目录为enhanced_tensor_logs
- 添加命令行参数控制tensor保存(--save-tensors, --tensor-save-dir)
- 新增增强版可视化工具(enhanced_tensor_visualizer.py, quick_visualize_enhanced.py)
- 更新一键可视化脚本支持增强版工具
- 清理重复的测试文件和文档
- 移动测试文件到tests/unit_tests/目录
- 更新.gitignore忽略旧的tensor_logs目录

功能特点:
- 高质量图像生成(300 DPI)
- 多维度分析: 量化类型对比、attention分析、层类型对比
- 详细的统计报告生成
- 自动错误处理和回退机制
- 支持megatron环境运行
- Translate README.md documentation to English
- Translate one_click_visualize.sh shell script messages
- Translate quick_visualize_enhanced.py Python script
- Translate quick_visualize.py Python script
- Translate enhanced_tensor_visualizer.py Python script
- Translate visualize_tensors.py Python script

All Chinese text in visualization directory has been converted to English
for better international accessibility and chart display compatibility.
- Implemented unified tensor indexing for same-layer tensors (input, weight, output)
- Added attention weight (P distribution) saving functionality
- Enhanced rank detection with multiple fallback mechanisms
- Added comprehensive tensor analysis and visualization features
- Implemented overflow detection and analysis
- Added FP8 distribution analysis (hifp8, mxfp8)
- Added BF16 special distribution analysis
- Added backward pass tensor collection
- Added layer-wise and batch analysis capabilities
- Added rank-aware tensor storage with metadata
- Enhanced visualization with multiple analysis types
- Added comprehensive test scripts and documentation

Key features:
- TensorIndexManager for consistent indexing across layer components
- Global state management for rank, sample, and iteration tracking
- Enhanced filename generation with group indexing
- Comprehensive metadata collection and analysis
- Multiple visualization types for different analysis needs
- Add --control-iter argument to control number of iterations for tensor collection
- Modify TensorSaver class to stop collection after specified iterations
- Update run_wikipedia_tensor_collection.sh to support --control-iter parameter
- Update train_llama32_1b_h100_fp8.sh to pass through additional arguments
- Add documentation and test script for the new feature
- Default value is 1 iteration, can be customized via command line or environment variable
- Add iteration update in training loop to sync with actual training iteration
- Remove incorrect iteration_data_count check that was preventing proper control
- Now tensor collection will properly stop after control_iter iterations
- Add check for control_iter limit in training loop
- Exit training immediately when iteration >= control_iter
- This prevents unnecessary training after tensor collection is complete
- Improves efficiency by stopping training when tensor collection goal is met
- Replace hardcoded assumption of 10 tensors per iteration
- Use actual iteration pattern from tensor filenames (iter{iteration:03d})
- Count unique iterations instead of total file count
- More accurate detection of when control_iter is reached
- Add test script to verify iteration detection logic
- Add sample idx update in forward_step for each micro batch
- Reset sample idx to 0 at the start of each iteration
- Now sample_idx correctly ranges from 0 to 127 for 128 samples per iteration
- This ensures tensor files have proper sample indexing in filenames
- Add test script to verify sample idx update logic
- Move control_iter check to after iteration increment
- Now correctly exits after exactly control_iter iterations
- Fixes issue where it was executing one extra iteration
- control_iter=1 now executes only iteration 0, not 0 and 1
- Add test script to verify correct behavior
- Verify all tensor-related parameters work correctly
- Check --save-tensors -> args.save_tensors conversion
- Check --tensor-save-dir -> args.tensor_save_dir conversion
- Check --control-iter -> args.control_iter conversion
- Confirm argparse automatically converts hyphens to underscores
- Add comprehensive test scripts for parameter validation
- Create detailed parameter consistency report
- All tensor collection parameters are working correctly
- Add layer_idx detection logic using inspect module to get layer_number from call stack
- Fix linear layer tensor saving to include proper layer_idx in filenames
- Improve sample_idx fallback logic with default value 0
- Add comprehensive test script to analyze tensor collection issues
- Address issues: 56 tensors too few, linear layers missing layer numbers
- Expected ~131K tensors for 128 samples, 32 layers, 8 ranks
- Linear layers now should have L{layer_idx} in filenames like attention layers
- 训练循环:iteration > control_iter时退出(而不是>=)
- tensor saver:current_iteration > control_iter时停止保存(而不是>=)
- 增加脚本监控的等待时间和详细统计
- 修复后应该收集大量tensor文件而不是56个

修复前:只执行1个iteration,tensor收集不完整
修复后:执行完整的1个iteration,tensor收集完整
- 训练循环:iteration >= control_iter时退出
- tensor saver:current_iteration >= control_iter时停止保存
- 当control_iter=1时:
  - 执行iteration 0(收集tensor)
  - 执行iteration 1(不收集tensor,然后退出)
- 确保执行完整的1个iteration并收集所有tensor
主要修复:
- 添加后台运行训练进程 (&)
- 添加进程PID管理和监控
- 添加完整的tensor收集监控循环
- 添加详细的统计信息(attention/linear/sample分布)
- 添加训练进程停止逻辑
- 移除空的代码块

修复前问题:
- 缺少监控逻辑,直接运行训练就结束
- 空的代码块(第193-197行)
- 没有进程管理

修复后效果:
- 正确启动和监控训练进程
- 等待tensor收集完成
- 提供详细的进度信息
- 在收集完成后停止训练
主要修改:
- 移除后台运行训练 (&)
- 移除进程PID管理和监控循环
- 移除kill进程逻辑
- 让control_iter直接控制训练自然结束

优势:
- 更简洁:不需要复杂的进程管理
- 更可靠:训练自然结束,避免强制终止
- 更清晰:control_iter直接控制训练流程
- 更安全:避免数据丢失或状态不一致

现在脚本将:
- 直接运行训练脚本(同步)
- control_iter自动控制训练在指定iteration后结束
- 训练结束后统计收集结果
- 继续处理下一个量化类型
删除的文件:
- 临时测试脚本: test_*.py, test_*.sh
- 临时文档: CONTROL_ITER_IMPLEMENTATION.md, PARAMETER_CONSISTENCY_REPORT.md
- 临时数据文件: generate_mock_tensor_data.py, simple_iteration_test.py
- 旧的tensor文件: enhanced_tensor_logs/*.pt

保留的文件:
- 核心功能文件: run_wikipedia_tensor_collection.sh
- 核心代码: megatron/ 目录下的修改
- 正式测试: tests/unit_tests/ 目录下的测试

清理后的代码库更加整洁,只保留必要的文件
新增功能:
1. 多线程可视化工具 (multi_threaded_visualizer.py)
   - 支持bf16, mxfp8, mxfp4, hifp8四种量化类型
   - 支持Sample (0,1,2) 和 Layer (1-16) 多维度比较
   - 使用多线程加速画图过程
   - 生成量化类型、样本、层、综合比较分析图

2. 溢出检测分析器 (overflow_detection_analyzer.py)
   - 基于量化类型特征值检测tensor溢出情况
   - 支持四种量化类型的精确溢出检测
   - 多线程并行处理大量tensor文件
   - 生成详细的溢出分析报告和可视化图表

3. 更新run_tensor_draw.sh
   - 支持新的数据结构 (bf16/mxfp8/mxfp4/hifp8)
   - 添加量化类型目录检查
   - 使用多线程可视化工具
   - 提供详细的输出信息

4. 新增run_overflow_analysis.sh
   - 专门用于溢出检测分析
   - 显示量化类型限制值信息
   - 生成溢出统计摘要

5. 测试脚本 (test_overflow_detection.py)
   - 验证量化类型限制值
   - 测试溢出检测逻辑
   - 验证预期行为

基于量化类型特征值:
- bf16: 最大65504, 最小6.103e-05
- hifp8: 最大32768, 最小3.052e-05
- mxfp8: 最大448, 最小0.015625
- mxfp4: 最大57344, 最小6.104e-05
- Removed all [Pipeline DEBUG] print statements from schedules.py
- Cleaned up verbose debug output that was cluttering the training logs
- Kept essential [Pipeline] messages for important status updates
- Improves log readability during tensor collection process
- Fixed IndentationError caused by empty else blocks after removing debug prints
- Removed orphaned 'else:' statements that had no content
- Corrected try-except block structure in pipeline schedule functions
- Resolves syntax error that prevented module import
…ample-based

- Changed from --train-samples to --train-iters (369844 iterations)
- Changed from --lr-decay-samples to --lr-decay-iters (369103 iterations)
- Changed from --lr-warmup-samples to --lr-warmup-iters (740 iterations)
- Resolves AssertionError: Only backward compatibility support for iteration-based training
- Maintains same training schedule by converting sample counts to iteration counts
- Global batch size of 128 is used for conversion calculations
- Only add --load parameter if checkpoint directory or files exist
- Prevents AssertionError when trying to load non-existent checkpoints
- Supports both fresh training and resuming from checkpoints
- Adds informative messages about checkpoint loading status
- Resolves optimizer_param_scheduler state_dict mismatch errors
- Add --use-checkpoint-opt-param-scheduler flag to use checkpoint values
- Resolves AssertionError: class input value 47245184 and checkpoint value 47245280
- Allows training to resume from checkpoints with different parameter configurations
- Maintains compatibility with existing checkpoints while using iteration-based training
- Change --use-checkpoint-opt-param-scheduler to --use-checkpoint-opt_param-scheduler
- Resolves 'unrecognized arguments' error
- Matches the actual parameter name defined in arguments.py
- Enables proper checkpoint optimizer parameter scheduler loading
- Remove bf16_linear function as it's not used anywhere in the codebase
- Megatron-LM uses bf16_matmul for linear layers (matrix multiplication + bias)
- Reduces code maintenance burden by removing unused code
- BF16Linear class and bf16_matmul/bf16_baddbmm functions remain for actual usage
- Remove BF16Linear class as it's not used anywhere in the codebase
- Megatron-LM uses bf16_matmul for linear operations (matrix multiplication + bias)
- Reduces code maintenance burden by removing unused code
- Only bf16_matmul and bf16_baddbmm functions remain for actual usage
- Maintains clean and focused codebase with only necessary components
- Change 'if int(rank) is not in (0,1):' to 'if rank not in [0, 1]:'
- Fixes '此处应有表达式' syntax error
- Uses proper membership testing with 'not in' operator
- Maintains same logic: only save tensors for rank 0 and 1
- Change should_collect_tensor() to check collection_completed instead of tensor_collected_in_warmup
- This allows backward tensors to be collected after forward tensors are saved
- Add should_exit_after_backward() method to exit after backward completion
- Remove forward-only exit check, wait for backward completion instead
- Ensures both forward and backward tensors are properly saved before exiting
- Move rank filtering check after rank detection logic
- This ensures rank is properly detected before filtering
- Only save tensors for rank 0 and 1 as intended
- Improves code readability and logic flow
- Move backward exit check inside the microbatch loop where break is valid
- Remove duplicate backward exit check outside the loop
- Ensures backward tensor collection works properly in no_pipelining mode
- Fixes SyntaxError: 'break' outside loop
- Save layer_idx to ctx in forward method for backward phase use
- Ensures backward tensors have proper layer numbering (L0, L1, L2, etc.)
- Fixes issue where some linear backward tensors were saved without layer numbers
- Maintains consistency between forward and backward tensor naming
- Change linear_with_grad_accumulation_and_async_allreduce to use CustomLinearWithGradAccumulationAndAsyncCommunication
- This ensures all linear layers use the same Function with tensor saving capabilities
- Fixes issue where some linear backward tensors were saved without layer numbers
- Now all linear backward tensors will have proper layer_idx (L0, L1, L2, etc.)
- Remove unused 'from megatron.core.tensor_saver import save_tensor' imports
- Remove unused 'import os' statements
- Clean up code by removing imports that are no longer needed
- Tensor saving is now handled by quantization operators, making these imports redundant
- Remove duplicate attention tensor saving in dot_product_attention.py
- Update BF16, MXFP, and HIFP operators to use semantic tensor names:
  * Attention operations: query, key, value, attention_probs
  * Linear operations: input_A, input_B (unchanged)
- Forward tensors: query, key, value, attention_probs, matmul_input_buffer
- Backward tensors: grad_query, grad_key, grad_value, grad_attention_probs, grad_matmul_input_buffer
- All operators now intelligently determine tensor names based on component type
- Eliminates duplicate tensor saving and provides clearer naming convention
- Remove print statements from TensorCollectionState setter methods
- Remove print statements from TensorIndexManager
- Remove initialization debug log
- Clean up console output for better log readability
…ools

- Update mxfp_scaling_test.py to organize output by elem_format subdirectory
- Add visualization/overflow/ directory with enhanced analysis tools:
  * enhanced_overflow_analyzer.py - Advanced overflow analysis
  * enhanced_scaling_analyzer.py - Comprehensive scaling analysis
  * enhanced_scaling_analysis_results.json - Analysis results data
  * run_analysis.py - Analysis execution script
  * README_scaling_analysis.md - Documentation
  * mxfp_scaling_test.py - Copy of scaling test script
- Better organize scaling analysis outputs by quantization format
- Change args.elem-format to args.elem_format (use underscore instead of hyphen)
- Fix attribute access syntax in output directory generation
- Ensure proper variable naming consistency
- Correct the number of None values returned by MXFPBAddBmm.backward from 14 to 13
- MXFPBAddBmm.forward has 13 parameters (excluding ctx), so backward should return 13 None values
- This fixes RuntimeError: function MXFPBAddBmmBackward returned an incorrect number of gradients
- MXFPBAddBmm.forward has 14 parameters (excluding ctx)
- MXFPBAddBmm.backward should return 14 None values
- This matches PyTorch autograd expectation for gradient count
- Change --control-iter default from 1 to None
- This prevents training from exiting after just 1 iteration
- Users can still specify --control-iter N for tensor collection mode
- Fixes issue where training would exit with 'Reached control_iter limit (1)'
- Remove all Chinese debug prints and comments from training code
- Clean up tensor_saver module outputs to be production-ready
- Remove verbose debug logs from training.py and pipeline schedules
- Clean up quantization operator error messages
- Ensure training output is consistent with original Megatron-LM
- All tensor saving functionality preserved but with silent error handling
- Fix missing content in except block that caused IndentationError
- Add proper error handling for distributed rank detection
- Ensure clean syntax after Chinese output cleanup
- Add --scaling-control command line argument with choices: max, max_minus_1
- Implement scaling control in _quantize_mx and _shared_exponents functions
- Support max_minus_1 strategy to use (max_value - 1) for scaling to avoid overflow
- Update MXFPMatMul and MXFPBAddBmm operators to support scaling_control parameter
- Update all MXFP operator calls in layers.py and dot_product_attention.py
- Fix gradient count for updated operators (12 for MXFPMatMul, 15 for MXFPBAddBmm)
- Update llama32-1b script to support scaling_control parameter in log naming
- Enable differentiation between different scaling strategies in checkpoint and log names
…ability

- Change default scaling_control from 'max' to 'max_minus_1' in pretrain_llama32-1b_wikipedia_FA_linear_mxfp4.sh
- This provides better numerical stability by avoiding potential overflow issues
- Users can still override by passing different scaling_control parameter
Features:
- Dynamic precision switching between quantized (fp8/fp4) and bf16 training
- Loss-based threshold triggering for precision changes
- Asynchronous checkpoint saving to minimize training interruption
- Window-based training management with configurable parameters
- Automatic recovery system with multiple checkpoint buffers

Core Components:
- AdaptiveQuantizationManager: Main coordinator for adaptive training
- Asynchronous checkpoint saving with thread-based implementation
- Integration with existing quantization operators (MXFP, HIFP, BF16)
- Modified training loop with precision switching logic

Command Line Arguments:
- --time-resume: Enable adaptive quantization training
- --quant-loss-threshold: Loss threshold for precision switching (default: 0.1)
- --quant-window-size: Training window size in iterations (default: 5)
- --quant-checkpoint-interval: Checkpoint save frequency (default: 1)
- --quant-fallback-strategy: Fallback precision strategy (bf16/fp16)
- --quant-recovery-buffer: Number of checkpoints to maintain (default: 2)

Implementation Details:
- Seamless integration with existing tensor saving and scaling control
- Thread-safe asynchronous operations
- Comprehensive error handling and recovery mechanisms
- Detailed logging and monitoring capabilities
- Example script and documentation for easy adoption

Benefits:
- Automatic optimization of quantization parameters
- Improved training stability through adaptive precision
- Reduced manual tuning requirements
- Enhanced fault tolerance and recovery capabilities
- Remove unsupported 'tag' parameter from save_checkpoint() calls
- Fix load_checkpoint() calls to use correct parameter names
- Use temporary modification of args.save/args.load for unique checkpoint paths
- Ensure proper restoration of original paths after checkpoint operations

This resolves the error:
[AdaptiveQuantization] Error saving checkpoint: save_checkpoint() got an unexpected keyword argument 'tag'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant