-
Notifications
You must be signed in to change notification settings - Fork 142
[build_tools] Add build time analysis tool for CI #2399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add a new tool to analyze Ninja build times and generate HTML reports, helping identify slow-building components in the ROCm build. Changes: - Add analyze_build_times.py: parses .ninja_log and generates HTML report - Add report_build_time_template.html: HTML template for the report - Integrate into CI workflow to auto-generate reports after builds
| # ============================================================================= | ||
|
|
||
|
|
||
| def get_system_info() -> Dict[str, str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for adding this to the report.
|
If possible we also want to add "Resource utilization" per component to the same report i n future. That will help with the work we are doing with @geomin12 on dockers dying due to lack of resources. for ASAN work. |
|
@WBobby i had one more feedback pls if we can also incorporate. can we also add a total build time info at the end for components and dependencies ? thanks! |
Calculate wall clock time from ninja log and display it in the Build Information section alongside CPU, memory details. Signed-off-by: Wang, Yanyao <[email protected]>
|
Link the issue #2399 in PR a mandatory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank You for adding comments to use the script.
ScottTodd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running ~1 week behind on reviews, sorry. Here's some post-merge feedback.
| # Analyze ninja build log to generate per-component timing report | ||
| - name: Analyze Build Times | ||
| if: ${{ !cancelled() }} | ||
| run: | | ||
| python3 build_tools/analyze_build_times.py --build-dir build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work on Windows too?
Please keep Linux and Windows workflows in sync: https://github.com/ROCm/TheRock/blob/main/.github/workflows/build_windows_artifacts.yml
| # Build Time Analysis is only generated on Linux | ||
| if PLATFORM == "linux": | ||
| analysis_url = f"{bucket_url}/logs/{artifact_group}/build_time_analysis.html" | ||
| gha_append_step_summary(f"[Build Time Analysis]({analysis_url})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? We should have these for Windows too.
Also note that multiple workflows call this post_build_upload.py script, so this file may not exist. See this for example:
TheRock/.github/workflows/release_portable_linux_packages.yml
Lines 232 to 240 in f9e3277
| # TODO: upload logs even if the build fails (separate from artifact upload?) | |
| - name: Post Build Upload | |
| if: ${{ github.repository_owner == 'ROCm' }} | |
| run: | | |
| python3 build_tools/github_actions/post_build_upload.py \ | |
| --run-id ${{ github.run_id }} \ | |
| --artifact-group "${{ matrix.target_bundle.amdgpu_family }}" \ | |
| --build-dir ${{ env.OUTPUT_DIR }}/build \ | |
| --upload |
This should check if the file was found and uploaded, not if the current platform is Linux
| # Build name -> Display name mapping | ||
| NAME_MAPPING = { | ||
| "clr": "core-hip", | ||
| "ocl-clr": "core-ocl", | ||
| "ROCR-Runtime": "core-runtime", | ||
| "blas": "rocBLAS", | ||
| "prim": "rocPRIM", | ||
| "fft": "rocFFT", | ||
| "rand": "rocRAND", | ||
| "miopen": "MIOpen", | ||
| "hipdnn": "hipDNN", | ||
| "composable-kernel": "composable_kernel", | ||
| "support": "mxDataGenerator", | ||
| "host-suite-sparse": "SuiteSparse", | ||
| "rocwmma": "rocWMMA", | ||
| "miopen-plugin": "miopen_plugin", | ||
| } | ||
|
|
||
| # Top-level directories for ROCm components | ||
| ROCM_COMPONENT_DIRS = { | ||
| "base", | ||
| "compiler", | ||
| "core", | ||
| "comm-libs", | ||
| "dctools", | ||
| "profiler", | ||
| "ml-libs", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these mappings in this file? This may be difficult to maintain in the future as projects are moved around (https://github.com/ROCm/TheRock/blob/main/docs/rfcs/RFC0003-Build-Tree-Normalization.md)
| # Get CPU model from /proc/cpuinfo | ||
| try: | ||
| with open("/proc/cpuinfo", "r") as f: | ||
| for line in f: | ||
| if line.startswith("model name"): | ||
| info["cpu_model"] = line.split(":")[1].strip() | ||
| break | ||
| except (FileNotFoundError, IOError): | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely Linux-specific.
We have some code for this that is more portable here:
TheRock/build_tools/hack/env_check/device.py
Lines 159 to 199 in f9e3277
| def device_cpu_status(self): | |
| """ | |
| **Warning:** This function may broken in Cluster systems. | |
| Return CPU status, include its name, architecture, total cpu count. | |
| -> `(CPU_NAME, CPU_ARCH, CPU_CORES)` | |
| """ | |
| import os, platform, subprocess, re | |
| if self.is_windows: | |
| _cpu_name = get_regedit( | |
| "HKLM", | |
| r"HARDWARE\DESCRIPTION\System\CentralProcessor\0", | |
| "ProcessorNameString", | |
| ) | |
| _cpu_arch = platform.machine() | |
| _cpu_core = os.cpu_count() | |
| return (_cpu_name, _cpu_core, _cpu_arch) | |
| elif self.is_linux: | |
| _cpu_name = ( | |
| re.search( | |
| r"^\s*Model name:\s*(.+)$", | |
| subprocess.run( | |
| ["lscpu"], capture_output=True, text=True, check=True | |
| ).stdout, | |
| re.MULTILINE, | |
| ) | |
| .group(1) | |
| .strip() | |
| ) | |
| _cpu_arch = platform.machine() | |
| _cpu_core = os.cpu_count() | |
| return (_cpu_name, _cpu_core, _cpu_arch) | |
| else: | |
| # <ADD BSD/Intel_MAC ???> | |
| ... | |
| pass |
Can we just run that env_check and pull that information into these analysis reports instead of duplicating code? Or we can pull those functions into a common module used for both scripts.
| def build_table_rows( | ||
| data: Dict[str, Dict[str, int]], phase_columns: List[str] | ||
| ) -> List[tuple]: | ||
| """Build sorted table rows from project phase data.""" | ||
| rows = [] | ||
| for name, phases in data.items(): | ||
| total = sum(phases.values()) | ||
| cols = [format_duration(phases.get(p, 0)) for p in phase_columns] | ||
| rows.append((name, cols, format_duration(total), total)) | ||
| rows.sort(key=lambda x: x[3], reverse=True) | ||
| return [(r[0], r[1], r[2]) for r in rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to be able to sort https://therock-ci-artifacts.s3.amazonaws.com/19685707430-linux/logs/gfx94X-dcgpu/build_time_analysis.html based on project name too. Let's add "make the table dynamic" as a feature request for later
| def build_table_rows( | ||
| data: Dict[str, Dict[str, int]], phase_columns: List[str] | ||
| ) -> List[tuple]: | ||
| """Build sorted table rows from project phase data.""" | ||
| rows = [] | ||
| for name, phases in data.items(): | ||
| total = sum(phases.values()) | ||
| cols = [format_duration(phases.get(p, 0)) for p in phase_columns] | ||
| rows.append((name, cols, format_duration(total), total)) | ||
| rows.sort(key=lambda x: x[3], reverse=True) | ||
| return [(r[0], r[1], r[2]) for r in rows] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More feature requests:
- Link to ninja logs from this report
- Link to https://ui.perfetto.dev/ from this report
- Embed https://ui.perfetto.dev/ into this report directly
Motivation
Add a new tool to analyze Ninja build times and generate HTML reports, helping identify slow-building components in the ROCm build.
Issue Link: #2214
Technical Details
Changes:
Test Plan
Initiate Continuous Integration (CI) to validate the code.
Execute the test script in the local environment.
Test Result
https://therock-ci-artifacts.s3.amazonaws.com/19685707430-linux/logs/gfx94X-dcgpu/build_time_analysis.html
The script successfully generates the HTML file.
Submission Checklist