-
Notifications
You must be signed in to change notification settings - Fork 2
Docs Update: Perf Model Add and Local Test Run Instructions #389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
spandoesai
wants to merge
2
commits into
main
Choose a base branch
from
docs/perf-model-instructions-tests-run
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,18 +8,16 @@ See LICENSE for license information. | |
|
|
||
| TreePerf is a Python SDK that works in conjunction with the Trace2Tree project. PyTorch generates a trace JSON file during profiling, which Trace2Tree parses into a tree data structure representing hierarchical dependencies between CPU operations and GPU kernel executions. TreePerf builds on this tree structure to compute performance metrics at both the model and operation levels. It enables users to analyze, interpret, and optimize AI models by providing detailed performance insights essential for architectural design and performance optimization. | ||
|
|
||
| --- | ||
| ## Key Ideas | ||
|
|
||
| Key Ideas | ||
| 1. Tree Structure for GPU Execution Analysis: The hierarchical tree structure, generated by Trace2Tree, enables straightforward computation of GPU execution times. By linking CPU operations to their corresponding GPU kernel launches, it allows seamless aggregation of kernel execution times and performance metrics at various levels of granularity. | ||
| 2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics. | ||
|
|
||
| --- | ||
| 2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics. | ||
|
|
||
| ## Key Featues | ||
|
|
||
| 1. **GPU timeline breakdown**: Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc | ||
| ### 1. GPU timeline breakdown | ||
|
|
||
| Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc | ||
| Example output: | ||
|
|
||
| | type | time ms | percent | | ||
|
|
@@ -31,8 +29,9 @@ Example output: | |
| | idle_time | 4.717306 | 0.072283 | | ||
| | total_time | 6526.175517 | 100.000000 | | ||
|
|
||
| ### 2. GPU compute time breakdown by CPU op | ||
|
|
||
| 2. **GPU compute time breakdown by CPU op**: Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU. | ||
| Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU. | ||
| Unlike traditional methods that directly look at CPU and GPU durations, this feature provides insight into how CPU operations translate into GPU kernel launches, offering a more stable and interpretable abstraction of performance. | ||
| Example output showing top 5 ops sorted by the total GPU time each op induces: | ||
|
|
||
|
|
@@ -44,8 +43,8 @@ Example output showing top 5 ops sorted by the total GPU time each op induces: | |
| | aten::copy_ | 4 | 69.96 | 1.11 | 95.15 | | ||
| | triton_poi_fused_add_fill_mul_sigmoid_silu_sub_0 | 8 | 43.19 | 0.69 | 95.84 | | ||
|
|
||
| ### 3. Roofline metrics | ||
|
|
||
| 3. **Roofline metrics** | ||
| Example output for aten::mm showing top 5 param combos sorted by the total GPU time each param combo induces: | ||
|
|
||
| | name | param: M | param: N | param: K | param: bias | FLOPS/Byte_first | TFLOPS/s_mean | | ||
|
|
@@ -56,13 +55,108 @@ Example output showing top 5 ops sorted by the total GPU time each op induces: | |
| | aten::mm | 73728 | 128256 | 8192 | False | 6972.01 | 628.10 | | ||
| | aten::mm | 8192 | 28672 | 73728 | False | 5864.73 | 599.95 | | ||
|
|
||
| - Adding new operations is simple (Contributions are welcome!): | ||
| - perf_model.py | ||
| - Parse operation shapes from the JSON trace | ||
| - Write the performance model (FLOPS, bytes) or reuse an existing one | ||
| - torch_op_mapping.py: Map the operation to the perf model here | ||
|
|
||
| For more details and walkthrough checkout the base_example.ipynb notebook | ||
| **Replace the profile path in base_example.ipynb by your profile file and get insights instantly!** | ||
|
|
||
| Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly. | ||
| Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly. | ||
|
|
||
| ## How to Add a New Perf Model | ||
|
|
||
| To add a new performance model (perf model) to TreePerf, follow these steps. This guide uses an attention-based perf model as an example, since these are frequently added. Although this example is for PyTorch, the perf models for JAX can be added in a similar fashion. | ||
|
|
||
| **Most important:** For attention and similar ops, the main change is usually in the `get_param_details` method. This method extracts parameters (such as input shapes, dropout, etc.) from the trace event, and the indices/positions of these parameters often differ between ops. You typically do NOT need to overload `flops` or `bytes` unless the computation itself changes. | ||
|
|
||
| ### 1. Implement the Perf Model Class | ||
|
|
||
| - Go to `perf_model.py`. | ||
| - Create a new class for your operation, inheriting from the appropriate base class (`SDPA` for attention, `GEMM` for matrix multiplication, etc.). | ||
| - Implement required methods: | ||
| - `get_param_details`: **This is usually the only method you need to change.** Update the indices/positions to correctly extract parameters (e.g., input shapes, dropout) from the event, based on your op's argument order in the Kineto trace. | ||
| - `__init__`: Call `get_param_details` and store parsed parameters. | ||
| - `flops`, `bytes`: Usually do not need to be changed unless the computation logic is different. | ||
| - (Optional) `flops_bwd`, `bytes_bwd`: For backward pass metrics. | ||
|
|
||
| ### Example: Adding a new attention-based perf model (with custom get_param_details) | ||
|
|
||
| ```python | ||
| class my_attention_forward(SDPA): | ||
| def get_param_details(self, event): | ||
| # Example: change indices to match your op's argument order | ||
| q_shape = event['args'][0]['shape'] # index may differ | ||
| k_shape = event['args'][1]['shape'] | ||
| v_shape = event['args'][2]['shape'] | ||
| dropout = event['args'][5] # index may differ | ||
| # ... parse other params as needed | ||
| return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout } | ||
|
|
||
| @staticmethod | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will override the base class method i believe, so we should not have this |
||
| def flops(self): | ||
| # Usually unchanged | ||
| pass | ||
|
|
||
| def bytes(self, bytes_per_element=2): | ||
| # Usually unchanged | ||
| pass | ||
| ``` | ||
|
|
||
| ### 2. Map the Operation Name to the Perf Model | ||
|
|
||
| - Go to `torch_op_mapping.py`. | ||
| - Add your operation name and class to `op_to_perf_model_class_map`. Note that this is the name by which this op appears in the PyTorch Kineto trace. | ||
|
|
||
| ```python | ||
| op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward | ||
| ``` | ||
|
|
||
| ### 3. Categorize the Operation (Optional) | ||
|
|
||
| - If your operation fits a new category, update `dict_base_class2category` and `dict_cat2names` accordingly. | ||
|
|
||
| ### 4. Test and Validate | ||
|
|
||
| - Ensure your perf model parses parameters correctly and computes metrics as expected. | ||
| - Use the example notebook or scripts to verify integration. | ||
|
|
||
| --- | ||
|
|
||
| ## Example: Adding a New Attention Perf Model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is mostly repeated content from just above |
||
|
|
||
| The most commonly requested perf model change is for a new variant of attention. Note that most attention variants have very similar function signatures (type of arguments and their position). In this case, the only change to be made is the position or index of the specific function parameters such as input shape, dropout, etc. Carefully change these indices based on the order of arguments in your Kineto trace. | ||
|
|
||
| **Key point:** For most attention variants, you only need to update the indices/positions in `get_param_details` to match the argument order in your trace. The rest of the logic (FLOPs, bytes) can usually be reused. | ||
|
|
||
| ### 1. In `perf_model.py:` | ||
|
|
||
| ```python | ||
| class my_attention_forward(SDPA): | ||
| def get_param_details(self, event): | ||
| # Update indices to match your op's argument order | ||
| q_shape = event['args'][0]['shape'] | ||
| k_shape = event['args'][1]['shape'] | ||
| v_shape = event['args'][2]['shape'] | ||
| dropout = event['args'][5] | ||
| return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout } | ||
|
|
||
| @staticmethod | ||
| def flops(self): | ||
| pass | ||
|
|
||
| def bytes(self, bytes_per_element=2): | ||
| pass | ||
| ``` | ||
|
|
||
| ### 2. In `torch_op_mapping.py:` | ||
|
|
||
| ```python | ||
| op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Tips | ||
|
|
||
| - Reuse parameter parsing logic from similar attention models. | ||
| - Follow the structure and naming conventions used in existing models. | ||
| - If your operation is a variant (e.g., backward, varlen), inherit from the closest existing perf model and override only necessary methods. | ||
|
|
||
| For more details, see the comments and examples in `perf_model.py` and `torch_op_mapping.py`. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a real world example of the src function signature and how we get the indices from there?
maybe we can use tri dao flash attention or something like that?