AMD-AGI · spandoesai · Oct 16, 2025 · Oct 16, 2025 · ajassani · Oct 16, 2025
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -38,6 +38,7 @@ Thanks for your interest in improving **TraceLens** — a toolkit that parses Py
   - [Scope (optional)](#scope-optional)
   - [Examples](#examples)
 - [Commit Message Convention](#commit-message-convention)
+- [Running Tests Locally](#running-tests-locally)
 
 ---
 
@@ -149,3 +150,31 @@ docs(readme-tracediff): add docs for jax tracediff
 ```
 
 This format helps us to automatically generate changelogs and provide more clarity in versioning.
+
+## Running Tests Locally
+
+Once you have made your changes and are ready to open a PR, it is good practice to first run some tests locally. This will expose any errors that require fixing and ensure the code is safe to push. Please check the /tests folder to run the tests. You will require pytest to run these tests.
+
+### Installing pytest
+
+If you don't have pytest installed, you can add it using pip:
+
+```sh
+pip install pytest
+```
+
+### Running Tests
+
+To run all tests in the `/tests` directory:
+
+```sh
+pytest tests/
+```
+
+To run a specific test file:
+
+```sh
+pytest tests/test_compare_perf_report.py
+```
+
+For more options and usage, see the [pytest documentation](https://docs.pytest.org/en/stable/).
diff --git a/docs/TreePerf.md b/docs/TreePerf.md
@@ -8,18 +8,16 @@ See LICENSE for license information.
 
 TreePerf is a Python SDK that works in conjunction with the Trace2Tree project. PyTorch generates a trace JSON file during profiling, which Trace2Tree parses into a tree data structure representing hierarchical dependencies between CPU operations and GPU kernel executions. TreePerf builds on this tree structure to compute performance metrics at both the model and operation levels. It enables users to analyze, interpret, and optimize AI models by providing detailed performance insights essential for architectural design and performance optimization.
 
----
 ## Key Ideas
 
-Key Ideas
 1. Tree Structure for GPU Execution Analysis: The hierarchical tree structure, generated by Trace2Tree, enables straightforward computation of GPU execution times. By linking CPU operations to their corresponding GPU kernel launches, it allows seamless aggregation of kernel execution times and performance metrics at various levels of granularity.
-2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics. 
-
----
+2. Performance Metrics from JSON Parsing: Metrics such as FLOPS and FLOPS/Byte are derived by extracting operation parameters from the JSON event data structure. These parameters are parsed and fed into performance models, enabling precise computation of performance metrics.
 
 ## Key Featues
 
-1. **GPU timeline breakdown**: Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc
+### 1. GPU timeline breakdown
+
+Provides a high-level view of activity of the GPU which includes busy time, idle time, communication time, etc
 Example output:
 
 | type                          | time ms       | percent     |
@@ -31,8 +29,9 @@ Example output:
 | idle_time                     | 4.717306     | 0.072283   |
 | total_time                    | 6526.175517  | 100.000000 |
 
+### 2. GPU compute time breakdown by CPU op
 
-2. **GPU compute time breakdown by CPU op**: Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU. 
+Analyze the performance breakdown by examining the lowest-level CPU operations (from the call stack perspective) and the time they "induce" on the GPU.
 Unlike traditional methods that directly look at CPU and GPU durations, this feature provides insight into how CPU operations translate into GPU kernel launches, offering a more stable and interpretable abstraction of performance.
 Example output showing top 5 ops sorted by the total GPU time each op induces:
 
@@ -44,8 +43,8 @@ Example output showing top 5 ops sorted by the total GPU time each op induces:
 | aten::copy_                                                | 4     | 69.96                       | 1.11           | 95.15                     |
 | triton_poi_fused_add_fill_mul_sigmoid_silu_sub_0           | 8     | 43.19                       | 0.69           | 95.84                     |
 
+### 3. Roofline metrics
 
-3. **Roofline metrics**
  Example output for aten::mm showing top 5 param combos sorted by the total GPU time each param combo induces:
 
 | name    | param: M | param: N | param: K | param: bias | FLOPS/Byte_first | TFLOPS/s_mean |
@@ -56,13 +55,108 @@ Example output showing top 5 ops sorted by the total GPU time each op induces:
 | aten::mm | 73728    | 128256   | 8192     | False       | 6972.01          | 628.10        |
 | aten::mm | 8192     | 28672    | 73728    | False       | 5864.73          | 599.95        |
 
-- Adding new operations is simple (Contributions are welcome!):
-    - perf_model.py
-        - Parse operation shapes from the JSON trace 
-        - Write the performance model (FLOPS, bytes) or reuse an existing one
-    - torch_op_mapping.py: Map the operation to the perf model here
-
 For more details and walkthrough checkout the base_example.ipynb notebook
 **Replace the profile path in base_example.ipynb by your profile file and get insights instantly!**
 
-Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly.
+Once you are familiar with the workflows you can directly use or modify the **generate_perf_report.py** script to generate a report excel sheet quickly.
+
+## How to Add a New Perf Model
+
+To add a new performance model (perf model) to TreePerf, follow these steps. This guide uses an attention-based perf model as an example, since these are frequently added. Although this example is for PyTorch, the perf models for JAX can be added in a similar fashion.
+
+**Most important:** For attention and similar ops, the main change is usually in the `get_param_details` method. This method extracts parameters (such as input shapes, dropout, etc.) from the trace event, and the indices/positions of these parameters often differ between ops. You typically do NOT need to overload `flops` or `bytes` unless the computation itself changes.
+
+### 1. Implement the Perf Model Class
+
+- Go to `perf_model.py`.
+- Create a new class for your operation, inheriting from the appropriate base class (`SDPA` for attention, `GEMM` for matrix multiplication, etc.).
+- Implement required methods:
+  - `get_param_details`: **This is usually the only method you need to change.** Update the indices/positions to correctly extract parameters (e.g., input shapes, dropout) from the event, based on your op's argument order in the Kineto trace.
+  - `__init__`: Call `get_param_details` and store parsed parameters.
+  - `flops`, `bytes`: Usually do not need to be changed unless the computation logic is different.
+  - (Optional) `flops_bwd`, `bytes_bwd`: For backward pass metrics.
+
+### Example: Adding a new attention-based perf model (with custom get_param_details)
+
+```python
+class my_attention_forward(SDPA):
+    def get_param_details(self, event):
+        # Example: change indices to match your op's argument order
+        q_shape = event['args'][0]['shape']  # index may differ
+        k_shape = event['args'][1]['shape']
+        v_shape = event['args'][2]['shape']
+        dropout = event['args'][5]           # index may differ
+        # ... parse other params as needed
+        return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout }
+
+    @staticmethod
+    def flops(self):
+        # Usually unchanged
+        pass
+
+    def bytes(self, bytes_per_element=2):
+        # Usually unchanged
+        pass
+```
+
+### 2. Map the Operation Name to the Perf Model
+
+- Go to `torch_op_mapping.py`.
+- Add your operation name and class to `op_to_perf_model_class_map`. Note that this is the name by which this op appears in the PyTorch Kineto trace.
+
+```python
+op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward
+```
+
+### 3. Categorize the Operation (Optional)
+
+- If your operation fits a new category, update `dict_base_class2category` and `dict_cat2names` accordingly.
+
+### 4. Test and Validate
+
+- Ensure your perf model parses parameters correctly and computes metrics as expected.
+- Use the example notebook or scripts to verify integration.
+
+---
+
+## Example: Adding a New Attention Perf Model
+
+The most commonly requested perf model change is for a new variant of attention. Note that most attention variants have very similar function signatures (type of arguments and their position). In this case, the only change to be made is the position or index of the specific function parameters such as input shape, dropout, etc. Carefully change these indices based on the order of arguments in your Kineto trace.
+
+**Key point:** For most attention variants, you only need to update the indices/positions in `get_param_details` to match the argument order in your trace. The rest of the logic (FLOPs, bytes) can usually be reused.
+
+### 1. In `perf_model.py:`
+
+```python
+class my_attention_forward(SDPA):
+    def get_param_details(self, event):
+        # Update indices to match your op's argument order
+        q_shape = event['args'][0]['shape']
+        k_shape = event['args'][1]['shape']
+        v_shape = event['args'][2]['shape']
+        dropout = event['args'][5]
+        return { 'q_shape': q_shape, 'k_shape': k_shape, 'v_shape': v_shape, 'dropout': dropout }
+
+    @staticmethod
+    def flops(self):
+        pass
+
+    def bytes(self, bytes_per_element=2):
+        pass
+```
+
+### 2. In `torch_op_mapping.py:`
+
+```python
+op_to_perf_model_class_map['my_attention::forward'] = perf_model.my_attention_forward
+```
+
+---
+
+## Tips
+
+- Reuse parameter parsing logic from similar attention models.
+- Follow the structure and naming conventions used in existing models.
+- If your operation is a variant (e.g., backward, varlen), inherit from the closest existing perf model and override only necessary methods.
+
+For more details, see the comments and examples in `perf_model.py` and `torch_op_mapping.py`.