- 
                Notifications
    You must be signed in to change notification settings 
- Fork 0
Feat/jax rccl analyser #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not sure if JAXRcclAnalyser is the proper place for this code. Why not put this be under NcclAnalyser?
- Remove unnecessary files that are not meant to be included in the PR.
- rccl_complete_analysis.py should be refactored and only the essential parts should be kept.
- Time differences between ranks in different nodes might not be meaningful. These should be investigated further. Probably out of the scope of this work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this folder the proper place for Jupyter notebooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this folder the proper place for Jupyter notebooks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean this out if not needed.
        
          
                TraceLens/JAXRcclAnalyser/notes.txt
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file?
        
          
                TraceLens/JAXRcclAnalyser/test_1.csv
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file?
        
          
                TraceLens/JAXRcclAnalyser/test_2.csv
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try to assimilate (rewrite/clean/refactor) the essential parts from this file into the current feature (multinode RCCL analysis) and place those in locations that make sense in TraceLense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably not correct to compare timestamps between different nodes / processes as they might not be synced. I think this requires further investigations. The safe option would be to remove any calculations / metrics that depend on comparing timestamps on different nodes / processes and add these in a future PR if they make sense.
Pull Request Template