-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[Mellanox] Retry mechanism to EEPROM accesses #23866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@prgeor FYI. |
| CMIS_MCI_EEPROM_OFFSET = 2 | ||
| CMIS_MCI_MASK = 0b00001100 | ||
|
|
||
| MAX_ATTEMPTS = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshalvi 50 retries for an i2c r/w seems too much. won't this mask the real hw/driver issue from being debugged? A reasonable retry number would be 2 or 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 50 retry attempts, each spaced 100 ms apart. Once one succeeds, the process stops and no redundant retries are performed- Could you please clarify what’s the concern here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshalvi worst case retry for one i2c transaction could be 5 * 100msec = 5 secs. This retry will simply mask the actual issue. Thats why I said we should investigate any retries more than 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshalvi What's the motivation with 50 retries? vs 100 retires vs 1000 retires? How did you arrive at the upper bound of 50?
…rted by the kernel
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
hi @prgeor , should we wait for master PR to be merged before merging the 202412 one? |
|
202412 is approved by PrinceG hence merged now: Azure/sonic-buildimage-msft#1679. |
Why I did it
Currently, if an EEPROM read or write attempt fails, it is not retried. To make EEPROM access more robust and reliable, this PR introduces a retry mechanism for both read and write operations.
Work item tracking
How I did it
Added retry attempts for EEPROM read/write operations when the failure is specifically due to an I²C error (errno EIO). In such cases, the operation is retried up to 50 times with 100 ms intervals. For any other failure type, no additional attempts are made.
Introduced corresponding NOTICE logs to indicate retry attempts and successes.
How to verify it
Manual testing.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)