-
Notifications
You must be signed in to change notification settings - Fork 96
[Mellanox] Retry mechanism to EEPROM accesses #1679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rted by the kernel
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@microsoft-github-policy-service agree company="NVIDIA" |
| except (OSError, IOError) as e: | ||
| return False, ctypes.get_errno() | ||
|
|
||
| logger.log_debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshalvi can we move this debug log to the beginning so that we do'nt miss the log if there is error a line 674?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d prefer to keep the debug log after the write, since the goal is to record the actual completed write.
Moving it before the write would log an attempted write, even if it later fails, which could be misleading.
| if utils.read_int_from_file(presence_sysfs) != 1: | ||
| return False | ||
| eeprom_raw = self._read_eeprom(0, 1, log_on_error=False) | ||
| eeprom_raw = self.read_eeprom(0, 1, log_on_error=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tshalvi why do we need to mix hardware presence (which is what get_presence() is meant to return) with i2c read ? What is the motivation behind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation is that get_presence() returning True (presence = 1) only indicates that the hardware module is physically detected. It doesn’t necessarily mean that the module’s EEPROM is ready for access. The expectation is to report present=True only when both the hardware is connected and the EEPROM is ready and accessible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor , there are other threads in xcvrd and other process such as thermalctld which calls get_presence before querying EEPROM. So, if we just check the presence sysfs without checking the eeprom readness, it will cause error in other threads / processes/
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Why I did it
Currently, if an EEPROM read or write attempt fails, it is not retried. To make EEPROM access more robust and reliable, this PR introduces a retry mechanism for both read and write operations.
Work item tracking
How I did it
Added retry attempts for EEPROM read/write operations when the failure is specifically due to an I²C error (errno EIO). In such cases, the operation is retried up to 5 times with 100 ms intervals. For any other failure type, no additional attempts are made.
Introduced corresponding NOTICE logs to indicate retry attempts and successes.
How to verify it
Manual testing.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)