Skip to content

Conversation

kakwok
Copy link

@kakwok kakwok commented Sep 15, 2025

Rebased #23 to CMSSW_15_1_0_pre6.

PR description:

  • Implement RetryActionDiffServer for SonicTriton using TritonService’s server registry; remove per-action alternate server parameters.
  • Use TritonClient::updateServer(TritonService::Server::fallbackName) to switch servers on retry, per review guidance.
  • Extend test coverage:
    • Add Catch2 unit test HeterogeneousCore/SonicTriton/test/test_RetryActionDiffServer.cc (arms → updateServer(fallback) → no-op on second retry → exception path is caught).
    • Extend HeterogeneousCore/SonicTriton/test/tritonTest_cfg.py with --retryAction {same,diff} and a verbose confirmation line.
    • Add scripted test TestHeterogeneousCoreSonicTritonRetryActionDiff_Log to assert the selected retry policy.
  • Move all test definitions to HeterogeneousCore/SonicTriton/test/BuildFile.xml (remove package-level test entries; use proper with catch2).
  • Provide a protected default constructor for TritonClient (test double support), removing the unused testing flag constructor.
  • Aligns with the Retry framework expectations discussed in review (Test PR for new Retry Framework #19 ); no physics changes expected, functionality is exercised only when retry is configured.

PR validation:

  • Built and ran unit/integration tests in CMSSW_15_1_0_pre6 area:

    • scram b -j 8
  • TODO/To verify:

    • scram b runtests TEST=HeterogeneousCore/SonicTriton (passes)
    • Catch2 test validates action behavior;
    • cmsRun-based tests validate configuration wiring for both same and diff policies.
    • No changes to standard workflows unless Client.Retry is explicitly configured.

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an exception rather than a LogError. (It may actually need to be a return false or similar, because the call chain is client->finish() -> action->retry() -> action->eval(), and only finish() should actually emit an exception.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

} catch (std::exception& e) {
edm::LogError("RetryActionDiffServer") << "Failed to retry with alternative server: " << e.what();
} catch (...) {
edm::LogError("RetryActionDiffServe: rUnknownFailure") << "An unknown exception was thrown";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@@ -0,0 +1,22 @@
#!/bin/bash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment has not been addressed

if (client_) {
client_->evaluate();
} else {
edm::LogError("RetryActionBase") << "Client pointer is null, cannot evaluate.";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment has not been addressed yet

parser.add_argument("--device", default="auto", type=str.lower, choices=allowed_devices, help="specify device for fallback server")
parser.add_argument("--container", default="apptainer", type=str.lower, choices=allowed_containers, help="specify container for fallback server")
parser.add_argument("--tries", default=0, type=int, help="number of retries for failed request")
parser.add_argument("--retryAction", default="same", type=str, choices=["same","diff"], help="retry policy: same server or different server")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be added to getParser() in customize.py

Comment on lines +24 to +29
parser.add_argument("--noShm", default=False, action="store_true", help="disable shared memory")
parser.add_argument("--compression", default="", type=str, choices=allowed_compression, help="enable I/O compression")
parser.add_argument("--ssl", default=False, action="store_true", help="enable SSL authentication for server communication")
parser.add_argument("--device", default="auto", type=str.lower, choices=allowed_devices, help="specify device for fallback server")
parser.add_argument("--container", default="apptainer", type=str.lower, choices=allowed_containers, help="specify container for fallback server")
parser.add_argument("--tries", default=0, type=int, help="number of retries for failed request")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are redundant with customize.py and should be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants