Skip to content

Conversation

aparajon
Copy link
Collaborator

@aparajon aparajon commented Sep 29, 2025

Description

Update the shard targeting logic to use a single connection throughout the entire lifecycle and align the logic with an internal version of this component that we use on a different service framework that's more battle tested for Vitess. This is necessary to prevent issues where connections don't properly get reset after shard targeting operations, which pollutes the underlying connection pool and can lead to queries picking up a connection that is targeting an unintended destination.

Additionally, introduce a new interface session.target(destination: Destination, block: (session: Session) -> T, which allows for more explicit routing than the current interface which only accepts Shard, as Destination can also include @primary or @replica.

Also moved over the shard targeting tests in TransacterTest into a new suite VitessShardTargetIntegrationTest for better organization. Also added new tests there as well which test the new session.target(Destination...) interface.

Lastly, removed fun <T> Transacter.failSafeRead(shard: Shard, block: (session: Session) -> T): T, which is not in use.

Testing Strategy

Existing shard targeting tests were moved to VitessShardTargetIntegrationTest

Canary tested in a service

Definition of Done

Shard targeting correctly works in Misk service consumers and connection reset issues don't happen

Rollout Strategy and Risk Mitigation

Checklist

  • I have reviewed this PR with relevant experts and/or impacted teams.
  • I have added tests to have confidence my changes work as expected.
  • I have a rollout plan that minimizes risks and includes monitoring for potential issues.

Thank you for contributing to Misk! 🎉

@aparajon aparajon force-pushed the armand/09.25.25-update-shard-targetting-logic branch from 1527d47 to 9c1d3fd Compare September 29, 2025 16:05
@aparajon aparajon marked this pull request as ready for review September 29, 2025 16:13
if (config.type.isVitess && config.database != Destination.replica().toString()) {
session.target(Destination.replica(), block)
} else {
block(session)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this condition: && config.database != Destination.replica().toString()
If I understand session.target correctly, the current target could include a specific shard (for ex. loanstar/-80@replica), in which case we probably want to update the connection to @replica

val previous = currentTarget()
val targetHasChanged = previous != destination

Copy link
Collaborator Author

@aparajon aparajon Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment here and a test case in 76b9daf. This logic helps avoid an unnecessary target call if the connection is already pointing to @replica, e.g. from a readerTransacter.replicaRead call

The test cases in VitessShardTargetIntegrationTest prove that shard targeting still works within a replicaRead scope. Otherwise, callers can also do (pseudo-code):

 session.target(Destination(shard = shard(loanstar/-80), tabletType = TabletType.REPLICA)) {
     // query with session
 }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the shard targeting works if you specify a shard, but what if you're assuming that a read is simply hitting @replica? Here's an example of the case I'm thinking of:

// 1. initial example from test, target a specific shard 
transacter.replicaRead { session ->
      session.target(jp.shard(session)) {
        assertThat(
          queryFactory.newQuery<MovieQuery>().allowTableScan()
            .name("Jurassic Park").uniqueResult(session)
        ).isNotNull
        assertThat(
          queryFactory.newQuery<MovieQuery>().allowTableScan()
            .name("Star Wars").uniqueResult(session)
        ).isNull()
}

// 2. subsequent ReplicaRead. Let's say the caller is assuming that since 
// shard targeting was not explicitly stated,  that the shard targeting
// will be determined in VTgate. 
transacter.replicaRead { session ->
      // not calling session.target here, and this would fail because we are not updating the destination to just
      // @replica
     assertThat(
          queryFactory.newQuery<MovieQuery>().allowTableScan()
            .name("Star Wars").uniqueResult(session)
        ).isNotNull()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second case should still work since session.target(Destination.replica(), block) would activate for Vitess and for the typical case of database = @primary, which should set the destination to @replica. I'll add an explicit test for this to verify

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the tests in 2ea00cf address case 2?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the back and forth, I'm still not sure if the tests address the case I'm thinking about, and lmk if you want to hop on a huddle if that's easier :/

My understanding is that config.database comes from the data source config, and if we are using the reader transacter, it will be set to @replica. Assume that we're only talking about the reader transacter in the above example, such that config.database != Destination.replica().toString() is true throughout.

If we happened to set shard targeting while in step 1, won't 2 enter the else block and execute with the shard set in step 1? I guess this is not the typical case, where we generally expect not to set the shard targeting in the reader, but there's nothing explicitly guarding against this right?

Copy link
Collaborator Author

@aparajon aparajon Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we tested this scenario out, this fortunately exposed out an edge case that restoration was not happening properly for shard targeting inside replicaRead, since the underlying connection would get closed before reaching the finally block in target where the restoration was supposed to happen.

This should now be fixed in 6e601c6!

@aparajon aparajon force-pushed the armand/09.25.25-update-shard-targetting-logic branch from dca139c to 76b9daf Compare September 30, 2025 00:13
@aparajon aparajon force-pushed the armand/09.25.25-update-shard-targetting-logic branch 2 times, most recently from 6e601c6 to 3adb14c Compare October 3, 2025 22:36
@aparajon aparajon force-pushed the armand/09.25.25-update-shard-targetting-logic branch from 3adb14c to 2edb43f Compare October 6, 2025 23:16
@aparajon aparajon merged commit 2cccd73 into cashapp:master Oct 7, 2025
14 checks passed
@aparajon aparajon deleted the armand/09.25.25-update-shard-targetting-logic branch October 7, 2025 15:24
ean5533 added a commit that referenced this pull request Oct 9, 2025
…tric

* origin/master: (22 commits)
  feat: add UrlMatcher to inspect if the Misk Application has a handler for a given URL (#3457)
  Fix edge-case issue with `RetryingTransacter` (#3456)
  Add http_response_header_size to WebConfig (#3453)
  add blpop to redis (#3441)
  fix post message request field name (#3450)
  Support getChatPermalink and threaded responses on slack api (#3447)
  misk-hibernate: use single connection in Vitess shard targeting and improve interface (#3409)
  [bug fix] Fix  `del` operation in FakeRedis.kt (#3449)
  fix: Use TCCL to load classes (#3448)
  adds support for server instructions (#3430)
  misk-testing: create TestFixtureBindingValidator (#3440)
  fix: flaky ExclusiveTimingInterceptorTest (#3446)
  fix: some more flakiness in SubscriptionTest (#3445)
  Bump form-data in /misk-admin/web-actions/development-proxy (#3432)
  fix: flaky SqsJobQueueTest (#3438)
  Misk servlet embeddable to an external servlet containers (#3435)
  feat: reattempt in introducing configurable http/2 control frame limiter with observability (#3422)
  fix: flaky SubscriptionTest (#3437)
  Bump tar-fs from 2.1.1 to 2.1.4 (#3431)
  Bump sha.js from 2.4.11 to 2.4.12 in /misk-admin/web/tabs/database (#3434)
  ...

# Conflicts:
#	misk-mcp/src/main/kotlin/misk/mcp/McpServerModule.kt
#	misk-mcp/src/main/kotlin/misk/mcp/MiskMcpServer.kt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants