-
Notifications
You must be signed in to change notification settings - Fork 314
Async multi packet fixes #3534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Async multi packet fixes #3534
Conversation
@ErikEJ this might reduce memory usage for string reads. It might be worth benching the artifacts if the CI runs green. |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
I've added an additional fix which is the same as the 0 length left in terminator case and which occurs on the varchar not nvarchar read path. |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
@dotnet/sqlclientdevteam can I get a CI run on this please. I've added a new commit which forces process sni mode to compatibility mode (and by extension, disabled async-continue mode) and adds in a fix for the pending read counter imbalance that we discussed and that @rhuijben has been assisting with tracking down today. This is a possible stable current codebase state to evaluate. |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
I've aligned the appcontext switch test with the new defaults. Can i get another run please @dotnet/sqlclientdevteam |
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for some clarity on >> 1
versus / 2
.
@@ -13206,7 +13206,7 @@ bool writeDataSizeToSnapshot | |||
if (stateObj._longlen == 0) | |||
{ | |||
Debug.Assert(stateObj._longlenleft == 0); | |||
totalCharsRead = 0; | |||
totalCharsRead = startOffsetByteCount >> 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a division by 2 in disguise? Are you using a special property of right-bit-shift that divide-by-2 doesn't have? Something else?
If the former, please use startOffsetByteCount / 2
for clarity. If the either of the latter, please document why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No magic. Just using the same idiom as the containing methods. I've changed it to use division instead of shift.
I've also changed the multiplexer test detection of compatibility to match the library which should skip the multiplexer tests correctly now.
@Wraith2 when I run the testcase from the other issue against this branch I get in DEBUG mode
In release mode the test passes. |
multitasking here, can you link me to the exact repro you're talking about? |
The testcase from I'm currently trying to get things reproduced against a docker instance of sqlserver 2019 so we can look at the same thing (and maybe even test this on github actions, like I do in the RepoDB project) |
This case rhuijben@1964bc1 fails for me on this docker setup. Too bad it is not the error I'm seeing myself, but it is still a valid testcase. Trying to extend this to include my case. It fails on the first (smallest) packetsize of 512.
|
Debug.Assert(TdsEnums.HEADER_LEN + Packet.GetDataLengthFromHeader(buffer) == read, "partially read packets cannot be appended to the snapshot"); read=512 503+8 = 511, so mismatch. Looks like the first byte of the next package is already in the buffer here. |
That assert will fire periodically when packet multiplexing is disabled. We should add in the context switch to the assertion. That might be correct. I saw something similar while looking at the multipart xml reads with a weird packet size. If the packet status does not include the last packet bit, and the requiredlength is less than the total packet as long as the transferred data amount is the same as the buffer size it's technically correct, I think. I'm referring to this as padded packets. I hadn't seen them before 2 weeks ago but the spec doesn't preclude them. When i saw them the remaining data space in the packet buffer was filled with FF. This is part of the reason that i added the DumpPackets and DumpInBuff functions to my debug branch. |
With packet size configured as 512 I see 511 byte packets (which fail these tests), but also one really large packet (>= 60 KB). Not sure if the debug assert does the right thing. It looks like the demultiplexer handles these cases just fine. With this packet code you also always have to handle short-reads caused by network security and TCP packets. There are standard proxies for that last case so you can always get small (or large) packets from the network layer. The DotNet core project uses fuzzing with that to catch http errors, as do a lot of other libraries. Looks like these asserts are on the wrong layer... as from the network you can have much smaller or larger packets than the TDS packets (smaller when processing really fast, and much longer when the network already delivered more data than a single packet... Which can also happen on slow networks when one packet got lost and is re-delivered, while others are already in the queue. |
With the multiplexer active every packet which is appended to the snapshot is required to have a correct header and be a complete packet. If you look at the locations where packets are assigned to _inBuff from the snapshot you'll see that it gets the status and data length from the packet, that requires a correct packet header. With the multiplexer inactive, which is ProcessSni compatibility mode those requirements no longer hold. So the correct things to do is to change the asserts to be:
In terms of packet lengths I refer to the TDS definition of a packet which is that all packets are of the communicated packetsize unless they are the last packet in a stream. Even though it's possible in network terms to receive any length of bytes up to the buffer size that was passed to the function we will generally find that there are full packets unless your server is quite slow. Unfortunately the slow case is very hard to replicate in a reliable way so there have been holes in the code. The demultiplexer current does not handle padded packets (physical length > logical length) correctly. I need to get some example packets with padding and then write some new test infra for the multiplexer to make sure that CurrentLength and RequiredLength are used correctly. |
On https://github.com/rhuijben/SqlClient/tree/test/repro-1 (your branch + 1 commit) I now have (with Running the tests against a docker with a clean sqlserver 2019 container. (Create script is in that commit)
(Don't look at the timings. My laptop is in silent mode, which is very slow with todays weather) |
Token packet, last packet of stream. No idea what the content is supposed to represent. This would silently fail in release mode because all that diagnostic output is a #DEBUG block |
Not sure if I read this correctly but looks like: |
while (counter < 10) | ||
SqlConnectionStringBuilder builder = new SqlConnectionStringBuilder(DataTestUtility.TCPConnectionString); | ||
builder.PersistSecurityInfo = true; | ||
builder.Pooling = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hade the most luck reproducing issues by using pooling for just 2-3 times. Disabling pooling hides some backend errors as the connection is closed anyway.
Yes. That was my interpretation. Since I've no idea what the bytes that have been read were a part of I have no way to tell if the extra bytes are meaningful or not. It seems likely that they should be. If you've got my additional logging enabled from the -dev branch then you could look at the output of DumpLog() and see what operation consumed those bytes. |
/azp run |
This PR needs a merge from main to pickup macOS test failure fixes. |
Sure, but that doesn't block review of the code changes. As I mentioned in #3534 (comment) I'm happy to rebase when there is clear code reason to do so but failing test infra does not affect the code changes that I've made. If The changes pass review then I'll rebase in readiness of a merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question and one suggestion.
src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/TdsParser.cs
Outdated
Show resolved
Hide resolved
src/Microsoft.Data.SqlClient/src/Microsoft/Data/SqlClient/TdsParserStateObject.Multiplexer.cs
Show resolved
Hide resolved
@edwardneal thanks for your work on this. Did you have a chance to run your stress tests on this+feature_disabled? |
fix 0 length read at the start of a packe in plp stream returning 0 when continuing handle char array sizing better change existing test to use multiple packet sizes
// so we must make sure that even though we are not making a network call that we do | ||
// not cause an incorrect decrement which will cause disconnection from the native | ||
// component | ||
IncrementPendingCallbacks(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, readFromNetwork is always true in compat mode because partial packets are not supported.
@@ -5753,6 +5751,10 @@ private static Task<T> GetFieldValueAsyncExecute<T>(Task task, object state) | |||
{ | |||
return Task.FromResult<T>(reader.GetFieldValueFromSqlBufferInternal<T>(reader._data[columnIndex], reader._metaData[columnIndex], isAsync: true)); | |||
} | |||
else | |||
{ | |||
return reader.ExecuteAsyncCall(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me better understand this change? It's not clear to me which issue this addresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the edit I made to the OP.
Fixes 3572.
When reading data sequentially and calling GetFieldValueAsync it is possible for the call to TryReadColumnInternal to fail because there is not enough data. When this happens an async read has been started so we should wait for it to return but the code does not do this and instead falls through to attempting a call to TryReadColumnValue. Under a debugger it's clear that there is a problem because you hit an assert but at runtime this was silently failing which could cause data loss as an async read collided with a sync read.. The fix is to prevent the fallthrough and queue the wait for the next packet.
Thanks @mdaigle. I've re-run the stress tests with both of those AppContext switches set, and the behaviour's consistent from end to end - #3572 is fixed in (almost) all cases, irrespective of the AppContext switch settings. I've noted one bug which appears under specific circumstances. This bug appears when run against M.D.S 5.1.7, 5.2.3, 6.0.2, 6.1.1 and this PR (with the UseCompatibilityProcessSni switch set to true.) The circumstances in question are:
When this bug appears, disposing of the SqlDataReader throws the exception below:
I think it might be a reproduction of #1044. You're welcome to cannibalise my test harness - link is here. A few notes on them though:
|
Results from my dev branch:
|
/azp run |
Azure Pipelines successfully started running 2 pipeline(s). |
FYI, we're having issues with macos tests at the moment, separate from this PR |
Fixes #3519
Fixes: #3572
Fix 1:
When reading multi packet strings it is possible for multiple strings to happen in a single row. When reading asynchronously a snapshot is used which contains a linked list of packets. The current codebase has logic which keeps a cleared spare linked list node when the snapshot is cleared. The logic to clear the spare packet was faulty and did not clear all the fields leaving the data length in the node. In specific circumstances it is possible to re-use the spare linked list node containing an old data value as the first packet in a new linked list of packets. When this happens in a read which reaches the continue stage (3 or more packets) the size calculation is incorrect and various errors can occur.
The spare packet functionality is not very useful because it can store a single node. It doesn't retain the byte[] buffer so the memory saving is tiny. I have removed it and changed the linked list node fields to be readonly. This resolves the bug.
Fix 2:
When reading a multi packet string the plp chunks are read from each packet and the end is signalled by a terminator. It is possible for the data to align such that the contents of a string complete exactly at the end of a packet and the terminator is in the next packet. In this case some pre-existing logic checks for a 0 chars remaining and exists early.
This logic needed to be updated so that in when continuing it returns the entire existing length read and not a 0 value.
Fix 3:
While debugging the first two issues the buffer sizes and calculations were confusing me. I eventually realised that the code was directly using _longlenleft which is measured in bytes to size a char array, meaning that all char arrays were twice as long as needed. I have updated the code to handle that and use smaller appropriately sized arrays.
I have updated the existing test to iterate from 512 (minimum packet size) to 2048 bytes in size. This can cause lots of interesting alignments in the data testing the paths through the string reading code more effectively. The range could be increased but I considered that the runtime needed to be low enough to not timeout CI runs, most higher packet size will be similar to lower sized runs due to factoring.
Thanks to @erenes and @Suchiman for their help finding the reproduction that worked on my machine, without that I would have been unable to fix anything
@dotnet/sqlclientdevteam can I get a CI run please.
/cc @Jakimar
Edits:
Fixes 3572.
When reading data sequentially and calling GetFieldValueAsync it is possible for the call to TryReadColumnInternal to fail because there is not enough data. When this happens an async read has been started so we should wait for it to return but the code does not do this and instead falls through to attempting a call to TryReadColumnValue. Under a debugger it's clear that there is a problem because you hit an assert but at runtime this was silently failing which could cause data loss as an async read collided with a sync read.. The fix is to prevent the fallthrough and queue the wait for the next packet.