-
Notifications
You must be signed in to change notification settings - Fork 70
fix: AsyncAbstractResponse might loose part of send buffer #316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
BTW is not that same problem #242 ? |
It looks like the same indeed! We can ask the use to test with this fix... |
2387abc to
226c4a5
Compare
src/WebResponses.cpp
Outdated
| --_in_flight_credit; // take a credit | ||
| #endif | ||
| request->client()->send(); | ||
| _send_buffer.erase(_send_buffer.begin(), _send_buffer.begin() + written); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vortigont : could this call be expensive ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure, depends on implementation it could be compiler-optimized to something like memmove, but not sure how this is done in espressif's toolchain. But actually I do not expect this part to run frequently under normal conditions, buffer should be aligned with available space.
Other option could be to set additional member var and do index offset calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I do not like this buffer approach at all - it's too heavy in general to create buffer matching window size then copy data there, then copy from buffer to tcp's pcbs. Default Arduino's is 5.7k but with custom-builded lwip it could become a mem hog. Should think of something other - a small fixed-size circular buffer maybe or other type of handlers for objects that could avoid copying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. That’s what I saw also - adding a vector field to handler this situation. Wondering if the same thing could be done without it. I was going to propose also a circular buffer because anyway it cannot be more than the pcb space, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's an alternative solution at the architectural level -- the interface of AsyncAbstractResponse requires that it consume bytes from the implementation only once; and we can't know for sure how many bytes the socket will accept until we send it some; so to be correct, AsyncAbstractResponse is going to have to cache the bytes it couldn't send. Since the API requires it to have a temporary buffer anyways, "just keep the buffer until we've sent it all" is the least bad solution.
Performance wise, std::vector<> does hurt a bit though - it both (a) insists on zeroing the memory, and (b) doesn't have easy to use release/reallocate semantics. I tried using a default_init_allocator<> to speed it up, but it didn't help much. Ultimately in the solution I put together for the fork I've been maintaining for WLED, I ended up making up a more explicit buffer data structure. I also wound up doing some gymnastics to avoid allocating so much memory that LwIP couldn't allocate a packet buffer.
See my attempt at a whole solution here: https://github.com/Aircoookie/ESPAsyncWebServer/blob/39b830c852054444ea12a8b5d7fcb4fa004a89d7/src/WebResponses.cpp#L317
Sorry I'm a bit backlogged on pulling this out and pushing it forward...
Some design notes:
- I opted to release the assembly buffer as soon as the data was sent, and reallocate every
_ack; this keeps the "static" memory usage down and lets it better multiplex between many connections when under memory pressure. - If I was doing it again, I'd give serious thought to capping the buffer at
TCP_MSSand looping over_fillBufferAndProcessTemplates->client()->write(). The up side is a guarantee that it'd never be buffering more than one packet; the down side is that it would make ArduinoJSON very sad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@willmmiles : I understand a bit more. I was wondering why we needed to add a buffer instead of just using indexes, but the fact is that this implementation being in in the abstract class like you say, it has to work with and without content buffer. Thanks!
|
@vortigont : FYI, I opened PR #317 to add an example in the project that we did not have about large responses. I was hopping to get the opportunity to reproduce these 2 issues but no, everything goes fine. Questions:
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have verified that the bug has been fixed.
I do not understand the code, but I pointed out some typos in the changes.
thanks! I'll fix typos :) |
92da8c7 to
0f6f725
Compare
|
@vortigont @willmmiles : I did some testing of this PR compared to main. I am using the new example merged yesterday: > for i in {1..20}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;main: => OK: everything works fine and I receive all 20x 16000 characters. this pr: => CRASH _send_buffer.resize(std::min(space, _contentLength - _sentLength));So I tried reduce the concurrency to 16 (lwip limit( connections: > for i in {1..16}; do ( curl -s http://192.168.4.1/2 | wc -c ) & done;And I am not able to reproduce anymore, except if I keep going and going But curl + bash like that are not going a great job like autocannon... So I am sapwning it: 32 requests: 16 threads and 16 concurrent connections (so lwip limit) autocannon -w 16 -c 16 -a 32 http://192.168.4.1/2=> CRASH Strangely the crash happens also when using a lower number of connections: autocannon -w 16 -c 5 -a 32 http://192.168.4.1/2So as long as the threads are correctly aligned and requests executed pretty much at the same time, the buffer allocations / resize are then done pretty much at the same time also I think. That explains why it is easier to reproduce with autocannon than curl. So that's not good because it kills the concurrency level of te library. @willmmiles : how did you solve that in your fork ? You might have the same issue also if you are buffering ? Is is what you are solving thanks to your |
|
@vortigont : follow-up from #317 (comment) I just pushed the MRE in the PR and tested it: 51f4472 => 16000 OK Console: In main branch, I receive only 15572 bytes indeed: ❯ curl http://192.168.4.1/3 | grep -o '.' | sort | uniq -c
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 15572 0 15572 0 0 18329 0 --:--:-- --:--:-- --:--:-- 18320
5332 A
4308 B
5760 C
172 D=> 15572 Console So we have a MRE in the project showing that this is fixed 👍 The only issue now remaining is to fix the crash with concurrent requests... |
|
that is... unexpected 8-0, not that that it crashes on alloc but that it does not drop req's on main branch Here are my results and those are interesting. Yes, this PR ver crashes on high concurrency leveles, but somehow it is much faster when not crashing. == this PR == main So it is more that 2.5 times faster, do not ask me how :) that crappy I'll think about replacing this |
|
yeah, I'm using Autocannon does not give me any readable stat at all, it just runs requests then quits with zeroes in output. Works but quite useless for any analysis. |
@vortigont just to clarify: examples/LargeResponseI did my testing:
In main: /2 handler works (16000 bytes). I did not use the /3 handler for my testing: it just acts as a MRE to reproduce the is issue Using But in both cases, it is random. I have to restart ab several times to trigger it while I get to reproduce it more easily with autocannon. examples/PerfTests- For request serving: This is insanely fast! We were barely reaching 13 req/s before on average! This is more like 5-6 times faster! ❯ autocannon -c 16 -w 16 -d 20 --renderStatusCodes http://192.168.4.1
Running 20s test @ http://192.168.4.1
16 connections
16 workers
\
┌─────────┬────────┬─────────┬──────────┬──────────┬────────────┬────────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼─────────┼──────────┼──────────┼────────────┼────────────┼──────────┤
│ Latency │ 206 ms │ 4246 ms │ 11578 ms │ 12129 ms │ 4749.38 ms │ 3246.55 ms │ 14444 ms │
└─────────┴────────┴─────────┴──────────┴──────────┴────────────┴────────────┴──────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec │ 44 │ 44 │ 69 │ 80 │ 68.16 │ 7.75 │ 44 │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 193 kB │ 193 kB │ 302 kB │ 350 kB │ 298 kB │ 33.9 kB │ 193 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200 │ 1363 │
└──────┴───────┘
Req/Bytes counts sampled once per second.
# of samples: 320
3k requests in 20.03s, 5.97 MB read
200 errors (0 timeouts) |
Yes it is bad at interpreting response, I guess because of the way the response is crafted with this subclass. But anyway this is not important. What's important is a tool that triggers concurrent requests with workers/threads. |
it is important to understand how fast it works, otherwise I wound not have noticed :) |
The request ptr is not passed to public api: On of the useful thing is:
I might not understand your question ? |
yeah, sorry, I mean here
3rd arg was |
I see! Use cases:
This makes me think that I forgot to add it in AsyncWebSocketMessageHandler. The onConnect callback should be:
|
|
yeah, you right, that makes sense to get extensions from headers. |
This is only for connect yes... After that, the arg is casted in a AwsFrameInfo for data events |
|
should good this way :) |
I will have a look, test and also update the callback to add the request: this class was added recently by me as a way to simplify WS usage so this is not used a lot and this is an acceptable api break I think providing we can update next version to 3.9.0 considering all the things that will be released. |
|
As of commit 55b984b, this PR causes a regression in template processing. sunny@sunnyB:~/Downloads$ wget http://192.168.5.85/dynamic.html
--2025-11-02 08:12:32-- http://192.168.5.85/dynamic.html
Connecting to 192.168.5.85:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dynamic.html’
dynamic.html [ <=> ] 77 --.-KB/s in 0s
2025-11-02 08:12:33 (5.28 MB/s) - ‘dynamic.html’ saved [77]
sunny@sunnyB:~/Downloads$ hexdump -C dynamic.html
00000000 00 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e |.<!DOCTYPE html>|
00000010 0a 3c 68 74 6d 6c 3e 0a 3c 62 6f 64 79 3e 0a 20 |.<html>.<body>. |
00000020 20 20 20 3c 68 31 3e 48 65 6c 6c 6f 2c 20 42 6f | <h1>Hello, Bo|
00000030 62 20 31 31 36 35 32 3c 2f 68 31 3e 0a 3c 2f 62 |b 11652</h1>.</b|
00000040 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e 0a |ody>.</html>.|
0000004d |
Thanks a lot for participating in this PR's testing! |
|
good catch @yoursunny, |
You mean was more impactful than initially thought 😀 I will take time this morning to test again the other parts. For now we always discovered bad side effects / crashes so that’s ok to make sure that this (complex) fix is well tested and does not crash / impact other ones. |
|
mean could have just fix the problem without further enhancements and regressions :))) |
I don't understand either why your change impacts templating and not other endpoints. That's what I will try to check. |
55b984b to
46e984c
Compare
|
@vortigont : |
|
@vortigont : test results: ❯ autocannon -c 16 -w 16 -d 20 --renderStatusCodes http://192.168.4.1/
Running 20s test @ http://192.168.4.1/
16 connections
16 workers
/
┌─────────┬────────┬─────────┬──────────┬──────────┬────────────┬────────────┬──────────┐
│ Stat │ 2.5% │ 50% │ 97.5% │ 99% │ Avg │ Stdev │ Max │
├─────────┼────────┼─────────┼──────────┼──────────┼────────────┼────────────┼──────────┤
│ Latency │ 220 ms │ 4809 ms │ 11013 ms │ 11514 ms │ 4935.47 ms │ 3134.86 ms │ 12016 ms │
└─────────┴────────┴─────────┴──────────┴──────────┴────────────┴────────────┴──────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬───────┬────────┐
│ Stat │ 1% │ 2.5% │ 50% │ 97.5% │ Avg │ Stdev │ Min │
├───────────┼────────┼────────┼────────┼────────┼────────┼───────┼────────┤
│ Req/Sec │ 56 │ 56 │ 68 │ 79 │ 67.66 │ 5.48 │ 56 │
├───────────┼────────┼────────┼────────┼────────┼────────┼───────┼────────┤
│ Bytes/Sec │ 245 kB │ 245 kB │ 298 kB │ 346 kB │ 296 kB │ 24 kB │ 245 kB │
└───────────┴────────┴────────┴────────┴────────┴────────┴───────┴────────┘
┌──────┬───────┐
│ Code │ Count │
├──────┼───────┤
│ 200 │ 1353 │
└──────┴───────┘
Req/Bytes counts sampled once per second.
# of samples: 320
3k requests in 20.14s, 5.92 MB read
180 errors (0 timeouts)❯ ab -c 16 -t 20 http://192.168.4.1/
This is ApacheBench, Version 2.3 <$Revision: 1913912 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 192.168.4.1 (be patient)
Finished 1377 requests
Server Software:
Server Hostname: 192.168.4.1
Server Port: 80
Document Path: /
Document Length: 4272 bytes
Concurrency Level: 16
Time taken for tests: 20.001 seconds
Complete requests: 1377
Failed requests: 0
Total transferred: 6003946 bytes
HTML transferred: 5886816 bytes
Requests per second: 68.85 [#/sec] (mean)
Time per request: 232.398 [ms] (mean)
Time per request: 14.525 [ms] (mean, across all concurrent requests)
Transfer rate: 293.15 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 9 145 429.6 28 5073
Processing: 31 79 22.5 76 226
Waiting: 14 46 18.1 43 197
Total: 46 224 434.2 105 5162
Percentage of the requests served within a certain time (ms)
50% 105
66% 118
75% 131
80% 139
90% 177
95% 1120
98% 1160
99% 2149
100% 5162 (longest request)❯ for i in {1..16}; do ( count=$(gtimeout 30 curl -s -N -H "Accept: text/event-stream" http://192.168.4.1/events 2>&1 | grep -c "^data:"); echo "Total: $count events, $(echo "$count / 4" | bc -l) events / second" ) & done;
Total: 1284 events, 321.00000000000000000000 events / second
Total: 1398 events, 349.50000000000000000000 events / second
Total: 1398 events, 349.50000000000000000000 events / second
Total: 1311 events, 327.75000000000000000000 events / second
Total: 1398 events, 349.50000000000000000000 events / second
Total: 1481 events, 370.25000000000000000000 events / second
Total: 1481 events, 370.25000000000000000000 events / second
Total: 1334 events, 333.50000000000000000000 events / second
Total: 1334 events, 333.50000000000000000000 events / second
Total: 1329 events, 332.25000000000000000000 events / second
Total: 1334 events, 333.50000000000000000000 events / second
Total: 1481 events, 370.25000000000000000000 events / second
Total: 1399 events, 349.75000000000000000000 events / second
Total: 1333 events, 333.25000000000000000000 events / second
Total: 1398 events, 349.50000000000000000000 events / second
Total: 1481 events, 370.25000000000000000000 events / secondI have updated the comments on the test doing slow response to explain. Anyway., nobody should do that and stall the async_tcp task. I have also tried other endpoints and I am not able to reproduce @yoursunny issue. To me the PR is good to go. Also. no memory leak. I have monitored the heap. |
AsyncAbstractResponse::_ack could allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually lossing the remainder on transfer due to not checking if the complete data was added to sock buff. Refactoring code in favor of having a dedicated std::vector object acting as accumulating buffer and more carefull control on amount of data actually copied to sockbuff Closes #315 Added back MRE added overrides add AsyncWebServerRequest::clientRelease() method this will explicitly relese ownership of AsyncClient* object. Make it more clear on ownership change for SSE/WebSocket ci(pre-commit): Apply automatic fixes AsyncWebSocketResponse - keep request object till WS_EVT_CONNECT event is executed user code might use HTTP headers information from the request ci(pre-commit): Apply automatic fixes fix typo Add comment for slow response Cleanup wrong log line ci(pre-commit): Apply automatic fixes
fbc9801 to
747223f
Compare
|
@vortigont : |
|
Templates.ino regression still occurs as of commit 747223f. @mathieucarbou test report also shows the regression.
|
Still this 0x00 byte for template to fix
…e which was overriding the first template byte by null string terminator
|
The regression also affects ChunkResponse.ino example. main branch, commit c127402: sunny@sunnyB:~$ curl -fsLS --output - http://192.168.5.85/ | hexdump -C | head -8
curl: (18) transfer closed with outstanding read data remaining
00000000 0a 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e |.<!DOCTYPE html>|
00000010 0a 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 0a 20 |.<html>.<head>. |
00000020 20 20 20 3c 74 69 74 6c 65 3e 53 61 6d 70 6c 65 | <title>Sample|
00000030 20 48 54 4d 4c 3c 2f 74 69 74 6c 65 3e 0a 3c 2f | HTML</title>.</|
00000040 68 65 61 64 3e 0a 3c 62 6f 64 79 3e 0a 20 20 20 |head>.<body>. |
00000050 20 3c 68 31 3e 48 65 6c 6c 6f 2c 20 57 6f 72 6c | <h1>Hello, Worl|
00000060 64 21 3c 2f 68 31 3e 0a 20 20 20 20 3c 70 3e 4c |d!</h1>. <p>L|
00000070 6f 72 65 6d 20 69 70 73 75 6d 20 64 6f 6c 6f 72 |orem ipsum dolor|wresp_315 branch, commit 747223f: sunny@sunnyB:~$ curl -fsLS --output - http://192.168.5.85/ | hexdump -C | head -8
00000000 00 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e |.<!DOCTYPE html>|
00000010 0a 3c 68 74 6d 6c 3e 0a 3c 68 65 61 64 3e 0a 20 |.<html>.<head>. |
00000020 20 20 20 3c 74 69 74 6c 65 3e 53 61 6d 70 6c 65 | <title>Sample|
00000030 20 48 54 4d 4c 3c 2f 74 69 74 6c 65 3e 0a 3c 2f | HTML</title>.</|
00000040 68 65 61 64 3e 0a 3c 62 6f 64 79 3e 0a 20 20 20 |head>.<body>. |
00000050 20 3c 68 31 3e 48 65 6c 6c 6f 2c 20 57 6f 72 6c | <h1>Hello, Worl|
00000060 64 21 3c 2f 68 31 3e 0a 20 20 20 20 3c 70 3e 4c |d!</h1>. <p>L|
00000070 6f 72 65 6d 20 69 70 73 75 6d 20 64 6f 6c 6f 72 |orem ipsum dolor| |
@yoursunny this is fixed in bb0dd46. This was caused by a bug in @vortigont 's refactoring where sprintf was used to convert the chunk size into hex chaarcters, but this added a null string terminator into the buffer, which was overriding the first byte form the template. I will let you verify. Now I have: ❯ curl -v --output - http://192.168.4.1/uptime.html | hexdump -C
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 192.168.4.1:80...
* Established connection to 192.168.4.1 (192.168.4.1 port 80) from 192.168.4.2 port 62281
* using HTTP/1.x
> GET /uptime.html HTTP/1.1
> Host: 192.168.4.1
> User-Agent: curl/8.16.0
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 200 OK
< Content-Disposition: inline
< Last-Modified: Thu, 01 Jan 1970 00:01:00 GMT
< Cache-Control: no-cache
< Connection: close
< Accept-Ranges: none
< Transfer-Encoding: chunked
< Content-Type: text/html
<
{ [97 bytes data]
100 81 0 81 0 0 1016 0 --:--:-- --:--:-- --:--:-- 1025
* shutting down connection #0
00000000 0a 3c 21 44 4f 43 54 59 50 45 20 68 74 6d 6c 3e |.<!DOCTYPE html>|
00000010 0a 3c 68 74 6d 6c 3e 0a 3c 62 6f 64 79 3e 0a 20 |.<html>.<body>. |
00000020 20 20 20 3c 68 31 3e 48 65 6c 6c 6f 2c 20 42 6f | <h1>Hello, Bo|
00000030 62 20 31 20 6d 69 6e 75 74 65 73 3c 2f 68 31 3e |b 1 minutes</h1>|
00000040 0a 3c 2f 62 6f 64 79 3e 0a 3c 2f 68 74 6d 6c 3e |.</body>.</html>|
00000050 0a |.|
00000051 |
|
ah! that |

AsyncAbstractResponse::_ackcould allocate temp buffer with size larger than available sock buffer (i.e. to fit headers) and eventually loosing the remainder on transfer due to not checking if the complete data was added to sock buff.Refactoring code in favor of having a dedicated
std::vectorobject acting as accumulating buffer and more careful control on amount of data actually copied to sockbuffCloses #315