Skip to content

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jzkiss opened this issue Jul 5, 2024 · 13 comments
Closed

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

jzkiss opened this issue Jul 5, 2024 · 13 comments

Comments

@jzkiss
Copy link

jzkiss commented Jul 5, 2024

Describe the bug
AsyncRedisCluster reset causes coredump if one of the redis master was killed before.

To Reproduce
[1.] asynch client is defined / used in the following way:

::std::shared_ptr<::sw::redis::AsyncRedisCluster> m_redis_cluster;
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[2.] Continuous traffic is generated

[3.] One redis master exits (kill -9 redis-server-pid or execute kubernetes rolling upgrade for the redis pods)

[4.] User code of redis-plus-plus detects that for 4 seconds there is no response for those requests that are directed to the unreachable redis (based on hash slot)

[5.] User code of redis-plus-plus initiates AsyncRedisCluster reset with ip-address / port of a reachable redis master
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[6.] after a ~0.6 sec (restart: 14:59:39.710346104Z core dump: 14:59:40.398587602Z) core dump is detected:

[New LWP 1407]
[New LWP 1486]
[New LWP 1484]
[New LWP 1483]
[New LWP 1485]
[New LWP 1405]
[New LWP 1404]
[New LWP 1400]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x122c1700 (LWP 1407))]
...
(gdb) bt full
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000000095f8ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000007e7d96a in uv_async_send.cold () from /lib64/libuv.so.1
No symbol table info available.
#3 0x0000000007c31856 in sw::redis::AsyncConnection::send (this=0xe82ee20, event=std::unique_ptrsw::redis::AsyncEvent = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:251
No locals.
#4 0x0000000007c42d96 in sw::redis::AsyncShardsPool::_redeliver_events (this=0xfdbbf90,
events=std::queue wrapping: std::deque with 6 elements = {...}) at /usr/include/c++/8/bits/move.h:74
async_event =
pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990}
connection = {_pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990},
_connection = std::shared_ptrsw::redis::AsyncConnection (use count 3, weak count 1) = {get() = 0xe82ee20}}
event =
should_stop_worker = false
#5 0x0000000007c44530 in sw::redis::AsyncShardsPool::_run (this=0xfdbbf90)
at /.../redis++/rpm/BUILD/src/sw/redis++/async_shards_pool.cpp:191
events = std::queue wrapping: std::deque with 6 elements = {{key = "USER_KEY_297",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0x0}}, {key = "USER_KEY_302",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8b2d30}}, {
key = "USER_KEY_303", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe8a38a0}}, {key = "USER_KEY_522",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8d7a60}}, {
key = "USER_KEY_306", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe6d7e20}}, {key = "", event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe6c8dc0}}}
#6 0x0000000008d6ab23 in execute_native_thread_routine () from /lib64/libstdc++.so.6
No symbol table info available.
#7 0x00000000083591ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8 0x0000000009610e73 in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) Quit

USER_KEY_297, ..., USER_KEY_306 are anonymized keys, but all of them belongs to the slot range of the killed redis master

Expected behavior
No crash, traffic should be stabilized.

Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12

Additional context
Redis cluster is used with 3 masters and 3 slaves.

@jzkiss
Copy link
Author

jzkiss commented Dec 5, 2024

Hello,

Update:
The problem is solved by patching redis-plus-plus 1.3.12 in the following way (certainly test logs was just added for debugging):

void EventLoop::_notify() {
// assert(_event_async);
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify()\n";
if (!_event_async) {
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify():NULL _event_async, ASSERT WOULD CRASH THE APPLICATION!!!\n";
return;
}

  uv_async_send(_event_async.get());

}

Unfortunately this is not enough for redis-plus-plus 1.3.13, because sometimes this line also met:
ctrlog /redis-plus-plus-1.3.13/src/sw/redis++/async_connection.cpp:608: sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed.

In my case this is line 608:

GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr &pool) :
_pool(pool), _connection(_pool->fetch()) {
std::cout << "ctrlog redispp 1.3.13: In GuardedAsyncConnection::GuardedAsyncConnection() entered\n";
608: assert(!_connection->broken());
}

Br, Jozsef

@sewenew
Copy link
Owner

sewenew commented Mar 17, 2025

@jzkiss I fixed the second problem you mentioned (#625 ), i.e. sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed..

However, I still cannot reproduce your first problem. Could you please try the latest code on master branch to see if the problem still exists? Thanks!

Regards

@jzkiss
Copy link
Author

jzkiss commented Mar 18, 2025

Thank you very much @sewenew . I try to re-establish my testing environment eand re-execute the related tests in the next days.

Br, Jozsef

@jzkiss
Copy link
Author

jzkiss commented Mar 24, 2025

Hello!

I have executed 3 tests with the master. I attachted my test code and scripts.

Before executing the test, I create the redis cluster with
../../../_scripts/redispp_create_cluster.sh 172.17.0.2

where 172.17.0.2 is the ip address of the docker container I used for testing (from /etc/hosts)

The test uses 2 shell scripts:
redispp_cut_wire.sh - this is for the wirecut test (- network failure simulation)
redispp_graceful_kill_and_restart_redis_instance.sh - this is for the graceful kill test (~In Service Upgrade simulation)

I added 4 new options to the test setup:
-d: start async_mass_traffic_test
-w: wirecut test
-f: enable asych redis cluster reset
-g int: number of the requests

Test 1: In Service Upgrade simulation:
./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 > gracefull_kill_test.txt
OK:
Monday Mon Mar 24 11:51:58 2025 GENERATOR: Test is over: transactions generated:1210,requestSentCount:1247,successResponseCount:1195,errorResponseCount:52

Test 2: Network failure test, disable aync redis cluster reset:
./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 -w > tcp_cut_test.txt
OK
Monday Mon Mar 24 11:28:10 2025 GENERATOR: Test is over: transactions generated:1210,requestSentCount:1280,successResponseCount:1160,errorResponseCount:120

Test 3: Network failure test, enable aync redis cluster reset:
[root@f73f5f91c574 build]# ./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 -w -f > tcp_cut_async_reset_test.txt
test_redis++: /sources/redispp/redis-plus-plus/src/sw/redis++/event_loop.cpp:236: void sw::redis::EventLoop::_notify(): Assertion `_event_async' failed.
Aborted (core dumped)
NOK: Coredump

I can provide the compressed wireshark logs if you need.

Thanks,
Br, Jozsef

_redispp_scripts.zip

logs.zip

@jzkiss
Copy link
Author

jzkiss commented Mar 25, 2025

@sewenew I have executed the wire cut tests more times. In 10 test executions the "EventLoop::_notify()" crash appeared 6 times (60%), in the other cases there were no crash. The good news is that there were no other crash, in the 1.3.13 version the "EventLoop::_notify()" patch was not enough. So the master contains now very important fixes (#625).

The IPTABLES test works only in linux, I guess.

And I have to add try / catch block to the test script when I examine the cluster state after reset:
_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));
bool clusterOk = false;
try {
clusterOk = checkClusterState();
printTimestamp();
std::cout << " GENERATOR: cluster restarted, cluster state:" << clusterOk << std::endl;
}
catch (const sw::redis::Error &e) {
printTimestamp();
std::cout << " GENERATOR: exception during cluster restart:" << e.what() << std::endl;
}
for (int i=0; i<30 && !clusterOk; i++) {
printTimestamp();
std::cout << " GENERATOR: cluster restarted,cluster restarted, cluster state:" << clusterOk << std::endl;
_wait();
try {
clusterOk = checkClusterState();
}
catch (const sw::redis::Error &e) {
printTimestamp();
std::cout << " GENERATOR: exception during cluster restart (2):" << e.what() << std::endl;
}
}

When do you plan to release the new version that contains the fix?

Thanks, Jozsef

@sewenew
Copy link
Owner

sewenew commented Mar 27, 2025

Thanks for the info!

I'll take a look ASAP. No matter whether I can figure out the root cause or not, I'll publish a release with the fix you mentioned this weekend.

Sorry for the late reply, too busy these days.

Regards

sewenew added a commit that referenced this issue Mar 28, 2025
@sewenew
Copy link
Owner

sewenew commented Mar 28, 2025

@jzkiss I tried a fix, can you run your test with the code on dev branch?

Thanks a lot!

Regards

sewenew added a commit that referenced this issue Mar 30, 2025
@sewenew
Copy link
Owner

sewenew commented Mar 30, 2025

@jzkiss I run the test you give with the latest fix more than 10 times, and there's no core dump. Before the fix, I run the test more than 10 times, and it cored once.

I think the root cause has been fixed, and I've published a new release with the fix, i.e. 1.3.14.

Could you please run your test with the latest release? If you still have problem, feel free to let me know.

Regards

@jzkiss
Copy link
Author

jzkiss commented Mar 31, 2025

@sewenew Thank you very much for the fix and the new release! I have executed many tests with 1.3.14, and core dump issues fixed. I did not meet valgrind memory leak problem.

I have found one issue, it is not a new, we see it with the earlier releases, too. When I reset the asyncClient then the old TCP connections remains hanging in redispp side. I set tcp-timeout and keep-alive in redis side, and redis closes the connection towards redispp after the tcp timeout. Redis server sends [FIN,ACK] on TCP, but redispp side answers only with [ACK] instead of [FIN,ACK]. As a consequence, TCP connection state is CLOSE_WAIT:
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 172.17.0.2:40784 172.17.0.3:6384 TIME_WAIT -
tcp 0 0 172.17.0.2:49290 172.17.0.3:6381 TIME_WAIT -
tcp 1 0 172.17.0.2:48900 172.17.0.3:6382 CLOSE_WAIT 3518/valgrind
tcp 0 0 172.17.0.2:33602 172.17.0.3:6384 TIME_WAIT -
tcp 0 0 172.17.0.2:50310 172.17.0.3:6381 TIME_WAIT -
tcp 0 0 172.17.0.2:37552 172.17.0.3:6381 ESTABLISHED 3518/valgrind
tcp 0 0 172.17.0.2:39912 172.17.0.3:6382 TIME_WAIT -
tcp 0 0 172.17.0.2:39898 172.17.0.3:6382 TIME_WAIT -
tcp 1 0 172.17.0.2:58408 172.17.0.3:6381 CLOSE_WAIT 3518/valgrind
tcp 0 0 172.17.0.2:47428 172.17.0.3:6382 TIME_WAIT -
tcp 0 0 172.17.0.2:49948 172.17.0.3:6382 TIME_WAIT -
tcp 0 0 172.17.0.2:41260 172.17.0.3:6381 TIME_WAIT -
tcp 0 0 172.17.0.2:34984 172.17.0.3:6384 TIME_WAIT -
tcp 0 0 172.17.0.2:52170 172.17.0.3:6382 ESTABLISHED 3518/valgrind
tcp 0 0 172.17.0.2:49888 172.17.0.3:6384 ESTABLISHED 3518/valgrind
tcp 0 0 172.17.0.2:53662 172.17.0.3:6381 TIME_WAIT -
tcp 0 0 172.17.0.2:59096 172.17.0.3:6382 TIME_WAIT -

The problem is that async client does not provide method for reinitialisation or socket connection(pool) close.

This is the original connection before async reset:

Image

This connection is closed correctly, this is the new connection after async reset:

Image

I will open a new issue about this one.

Thanks again: Jozsef

@jzkiss jzkiss closed this as completed Mar 31, 2025
@jzkiss
Copy link
Author

jzkiss commented Mar 31, 2025

@sewenew One question: how can I create dev branch? My guess is:

git clone https://github.com/sewenew/redis-plus-plus.git
git checkout -b <my_branch_name>
git push origin <my_branch_name>:<my_branch_name>

Thanks, Jozsef

@sewenew
Copy link
Owner

sewenew commented Apr 2, 2025

Do you mean that you want to do contribution? In this case, you can fork redis-plus-plus, clone it, create a new branch, do some change, and create a pull request.

Regards

@jzkiss
Copy link
Author

jzkiss commented Apr 2, 2025

No. You wrote earlier:

_"@jzkiss I tried a fix, can you run your test with the code on dev branch?

Thanks a lot!

Regards"_

Here, what do you mean under dev brach? I thought I should create a test branch, put the modified test_main.cpp and the other stuffs, push them into the branch, and you can pull and execute my tests.

@sewenew
Copy link
Owner

sewenew commented Apr 2, 2025

Sorry, my bad! That message is out-of-date. I've already merged the code to master, and published a new release, i.e. 1.3.14. Also, I've tested your modified version of test_main.cpp, and all looks good.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@sewenew @jzkiss and others