[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

jzkiss · 2024-07-05T07:59:08Z

Describe the bug
AsyncRedisCluster reset causes coredump if one of the redis master was killed before.

To Reproduce
[1.] asynch client is defined / used in the following way:

::std::shared_ptr<::sw::redis::AsyncRedisCluster> m_redis_cluster;
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[2.] Continuous traffic is generated

[3.] One redis master exits (kill -9 redis-server-pid or execute kubernetes rolling upgrade for the redis pods)

[4.] User code of redis-plus-plus detects that for 4 seconds there is no response for those requests that are directed to the unreachable redis (based on hash slot)

[5.] User code of redis-plus-plus initiates AsyncRedisCluster reset with ip-address / port of a reachable redis master
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));

[6.] after a ~0.6 sec (restart: 14:59:39.710346104Z core dump: 14:59:40.398587602Z) core dump is detected:

[New LWP 1407]
[New LWP 1486]
[New LWP 1484]
[New LWP 1483]
[New LWP 1485]
[New LWP 1405]
[New LWP 1404]
[New LWP 1400]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x122c1700 (LWP 1407))]
...
(gdb) bt full
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000000095f8ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000007e7d96a in uv_async_send.cold () from /lib64/libuv.so.1
No symbol table info available.
#3 0x0000000007c31856 in sw::redis::AsyncConnection::send (this=0xe82ee20, event=std::unique_ptrsw::redis::AsyncEvent = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:251
No locals.
#4 0x0000000007c42d96 in sw::redis::AsyncShardsPool::_redeliver_events (this=0xfdbbf90,
events=std::queue wrapping: std::deque with 6 elements = {...}) at /usr/include/c++/8/bits/move.h:74
async_event =
pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990}
connection = {_pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990},
_connection = std::shared_ptrsw::redis::AsyncConnection (use count 3, weak count 1) = {get() = 0xe82ee20}}
event =
should_stop_worker = false
#5 0x0000000007c44530 in sw::redis::AsyncShardsPool::_run (this=0xfdbbf90)
at /.../redis++/rpm/BUILD/src/sw/redis++/async_shards_pool.cpp:191
events = std::queue wrapping: std::deque with 6 elements = {{key = "USER_KEY_297",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0x0}}, {key = "USER_KEY_302",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8b2d30}}, {
key = "USER_KEY_303", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe8a38a0}}, {key = "USER_KEY_522",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8d7a60}}, {
key = "USER_KEY_306", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe6d7e20}}, {key = "", event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe6c8dc0}}}
#6 0x0000000008d6ab23 in execute_native_thread_routine () from /lib64/libstdc++.so.6
No symbol table info available.
#7 0x00000000083591ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8 0x0000000009610e73 in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) Quit

USER_KEY_297, ..., USER_KEY_306 are anonymized keys, but all of them belongs to the slot range of the killed redis master

Expected behavior
No crash, traffic should be stabilized.

Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12

Additional context
Redis cluster is used with 3 masters and 3 slaves.

jzkiss · 2024-12-05T10:07:27Z

Hello,

Update:
The problem is solved by patching redis-plus-plus 1.3.12 in the following way (certainly test logs was just added for debugging):

void EventLoop::_notify() {
// assert(_event_async);
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify()\n";
if (!_event_async) {
std::cout << "ctrlog redispp 1.3.12: In EventLoop::_notify():NULL _event_async, ASSERT WOULD CRASH THE APPLICATION!!!\n";
return;
}

  uv_async_send(_event_async.get());

}

Unfortunately this is not enough for redis-plus-plus 1.3.13, because sometimes this line also met:
ctrlog /redis-plus-plus-1.3.13/src/sw/redis++/async_connection.cpp:608: sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed.

In my case this is line 608:

GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr &pool) :
_pool(pool), _connection(_pool->fetch()) {
std::cout << "ctrlog redispp 1.3.13: In GuardedAsyncConnection::GuardedAsyncConnection() entered\n";
608: assert(!_connection->broken());
}

Br, Jozsef

sewenew · 2025-03-17T05:39:10Z

@jzkiss I fixed the second problem you mentioned (#625 ), i.e. sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed..

However, I still cannot reproduce your first problem. Could you please try the latest code on master branch to see if the problem still exists? Thanks!

Regards

jzkiss · 2025-03-18T14:29:14Z

Thank you very much @sewenew . I try to re-establish my testing environment eand re-execute the related tests in the next days.

Br, Jozsef

jzkiss · 2025-03-24T12:09:30Z

Hello!

I have executed 3 tests with the master. I attachted my test code and scripts.

Before executing the test, I create the redis cluster with
../../../_scripts/redispp_create_cluster.sh 172.17.0.2

where 172.17.0.2 is the ip address of the docker container I used for testing (from /etc/hosts)

The test uses 2 shell scripts:
redispp_cut_wire.sh - this is for the wirecut test (- network failure simulation)
redispp_graceful_kill_and_restart_redis_instance.sh - this is for the graceful kill test (~In Service Upgrade simulation)

I added 4 new options to the test setup:
-d: start async_mass_traffic_test
-w: wirecut test
-f: enable asych redis cluster reset
-g int: number of the requests

Test 1: In Service Upgrade simulation:
./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 > gracefull_kill_test.txt
OK:
Monday Mon Mar 24 11:51:58 2025 GENERATOR: Test is over: transactions generated:1210,requestSentCount:1247,successResponseCount:1195,errorResponseCount:52

Test 2: Network failure test, disable aync redis cluster reset:
./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 -w > tcp_cut_test.txt
OK
Monday Mon Mar 24 11:28:10 2025 GENERATOR: Test is over: transactions generated:1210,requestSentCount:1280,successResponseCount:1160,errorResponseCount:120

Test 3: Network failure test, enable aync redis cluster reset:
[root@f73f5f91c574 build]# ./test/test_redis++ -n 172.17.0.2 -c 6380 -d -g 1200 -w -f > tcp_cut_async_reset_test.txt
test_redis++: /sources/redispp/redis-plus-plus/src/sw/redis++/event_loop.cpp:236: void sw::redis::EventLoop::_notify(): Assertion `_event_async' failed.
Aborted (core dumped)
NOK: Coredump

I can provide the compressed wireshark logs if you need.

Thanks,
Br, Jozsef

_redispp_scripts.zip

logs.zip

jzkiss · 2025-03-25T14:04:39Z

@sewenew I have executed the wire cut tests more times. In 10 test executions the "EventLoop::_notify()" crash appeared 6 times (60%), in the other cases there were no crash. The good news is that there were no other crash, in the 1.3.13 version the "EventLoop::_notify()" patch was not enough. So the master contains now very important fixes (#625).

The IPTABLES test works only in linux, I guess.

And I have to add try / catch block to the test script when I examine the cluster state after reset:
_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));
bool clusterOk = false;
try {
clusterOk = checkClusterState();
printTimestamp();
std::cout << " GENERATOR: cluster restarted, cluster state:" << clusterOk << std::endl;
}
catch (const sw::redis::Error &e) {
printTimestamp();
std::cout << " GENERATOR: exception during cluster restart:" << e.what() << std::endl;
}
for (int i=0; i<30 && !clusterOk; i++) {
printTimestamp();
std::cout << " GENERATOR: cluster restarted,cluster restarted, cluster state:" << clusterOk << std::endl;
_wait();
try {
clusterOk = checkClusterState();
}
catch (const sw::redis::Error &e) {
printTimestamp();
std::cout << " GENERATOR: exception during cluster restart (2):" << e.what() << std::endl;
}
}

When do you plan to release the new version that contains the fix?

Thanks, Jozsef

sewenew · 2025-03-27T01:52:50Z

Thanks for the info!

I'll take a look ASAP. No matter whether I can figure out the root cause or not, I'll publish a release with the fix you mentioned this weekend.

Sorry for the late reply, too busy these days.

Regards

sewenew · 2025-03-28T16:05:02Z

@jzkiss I tried a fix, can you run your test with the code on dev branch?

Thanks a lot!

Regards

sewenew · 2025-03-30T10:03:06Z

@jzkiss I run the test you give with the latest fix more than 10 times, and there's no core dump. Before the fix, I run the test more than 10 times, and it cored once.

I think the root cause has been fixed, and I've published a new release with the fix, i.e. 1.3.14.

Could you please run your test with the latest release? If you still have problem, feel free to let me know.

Regards

jzkiss · 2025-03-31T16:00:58Z

@sewenew Thank you very much for the fix and the new release! I have executed many tests with 1.3.14, and core dump issues fixed. I did not meet valgrind memory leak problem.

I have found one issue, it Active Internet connections Proto Recv-Q Send-Q Local Address tcp 0 0 172.17.0.2:40784 tcp 0 0 172.17.0.2:49290 tcp 1 0 172.17.0.2:48900 tcp 0 0 172.17.0.2:33602 tcp 0 0 172.17.0.2:50310 tcp 0 0 172.17.0.2:37552 tcp 0 0 172.17.0.2:39912 tcp 0 0 172.17.0.2:39898 tcp 1 0 172.17.0.2:58408 tcp 0 0 172.17.0.2:47428 tcp 0 0 172.17.0.2:49948 tcp 0 0 172.17.0.2:41260 tcp 0 0 172.17.0.2:34984 tcp 0 0 172.17.0.2:52170 tcp 0 0 172.17.0.2:49888 tcp 0 0 172.17.0.2:53662 tcp 0 0 172.17.0.2:59096 is not a new, we see it with the earlier releases, too. When I reset the asyncClient then the old TCP connections remains hanging in redispp side. I set tcp-timeout and keep-alive in redis side, and redis closes the connection towards redispp after the tcp timeout. Redis server sends [FIN,ACK] on TCP, but redispp side answers only with [ACK] instead of [FIN,ACK]. As a consequence, TCP connection state is CLOSE_WAIT:
(servers and established)
Foreign Address State PID/Program name
172.17.0.3:6384 TIME_WAIT -
172.17.0.3:6381 TIME_WAIT -
172.17.0.3:6382 CLOSE_WAIT 3518/valgrind
172.17.0.3:6384 TIME_WAIT -
172.17.0.3:6381 TIME_WAIT -
172.17.0.3:6381 ESTABLISHED 3518/valgrind
172.17.0.3:6382 TIME_WAIT -
172.17.0.3:6382 TIME_WAIT -
172.17.0.3:6381 CLOSE_WAIT 3518/valgrind
172.17.0.3:6382 TIME_WAIT -
172.17.0.3:6382 TIME_WAIT -
172.17.0.3:6381 TIME_WAIT -
172.17.0.3:6384 TIME_WAIT -
172.17.0.3:6382 ESTABLISHED 3518/valgrind
172.17.0.3:6384 ESTABLISHED 3518/valgrind
172.17.0.3:6381 TIME_WAIT -
172.17.0.3:6382 TIME_WAIT -

The problem is that async client does not provide method for reinitialisation or socket connection(pool) close.

This is the original connection before async reset:

This connection is closed correctly, this is the new connection after async reset:

I will open a new issue about this one.

Thanks again: Jozsef

jzkiss · 2025-03-31T16:03:40Z

@sewenew One question: how can I create dev branch? My guess is:

git clone https://github.com/sewenew/redis-plus-plus.git
git checkout -b <my_branch_name>
git push origin <my_branch_name>:<my_branch_name>

Thanks, Jozsef

sewenew · 2025-04-02T02:21:43Z

Do you mean that you want to do contribution? In this case, you can fork redis-plus-plus, clone it, create a new branch, do some change, and create a pull request.

Regards

jzkiss · 2025-04-02T07:11:49Z

No. You wrote earlier:

_"@jzkiss I tried a fix, can you run your test with the code on dev branch?

Thanks a lot!

Regards"_

Here, what do you mean under dev brach? I thought I should create a test branch, put the modified test_main.cpp and the other stuffs, push them into the branch, and you can pull and execute my tests.

sewenew · 2025-04-02T14:52:39Z

Sorry, my bad! That message is out-of-date. I've already merged the code to master, and published a new release, i.e. 1.3.14. Also, I've tested your modified version of test_main.cpp, and all looks good.

Regards

jzkiss mentioned this issue Jul 5, 2024

[BUG] socket_timeout parameter has no effect if the link is broken between redis instance and the client. Outage towards a slot range for several minutes. #579

Open

jzkiss mentioned this issue Oct 15, 2024

How frequent lib checks the cluster so as to identify possible changes #571

Closed

sewenew added a commit that referenced this issue Mar 28, 2025

fix async connection bugs: #578

1d6c6b8

sewenew added a commit that referenced this issue Mar 30, 2025

fix async connection bugs: #578 (#631)

b13fcea

jzkiss closed this as completed Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

jzkiss commented Jul 5, 2024

jzkiss commented Dec 5, 2024

sewenew commented Mar 17, 2025

jzkiss commented Mar 18, 2025

jzkiss commented Mar 24, 2025 •

edited

Loading

jzkiss commented Mar 25, 2025

sewenew commented Mar 27, 2025

sewenew commented Mar 28, 2025

sewenew commented Mar 30, 2025

jzkiss commented Mar 31, 2025

jzkiss commented Mar 31, 2025

sewenew commented Apr 2, 2025

jzkiss commented Apr 2, 2025

sewenew commented Apr 2, 2025

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578

Comments

jzkiss commented Jul 5, 2024

jzkiss commented Dec 5, 2024

sewenew commented Mar 17, 2025

jzkiss commented Mar 18, 2025

jzkiss commented Mar 24, 2025 • edited Loading

jzkiss commented Mar 25, 2025

sewenew commented Mar 27, 2025

sewenew commented Mar 28, 2025

sewenew commented Mar 30, 2025

jzkiss commented Mar 31, 2025

jzkiss commented Mar 31, 2025

sewenew commented Apr 2, 2025

jzkiss commented Apr 2, 2025

sewenew commented Apr 2, 2025

jzkiss commented Mar 24, 2025 •

edited

Loading