-
Notifications
You must be signed in to change notification settings - Fork 360
[BUG] redis-plus-plus core dump / crash at AsyncRedisCluster reset #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, Update: void EventLoop::_notify() {
} Unfortunately this is not enough for redis-plus-plus 1.3.13, because sometimes this line also met: In my case this is line 608: GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr &pool) : Br, Jozsef |
@jzkiss I fixed the second problem you mentioned (#625 ), i.e. sw::redis::GuardedAsyncConnection::GuardedAsyncConnection(const AsyncConnectionPoolSPtr&): Assertion `!_connection->broken()' failed.. However, I still cannot reproduce your first problem. Could you please try the latest code on master branch to see if the problem still exists? Thanks! Regards |
Thank you very much @sewenew . I try to re-establish my testing environment eand re-execute the related tests in the next days. Br, Jozsef |
Hello! I have executed 3 tests with the master. I attachted my test code and scripts. Before executing the test, I create the redis cluster with where 172.17.0.2 is the ip address of the docker container I used for testing (from /etc/hosts) The test uses 2 shell scripts: I added 4 new options to the test setup: Test 1: In Service Upgrade simulation: Test 2: Network failure test, disable aync redis cluster reset: Test 3: Network failure test, enable aync redis cluster reset: I can provide the compressed wireshark logs if you need. Thanks, |
@sewenew I have executed the wire cut tests more times. In 10 test executions the "EventLoop::_notify()" crash appeared 6 times (60%), in the other cases there were no crash. The good news is that there were no other crash, in the 1.3.13 version the "EventLoop::_notify()" patch was not enough. So the master contains now very important fixes (#625). The IPTABLES test works only in linux, I guess. And I have to add try / catch block to the test script when I examine the cluster state after reset: When do you plan to release the new version that contains the fix? Thanks, Jozsef |
Thanks for the info! I'll take a look ASAP. No matter whether I can figure out the root cause or not, I'll publish a release with the fix you mentioned this weekend. Sorry for the late reply, too busy these days. Regards |
@jzkiss I tried a fix, can you run your test with the code on dev branch? Thanks a lot! Regards |
@jzkiss I run the test you give with the latest fix more than 10 times, and there's no core dump. Before the fix, I run the test more than 10 times, and it cored once. I think the root cause has been fixed, and I've published a new release with the fix, i.e. 1.3.14. Could you please run your test with the latest release? If you still have problem, feel free to let me know. Regards |
@sewenew Thank you very much for the fix and the new release! I have executed many tests with 1.3.14, and core dump issues fixed. I did not meet valgrind memory leak problem. I have found one issue, it is not a new, we see it with the earlier releases, too. When I reset the asyncClient then the old TCP connections remains hanging in redispp side. I set tcp-timeout and keep-alive in redis side, and redis closes the connection towards redispp after the tcp timeout. Redis server sends [FIN,ACK] on TCP, but redispp side answers only with [ACK] instead of [FIN,ACK]. As a consequence, TCP connection state is CLOSE_WAIT: The problem is that async client does not provide method for reinitialisation or socket connection(pool) close. This is the original connection before async reset: ![]() This connection is closed correctly, this is the new connection after async reset: ![]() I will open a new issue about this one. Thanks again: Jozsef |
@sewenew One question: how can I create dev branch? My guess is: git clone https://github.com/sewenew/redis-plus-plus.git Thanks, Jozsef |
Do you mean that you want to do contribution? In this case, you can fork redis-plus-plus, clone it, create a new branch, do some change, and create a pull request. Regards |
No. You wrote earlier: _"@jzkiss I tried a fix, can you run your test with the code on dev branch? Thanks a lot! Regards"_ Here, what do you mean under dev brach? I thought I should create a test branch, put the modified test_main.cpp and the other stuffs, push them into the branch, and you can pull and execute my tests. |
Sorry, my bad! That message is out-of-date. I've already merged the code to master, and published a new release, i.e. 1.3.14. Also, I've tested your modified version of test_main.cpp, and all looks good. Regards |
Describe the bug
AsyncRedisCluster reset causes coredump if one of the redis master was killed before.
To Reproduce
[1.] asynch client is defined / used in the following way:
::std::shared_ptr<::sw::redis::AsyncRedisCluster> m_redis_cluster;
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));
[2.] Continuous traffic is generated
[3.] One redis master exits (kill -9 redis-server-pid or execute kubernetes rolling upgrade for the redis pods)
[4.] User code of redis-plus-plus detects that for 4 seconds there is no response for those requests that are directed to the unreachable redis (based on hash slot)
[5.] User code of redis-plus-plus initiates AsyncRedisCluster reset with ip-address / port of a reachable redis master
m_redis_cluster.reset(new ::sw::redis::AsyncRedisCluster(opts, pool_opts, ::sw::redis::Role::MASTER));
[6.] after a ~0.6 sec (restart: 14:59:39.710346104Z core dump: 14:59:40.398587602Z) core dump is detected:
[New LWP 1407]
[New LWP 1486]
[New LWP 1484]
[New LWP 1483]
[New LWP 1485]
[New LWP 1405]
[New LWP 1404]
[New LWP 1400]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x122c1700 (LWP 1407))]
...
(gdb) bt full
#0 0x0000000009625acf in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00000000095f8ea5 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000007e7d96a in uv_async_send.cold () from /lib64/libuv.so.1
No symbol table info available.
#3 0x0000000007c31856 in sw::redis::AsyncConnection::send (this=0xe82ee20, event=std::unique_ptrsw::redis::AsyncEvent = {...})
at /usr/include/c++/8/bits/shared_ptr_base.h:251
No locals.
#4 0x0000000007c42d96 in sw::redis::AsyncShardsPool::_redeliver_events (this=0xfdbbf90,
events=std::queue wrapping: std::deque with 6 elements = {...}) at /usr/include/c++/8/bits/move.h:74
async_event =
pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990}
connection = {_pool = std::shared_ptrsw::redis::AsyncConnectionPool (use count 3, weak count 1) = {get() = 0xfe40990},
_connection = std::shared_ptrsw::redis::AsyncConnection (use count 3, weak count 1) = {get() = 0xe82ee20}}
event =
should_stop_worker = false
#5 0x0000000007c44530 in sw::redis::AsyncShardsPool::_run (this=0xfdbbf90)
at /.../redis++/rpm/BUILD/src/sw/redis++/async_shards_pool.cpp:191
events = std::queue wrapping: std::deque with 6 elements = {{key = "USER_KEY_297",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0x0}}, {key = "USER_KEY_302",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8b2d30}}, {
key = "USER_KEY_303", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe8a38a0}}, {key = "USER_KEY_522",
event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe8d7a60}}, {
key = "USER_KEY_306", event = std::unique_ptrsw::redis::AsyncEvent = {
get() = 0xe6d7e20}}, {key = "", event = std::unique_ptrsw::redis::AsyncEvent = {get() = 0xe6c8dc0}}}
#6 0x0000000008d6ab23 in execute_native_thread_routine () from /lib64/libstdc++.so.6
No symbol table info available.
#7 0x00000000083591ca in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#8 0x0000000009610e73 in clone () from /lib64/libc.so.6
No symbol table info available.
(gdb) Quit
USER_KEY_297, ..., USER_KEY_306 are anonymized keys, but all of them belongs to the slot range of the killed redis master
Expected behavior
No crash, traffic should be stabilized.
Environment:
OS: Rocky Linux 8.2-20.el8.0.1
Compiler: gcc version 8.5.0
hiredis version: hiredis 1.2.0
redis-plus-plus version: 1.3.12
Additional context
Redis cluster is used with 3 masters and 3 slaves.
The text was updated successfully, but these errors were encountered: