Fix replica not able to initate election in time when epoch fails #1009

enjoy-binbin · 2024-09-10T08:15:35Z

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

…poch If multiple primary nodes go down at the same time, their replica nodes will initiate the elections at the same time. There is a certain probability that the replicas will initate the elections in the same epoch. And obviously, in our current election mechanism, only one replica node can eventually get the enough votes, and the other replica node will fail to win due the the insufficient majority, and then its election will time out and we will wait for the retry, which result in a long failure time. If another node has been won the election in the failover epoch, we can assume that my election has failed and we can retry as soom as possible. Signed-off-by: Binbin <[email protected]>

src/cluster_legacy.c

codecov · 2024-09-10T08:30:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.73%. Comparing base (e972d56) to head (6291ed6).
Report is 6 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1009      +/-   ##
============================================
+ Coverage     70.70%   70.73%   +0.02%     
============================================
  Files           114      114              
  Lines         63147    63151       +4     
============================================
+ Hits          44648    44669      +21     
+ Misses        18499    18482      -17

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.43% <100.00%> (+0.20%)`	⬆️

... and 11 files with indirect coverage changes

Signed-off-by: Binbin <[email protected]>

PingXie

This is a great bug, @enjoy-binbin! The fix LGTM overall.

src/cluster_legacy.c

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

Signed-off-by: Binbin <[email protected]>

PingXie

LGTM!

tests/unit/cluster/failover2.tcl

Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-11-08T04:57:22Z

@madolson @zuiderkwast do you guys want to take a look with this?

Signed-off-by: Binbin <[email protected]>

zuiderkwast

Fix looks good!

The PR title says "Optimize ..." but it is more than an optimization. Actually a bug fix? Please improve the title. :)

src/cluster_legacy.c

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-11-09T13:27:11Z

The PR title says "Optimize ..." but it is more than an optimization. Actually a bug fix? Please improve the title. :)

I think it is indeed more of an optimization, making the election fail ASAP and retrying ASAP. But it can also considered a bug fix, maybe: Fix replica not able to initate election in time when epoch fails?

zuiderkwast · 2024-11-09T20:58:50Z

The macos job failed. It's probably not related to this PR. It's a fascinating crash log though:

                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 255.255.255 (125c71fe/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 21995
I/O error reading reply
 |###|   .+###############+.   |###|    PID: 30605                     
    while executing
 |###|   |#####*'' ''*#####|   |###|                                 
"$r set [expr rand()] [expr rand()]"
 |###|   |####'  .-.  '####|   |###|                                 
    (procedure "gen_write_load" line 8)
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io/      
    invoked from within
 |###|   |####.  '-'  .####|   |###|                                 
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
 |###|   |#####*.   .*#####|   |###|                                 
    (file "tests/helpers/gen_write_load.tcl" line 24)I/O error reading reply
 |###|   '+#####|   |#####+'   |###|                                 
    while executing
 |####+.     +##|   |#+'     .+####|                                 
"$r set [expr rand()] [expr rand()]"
 '#######+   |##|        .+########'                                 
    (procedure "gen_write_load" line 8)
    '+###|   |##|    .+########+'                                    
    invoked from within
        '|   |####+########+'                                        
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
             +#########+'                                            
    (file "tests/helpers/gen_write_load.tcl" line 24)
                '+v+'                                                


I/O error reading reply
30605:M 09 Nov 2024 14:12:32.606 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128.
    while executing
30605:M 09 Nov 2024 14:12:32.607 * Server initialized
"$r set [expr rand()] [expr rand()]"
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections tcp
    (procedure "gen_write_load" line 8)
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections unix
    invoked from within
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
    (file "tests/helpers/gen_write_load.tcl" line 24)
30605:M 09 Nov 2024 14:12:32.894 - Accepted 127.0.0.1:64082
30605:M 09 Nov 2024 14:12:32.894 - Client closed connection id=3 addr=127.0.0.1:64082 laddr=127.0.0.1:21995 fd=14 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=34176 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1

enjoy-binbin · 2024-11-11T05:21:51Z

i try to fix the test in #1288

…lkey-io#1009) If multiple primary nodes go down at the same time, their replica nodes will initiate the elections at the same time. There is a certain probability that the replicas will initate the elections in the same epoch. And obviously, in our current election mechanism, only one replica node can eventually get the enough votes, and the other replica node will fail to win due the the insufficient majority, and then its election will time out and we will wait for the retry, which result in a long failure time. If another node has been won the election in the failover epoch, we can assume that my election has failed and we can retry as soom as possible. Signed-off-by: Binbin <[email protected]>

After valkey-io#1009, we will reset the election when we received a claim with an equal or higher epoch since a node can win an election in the past. But we need to consider the time before the node actually obtains the failover_auth_epoch. The failover_auth_epoch default is 0, so before the node actually get the failover epoch, we might wrongly reset the election. This is probably harmless, but will produce misleading log output and may delay election by a cron cycle or beforesleep. Now we will only reset the election when a node is actually obtains the failover epoch. Signed-off-by: Binbin <[email protected]>

…#1339) After #1009, we will reset the election when we received a claim with an equal or higher epoch since a node can win an election in the past. But we need to consider the time before the node actually obtains the failover_auth_epoch. The failover_auth_epoch default is 0, so before the node actually get the failover epoch, we might wrongly reset the election. This is probably harmless, but will produce misleading log output and may delay election by a cron cycle or beforesleep. Now we will only reset the election when a node is actually obtains the failover epoch. Signed-off-by: Binbin <[email protected]>

…valkey-io#1339) After valkey-io#1009, we will reset the election when we received a claim with an equal or higher epoch since a node can win an election in the past. But we need to consider the time before the node actually obtains the failover_auth_epoch. The failover_auth_epoch default is 0, so before the node actually get the failover epoch, we might wrongly reset the election. This is probably harmless, but will produce misleading log output and may delay election by a cron cycle or beforesleep. Now we will only reset the election when a node is actually obtains the failover epoch. Signed-off-by: Binbin <[email protected]>

We may rely on auth_time to determine whether a failover is in progress, like valkey-io#1009, so it is best to reset it. Signed-off-by: Binbin <[email protected]>

We may rely on auth_time to determine whether a failover is in progress, like #1009, so it is best to reset it. Signed-off-by: Binbin <[email protected]>

enjoy-binbin requested a review from PingXie September 10, 2024 08:15

enjoy-binbin commented Sep 10, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 10, 2024

fix build and fix format

cea9267

Signed-off-by: Binbin <[email protected]>

PingXie reviewed Sep 13, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

src/cluster_legacy.c Outdated Show resolved Hide resolved

enjoy-binbin and others added 3 commits September 13, 2024 15:08

Apply suggestions from code review

41034bd

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

fix format

0270e71

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into epoch_timeout

0d13f5d

Signed-off-by: Binbin <[email protected]>

PingXie approved these changes Sep 23, 2024

View reviewed changes

tests/unit/cluster/failover2.tcl Show resolved Hide resolved

enjoy-binbin added 2 commits September 24, 2024 11:09

update comment

ce0e75e

Signed-off-by: Binbin <[email protected]>

Merge remote-tracking branch 'upstream/unstable' into epoch_timeout

489bc00

Signed-off-by: Binbin <[email protected]>

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 24, 2024

enjoy-binbin added 2 commits November 8, 2024 14:02

Merge remote-tracking branch 'upstream/unstable' into epoch_timeout

a91dc59

Signed-off-by: Binbin <[email protected]>

Update the comment

7ce57f3

Signed-off-by: Binbin <[email protected]>

zuiderkwast approved these changes Nov 8, 2024

View reviewed changes

src/cluster_legacy.c Outdated Show resolved Hide resolved

Update src/cluster_legacy.c

6291ed6

Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>

enjoy-binbin changed the title ~~Optimize cluster election when nodes initiate elections at the same epoch~~ Fix replica not able to initate election in time when epoch fails Nov 11, 2024

enjoy-binbin mentioned this pull request Nov 11, 2024

Stabilize dual replication test to avoid getting LOADING error #1288

Merged

enjoy-binbin merged commit a2d22c6 into valkey-io:unstable Nov 11, 2024
56 of 57 checks passed

enjoy-binbin deleted the epoch_timeout branch November 11, 2024 14:12

enjoy-binbin mentioned this pull request Nov 22, 2024

Fix the election was reset wrongly before failover epoch was obtained #1339

Merged

enjoy-binbin mentioned this pull request Feb 11, 2025

Reset failover auth time when myself won the failover #1711

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix replica not able to initate election in time when epoch fails #1009

Fix replica not able to initate election in time when epoch fails #1009

enjoy-binbin commented Sep 10, 2024

codecov bot commented Sep 10, 2024 •

edited

Loading

PingXie left a comment

PingXie left a comment

enjoy-binbin commented Nov 8, 2024

zuiderkwast left a comment

enjoy-binbin commented Nov 9, 2024

zuiderkwast commented Nov 9, 2024

enjoy-binbin commented Nov 11, 2024

Fix replica not able to initate election in time when epoch fails #1009

Fix replica not able to initate election in time when epoch fails #1009

Conversation

enjoy-binbin commented Sep 10, 2024

codecov bot commented Sep 10, 2024 • edited Loading

Codecov Report

PingXie left a comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Nov 8, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

enjoy-binbin commented Nov 9, 2024

zuiderkwast commented Nov 9, 2024

enjoy-binbin commented Nov 11, 2024

codecov bot commented Sep 10, 2024 •

edited

Loading