Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix replica not able to initate election in time when epoch fails #1009

Merged
merged 10 commits into from
Nov 11, 2024

Conversation

enjoy-binbin
Copy link
Member

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

…poch

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

Signed-off-by: Binbin <[email protected]>
Copy link

codecov bot commented Sep 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.73%. Comparing base (e972d56) to head (6291ed6).
Report is 6 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1009      +/-   ##
============================================
+ Coverage     70.70%   70.73%   +0.02%     
============================================
  Files           114      114              
  Lines         63147    63151       +4     
============================================
+ Hits          44648    44669      +21     
+ Misses        18499    18482      -17     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.43% <100.00%> (+0.20%) ⬆️

... and 11 files with indirect coverage changes

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 10, 2024
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great bug, @enjoy-binbin! The fix LGTM overall.

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Sep 24, 2024
@enjoy-binbin
Copy link
Member Author

@madolson @zuiderkwast do you guys want to take a look with this?

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good!

The PR title says "Optimize ..." but it is more than an optimization. Actually a bug fix? Please improve the title. :)

Co-authored-by: Viktor Söderqvist <[email protected]>
Signed-off-by: Binbin <[email protected]>
@enjoy-binbin
Copy link
Member Author

The PR title says "Optimize ..." but it is more than an optimization. Actually a bug fix? Please improve the title. :)

I think it is indeed more of an optimization, making the election fail ASAP and retrying ASAP. But it can also considered a bug fix, maybe: Fix replica not able to initate election in time when epoch fails?

@zuiderkwast
Copy link
Contributor

The macos job failed. It's probably not related to this PR. It's a fascinating crash log though:

                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 255.255.255 (125c71fe/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in standalone mode
 |####+'     .+#######+.     '+####|    Port: 21995
I/O error reading reply
 |###|   .+###############+.   |###|    PID: 30605                     
    while executing
 |###|   |#####*'' ''*#####|   |###|                                 
"$r set [expr rand()] [expr rand()]"
 |###|   |####'  .-.  '####|   |###|                                 
    (procedure "gen_write_load" line 8)
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io/      
    invoked from within
 |###|   |####.  '-'  .####|   |###|                                 
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
 |###|   |#####*.   .*#####|   |###|                                 
    (file "tests/helpers/gen_write_load.tcl" line 24)I/O error reading reply
 |###|   '+#####|   |#####+'   |###|                                 
    while executing
 |####+.     +##|   |#+'     .+####|                                 
"$r set [expr rand()] [expr rand()]"
 '#######+   |##|        .+########'                                 
    (procedure "gen_write_load" line 8)
    '+###|   |##|    .+########+'                                    
    invoked from within
        '|   |####+########+'                                        
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
             +#########+'                                            
    (file "tests/helpers/gen_write_load.tcl" line 24)
                '+v+'                                                


I/O error reading reply
30605:M 09 Nov 2024 14:12:32.606 # WARNING: The TCP backlog setting of 511 cannot be enforced because kern.ipc.somaxconn is set to the lower value of 128.
    while executing
30605:M 09 Nov 2024 14:12:32.607 * Server initialized
"$r set [expr rand()] [expr rand()]"
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections tcp
    (procedure "gen_write_load" line 8)
30605:M 09 Nov 2024 14:12:32.607 * Ready to accept connections unix
    invoked from within
"gen_write_load [lindex $argv 0] [lindex $argv 1] [lindex $argv 2] [lindex $argv 3] [lindex $argv 4]"
    (file "tests/helpers/gen_write_load.tcl" line 24)
30605:M 09 Nov 2024 14:12:32.894 - Accepted 127.0.0.1:64082
30605:M 09 Nov 2024 14:12:32.894 - Client closed connection id=3 addr=127.0.0.1:64082 laddr=127.0.0.1:21995 fd=14 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16890 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=34176 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1

@enjoy-binbin
Copy link
Member Author

i try to fix the test in #1288

@enjoy-binbin enjoy-binbin changed the title Optimize cluster election when nodes initiate elections at the same epoch Fix replica not able to initate election in time when epoch fails Nov 11, 2024
@enjoy-binbin enjoy-binbin merged commit a2d22c6 into valkey-io:unstable Nov 11, 2024
56 of 57 checks passed
@enjoy-binbin enjoy-binbin deleted the epoch_timeout branch November 11, 2024 14:12
zvi-code pushed a commit to zvi-code/valkey-z that referenced this pull request Nov 13, 2024
…lkey-io#1009)

If multiple primary nodes go down at the same time, their replica nodes will
initiate the elections at the same time. There is a certain probability that
the replicas will initate the elections in the same epoch.

And obviously, in our current election mechanism, only one replica node can
eventually get the enough votes, and the other replica node will fail to win
due the the insufficient majority, and then its election will time out and
we will wait for the retry, which result in a long failure time.

If another node has been won the election in the failover epoch, we can assume
that my election has failed and we can retry as soom as possible.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Nov 22, 2024
After valkey-io#1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.

But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.

This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Dec 9, 2024
…#1339)

After #1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.

But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.

This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.

Signed-off-by: Binbin <[email protected]>
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
…valkey-io#1339)

After valkey-io#1009, we will reset the election when we received
a claim with an equal or higher epoch since a node can win
an election in the past.

But we need to consider the time before the node actually
obtains the failover_auth_epoch. The failover_auth_epoch
default is 0, so before the node actually get the failover
epoch, we might wrongly reset the election.

This is probably harmless, but will produce misleading log
output and may delay election by a cron cycle or beforesleep.
Now we will only reset the election when a node is actually
obtains the failover epoch.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Feb 11, 2025
We may rely on auth_time to determine whether a failover is
in progress, like valkey-io#1009, so it is best to reset it.

Signed-off-by: Binbin <[email protected]>
enjoy-binbin added a commit that referenced this pull request Feb 12, 2025
We may rely on auth_time to determine whether a failover is in progress,
like #1009, so it is best to reset it.

Signed-off-by: Binbin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants