-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash where command duration is not reset when client is blocked … #526
Conversation
1e9435d
to
0b1e1e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - however you need to signoff some commits
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #526 +/- ##
============================================
- Coverage 70.22% 70.02% -0.20%
============================================
Files 109 109
Lines 59956 59914 -42
============================================
- Hits 42104 41957 -147
- Misses 17852 17957 +105
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change makes sense to me. I think we can merge if you could address the comments on the test code.
The assert though sounds harsh to me. I am not sure if the tradeoff (accurate metrics vs server crash) is justified but agreed it is a separate issue.
@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think? |
I think we should address the assert separately. If we remove that assert, this PR would become moot :-). As for the assert, I would like to hear the community's thoughts on the following:
|
I think that it is a well known difference between assert and panic. panic means "abort now no matter what" and assert is a conditional abort like "test and abort if condition is not met"
I strongly support introducing debug_assert in valkey. I think the logic of assert/panic is to abort in case the program encounter a condition which prevents it from continue to operate correctly and/or fix. I think it will also help improve the stability and we might be more lose with adding debug_asserts which will help identify bugs during daily runs.
I am not 100% sure that this should not become a debug_assert. I find value in what you propose since it maybe a result of a real consistency issue (?) but I still feel the purpose here was to catch issues during tests.
Is there really a reason to run non debug daily tests? I agree performance tests should run "release" builds but maybe we can just run the daily with debug builds in order to reduce the extra resource utilization? |
…on XREADGROUP and the stream's slot is migrated Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
I meant to say "conditional" panic, something like
I feel that we are talking about two different things. What you described here aligns with my definition of the
I agree with you (and I call
Yes I think running the daily with |
We don't use assert that way, so I don't know if it's worth arguing about what
At the time I was also advocating for the addition |
I understand Redis/Valkey never used "asserts" in that way, which is unconventional IMO, but I agree the debate on semantics is sufficient at this moment. What I'm trying to say is that with today's code base we either have to take down the server on non-critical invariant violation like this one or have no way to detect for ever and we can do better.
"assert" that only fires during tests would totally work for me. Do you have an issue on it already?
Agreed |
This already exists today, it's |
… of cli Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor comment, LGTM to otherwise.
Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>
valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>
valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>
valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]> Signed-off-by: Ping Xie <[email protected]>
#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]> Signed-off-by: Ping Xie <[email protected]>
In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed.
I found an edge case where this happens - easily reproduced by blocking a client on
XGROUPREAD
and migrating the stream's slot. This causes the engine to process theXGROUPREAD
command twice:MOVED
response. But when we do that, we don’t reset the duration field.Fix: also reset the duration when returning a
MOVED
response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration.Also wrote a test which reproduces this, it fails without the fix and passes with it.