Fix crash where command duration is not reset when client is blocked … #526

nitaicaro · 2024-05-21T13:40:14Z

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking a client on XGROUPREAD and migrating the stream's slot. This causes the engine to process the XGROUPREAD command twice:

First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it.
After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a MOVED response. But when we do that, we don’t reset the duration field.

Fix: also reset the duration when returning a MOVED response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and passes with it.

src/server.c

tests/cluster/cluster.tcl

tests/cluster/tests/10-slot-migration-with-client-blocking.tcl

ranshid

LGTM - however you need to signoff some commits

codecov · 2024-05-28T04:33:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.02%. Comparing base (168da8b) to head (ec15779).
Report is 2 commits behind head on unstable.

❗ Current head ec15779 differs from pull request most recent head 9ef9b91

Please upload reports for the commit 9ef9b91 to get more accurate results.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #526      +/-   ##
============================================
- Coverage     70.22%   70.02%   -0.20%     
============================================
  Files           109      109              
  Lines         59956    59914      -42     
============================================
- Hits          42104    41957     -147     
- Misses        17852    17957     +105

Files	Coverage Δ
src/cluster.c	`86.60% <100.00%> (+0.01%)`	⬆️

... and 16 files with indirect coverage changes

PingXie

The code change makes sense to me. I think we can merge if you could address the comments on the test code.

The assert though sounds harsh to me. I am not sure if the tradeoff (accurate metrics vs server crash) is justified but agreed it is a separate issue.

tests/support/cluster_util.tcl

tests/unit/cluster/slot-migration.tcl

tests/support/cluster_util.tcl

tests/unit/cluster/slot-migration.tcl

nitaicaro · 2024-05-28T07:38:47Z

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

PingXie · 2024-05-28T07:54:27Z

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

I think we should address the assert separately. If we remove that assert, this PR would become moot :-).

As for the assert, I would like to hear the community's thoughts on the following:

rename assert to panic - assert in Redis/Valkey is a misnomer.; it is actually panic.
introduce debug builds and real asserts,which exist in debug builds only;
convert this assert (c->duration == 0) to a (real) runtime assert - there is still value in checking this condition
add daily test runs against both debug builds as well (in addition to the current "release" builds)

ranshid · 2024-05-28T08:32:46Z

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

I think we should address the assert separately. If we remove that assert, this PR would become moot :-).

As for the assert, I would like to hear the community's thoughts on the following:

rename assert to panic - assert in Redis/Valkey is a misnomer.; it is actually panic.

I think that it is a well known difference between assert and panic. panic means "abort now no matter what" and assert is a conditional abort like "test and abort if condition is not met"

introduce debug builds and real asserts,which exist in debug builds only;

I strongly support introducing debug_assert in valkey. I think the logic of assert/panic is to abort in case the program encounter a condition which prevents it from continue to operate correctly and/or fix. I think it will also help improve the stability and we might be more lose with adding debug_asserts which will help identify bugs during daily runs.

convert this assert (c->duration == 0) to a (real) runtime assert - there is still value in checking this condition

I am not 100% sure that this should not become a debug_assert. I find value in what you propose since it maybe a result of a real consistency issue (?) but I still feel the purpose here was to catch issues during tests.

add daily test runs against both debug builds as well (in addition to the current "release" builds)

Is there really a reason to run non debug daily tests? I agree performance tests should run "release" builds but maybe we can just run the daily with debug builds in order to reduce the extra resource utilization?

…on XREADGROUP and the stream's slot is migrated Signed-off-by: Nitai Caro <[email protected]>

Signed-off-by: Nitai Caro <[email protected]>

PingXie · 2024-05-28T16:24:57Z

I think that it is a well known difference between assert and panic. panic means "abort now no matter what" and assert is a conditional abort like "test and abort if condition is not met"

I meant to say "conditional" panic, something like serverPanicIf(cond). assert is not supposed to take down a running process/system.

I strongly support introducing debug_assert in valkey. I think the logic of assert/panic is to abort in case the program encounter a condition which prevents it from continue to operate correctly and/or fix. I think it will also help improve the stability and we might be more lose with adding debug_asserts which will help identify bugs during daily runs.

I feel that we are talking about two different things. What you described here aligns with my definition of the panic behavior, meaning that the violation of the invariant has severe consequences such as leading to data loss if not being held. the violation of c->duration ==0 does not fit this bill. However, there is still value in making sure that we don't just generate random metrics and this is why I think assert is the right tool because it is only active in the special debug builds that are not used in any prod env.

I am not 100% sure that this should not become a debug_assert. I find value in what you propose since it maybe a result of a real consistency issue (?) but I still feel the purpose here was to catch issues during tests.

I agree with you (and I call debug_assert as serverPanicIf). That said, I still think we should try our best to catch this invariant violation (using assert)

Is there really a reason to run non debug daily tests? I agree performance tests should run "release" builds but maybe we can just run the daily with debug builds in order to reduce the extra resource utilization?

Yes I think running the daily with debug builds makes sense. Performance tests can run weekly or right after any PR marked with "performance enhancement" on-demand.

madolson · 2024-05-28T17:44:12Z

I meant to say "conditional" panic, something like serverPanicIf(cond). assert is not supposed to take down a running process/system.

We don't use assert that way, so I don't know if it's worth arguing about what assert should and shouldn't do. BTW, we don't use assert anywhere in Valkey, it is always replaced by serverAssert() which is always active in production.

I can remove it in this PR as it's been over a year, what do you think?

At the time I was also advocating for the addition debugServerAssertWithInfo which more closely fits what we were looking for (specifically what Ping refers to with assert), but hadn't committed to at the time. It fills the role of "only assert during tests". I think we can move from a regular serverAssert to the debug variant. We should also reset the duration to zero if we aren't already, to make sure we are gracefully handling production cases.

tests/unit/cluster/slot-migration.tcl

PingXie · 2024-05-28T18:04:45Z

We don't use assert that way, so I don't know if it's worth arguing about what assert should and shouldn't do. BTW, we don't use assert anywhere in Valkey, it is always replaced by serverAssert() which is always active in production.

I understand Redis/Valkey never used "asserts" in that way, which is unconventional IMO, but I agree the debate on semantics is sufficient at this moment. What I'm trying to say is that with today's code base we either have to take down the server on non-critical invariant violation like this one or have no way to detect for ever and we can do better.

At the time I was also advocating for the addition debugServerAssertWithInfo which more closely fits what we were looking for (specifically what Ping refers to with assert), but hadn't committed to at the time. It fills the role of "only assert during tests". I think we can move from a regular serverAssert to the debug variant.

"assert" that only fires during tests would totally work for me. Do you have an issue on it already?

We should also reset the duration to zero if we aren't already, to make sure we are gracefully handling production cases.

Agreed

madolson · 2024-05-28T18:12:44Z

"assert" that only fires during tests would totally work for me. Do you have an issue on it already?

This already exists today, it's debugServerAssertWithInfo. By default debugServerAssertWithInfo is compiled out, but it's compiled in for some tests. See -DDEBUG_ASSERTIONS like https://github.com/valkey-io/valkey/blob/fd58b73f0ae895bf9de3810d799da20bb75a2b4f/.github/workflows/daily.yml#L601C102-L601C120.

src/cluster.c

tests/unit/cluster/slot-migration.tcl

… of cli Signed-off-by: Nitai Caro <[email protected]>

Signed-off-by: Nitai Caro <[email protected]>

tests/unit/cluster/slot-migration.tcl

madolson

Just a minor comment, LGTM to otherwise.

Signed-off-by: Nitai Caro <[email protected]>

valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]>

valkey-io#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]> Signed-off-by: Ping Xie <[email protected]>

#526) In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed. I found an edge case where this happens - easily reproduced by blocking a client on `XGROUPREAD` and migrating the stream's slot. This causes the engine to process the `XGROUPREAD` command twice: 1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it. 2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a `MOVED` response. But when we do that, we don’t reset the duration field. Fix: also reset the duration when returning a `MOVED` response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration. Also wrote a test which reproduces this, it fails without the fix and passes with it. --------- Signed-off-by: Nitai Caro <[email protected]> Co-authored-by: Nitai Caro <[email protected]> Signed-off-by: Ping Xie <[email protected]>

ranshid self-requested a review May 21, 2024 15:13

ranshid reviewed May 21, 2024

View reviewed changes

nitaicaro force-pushed the unstable branch 2 times, most recently from 1e9435d to 0b1e1e3 Compare May 27, 2024 11:30

ranshid approved these changes May 27, 2024

View reviewed changes

PingXie reviewed May 28, 2024

View reviewed changes

Nitai Caro added 4 commits May 28, 2024 10:15

Fix crash where command duration is not reset when client is blocked …

5719a64

…on XREADGROUP and the stream's slot is migrated Signed-off-by: Nitai Caro <[email protected]>

Fix comments

1ced801

Signed-off-by: Nitai Caro <[email protected]>

Fix test tag typo

8130c52

Signed-off-by: Nitai Caro <[email protected]>

switch to constant stream name and slot

ec15779

Signed-off-by: Nitai Caro <[email protected]>

nitaicaro force-pushed the unstable branch from 8e3c8fd to ec15779 Compare May 28, 2024 10:18

nitaicaro mentioned this pull request May 28, 2024

[CRASH] Command duration is not reset when client is blocked on XGROUPREAD and the stream's slot is migrated, failing an assertion #563

Closed

madolson reviewed May 28, 2024

View reviewed changes

tests/unit/cluster/slot-migration.tcl Outdated Show resolved Hide resolved

madolson reviewed May 28, 2024

View reviewed changes

src/cluster.c Outdated Show resolved Hide resolved

madolson reviewed May 28, 2024

View reviewed changes

tests/unit/cluster/slot-migration.tcl Outdated Show resolved Hide resolved

PingXie mentioned this pull request May 28, 2024

[NEW] Activate debugServerAssertWithInfo using a server config #569

Closed

Reduce number of nodes in test and switch to deferring client instead…

2679ce6

… of cli Signed-off-by: Nitai Caro <[email protected]>

nitaicaro force-pushed the unstable branch from 002f592 to 2679ce6 Compare May 29, 2024 08:52

Nitai Caro and others added 2 commits May 29, 2024 15:40

Move duration reset back to processCommand

21eb31a

Signed-off-by: Nitai Caro <[email protected]>

Merge branch 'unstable' into unstable

cd602fd

madolson added the release-notes This issue should get a line item in the release notes label May 30, 2024

madolson reviewed May 30, 2024

View reviewed changes

tests/unit/cluster/slot-migration.tcl Outdated Show resolved Hide resolved

madolson reviewed May 30, 2024

View reviewed changes

Nitai Caro added 2 commits May 30, 2024 15:26

replace sleep with wait_for_blocked_client

095b622

Signed-off-by: Nitai Caro <[email protected]>

update comment

9ef9b91

Signed-off-by: Nitai Caro <[email protected]>

nitaicaro force-pushed the unstable branch from 0053094 to 9ef9b91 Compare May 30, 2024 15:27

madolson approved these changes May 30, 2024

View reviewed changes

madolson added the bug Something isn't working label May 30, 2024

madolson merged commit 6fb90ad into valkey-io:unstable May 30, 2024
16 checks passed

madolson mentioned this pull request Jul 11, 2024

Skip tls for xgroup read regression since it doesn't matter #772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash where command duration is not reset when client is blocked … #526

Fix crash where command duration is not reset when client is blocked … #526

nitaicaro commented May 21, 2024 •

edited

Loading

ranshid left a comment •

edited

Loading

codecov bot commented May 28, 2024 •

edited

Loading

PingXie left a comment

nitaicaro commented May 28, 2024

PingXie commented May 28, 2024

ranshid commented May 28, 2024 •

edited

Loading

PingXie commented May 28, 2024

madolson commented May 28, 2024 •

edited

Loading

PingXie commented May 28, 2024

madolson commented May 28, 2024

madolson left a comment

Fix crash where command duration is not reset when client is blocked … #526

Fix crash where command duration is not reset when client is blocked … #526

Conversation

nitaicaro commented May 21, 2024 • edited Loading

ranshid left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented May 28, 2024 • edited Loading

Codecov Report

PingXie left a comment

Choose a reason for hiding this comment

nitaicaro commented May 28, 2024

PingXie commented May 28, 2024

ranshid commented May 28, 2024 • edited Loading

PingXie commented May 28, 2024

madolson commented May 28, 2024 • edited Loading

PingXie commented May 28, 2024

madolson commented May 28, 2024

madolson left a comment

Choose a reason for hiding this comment

nitaicaro commented May 21, 2024 •

edited

Loading

ranshid left a comment •

edited

Loading

codecov bot commented May 28, 2024 •

edited

Loading

ranshid commented May 28, 2024 •

edited

Loading

madolson commented May 28, 2024 •

edited

Loading