Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash where command duration is not reset when client is blocked … #526

Merged
merged 9 commits into from
May 30, 2024

Conversation

nitaicaro
Copy link
Contributor

@nitaicaro nitaicaro commented May 21, 2024

In #11012, we changed the way command durations were computed to handle the same command being executed multiple times. In #11970, we added an assert if the duration is not properly reset, potentially indicating that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking a client on XGROUPREAD and migrating the stream's slot. This causes the engine to process the XGROUPREAD command twice:

  1. First time, we are blocked on the stream, so we wait for unblock to come back to it a second time. In most cases, when we come back to process the command second time after unblock, we process the command normally, which includes recording the duration and then resetting it.
  2. After unblocking we come back to process the command, and this is where we hit the edge case - at this point, we had already migrated the slot to another node, so we return a MOVED response. But when we do that, we don’t reset the duration field.

Fix: also reset the duration when returning a MOVED response. I think this is right, because the client should redirect the command to the right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and passes with it.

@ranshid ranshid self-requested a review May 21, 2024 15:13
@nitaicaro nitaicaro force-pushed the unstable branch 2 times, most recently from 1e9435d to 0b1e1e3 Compare May 27, 2024 11:30
Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - however you need to signoff some commits

Copy link

codecov bot commented May 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.02%. Comparing base (168da8b) to head (ec15779).
Report is 2 commits behind head on unstable.

Current head ec15779 differs from pull request most recent head 9ef9b91

Please upload reports for the commit 9ef9b91 to get more accurate results.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #526      +/-   ##
============================================
- Coverage     70.22%   70.02%   -0.20%     
============================================
  Files           109      109              
  Lines         59956    59914      -42     
============================================
- Hits          42104    41957     -147     
- Misses        17852    17957     +105     
Files Coverage Δ
src/cluster.c 86.60% <100.00%> (+0.01%) ⬆️

... and 16 files with indirect coverage changes

Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change makes sense to me. I think we can merge if you could address the comments on the test code.

The assert though sounds harsh to me. I am not sure if the tradeoff (accurate metrics vs server crash) is justified but agreed it is a separate issue.

@nitaicaro
Copy link
Contributor Author

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

@PingXie
Copy link
Member

PingXie commented May 28, 2024

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

I think we should address the assert separately. If we remove that assert, this PR would become moot :-).

As for the assert, I would like to hear the community's thoughts on the following:

  1. rename assert to panic - assert in Redis/Valkey is a misnomer.; it is actually panic.
  2. introduce debug builds and real asserts,which exist in debug builds only;
  3. convert this assert (c->duration == 0) to a (real) runtime assert - there is still value in checking this condition
  4. add daily test runs against both debug builds as well (in addition to the current "release" builds)

@ranshid
Copy link
Member

ranshid commented May 28, 2024

@PingXie, agreed that the assert is a bit harsh. When @madolson added it originally she did point out that it can potentially be removed in the future. I can remove it in this PR as it's been over a year, what do you think?

I think we should address the assert separately. If we remove that assert, this PR would become moot :-).

As for the assert, I would like to hear the community's thoughts on the following:

  1. rename assert to panic - assert in Redis/Valkey is a misnomer.; it is actually panic.

I think that it is a well known difference between assert and panic. panic means "abort now no matter what" and assert is a conditional abort like "test and abort if condition is not met"

  1. introduce debug builds and real asserts,which exist in debug builds only;

I strongly support introducing debug_assert in valkey. I think the logic of assert/panic is to abort in case the program encounter a condition which prevents it from continue to operate correctly and/or fix. I think it will also help improve the stability and we might be more lose with adding debug_asserts which will help identify bugs during daily runs.

  1. convert this assert (c->duration == 0) to a (real) runtime assert - there is still value in checking this condition

I am not 100% sure that this should not become a debug_assert. I find value in what you propose since it maybe a result of a real consistency issue (?) but I still feel the purpose here was to catch issues during tests.

  1. add daily test runs against both debug builds as well (in addition to the current "release" builds)

Is there really a reason to run non debug daily tests? I agree performance tests should run "release" builds but maybe we can just run the daily with debug builds in order to reduce the extra resource utilization?

Nitai Caro added 4 commits May 28, 2024 10:15
…on XREADGROUP and the stream's slot is migrated

Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
Signed-off-by: Nitai Caro <[email protected]>
@PingXie
Copy link
Member

PingXie commented May 28, 2024

I think that it is a well known difference between assert and panic. panic means "abort now no matter what" and assert is a conditional abort like "test and abort if condition is not met"

I meant to say "conditional" panic, something like serverPanicIf(cond). assert is not supposed to take down a running process/system.

I strongly support introducing debug_assert in valkey. I think the logic of assert/panic is to abort in case the program encounter a condition which prevents it from continue to operate correctly and/or fix. I think it will also help improve the stability and we might be more lose with adding debug_asserts which will help identify bugs during daily runs.

I feel that we are talking about two different things. What you described here aligns with my definition of the panic behavior, meaning that the violation of the invariant has severe consequences such as leading to data loss if not being held. the violation of c->duration ==0 does not fit this bill. However, there is still value in making sure that we don't just generate random metrics and this is why I think assert is the right tool because it is only active in the special debug builds that are not used in any prod env.

I am not 100% sure that this should not become a debug_assert. I find value in what you propose since it maybe a result of a real consistency issue (?) but I still feel the purpose here was to catch issues during tests.

I agree with you (and I call debug_assert as serverPanicIf). That said, I still think we should try our best to catch this invariant violation (using assert)

Is there really a reason to run non debug daily tests? I agree performance tests should run "release" builds but maybe we can just run the daily with debug builds in order to reduce the extra resource utilization?

Yes I think running the daily with debug builds makes sense. Performance tests can run weekly or right after any PR marked with "performance enhancement" on-demand.

@madolson
Copy link
Member

madolson commented May 28, 2024

I meant to say "conditional" panic, something like serverPanicIf(cond). assert is not supposed to take down a running process/system.

We don't use assert that way, so I don't know if it's worth arguing about what assert should and shouldn't do. BTW, we don't use assert anywhere in Valkey, it is always replaced by serverAssert() which is always active in production.

I can remove it in this PR as it's been over a year, what do you think?

At the time I was also advocating for the addition debugServerAssertWithInfo which more closely fits what we were looking for (specifically what Ping refers to with assert), but hadn't committed to at the time. It fills the role of "only assert during tests". I think we can move from a regular serverAssert to the debug variant. We should also reset the duration to zero if we aren't already, to make sure we are gracefully handling production cases.

@PingXie
Copy link
Member

PingXie commented May 28, 2024

We don't use assert that way, so I don't know if it's worth arguing about what assert should and shouldn't do. BTW, we don't use assert anywhere in Valkey, it is always replaced by serverAssert() which is always active in production.

I understand Redis/Valkey never used "asserts" in that way, which is unconventional IMO, but I agree the debate on semantics is sufficient at this moment. What I'm trying to say is that with today's code base we either have to take down the server on non-critical invariant violation like this one or have no way to detect for ever and we can do better.

At the time I was also advocating for the addition debugServerAssertWithInfo which more closely fits what we were looking for (specifically what Ping refers to with assert), but hadn't committed to at the time. It fills the role of "only assert during tests". I think we can move from a regular serverAssert to the debug variant.

"assert" that only fires during tests would totally work for me. Do you have an issue on it already?

We should also reset the duration to zero if we aren't already, to make sure we are gracefully handling production cases.

Agreed

@madolson
Copy link
Member

"assert" that only fires during tests would totally work for me. Do you have an issue on it already?

This already exists today, it's debugServerAssertWithInfo. By default debugServerAssertWithInfo is compiled out, but it's compiled in for some tests. See -DDEBUG_ASSERTIONS like https://github.com/valkey-io/valkey/blob/fd58b73f0ae895bf9de3810d799da20bb75a2b4f/.github/workflows/daily.yml#L601C102-L601C120.

@madolson madolson added the release-notes This issue should get a line item in the release notes label May 30, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment, LGTM to otherwise.

Nitai Caro added 2 commits May 30, 2024 15:26
Signed-off-by: Nitai Caro <[email protected]>
@madolson madolson added the bug Something isn't working label May 30, 2024
@madolson madolson merged commit 6fb90ad into valkey-io:unstable May 30, 2024
16 checks passed
naglera pushed a commit to naglera/placeholderkv that referenced this pull request Jun 10, 2024
valkey-io#526)

In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Jul 8, 2024
valkey-io#526)

In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Jul 9, 2024
valkey-io#526)

In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Jul 9, 2024
valkey-io#526)

In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit that referenced this pull request Jul 10, 2024
#526)

In #11012, we changed the way command durations were computed to handle
the same command being executed multiple times. In #11970, we added an
assert if the duration is not properly reset, potentially indicating
that a call to report statistics was missed.

I found an edge case where this happens - easily reproduced by blocking
a client on `XGROUPREAD` and migrating the stream's slot. This causes
the engine to process the `XGROUPREAD` command twice:

1. First time, we are blocked on the stream, so we wait for unblock to
come back to it a second time. In most cases, when we come back to
process the command second time after unblock, we process the command
normally, which includes recording the duration and then resetting it.
2. After unblocking we come back to process the command, and this is
where we hit the edge case - at this point, we had already migrated the
slot to another node, so we return a `MOVED` response. But when we do
that, we don’t reset the duration field.

Fix: also reset the duration when returning a `MOVED` response. I think
this is right, because the client should redirect the command to the
right node, which in turn will calculate the execution duration.

Also wrote a test which reproduces this, it fails without the fix and
passes with it.

---------

Signed-off-by: Nitai Caro <[email protected]>
Co-authored-by: Nitai Caro <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release-notes This issue should get a line item in the release notes
Projects
Status: Backported
Development

Successfully merging this pull request may close these issues.

4 participants