You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
We observe a slow negative trend on the etcd_debugging_mvcc_watcher_total gauge—implying there are unpaired increment/decrement operations:
What did you expect to happen?
Logically the active watcher gauge should never be negative.
How can we reproduce it (as minimally and precisely as possible)?
I think this issue is very closely related to this PR which previously attempted to address negative watch counts from double decrements—but covers specifically the case when wa.compacted is true, and there is a cancel/close race. It's not so easy to reproduce as it is a race condition I believe, but I think you can somewhat provoke the scenario by contriving a unit test.
Consider for example adding this test to mvcc/watcher_test.go:
I provoke the wa.compacted state by just directly setting it here—I'm not sure if in practice the act of compaction might somehow change this scenario, but just reading the code I'm struggling to see how a double decrement could occur on any other branch, so am somewhat surmising this must be the state we're in.
It seems as though the wa.ch == nil case is designed to prevent re-processing a cancelled watcher, and both the if s.synced.delete and if s.unsynced.delete branches are "idempotent", as once the watch is deleted subsequent invocation return false. However, checking wa.compacted is not idempotent, and by being placed before the if wa.ch == nil case seems like it can be encountered on a subsequent invocation of cancelWatch.
Possibly a fix would be to move the "has this already been cancelled" check higher, e.g.:
} else if s.synced.delete(wa) {
watcherGauge.Dec()
break
- } else if wa.compacted {- watcherGauge.Dec()- break
} else if wa.ch == nil {
// already canceled (e.g., cancel/close race)
break
+ } else if wa.compacted {+ watcherGauge.Dec()+ break
}
this would avoid possibly performing the decrement twice.
Discussed during our bi-weekly triage meeting. Thanks, @kjgorman for raising this issue. This looks like a valid issue. Given that you have invested in tracing where the possible culprit is, and wrote a test case, would you be interested in opening a pull request addressing it? Thanks.
Bug report criteria
What happened?
We observe a slow negative trend on the
etcd_debugging_mvcc_watcher_total
gauge—implying there are unpaired increment/decrement operations:What did you expect to happen?
Logically the active watcher gauge should never be negative.
How can we reproduce it (as minimally and precisely as possible)?
I think this issue is very closely related to this PR which previously attempted to address negative watch counts from double decrements—but covers specifically the case when
wa.compacted
is true, and there is acancel/close
race. It's not so easy to reproduce as it is a race condition I believe, but I think you can somewhat provoke the scenario by contriving a unit test.Consider for example adding this test to
mvcc/watcher_test.go
:Note
I provoke the
wa.compacted
state by just directly setting it here—I'm not sure if in practice the act of compaction might somehow change this scenario, but just reading the code I'm struggling to see how a double decrement could occur on any other branch, so am somewhat surmising this must be the state we're in.and then to provoke the race add a delay in the unlocked portion of watcher.go's
Cancel(id WatchID)
method:Then what is possible to observe is that:
Cancel
methodClose
method with our specifically placedSleeps
...Close
will find a synced watcher, delete it, and decrement the counter in this branchCancel
wakes up and resumes, and invokes thecancelFunc
againwa.compacted
branch and decrement againIt seems as though the
wa.ch == nil
case is designed to prevent re-processing a cancelled watcher, and both theif s.synced.delete
andif s.unsynced.delete
branches are "idempotent", as once the watch is deleted subsequent invocation return false. However, checkingwa.compacted
is not idempotent, and by being placed before theif wa.ch == nil
case seems like it can be encountered on a subsequent invocation ofcancelWatch
.Possibly a fix would be to move the "has this already been cancelled" check higher, e.g.:
this would avoid possibly performing the decrement twice.
Anything else we need to know?
No response
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
No response
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
No response
Relevant log output
The text was updated successfully, but these errors were encountered: