eth/catalyst: avoid a race in the SimulatedBeacon Stop call #31328

eljobe · 2025-03-06T10:57:30Z

eth/catalyst/simulated_beacon_api.go

This allows the inner loop in the case of "on-demand" commits to bail out if the txPool has been terminated before the doCommit channel is closed. This has been tested with a known-flaky test which, before this commit was spinlooping and logging unfathomable reams of warning messages. Now, when the race occurs, only a single instance of the warning is logged.

eljobe · 2025-03-07T12:19:14Z

The irony of requiring two commits instead of one to introduce a function called fallibleCommit is not lost on me.

jwasinger · 2025-03-07T14:15:40Z

So the thing is... a failure from sealBlock doesn't necessarily indicate that that the client is closed/closing. For example, if we rewind via debug_setHead concurrently while trying to commit, we can fail attempting to build a payload.

eljobe · 2025-03-07T14:24:04Z

Well-spotted. So, this makes me want to go back to calling Sync() because that's the code that actually can only error when the txPool has already terminated.

Another option would be to create a special error type to be returned from Sync() when the pool is already terminated, then, conditionally break from the spinloop only if the error from fallibleCommit is of that new type. Sound like a plan? If so, I'll get to coding.

jwasinger · 2025-03-07T14:33:31Z

Let's just use a call to Sync. That should be fine.

jwasinger · 2025-03-07T14:40:29Z

actually, I would:

create a function commit which returns (common.Hash, error) (same as fallableCommit but less of a mouthful)
return a special package-private error type if the invocation to Sync within returns an error (I think Sync only errors if the pool is terminated)
if the result of the commit indicates that the pool terminated, use that as the break condition for the loop.

This way, we only break the spinloop when it is definitely because the transaction pool has already been terminated.

eljobe · 2025-03-07T16:49:37Z

Okay. Now, it's doing what you suggested. Thanks for the advice.

eth/catalyst/simulated_beacon.go

This addresses some review feedback.

eth/catalyst/simulated_beacon_api.go

MariusVanDerWijden · 2025-03-10T10:34:29Z

eth/catalyst/simulated_beacon.go

@@ -181,7 +192,7 @@ func (c *SimulatedBeacon) sealBlock(withdrawals []*types.Withdrawal, timestamp u
 	// behavior, the pool will be explicitly blocked on its reset before
 	// continuing to the block production below.
 	if err := c.eth.APIBackend.TxPool().Sync(); err != nil {
-		return fmt.Errorf("failed to sync txpool: %w", err)
+		return &errTxPoolTerminated{fmt.Errorf("failed to sync txpool: %w", err)}


I would just turn it into, not really a need for wrapping another error type errTxPoolTerminated error

I fear that I have not clearly undrestood this suggestion. But, I did something to move the "failed to sync txpool:" part of the message into the errTxPoolTerminated.Error implementation.

jwasinger · 2025-03-10T10:47:31Z

I think the general approach here is fine, and we should merge this (after cleaning up the error checking, RE comments from Marius). But just a thought: maybe a cleaner solution that accomplishes the same thing would be to add a member method Closed that returns if the txpool has been closed, and we poll that instead of checking for a specific error returned when committing.

Use `errors.Is` instead of `errors.As` since we don't need to access special fields in the custom error type. Rather than generating a new error to wrap, just capture the exisitng error.

eljobe · 2025-03-10T11:55:49Z

I think the general approach here is fine, and we should merge this (after cleaning up the error checking, RE comments from Marius). But just a thought: maybe a cleaner solution that accomplishes the same thing would be to add a member method Closed that returns if the txpool has been closed, and we poll that instead of checking for a specific error returned when committing.

I'm definitely open to switching to that implementation instead. But, I "slightly" prefer the existing one. I think library APIs are best when you don't need to remember to make multiple calls to the library to get the information the calling code needs to make progress. With the error-handling approach, it is clear that the work we're trying to do is commit and that there is a special error condition of committing in which we want to stop. I find it slightly less clear if we check TxPool.Closed() or !TxPool.Running() in each pass through the inner loop. Because it feels like this is somehow a completely separate state that can transition without affecting the outcome of the call to commit (which it isn't.)

Avoid a race in the SimulatedBeacon Stop call

8d2920b

Fixes: ethereum#31327

eljobe requested review from MariusVanDerWijden, lightclient, fjl and jwasinger as code owners March 6, 2025 10:57

jwasinger reviewed Mar 7, 2025

View reviewed changes

eth/catalyst/simulated_beacon_api.go Outdated Show resolved Hide resolved

fjl changed the title ~~Avoid a race in the SimulatedBeacon Stop call~~ eth/catalyst: avoid a race in the SimulatedBeacon Stop call Mar 7, 2025

eljobe added 2 commits March 7, 2025 13:13

Also commit the call to fallibleCommit

f6d5b76

Try using a custom error type

71d4441

This way, we only break the spinloop when it is definitely because the transaction pool has already been terminated.