[NEW] QoS of ae-loop #1789

artikell · 2025-02-27T02:10:28Z

The problem/use-case that the feature addresses

In high concurrency scenarios, Valkey's AE event handling mechanism may experience high delays in processing some events, resulting in some operations that require high real-time performance (such as primary and secondary replication, server Cron capabilities) being unable to respond in a timely manner.

Especially when a large number of low priority events occupy the event loop resources, high priority AE events may be blocked for a long time, affecting the overall performance and user experience of the system.

Description of the feature

Introduce event priority management mechanism in Valkey' AE event processing module, supporting the configuration of weights for each AE event. Treat the core master-replica synchronization tasks and server Cron tasks as high priority tasks to ensure their execution within a certain period of time.

Alternatives you've considered

Rate limit, CPU throttling.

Additional information

There are discussions on this proposal in other places. The proposal comes from: @xbasel @PingXie

xbasel · 2025-02-27T09:29:50Z

This is a good idea, not just for replication but also for admin clients (especially those monitoring the engine). How do you plan to prevent low-priority file descriptors from starving?

See also #1596

madolson · 2025-02-27T18:30:47Z

+1, I think we should prioritize. @JimB123 do you want to document what we did internally for this as an alternative?

PingXie · 2025-02-28T00:13:01Z

Thanks for starting this thread @artikell!

not just for replication but also for admin clients (especially those monitoring the engine). How do you plan to prevent low-priority file descriptors from starving?

On a high level, we can likely manage with two classes: internal connections (cluster bus, replication, and admin port connections) and external connections (normal clients). I don't foresee a risk of internal connections hogging the main thread. However, for the QoS idea to work, we'll need to limit the number of active external connections processed in each batch. The starvation risk exists here, but a simple solution might be to save the previous epoll return and keep a counter for the remaining active external connections, invoking another epoll only when the counter reaches 0. We then maintain the same round-robin strategy for normal clients.

Looking forward to a detailed design :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] QoS of ae-loop #1789

[NEW] QoS of ae-loop #1789

artikell commented Feb 27, 2025

xbasel commented Feb 27, 2025

madolson commented Feb 27, 2025

PingXie commented Feb 28, 2025

[NEW] QoS of ae-loop #1789

[NEW] QoS of ae-loop #1789

Comments

artikell commented Feb 27, 2025

xbasel commented Feb 27, 2025

madolson commented Feb 27, 2025

PingXie commented Feb 28, 2025