Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] QoS of ae-loop #1789

Open
artikell opened this issue Feb 27, 2025 · 3 comments
Open

[NEW] QoS of ae-loop #1789

artikell opened this issue Feb 27, 2025 · 3 comments

Comments

@artikell
Copy link
Contributor

The problem/use-case that the feature addresses

In high concurrency scenarios, Valkey's AE event handling mechanism may experience high delays in processing some events, resulting in some operations that require high real-time performance (such as primary and secondary replication, server Cron capabilities) being unable to respond in a timely manner.

Especially when a large number of low priority events occupy the event loop resources, high priority AE events may be blocked for a long time, affecting the overall performance and user experience of the system.

Description of the feature

Introduce event priority management mechanism in Valkey' AE event processing module, supporting the configuration of weights for each AE event. Treat the core master-replica synchronization tasks and server Cron tasks as high priority tasks to ensure their execution within a certain period of time.

Alternatives you've considered

Rate limit, CPU throttling.

Additional information

There are discussions on this proposal in other places. The proposal comes from: @xbasel @PingXie

@xbasel
Copy link
Member

xbasel commented Feb 27, 2025

This is a good idea, not just for replication but also for admin clients (especially those monitoring the engine). How do you plan to prevent low-priority file descriptors from starving?

See also #1596

@madolson
Copy link
Member

+1, I think we should prioritize. @JimB123 do you want to document what we did internally for this as an alternative?

@PingXie
Copy link
Member

PingXie commented Feb 28, 2025

Thanks for starting this thread @artikell!

not just for replication but also for admin clients (especially those monitoring the engine). How do you plan to prevent low-priority file descriptors from starving?

On a high level, we can likely manage with two classes: internal connections (cluster bus, replication, and admin port connections) and external connections (normal clients). I don't foresee a risk of internal connections hogging the main thread. However, for the QoS idea to work, we'll need to limit the number of active external connections processed in each batch. The starvation risk exists here, but a simple solution might be to save the previous epoll return and keep a counter for the remaining active external connections, invoking another epoll only when the counter reaches 0. We then maintain the same round-robin strategy for normal clients.

Looking forward to a detailed design :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants