Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loadavg calculation inside containers stops for about 70 min after live migration when running on a GCP VM #678

Open
DeyanSG opened this issue Mar 17, 2025 · 0 comments

Comments

@DeyanSG
Copy link

DeyanSG commented Mar 17, 2025

When running lxcfs inside a GCP VM, we encountered a strange situation after the VM was live-migrated to a different host.

It seems that the kernel missed some interrupts due to the live migration, which caused the thread responsible for calculating the loadavg values inside containers to fail to wake up. This resulted in a freeze in the loadavg values for about 70 minutes.

Initially, when we observed the issue, we decided to try reloading lxcfs, hoping this would restart the loadavg thread. However, instead of resolving the issue, all attempts to read the files managed by lxcfs ended up hanging in the read call, and the reload did not occur. We had to restart the service and all containers running on the VM to resume normal operations.

A kernel backtrace of the thread before the restart showed that it was sleeping:

crash> bt 4467
PID: 4467     TASK: ffff88a052bf8000  CPU: 43   COMMAND: "lxcfs"
 #0 [ffffc9000eea7d90] __schedule at ffffffff81bd7161
 #1 [ffffc9000eea7e38] schedule at ffffffff81bd785f
 #2 [ffffc9000eea7e58] do_nanosleep at ffffffff81bddcae
 #3 [ffffc9000eea7ea0] hrtimer_nanosleep_restart at ffffffff81bdde26
 #4 [ffffc9000eea7f18] __x64_sys_restart_syscall at ffffffff810c05df
 #5 [ffffc9000eea7f28] x64_sys_call at ffffffff81003316
 #6 [ffffc9000eea7f38] do_syscall_64 at ffffffff81bc2a35
 #7 [ffffc9000eea7f50] entry_SYSCALL_64_after_hwframe at ffffffff81c000af
    RIP: 00007fa4742d4355  RSP: 00007fa473fffd80  RFLAGS: 00000293
    RAX: ffffffffffffffda  RBX: 00007fa474000640  RCX: 00007fa4742d4355
    RDX: 00007fa473fffdc0  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: 00007fa473fffe50   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000293  R12: 0000000000000001
    R13: 0000000000000000  R14: 00007fa474289bc0  R15: 0000000000000000
    ORIG_RAX: 00000000000000db  CS: 0033  SS: 002b
crash>

We had a few more nodes with the same issue, and they all recovered (started recalculating the loadavg values again) within 70 minutes without any additional intervention.

Further investigation showed the following message in the kernel logs, indicating that we missed about 0.21 seconds:

kernel: hrtimer: interrupt took 210767669 ns

This led us to conclude that we missed the wake-up time for the timer started by the usleep in the loadavg thread. We calculated that it would take about 70 minutes for the timer to overflow and reach the same value again, allowing normal operations to continue.
This issue does not occur with every live migration, so we believe it only happens when the timer is about to finish at the time the kernel skips due to migration.

I am opening a PR to address the issue with the 70-minute-long reload. It seems that normally the reload signal will interrupt a sleep only in the main thread (since it is handled by the main thread), so simply sending a signal to the loadavg thread seems to interrupt the sleep and is sufficient to allow the reload to remediate the situation. We’ve tested this on another node that experienced the same issue, and it works.

I would appreciate any advice on handling this situation in general (besides catching the kernel message and reloading lxcfs).

Regards,
Deyan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant