Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd reading member-k0-key.pem every second (again) #19505

Open
4 tasks done
kaleal opened this issue Mar 1, 2025 · 4 comments
Open
4 tasks done

etcd reading member-k0-key.pem every second (again) #19505

kaleal opened this issue Mar 1, 2025 · 4 comments
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. stage/triaged type/bug

Comments

@kaleal
Copy link

kaleal commented Mar 1, 2025

Bug report criteria

What happened?

I have observed that etcd is reading the private key from disk continuously. The issue observed when an audit rule has been applied to catch the private keys being read.
The issue was discovered running a kubernetes cluster with three nodes deployed with kubespray in three virtual machines managed by vagrant.

What did you expect to happen?

etcd should read the key at start up and keep it in memory for further use.

My previous ticket was closed without any chance of justifying my point of view.

Here goes the replica:
Supporting dynamically loading the certificate does not imply on reading it from disk at every call.
The certificate can be held for a specific amount of time, and replaced when that time is expired. That could be even configurable.

Reading the file at every call can cause a bottleneck and harm the performance.

How can we reproduce it (as minimally and precisely as possible)?

On a k8s control plane node install auditd.

apt-get install auditd

Create an audit rules for monitoring the key file

for key in $(ls /etc/ssl/etcd/ssl/*key.pem); do
  echo -a always,exit -F arch=b64 -F path=${key} -F perm=rwa -k etcd -k etcd-keys
done >/etc/audit/rules.d/etcd-keys.rules
systemctl restart auditd

Inspect audit logs

ausearch -k etcd-keys -i

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.16
Git SHA: f20bbad
Go Version: go1.22.7
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.16
API version: 3.5

Etcd configuration (command line flags or environment variables)

root@k0:~# cat /etc/etcd.env 
# Environment file for etcd v3.5.16
ETCD_DATA_DIR=/var/lib/etcd
ETCD_ADVERTISE_CLIENT_URLS=https://192.168.121.89:2379
ETCD_INITIAL_ADVERTISE_PEER_URLS=https://192.168.121.89:2380
ETCD_INITIAL_CLUSTER_STATE=existing
ETCD_METRICS=basic
ETCD_LISTEN_CLIENT_URLS=https://192.168.121.89:2379,https://127.0.0.1:2379
ETCD_ELECTION_TIMEOUT=5000
ETCD_HEARTBEAT_INTERVAL=250
ETCD_INITIAL_CLUSTER_TOKEN=k8s_etcd
ETCD_LISTEN_PEER_URLS=https://192.168.121.89:2380
ETCD_NAME=etcd1
ETCD_PROXY=off
ETCD_INITIAL_CLUSTER=etcd1=https://192.168.121.89:2380,etcd2=https://192.168.121.79:2380,etcd3=https://192.168.121.32:2380
ETCD_AUTO_COMPACTION_RETENTION=8
ETCD_SNAPSHOT_COUNT=10000
# Flannel need etcd v2 API
ETCD_ENABLE_V2=true

# TLS settings
ETCD_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_CERT_FILE=/etc/ssl/etcd/ssl/member-k0.pem
ETCD_KEY_FILE=/etc/ssl/etcd/ssl/member-k0-key.pem
ETCD_CLIENT_CERT_AUTH=true

ETCD_PEER_TRUSTED_CA_FILE=/etc/ssl/etcd/ssl/ca.pem
ETCD_PEER_CERT_FILE=/etc/ssl/etcd/ssl/member-k0.pem
ETCD_PEER_KEY_FILE=/etc/ssl/etcd/ssl/member-k0-key.pem
ETCD_PEER_CLIENT_CERT_AUTH=True



# CLI settings
ETCDCTL_ENDPOINTS=https://127.0.0.1:2379
ETCDCTL_CACERT=/etc/ssl/etcd/ssl/ca.pem
ETCDCTL_KEY=/etc/ssl/etcd/ssl/admin-k0-key.pem
ETCDCTL_CERT=/etc/ssl/etcd/ssl/admin-k0.pem

# ETCD 3.5.x issue
# https://groups.google.com/a/kubernetes.io/g/dev/c/B7gJs88XtQc/m/rSgNOzV2BwAJ?utm_medium=email&utm_source=footer
ETCD_EXPERIMENTAL_INITIAL_CORRUPT_CHECK=True


ETCD_EXPERIMENTAL_WATCH_PROGRESS_NOTIFY_INTERVAL=5s

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+-------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME  |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+-------+-----------------------------+-----------------------------+------------+
| 4a694b80a6468586 | started | etcd2 | https://192.168.121.79:2380 | https://192.168.121.79:2379 |      false |
| 6a2e077545146330 | started | etcd1 | https://192.168.121.89:2380 | https://192.168.121.89:2379 |      false |
| a0ffe7ca1e9739ea | started | etcd3 | https://192.168.121.32:2380 | https://192.168.121.32:2379 |      false |
+------------------+---------+-------+-----------------------------+-----------------------------+------------+


$ etcdctl --endpoints=<member list> endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.121.79:2379 | 4a694b80a6468586 |  3.5.16 |  7.8 MB |     false |      false |         8 |      60238 |              60238 |        |
| https://192.168.121.89:2379 | 6a2e077545146330 |  3.5.16 |  7.7 MB |      true |      false |         8 |      60238 |              60238 |        |
| https://192.168.121.32:2379 | a0ffe7ca1e9739ea |  3.5.16 |  7.7 MB |     false |      false |         8 |      60238 |              60238 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

----
type=PROCTITLE msg=audit(02/26/25 16:58:57.139:6265) : proctitle=/usr/local/bin/etcd 
type=PATH msg=audit(02/26/25 16:58:57.139:6265) : item=0 name=/etc/ssl/etcd/ssl/member-k0-key.pem inode=1966155 dev=fe:01 mode=file,700 ouid=etcd ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(02/26/25 16:58:57.139:6265) : cwd=/ 
type=SYSCALL msg=audit(02/26/25 16:58:57.139:6265) : arch=x86_64 syscall=openat success=yes exit=207 a0=AT_FDCWD a1=0xc001efdbc0 a2=O_RDONLY|O_CLOEXEC a3=0x0 items=1 ppid=1 pid=854 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=etcd exe=/usr/local/bin/etcd subj=unconfined key=etcd key=etcd-keys 
----
type=PROCTITLE msg=audit(02/26/25 16:58:58.159:6266) : proctitle=/usr/local/bin/etcd 
type=PATH msg=audit(02/26/25 16:58:58.159:6266) : item=0 name=/etc/ssl/etcd/ssl/member-k0-key.pem inode=1966155 dev=fe:01 mode=file,700 ouid=etcd ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(02/26/25 16:58:58.159:6266) : cwd=/ 
type=SYSCALL msg=audit(02/26/25 16:58:58.159:6266) : arch=x86_64 syscall=openat success=yes exit=207 a0=AT_FDCWD a1=0xc003b7d650 a2=O_RDONLY|O_CLOEXEC a3=0x0 items=1 ppid=1 pid=854 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=etcd exe=/usr/local/bin/etcd subj=unconfined key=etcd key=etcd-keys 
----
type=PROCTITLE msg=audit(02/26/25 16:58:58.303:6267) : proctitle=/usr/local/bin/etcd 
type=PATH msg=audit(02/26/25 16:58:58.303:6267) : item=0 name=/etc/ssl/etcd/ssl/member-k0-key.pem inode=1966155 dev=fe:01 mode=file,700 ouid=etcd ogid=root rdev=00:00 nametype=NORMAL cap_fp=none cap_fi=none cap_fe=0 cap_fver=0 cap_frootid=0 
type=CWD msg=audit(02/26/25 16:58:58.303:6267) : cwd=/ 
type=SYSCALL msg=audit(02/26/25 16:58:58.303:6267) : arch=x86_64 syscall=openat success=yes exit=207 a0=AT_FDCWD a1=0xc003b7dad0 a2=O_RDONLY|O_CLOEXEC a3=0x0 items=1 ppid=1 pid=854 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=etcd exe=/usr/local/bin/etcd subj=unconfined key=etcd key=etcd-keys
@kaleal kaleal added the type/bug label Mar 1, 2025
@ahrtr
Copy link
Member

ahrtr commented Mar 2, 2025

You raised this as a bug, what's the exact issue that impacts your business?

You also mentions the performance impact. If it's only related to performance, this ticket should be a feature request, also it'd be better to demo how much performance may be affected with or without the dynamically loading certificate feature.

@ivanvc ivanvc added stage/triaged priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Mar 13, 2025
@kaleal
Copy link
Author

kaleal commented Mar 26, 2025

@ahrtr @ivanvc

I don't have a business. It was just observation, disks are slow, ram is fast. That is not a guess, that is a fact. I have never mentioned of disabling dynamically loading the certificate. I just mentioned on doing it using another method, to avoid reading the certificates from disk at every call, that is expensive. You are not expecting to have the certificate replaced multiple times per second, are you? Why not keeping it in memory and loading it from disk at a scheduled manner, once per 10 seconds, for instance?

Talking about the bug, you are right. That can be a feature request. A request could be:

  • Load the certificate and keep it loaded in memory for a configurable amount of time before reloading it from disk again.

@kaleal
Copy link
Author

kaleal commented Mar 26, 2025

My intention with this was doing a contribution. Nothing is broken but there is nothing that can't be improved. I have included my evidences that the file is being read from disk multiple times a second on an idle k8s cluster. I don't think that writing code to prove that keeping the certificate in memory is quicker than reading it from disk is going to discover something that can't be predicted.

I am not versed with Go and the etcd code, but if someone get interested and agree with me that a little of performance increment can be archived with this I am more than happy to have contributed, and perhaps, with some help, do a code contribution.

@serathius
Copy link
Member

serathius commented Mar 26, 2025

Please note that if rereading private key had significant footprint on etcd, issue would have been discovered earlier. I suspect that after first call, the file is just cached by operating system and we never need to hit the disk. The overhead would only come from syscall which is much smaller.

There is always things that can be optimized, but doing so is always a tradeoff. A tradeoff with complexity, code readability or with other execution path.

Before we optimize here we need to:

  • measure the overhead
  • understand the tradeoffs

To measure overhead would be good to see some profiles showing that this cost is not negligible.

For this case I expect that caching private key might result in two things: delay in dynamic certificate reload, create potential inconsistency with reload of certificate. We need to cache not only private key, but certificate too to ensure they match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. stage/triaged type/bug
Development

No branches or pull requests

4 participants