Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close file streams if MinIO upload fails #1531

Merged
merged 1 commit into from
Mar 24, 2025

Conversation

fflorent
Copy link
Collaborator

Context

Quoting @vviers in #1526:

we noticed that the disk usage (as monitored by df) on the docWorker increases steadily throughout the day, while the files in /persist (as monitored by du) stays flat. Looking deeper into this, we noticed that there were stale deleted documents in /proc/{node_pid}/fd that would accumulate until the node process is killed. Correlating logs, it looks that these two things are related. Additionnaly, multiplying the number or opened deleted file descriptors for one document by its size corresponds exactly to the additional disk usage listed by df.

We successfully reproduced the bug:

  1. Run a Grist instance with Minio (like the one in this docker-compose setup);
  2. Create some document.
  3. Monitor the file descriptors created by the node process as grist: su grist and watch "ls -l /proc/$(pgrep -u grist -f 'server.js')/fd | grep persist"
  4. ⚠️ Stop the MinIO server (to simulate some problem)
  5. Then edit through the UI the document you created: on the server filesystem under /persist, the upload should create a copy of the document (a backup suffixed with -backup)
  6. Check the logs of Grist the Grist server, after approximately 30 seconds it fails to upload the document to MinIO:
grist-1        | 2025-03-20 19:51:38.000 - error: ext doc upload: dryp2XSXceaxb4qYZ1Snhb failure to send, error getaddrinfo ENOTFOUND grist-minio
grist-1        | 2025-03-20 19:51:38.001 - error: HostedStorageManager error pushing dryp2XSXceaxb4qYZ1Snhb (3): Error: getaddrinfo ENOTFOUND grist-minio
grist-1        |     at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:120:26) {
grist-1        |   errno: -3008,
grist-1        |   code: 'ENOTFOUND',
grist-1        |   syscall: 'getaddrinfo',
grist-1        |   hostname: 'grist-minio'
grist-1        | } docId=null
grist-1        | 2025-03-20 19:51:38.109 - info: heartbeat [email protected], userId=5, age=107, org=docs, altSessionId=9n1KdkfmZoDDZCHKf32s5X, clientId=812b6630e11beb7f, counter=1, url=http://localhost:8484/o/docs/dryp2XSXceax/New-empty-doc-copy, docId=dryp2XSXceaxb4qYZ1Snhb
grist-1        | 2025-03-20 19:52:23.105 - info: heartbeat [email protected], userId=5, age=107, org=docs, altSessionId=9n1KdkfmZoDDZCHKf32s5X, clientId=812b6630e11beb7f, counter=1, url=http://localhost:8484/o/docs/dryp2XSXceax/New-empty-doc-copy, docId=dryp2XSXceaxb4qYZ1Snhb
grist-1        | 2025-03-20 19:52:38.004 - debug: backupSqliteDatabase: starting copy of /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist (backup) docId=dryp2XSXceaxb4qYZ1Snhb
grist-1        | 2025-03-20 19:52:38.005 - info: backupSqliteDatabase: copying /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist (backup) using source connection docId=dryp2XSXceaxb4qYZ1Snhb
grist-1        | 2025-03-20 19:52:38.022 - info: backupSqliteDatabase: copy of /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist (backup) completed successfully docId=dryp2XSXceaxb4qYZ1Snhb
grist-1        | 2025-03-20 19:52:38.022 - debug: backupSqliteDatabase: stopped copy of /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist (backup) docId=dryp2XSXceaxb4qYZ1Snhb, finalStepTimeMs=17, maxStepTimeMs=17, maxNonFinalStepTimeMs=0, numSteps=1
grist-1        | 2025-03-20 19:52:38.034 - error: ext doc upload: dryp2XSXceaxb4qYZ1Snhb failure to send, error getaddrinfo ENOTFOUND grist-minio
grist-1        | 2025-03-20 19:52:38.034 - error: HostedStorageManager error pushing dryp2XSXceaxb4qYZ1Snhb (4): Error: getaddrinfo ENOTFOUND grist-minio
grist-1        |     at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:120:26) {
grist-1        |   errno: -3008,
grist-1        |   code: 'ENOTFOUND',
grist-1        |   syscall: 'getaddrinfo',
grist-1        |   hostname: 'grist-minio'
grist-1        | } docId=null

➡️ And you should see through the above watch command that each time the server attempts to send the copy of the document, this copy remains open despite being dereferenced from the filesystem:

Every 2.0s: ls -l /proc/25/fd | grep persist                                                                                                                                  58e5b15f60ec: Thu Mar 20 19:58:57 2025

lr-x------ 1 grist grist 64 Mar 20 19:48 25 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 27 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 29 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 31 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 32 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 33 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lrwx------ 1 grist grist 64 Mar 20 19:48 36 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist
lr-x------ 1 grist grist 64 Mar 20 19:48 37 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)
lr-x------ 1 grist grist 64 Mar 20 19:48 40 -> /persist/docs/dryp2XSXceaxb4qYZ1Snhb.grist-backup (deleted)

Leading to the result described in the issue #1526

Proposed solution

We solve the issue by just ensuring that the stream open when uploading the document is closed no matter if the minio client could have successfully sent the file or not.

Credits goes very mostly to @jonathanperret for this solution.

Related issues

Fixes #1526

Has this been tested?

  • 👍 yes, I added tests to the test suite
  • 💭 no, because this PR is a draft and still needs work
  • 🙅 no, because this is not relevant here
  • 🙋 no, because I need help

Copy link
Contributor

@georgegevoian georgegevoian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fflorent, @jonathanperret, and @vviers.

@paulfitz, should we file an issue in our internal backlog to check S3 and Azure in grist-ee, or do you already happen to know how they handle cleanup?

@paulfitz
Copy link
Member

Just cross-checking with @Spoffy that uploadStream calling destroy is ok also in the attachment usecase. I expect it is. Responsibility seems fuzzy here though. would be good to add a docstring to uploadStream in ExternalStorage.ts saying that it will destroy the stream after upload. An alternative choice could have been to leave the fate of the stream in the hands of the caller, and fix the problem there. Not advocating for that, except that it would save having to update other back-ends, but that's just a Grist Labs problem currently :)

@Spoffy
Copy link
Contributor

Spoffy commented Mar 21, 2025

Firstly - some really nice sleuthing to find this issue.

With respect to closing the stream, I think there's two considerations here: one is responsibility, one is attachment archives.

I think the code that creates the stream from the file should take the responsibility to ensure the FD is closed properly. uploadStream doesn't know anything about the the source of the stream (or even its exact type), and therefore shouldn't be deciding that stream should be closed.

Attachment archive uploads are one situation where this might be problematic. The entire file stream needs to be consumed, so that the next file in the .tar archive can be processed. Destroying the stream here would prevent the .tar unpacking function from doing that. This would likely cause attachment archive uploads to fail if any individual S3 upload failed for any reason.

For comparison, the current behaviour is that on a failed upload, the file is skipped and it tries to upload the next file.


Looking at the code, I think the best place to handle this is in the upload method here:

public async upload(key: string, fname: string, metadata?: ObjMetadata) {
// calling putObject with a file size will let MinIO be clever about uploading in multiple parts or not.
const stat = await fse.lstat(fname);
const filestream = fse.createReadStream(fname);
return this.uploadStream(key, filestream, stat.size, metadata);
}

Which creates the stream from the file.

That does mean this issue might also exist on our other providers @paulfitz, if they're not cleaning up the file stream. Weirdly, it seems like Node doesn't release the FD when the stream is garbage collected?

@jonathanperret
Copy link
Collaborator

With respect to closing the stream, I think there's two considerations here: one is responsibility, one is attachment archives.

I think the code that creates the stream from the file should take the responsibility to ensure the FD is closed properly. uploadStream doesn't know anything about the the source of the stream (or even its exact type), and therefore shouldn't be deciding that stream should be closed.

I agree. Actually our first implementation was in upload for exactly the reasons you mention, but somehow we forgot about those when faced with writing a test that would have to check for a file handle leak.

I'll move the code back into upload and find a way to test it anyway.

Attachment archive uploads are one situation where this might be problematic. The entire file stream needs to be consumed, so that the next file in the .tar archive can be processed. Destroying the stream here would prevent the .tar unpacking function from doing that. This would likely cause attachment archive uploads to fail if any individual S3 upload failed for any reason.

For comparison, the current behaviour is that on a failed upload, the file is skipped and it tries to upload the next file.

I guess it could be debated whether skipping an attachment is better than failing the upload outright, but this doesn't take away from your main point.

That does mean this issue might also exist on our other providers @paulfitz, if they're not cleaning up the file stream. Weirdly, it seems like Node doesn't release the FD when the stream is garbage collected?

This is a known behavior of Node streams, see for example nodejs/node#38681. But as far as I can gather, even language runtimes that do make an attempt to release external resources on garbage collection generally warn against relying on this, see e.g. for Python, despite this being a significant footgun.

@jonathanperret
Copy link
Collaborator

The branch was updated with a change that only touches upload. As expected this required some acrobatics in the tests but hopefully acceptable.

@paulfitz paulfitz requested a review from georgegevoian March 24, 2025 13:06
Copy link
Contributor

@Spoffy Spoffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me :)

@georgegevoian georgegevoian merged commit e9840a9 into gristlabs:main Mar 24, 2025
12 checks passed
@jonathanperret jonathanperret deleted the close-upload-streams branch March 26, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MinIO client infinite retry can fill up docWorker disk
5 participants