-
Notifications
You must be signed in to change notification settings - Fork 1
feat: add hg.fuse
for http-based filesystem
#6
Conversation
Example from rich import print
import hg.fuse
if __name__ == "__main__":
# create a filesystem
fs = hg.fuse.GlobalHTTPFileSystem()
# any path that isn't terminated with ".." is an empty directory
print(fs.info("/http/example.com"), fs.ls('/http/example.com'))
# {'name': '/http/example.com', 'size': None, 'type': 'directory'}
# []
# except for root which has two "directories"
print(fs.ls("/"))
# ["http/", "https/"]
# a path terminated with ".." will be inspected as a file
print(fs.info("/http/example.com.."))
# {'name': '/http/example.com..', 'size': 648, 'ETag': '"3147526947"', 'type': 'file'}
try:
fs.ls("/http/example.com..")
except NotADirectoryError:
# path is for a file
pass
print(fs.cat("/http/example.com.."))
# b'<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta charset="utf-8" />\n
# filelike objects
f = fs.open("/http/example.com..")
print(f)
# <File-like object GlobalHTTPFileSystem, /http/example.com..>
f.close()
# context manager
with fs.open("/http/example.com..") as f:
print(f.readline())
# b'<!doctype html>\n' |
hg.fuse
for http-based filesystem
Perhaps a silly question, but why can't we just use fsspec.implementations.http.HTTPFilesystem and have In other words, why do we need the filesystem path mapping? |
Not a silly question. The issue is that we can't pass a file object to clodius. We need to pass a local file path, so we need to handle the compat at the OS-level with FUSE. import fsspec
from clodius.cooler import tiles
tiles(fsspec.open('http://example.com/data.mcool'), ...) # error
tiles('/my/mount/dir/http/example.com/data.mcool..', ...) # ok if FUSE is running I talked to @nvictus about this last week, but ideally all clodius tile implementations could take a file-like object or local path. Not sure how feasable that would be but if even some of the implementations could take flike-like object, fsspec could provide a lot of flexiblity with where data can live.. fh = fsspec.open('s3://path/to/data.mcool')
fh = fsspec.open('dropbox://path/to/data.mcool')
fh = fsspec.open('ftp://path/to/data.mcool')
tiles(fh, ...) |
We can only circumvent FUSE for those tile fetchers that can operate on a Python file-like object directly, in which case the built-in However, if the underlying file access happens in a low-level library (bbi, bam, etc.) we have to either delegate to its internal http handling if it exists (htslib and bbi can do this, but not flexibly or gracefully) or rely on FUSE's filesystem path mapping (+ add caching layers, etc.), which is what we currently do. One exception to this rule is HDF5/h5py, which recently implemented a low-level Python file-like object driver that calls back into Python. However, clodius doesn't exploit it at the moment and only accepts a path. So the idea @manzt is proposing is that clodius tile fetchers could be extended so that, when possible, they can accept either a path string or a file-like object, which would increase the flexibility of what kind of resource they can accept, without introducing specific fs implementation logic into clodius itself. Unfortunately, for most bioinformatic format libraries, this option isn't available and we'll have to keep relying on FUSE and file paths. I personally like keeping the FUSE process lifecycle independent of the hg.fuse.start('/my/mount/dir/') # global instance of FuseProcess
ts = hg.tilesets.bigwig(hg.fuse.path('http://mydataset.com/data.bigwig')) But I can also see |
Agreed. FUSE could be setup totally outside of hg. |
FYI the fsspec-based FUSE operator is muuuuuuch slower that @pkerpedjiev is it possible to configure a single |
All those points make sense. But I am really curious why simple-httpfs
would be faster than fsspec. Is it just the caching? Because simple-httpfs
should come with a big warning that it assumes http files are immutable so
it basically does caching with an infinite TTL. Big no no but for the
purposes for which it’s used that’s rarely a problem.
…On Mon, Feb 7, 2022 at 6:42 PM Trevor Manz ***@***.***> wrote:
FYI the fsspec-based FUSE operator is muuuuuuch slower that simple-httpfs.
I played around with some of the caching, but I think I'll probably wrap
simple-httpfs in a follow up PR to replace the fsspec usage.
—
Reply to this email directly, view it on GitHub
<#6 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQLLDK6RSXJYX6A2XCVSD3U2B7H3ANCNFSM5NUQ7WZA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm really not sure without digging further. I just tried both and the former was painfully slow compared to the latter. Basically unusable, so I deferred to using I know that
Screen.Recording.2022-02-08.at.11.03.40.AM.mov
Screen.Recording.2022-02-08.at.11.00.37.AM.mov |
This PR adds a special
fsspec
filesystem which extends the builtinfsspec.implementations.http.HTTPFilesystem
so that it can be mapped to a local filesystem via FUSE.Motivation
The issue with the builtin
fsspec.HTTPFilesystem
is that "paths" for the file system are complete URLs, so they cannot be mapped to filesystem. Additionally,ls
has some special behavior to try to guess additional files.Special handling of URLs is required to distinguish between a "directory" and a "file" in our virtual file system. I used the path style from https://github.com/higlass/simple-httpfs for this filesystem, where file paths are suffixed with
".."
. (NOTE: we could expose this as a parameter in the FS constructor).Why not
simple-httpfs
?This is mostly an experiment to see if we can lean on some of the sophisticated behavior from fsspec. The file objects inherit from
fsspec.spec.AbstractBufferedFile
which allows for configurable caching https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access. We will have the ability to create an individual cache per file.In theory, it shouldn't matter if we use this or
simple-httpfs
because mapping FUSE is something that occurs at the OS-level (and I'm not sure if this is something that needs to implemented in this library at all). Really what we could expose are some convenience functions for mapping a url to the fileysystem-style.It would be worth looking into clodius and see how many tilesets could except a file-like object, and then this type of functionality would only be necessary for the readers which require a local filepath.
Setup/teardown
If we do decide to add this functionality to hg, we could export a class which globally starts the fuse in a subprocess. The end user needs to run this once at the top of their file. This way we decouple the fuse mapping from the server and make users be explicit about when fuse is needed. The main issue here is gracefully shutting down / cleaning up.
cc. @pkerpedjiev @nvictus