Skip to content
This repository was archived by the owner on May 27, 2023. It is now read-only.

feat: add hg.fuse for http-based filesystem #6

Merged
merged 9 commits into from
Feb 7, 2022
Merged

feat: add hg.fuse for http-based filesystem #6

merged 9 commits into from
Feb 7, 2022

Conversation

manzt
Copy link
Owner

@manzt manzt commented Feb 5, 2022

This PR adds a special fsspec filesystem which extends the builtin fsspec.implementations.http.HTTPFilesystem so that it can be mapped to a local filesystem via FUSE.

import hg.fuse

# use in python
fs = hg.fuse.GlobalHTTPFilesystem()
with fs.open('/http/example.com..') as f:
    f.readline()

# mount to a local directory with FUSE and then access 
import fsspec.fuse

fsspec.fuse.run(hg.fuse.GlobalHTTPFilesystem(), '/', '/my/mount/dir') # blocking by default, can be with background

# open from OS
with open('/my/mount/dir/http/example.com..') as f:
  f.readline()

# different root (only http)
fsspec.fuse.run(hg.fuse.GlobalHTTPFilesystem(), '/http/',  '/my/mount/dir') 
with open('/my/mount/dir/example.com..') as f:
  f.readline()

Motivation

The issue with the builtin fsspec.HTTPFilesystem is that "paths" for the file system are complete URLs, so they cannot be mapped to filesystem. Additionally, ls has some special behavior to try to guess additional files.

fs = HTTPFilesystem()
fs.info("http://example.com") # { "type": "file", ... }
fs.ls("http://example.com") # will try to crawl HTML if available and extract hrefs, not what we want!

Special handling of URLs is required to distinguish between a "directory" and a "file" in our virtual file system. I used the path style from https://github.com/higlass/simple-httpfs for this filesystem, where file paths are suffixed with "..". (NOTE: we could expose this as a parameter in the FS constructor).

Why not simple-httpfs?

This is mostly an experiment to see if we can lean on some of the sophisticated behavior from fsspec. The file objects inherit from fsspec.spec.AbstractBufferedFile which allows for configurable caching https://filesystem-spec.readthedocs.io/en/latest/features.html#file-buffering-and-random-access. We will have the ability to create an individual cache per file.

In theory, it shouldn't matter if we use this or simple-httpfs because mapping FUSE is something that occurs at the OS-level (and I'm not sure if this is something that needs to implemented in this library at all). Really what we could expose are some convenience functions for mapping a url to the fileysystem-style.

It would be worth looking into clodius and see how many tilesets could except a file-like object, and then this type of functionality would only be necessary for the readers which require a local filepath.

Setup/teardown

If we do decide to add this functionality to hg, we could export a class which globally starts the fuse in a subprocess. The end user needs to run this once at the top of their file. This way we decouple the fuse mapping from the server and make users be explicit about when fuse is needed. The main issue here is gracefully shutting down / cleaning up.

import hg
import hg.tilesets

hg.fuse.start('/my/mount/dir/') # global instance of FuseProcess
path = hg.fuse.path('http://mydataset.com/data.bigwig') # /my/mount/dir/http/mydataset.com/data.bigwig
ts = hg.tilesets.bigwig(path)

cc. @pkerpedjiev @nvictus

@manzt
Copy link
Owner Author

manzt commented Feb 6, 2022

Example

from rich import print

import hg.fuse

if __name__ == "__main__":

    # create a filesystem
    fs = hg.fuse.GlobalHTTPFileSystem()

    # any path that isn't terminated with ".." is an empty directory
    print(fs.info("/http/example.com"), fs.ls('/http/example.com'))
    # {'name': '/http/example.com', 'size': None, 'type': 'directory'}
    # []

    # except for root which has two "directories"
    print(fs.ls("/"))
    # ["http/", "https/"]

    # a path terminated with ".." will be inspected as a file
    print(fs.info("/http/example.com.."))
    # {'name': '/http/example.com..', 'size': 648, 'ETag': '"3147526947"', 'type': 'file'}
    try:
        fs.ls("/http/example.com..")
    except NotADirectoryError:
        # path is for a file
        pass

    print(fs.cat("/http/example.com.."))
    # b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n

    # filelike objects
    f = fs.open("/http/example.com..")
    print(f)
    # <File-like object GlobalHTTPFileSystem, /http/example.com..>
    f.close()

    # context manager
    with fs.open("/http/example.com..") as f:
        print(f.readline())
        # b'<!doctype html>\n'

@manzt manzt changed the title feat: try fsspec fuse impl feat: add hg.fuse for http-based filesystem Feb 7, 2022
@manzt manzt merged commit 57ca1e9 into main Feb 7, 2022
@manzt manzt deleted the fuse branch February 7, 2022 02:03
@pkerpedjiev
Copy link

pkerpedjiev commented Feb 7, 2022

Perhaps a silly question, but why can't we just use fsspec.implementations.http.HTTPFilesystem and have hg intercept whichever url or file is passed in for a tileset and return a file object?

In other words, why do we need the filesystem path mapping?

@manzt
Copy link
Owner Author

manzt commented Feb 7, 2022

Not a silly question. The issue is that we can't pass a file object to clodius. We need to pass a local file path, so we need to handle the compat at the OS-level with FUSE.

import fsspec
from clodius.cooler import tiles

tiles(fsspec.open('http://example.com/data.mcool'), ...) # error
tiles('/my/mount/dir/http/example.com/data.mcool..', ...) # ok if FUSE is running

I talked to @nvictus about this last week, but ideally all clodius tile implementations could take a file-like object or local path. Not sure how feasable that would be but if even some of the implementations could take flike-like object, fsspec could provide a lot of flexiblity with where data can live..

fh = fsspec.open('s3://path/to/data.mcool')
fh = fsspec.open('dropbox://path/to/data.mcool')
fh = fsspec.open('ftp://path/to/data.mcool')
tiles(fh, ...)

@nvictus
Copy link

nvictus commented Feb 7, 2022

We can only circumvent FUSE for those tile fetchers that can operate on a Python file-like object directly, in which case the built-in HTTPFilesystem or any other fsspec file system implementation could be readily used.

However, if the underlying file access happens in a low-level library (bbi, bam, etc.) we have to either delegate to its internal http handling if it exists (htslib and bbi can do this, but not flexibly or gracefully) or rely on FUSE's filesystem path mapping (+ add caching layers, etc.), which is what we currently do.

One exception to this rule is HDF5/h5py, which recently implemented a low-level Python file-like object driver that calls back into Python. However, clodius doesn't exploit it at the moment and only accepts a path.

So the idea @manzt is proposing is that clodius tile fetchers could be extended so that, when possible, they can accept either a path string or a file-like object, which would increase the flexibility of what kind of resource they can accept, without introducing specific fs implementation logic into clodius itself.

Unfortunately, for most bioinformatic format libraries, this option isn't available and we'll have to keep relying on FUSE and file paths. I personally like keeping the FUSE process lifecycle independent of the HgServer, so that it can be used to explicitly provide a file path mapping

hg.fuse.start('/my/mount/dir/') # global instance of FuseProcess
ts = hg.tilesets.bigwig(hg.fuse.path('http://mydataset.com/data.bigwig'))

But I can also see hg tileset wrappers intercepting URLs and doing the mapping for you. Though keeping things separate initially can help us think about lifecycle management of FUSE alongside a background HgServer: e.g. what if two notebooks are running hg and each using FUSE mounted at the same location?

@manzt
Copy link
Owner Author

manzt commented Feb 7, 2022

I personally like keeping the FUSE process lifecycle independent of the HgServer, so that it can be used to explicitly provide a file path mapping

Agreed. FUSE could be setup totally outside of hg.

@manzt
Copy link
Owner Author

manzt commented Feb 8, 2022

FYI the fsspec-based FUSE operator is muuuuuuch slower that simple-httpfs. I played around with some of the caching, but I think I'll probably wrap simple-httpfs in a follow up PR to replace the fsspec usage.

@pkerpedjiev is it possible to configure a single simple_httpfs.HttpFs to handle both /http/ and /https/? It would be nice to only spin up a single process instead of multiple.

@pkerpedjiev
Copy link

pkerpedjiev commented Feb 8, 2022 via email

@manzt
Copy link
Owner Author

manzt commented Feb 8, 2022

I'm really not sure without digging further. I just tried both and the former was painfully slow compared to the latter. Basically unusable, so I deferred to using simple-httpfs which is snappy out of the box.

I know that fsspec.HTTPFileSystem wraps an async event-loop and then provides methods that sync-ify the async implementations. I did my best to configure GlobalHTTPFileSystem similarly to simple-httpfs (LRU block cache with similar block size):

GlobalHTTPFileSystem:

Screen.Recording.2022-02-08.at.11.03.40.AM.mov

simple-httpfs

Screen.Recording.2022-02-08.at.11.00.37.AM.mov

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants