Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input fna file of 50G, resulting prefilter of estimated 9T? #972

Open
hellopeccat opened this issue Mar 16, 2025 · 2 comments
Open

Input fna file of 50G, resulting prefilter of estimated 9T? #972

hellopeccat opened this issue Mar 16, 2025 · 2 comments

Comments

@hellopeccat
Copy link

Hi,

I started to use mmseqs2 to functionally annotate genes, and saw a surprising requirement of disk space. I tested the swissprot db with a concatenated fna file using nohup mmseqs easy-search /mnt/8T_2/zuo/gene_cluster_cohort/27_genes_cohort.fna /mnt/16T_2/mmseqs_db/swissprot alnResult.m8 tmp -e 0.01 --min-seq-id 0.3 --cov-mode 2 -c 0.8.

And the process reported:

prefilter tmp/5432758783232164347/search_tmp/7264814417130636468/q_orfs_aa /mnt/16T_2/mmseqs_db/swissprot tmp/5432758783232164347/search_tmp/7264814417130636468/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3 -s 5.7 

Query database size: 797809035 type: Aminoacid
Estimated memory consumption: 4G
Target database size: 572970 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 572.97K 2s 147ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 572.97K 1s 889ms
Index statistics
Entries:          197513212
DB size:          1618 MB
Avg k-mer size:   3.086144
Top 10 k-mers
    GPGGTL	1851
    GQSWTV	1705
    WGMFAT	1637
    PGVFEV	1637
    VLWQFW	1622
    AYIRPN	1586
    RSPKGV	1584
    TPHKWY	1559
    KPWFAY	1551
    ITLSPY	1540
Time for index table init: 0h 0m 5s 636ms
Hard disk might not have enough free space (717G left).The prefilter result might need up to 9T.
Process prefiltering step 1 of 1

Is there any way to reduce the requirement of disk space? I feel unrealistic about the so large size for merely 50G input. Any suggestion would be greatly appreciated. /(T o T)/~~

@martin-steinegger
Copy link
Member

The prefilter in default will produce up to 300 results. So we estimate the disk usage by multiplying 300(—max-seqs)queries16 byte. 9TB is an upper bound, you could reduce max-seqs.

@hellopeccat
Copy link
Author

Hi @martin-steinegger ,

Thanks a lot. I have read the user guide and learn about the option --max-seqs.

The option --max-seqs controls the maximum number of prefiltering results per query sequence.
For very large databases (tens of millions of sequences), it is a good advice to keep this number at reasonable values (i.e. the default value 300).

I searched the literature to try to find a reasonable but reduced value, and some studies used that of the smaller 100 for functional annotation, or 10 but for decontamination. I have no idea about how small could the max-seqs be as possible. Besides, is it advised to transform fna to faa to save the disk usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants