Input fna file of 50G, resulting prefilter of estimated 9T? #972

hellopeccat · 2025-03-16T08:15:06Z

Hi,

I started to use mmseqs2 to functionally annotate genes, and saw a surprising requirement of disk space. I tested the swissprot db with a concatenated fna file using nohup mmseqs easy-search /mnt/8T_2/zuo/gene_cluster_cohort/27_genes_cohort.fna /mnt/16T_2/mmseqs_db/swissprot alnResult.m8 tmp -e 0.01 --min-seq-id 0.3 --cov-mode 2 -c 0.8.

And the process reported:

prefilter tmp/5432758783232164347/search_tmp/7264814417130636468/q_orfs_aa /mnt/16T_2/mmseqs_db/swissprot tmp/5432758783232164347/search_tmp/7264814417130636468/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3 -s 5.7 

Query database size: 797809035 type: Aminoacid
Estimated memory consumption: 4G
Target database size: 572970 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 572.97K 2s 147ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 572.97K 1s 889ms
Index statistics
Entries:          197513212
DB size:          1618 MB
Avg k-mer size:   3.086144
Top 10 k-mers
    GPGGTL	1851
    GQSWTV	1705
    WGMFAT	1637
    PGVFEV	1637
    VLWQFW	1622
    AYIRPN	1586
    RSPKGV	1584
    TPHKWY	1559
    KPWFAY	1551
    ITLSPY	1540
Time for index table init: 0h 0m 5s 636ms
Hard disk might not have enough free space (717G left).The prefilter result might need up to 9T.
Process prefiltering step 1 of 1

Is there any way to reduce the requirement of disk space? I feel unrealistic about the so large size for merely 50G input. Any suggestion would be greatly appreciated. /(T o T)/~~

The text was updated successfully, but these errors were encountered:

martin-steinegger · 2025-03-16T09:26:24Z

The prefilter in default will produce up to 300 results. So we estimate the disk usage by multiplying 300(—max-seqs)queries16 byte. 9TB is an upper bound, you could reduce max-seqs.

hellopeccat · 2025-03-17T07:31:54Z

Hi @martin-steinegger ,

Thanks a lot. I have read the user guide and learn about the option --max-seqs.

The option --max-seqs controls the maximum number of prefiltering results per query sequence.
For very large databases (tens of millions of sequences), it is a good advice to keep this number at reasonable values (i.e. the default value 300).

I searched the literature to try to find a reasonable but reduced value, and some studies used that of the smaller 100 for functional annotation, or 10 but for decontamination. I have no idea about how small could the max-seqs be as possible. Besides, is it advised to transform fna to faa to save the disk usage?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input fna file of 50G, resulting prefilter of estimated 9T? #972

Input fna file of 50G, resulting prefilter of estimated 9T? #972

hellopeccat commented Mar 16, 2025

martin-steinegger commented Mar 16, 2025

hellopeccat commented Mar 17, 2025

Input fna file of 50G, resulting prefilter of estimated 9T? #972

Input fna file of 50G, resulting prefilter of estimated 9T? #972

Comments

hellopeccat commented Mar 16, 2025

martin-steinegger commented Mar 16, 2025

hellopeccat commented Mar 17, 2025