You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I started to use mmseqs2 to functionally annotate genes, and saw a surprising requirement of disk space. I tested the swissprot db with a concatenated fna file using nohup mmseqs easy-search /mnt/8T_2/zuo/gene_cluster_cohort/27_genes_cohort.fna /mnt/16T_2/mmseqs_db/swissprot alnResult.m8 tmp -e 0.01 --min-seq-id 0.3 --cov-mode 2 -c 0.8.
And the process reported:
prefilter tmp/5432758783232164347/search_tmp/7264814417130636468/q_orfs_aa /mnt/16T_2/mmseqs_db/swissprot tmp/5432758783232164347/search_tmp/7264814417130636468/search/pref_0 --sub-mat 'aa:blosum62.out,nucl:nucleotide.out' --seed-sub-mat 'aa:VTML80.out,nucl:nucleotide.out' -k 0 --target-search-mode 0 --k-score seq:2147483647,prof:2147483647 --alph-size aa:21,nucl:5 --max-seq-len 65535 --max-seqs 300 --split 0 --split-mode 2 --split-memory-limit 0 -c 0.8 --cov-mode 2 --comp-bias-corr 1 --comp-bias-corr-scale 1 --diag-score 1 --exact-kmer-matching 0 --mask 1 --mask-prob 0.9 --mask-lower-case 0 --min-ungapped-score 15 --add-self-matches 0 --spaced-kmer-mode 1 --db-load-mode 0 --pca substitution:1.100,context:1.400 --pcb substitution:4.100,context:5.800 --threads 96 --compressed 0 -v 3 -s 5.7
Query database size: 797809035 type: Aminoacid
Estimated memory consumption: 4G
Target database size: 572970 type: Aminoacid
Index table k-mer threshold: 112 at k-mer size 6
Index table: counting k-mers
[=================================================================] 572.97K 2s 147ms
Index table: Masked residues: 0
Index table: fill
[=================================================================] 572.97K 1s 889ms
Index statistics
Entries: 197513212
DB size: 1618 MB
Avg k-mer size: 3.086144
Top 10 k-mers
GPGGTL 1851
GQSWTV 1705
WGMFAT 1637
PGVFEV 1637
VLWQFW 1622
AYIRPN 1586
RSPKGV 1584
TPHKWY 1559
KPWFAY 1551
ITLSPY 1540
Time for index table init: 0h 0m 5s 636ms
Hard disk might not have enough free space (717G left).The prefilter result might need up to 9T.
Process prefiltering step 1 of 1
Is there any way to reduce the requirement of disk space? I feel unrealistic about the so large size for merely 50G input. Any suggestion would be greatly appreciated. /(T o T)/~~
The text was updated successfully, but these errors were encountered:
The prefilter in default will produce up to 300 results. So we estimate the disk usage by multiplying 300(—max-seqs)queries16 byte. 9TB is an upper bound, you could reduce max-seqs.
Thanks a lot. I have read the user guide and learn about the option --max-seqs.
The option --max-seqs controls the maximum number of prefiltering results per query sequence.
For very large databases (tens of millions of sequences), it is a good advice to keep this number at reasonable values (i.e. the default value 300).
I searched the literature to try to find a reasonable but reduced value, and some studies used that of the smaller 100 for functional annotation, or 10 but for decontamination. I have no idea about how small could the max-seqs be as possible. Besides, is it advised to transform fna to faa to save the disk usage?
Hi,
I started to use mmseqs2 to functionally annotate genes, and saw a surprising requirement of disk space. I tested the swissprot db with a concatenated fna file using
nohup mmseqs easy-search /mnt/8T_2/zuo/gene_cluster_cohort/27_genes_cohort.fna /mnt/16T_2/mmseqs_db/swissprot alnResult.m8 tmp -e 0.01 --min-seq-id 0.3 --cov-mode 2 -c 0.8
.And the process reported:
Is there any way to reduce the requirement of disk space? I feel unrealistic about the so large size for merely 50G input. Any suggestion would be greatly appreciated. /(T o T)/~~
The text was updated successfully, but these errors were encountered: