using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

liyue9129 · 2025-03-21T13:00:35Z

Hi ！

I want to use MMseqs2 to obtain protein sequences from the query dataset that are completely dissimilar to the protein sequences in the target dataset (e.g., with a similarity threshold of 0.3). What should I do? Can I achieve this goal using the following code:

`
mmseqs search queryDB targetDB resultDB tmp

mmseqs filterresult queryDB targetDB resultDB resultDB0.3 --max-seq-id 0.3

mmseqs createtsv queryDB targetDB resultDB0.3 resultDB0.3.tsv
`

The results I obtained using the above code only show the protein sequences from the query dataset that are below the threshold for certain proteins in the target dataset( resultDB0.3.txt ), which is confusing to me. Did I make a mistake?

Thank you!
Best wishes!

RPINerd · 2025-03-21T17:20:07Z

I was coming to the issues section just now to ask almost exactly the same question!

Like OP, I've got a queryDB and targetDB
I ran mmseqs search queryDB targetDB resultDB ./tmp and now am trying to figure out how to extract everything from queryDB that is NOT in resultsDB

Looking through the documentation on the structure of the database, I can see sort of how things are linked together but not a clear way to pick out things that are not in the results DB

In my case, these are fasta entries so I guess I could brute force with converting resultsDB to a *.m8 file, then parsing it and all the input sequences from queryDB.. but that is a massively intensive and inefficient process that I hope we can find an integrated way to achieve!

milot-mirdita · 2025-03-24T05:53:33Z

The easiest is to take all result entries that are empty and create a new database out of that:

mmseqs search queryDB targetDB resultDB tmp --min-seq-id 0.3
awk '$3 == 1' resultDB.index > no_hits_queries.tsv
mmseqs createsubdb no_hits_queries.tsv queryDB query_subset
mmseqs convert2fasta query_subset query_subset.fasta

resultDB will make an empty entry (entry length of 1 byte) for everything that did not have any hits in the target database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

liyue9129 commented Mar 21, 2025

RPINerd commented Mar 21, 2025

milot-mirdita commented Mar 24, 2025

using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

Comments

liyue9129 commented Mar 21, 2025

RPINerd commented Mar 21, 2025

milot-mirdita commented Mar 24, 2025