Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using MMseqs2 to get the totally different proteins from one dataset compared to another dataset #973

Open
liyue9129 opened this issue Mar 21, 2025 · 2 comments

Comments

@liyue9129
Copy link

Hi !

I want to use MMseqs2 to obtain protein sequences from the query dataset that are completely dissimilar to the protein sequences in the target dataset (e.g., with a similarity threshold of 0.3). What should I do? Can I achieve this goal using the following code:

`
mmseqs search queryDB targetDB resultDB tmp

mmseqs filterresult queryDB targetDB resultDB resultDB0.3 --max-seq-id 0.3

mmseqs createtsv queryDB targetDB resultDB0.3 resultDB0.3.tsv
`

The results I obtained using the above code only show the protein sequences from the query dataset that are below the threshold for certain proteins in the target dataset( resultDB0.3.txt ), which is confusing to me. Did I make a mistake?

Thank you!
Best wishes!

@RPINerd
Copy link

RPINerd commented Mar 21, 2025

I was coming to the issues section just now to ask almost exactly the same question!

Like OP, I've got a queryDB and targetDB
I ran mmseqs search queryDB targetDB resultDB ./tmp and now am trying to figure out how to extract everything from queryDB that is NOT in resultsDB

Looking through the documentation on the structure of the database, I can see sort of how things are linked together but not a clear way to pick out things that are not in the results DB

In my case, these are fasta entries so I guess I could brute force with converting resultsDB to a *.m8 file, then parsing it and all the input sequences from queryDB.. but that is a massively intensive and inefficient process that I hope we can find an integrated way to achieve!

@milot-mirdita
Copy link
Member

The easiest is to take all result entries that are empty and create a new database out of that:

mmseqs search queryDB targetDB resultDB tmp --min-seq-id 0.3
awk '$3 == 1' resultDB.index > no_hits_queries.tsv
mmseqs createsubdb no_hits_queries.tsv queryDB query_subset
mmseqs convert2fasta query_subset query_subset.fasta

resultDB will make an empty entry (entry length of 1 byte) for everything that did not have any hits in the target database

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants