Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General proposal feedback #2

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

General proposal feedback #2

wants to merge 1 commit into from

Conversation

trvrb
Copy link
Collaborator

@trvrb trvrb commented Mar 19, 2025

Not intended for merging, but I thought that a PR would be the easiest way to surface comments.

@trvrb trvrb requested a review from matsen March 19, 2025 23:38
Copy link
Owner

@matsen matsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Trevor!

We will provide proof of concept by prioritizing alignments based on structural relevance and novelty. Starting with our viral datasets, we will gradually expand to include additional sequences. First, we'll focus on FoldSeek clusters containing relevant viral proteins, then move to other clusters that are closely related according to structural similarity metrics such as TM-score and local distance difference test. We hypothesize that including these related structures will improve performance according to our metrics. We will track performance improvements as we incorporate additional MSAs.

_TB: The beauty of ESM is that you can train one giant model on all proteins and then query this model. If I'm following the proposal, you would still need buckets and there would be an "HA" model and a "spike" model. So I'm assuming the outputs would be a bunch of distinctly trained models? Each model on a different FoldSeek cluster? You'd have hundreds or thousands of small(ish) models? Some clarity here would be helpful._
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we'd have one big model. I have modified this as follows:

If it proves useful to add related sequence alignments to our model training for viral proteins, we will take this work to its logical conclusion and develop a single model for all proteins.

...

We will provide proof of concept by prioritizing alignments to add as training data to this single model based on structural relevance and novelty.

I just don't want to get to the end of a big ESM-scale training exercise and realize that it didn't actually help with our original viral training goals, or that we should have done something different. This strategy will allow us to iterate.

The EvolutionaryScale startup company, which develops the ESM class of protein language models, is in a sense a competitor. They use masked language modeling as does the rest of the community, and I suspect their models would perform better on functional tasks if they used a DASM-like framework.

_TB: I don't know if you want to mention Evo and genome language models? I understand that you're restricting yourself to proteins and mutation/selection in protein sequences, but does mutational input matter for DNA prediction? I'd think it would._
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea but feels well beyond my knowledge base!

@@ -7,6 +7,7 @@ Evolution is fundamentally a process of mutation and selection, yet current prot
## Success with Antibodies

Our work with antibodies demonstrates the importance of this approach. We've shown that antibody language models like AbLang2 estimate amino acids coded by codon neighbors as being two orders more likely than non-neighbors, and show clear effects of neutral mutation probability (Figures 4A and 4B). These factors negatively impact functional prediction (Figure 4C). Because current models must implicitly learn mutation-level processes, they will always be deficient because they do not have access to nucleotide sequence.
_TB: In reading this as a proposal, I'm confused on Figure hierarchy. I'm used to being able to get a sense of a proposal / paper by looking at figures, but the ordering in main.pdf doesn't seem to match logical ordering. Here in Vision you start by referencing Figure 4. And don't reference Figure 2 (which would seem to be the closest to a summary figure for your new approach). I'd suggest ordering figures by first use in text in how a reader is going to encounter them._
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, Figure 1 summarizes the DASM approach. Figure 2 is the next step and is referred to in this markdown document. The other figures are supporting. So I think that order makes sense.

You are right that they are out of order according to references. Part of that is the page limit. I can't put Figure 4 first without needing another page.

@@ -12,6 +12,7 @@ Also, the protein embeddings for DASMs, free from the confounding effects of mut
## DASMs for viral evolution

Evolutionary analysis of viral sequences has led to insights about viral adaptation, but conclusions are limited because evolutionary models give overall inferences for entire sequence alignments. For a given virus, one can make per-site selection statements with sufficient data, but cannot learn per-sequence per-site using existing methods (Figure 2, top). In other words, existing methods do not account for epistasis.
_TB: I basically agree with sentence as written. My reaction is that you can get at epistasis by counting mutations on different backgrounds, ie Hugh's work calculating Bloom/Neher fitness effects on different clades. This requires bucketing variation however._
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Bucketing reduces the resolution of the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants