General proposal feedback #2

trvrb · 2025-03-19T23:38:45Z

Not intended for merging, but I thought that a PR would be the easiest way to surface comments.

matsen

Thanks, Trevor!

matsen · 2025-03-20T09:34:33Z

expanded/experimental-design-750-words.md

+We will provide proof of concept by prioritizing alignments based on structural relevance and novelty.  Starting with our viral datasets, we will gradually expand to include additional sequences.  First, we'll focus on FoldSeek clusters containing relevant viral proteins, then move to other clusters that are closely related according to structural similarity metrics such as TM-score and local distance difference test.  We hypothesize that including these related structures will improve performance according to our metrics.  We will track performance improvements as we incorporate additional MSAs.
+
+_TB: The beauty of ESM is that you can train one giant model on all proteins and then query this model. If I'm following the proposal, you would still need buckets and there would be an "HA" model and a "spike" model. So I'm assuming the outputs would be a bunch of distinctly trained models? Each model on a different FoldSeek cluster? You'd have hundreds or thousands of small(ish) models? Some clarity here would be helpful._


No, we'd have one big model. I have modified this as follows:

If it proves useful to add related sequence alignments to our model training for viral proteins, we will take this work to its logical conclusion and develop a single model for all proteins.

...

We will provide proof of concept by prioritizing alignments to add as training data to this single model based on structural relevance and novelty.

I just don't want to get to the end of a big ESM-scale training exercise and realize that it didn't actually help with our original viral training goals, or that we should have done something different. This strategy will allow us to iterate.

matsen · 2025-03-20T09:35:15Z

expanded/related-goals-250-words.md

+The EvolutionaryScale startup company, which develops the ESM class of protein language models, is in a sense a competitor.  They use masked language modeling as does the rest of the community, and I suspect their models would perform better on functional tasks if they used a DASM-like framework.
+
+_TB: I don't know if you want to mention Evo and genome language models? I understand that you're restricting yourself to proteins and mutation/selection in protein sequences, but does mutational input matter for DNA prediction? I'd think it would._


Interesting idea but feels well beyond my knowledge base!

matsen · 2025-03-20T09:41:51Z

expanded/vision-500-words.md

@@ -7,6 +7,7 @@ Evolution is fundamentally a process of mutation and selection, yet current prot
 ## Success with Antibodies

 Our work with antibodies demonstrates the importance of this approach.  We've shown that antibody language models like AbLang2 estimate amino acids coded by codon neighbors as being two orders more likely than non-neighbors, and show clear effects of neutral mutation probability (Figures 4A and 4B).  These factors negatively impact functional prediction (Figure 4C).  Because current models must implicitly learn mutation-level processes, they will always be deficient because they do not have access to nucleotide sequence.
+_TB: In reading this as a proposal, I'm confused on Figure hierarchy. I'm used to being able to get a sense of a proposal / paper by looking at figures, but the ordering in main.pdf doesn't seem to match logical ordering. Here in Vision you start by referencing Figure 4. And don't reference Figure 2 (which would seem to be the closest to a summary figure for your new approach). I'd suggest ordering figures by first use in text in how a reader is going to encounter them._ 


Hm, Figure 1 summarizes the DASM approach. Figure 2 is the next step and is referred to in this markdown document. The other figures are supporting. So I think that order makes sense.

You are right that they are out of order according to references. Part of that is the page limit. I can't put Figure 4 first without needing another page.

matsen · 2025-03-20T09:44:41Z

expanded/examples-500-words.md

@@ -12,6 +12,7 @@ Also, the protein embeddings for DASMs, free from the confounding effects of mut
 ## DASMs for viral evolution

 Evolutionary analysis of viral sequences has led to insights about viral adaptation, but conclusions are limited because evolutionary models give overall inferences for entire sequence alignments.  For a given virus, one can make per-site selection statements with sufficient data, but cannot learn per-sequence per-site using existing methods (Figure 2, top).  In other words, existing methods do not account for epistasis.
+_TB: I basically agree with sentence as written. My reaction is that you can get at epistasis by counting mutations on different backgrounds, ie Hugh's work calculating Bloom/Neher fitness effects on different clades. This requires bucketing variation however._


Right. Bucketing reduces the resolution of the model.

General proposal feedback

59e2715

trvrb requested a review from matsen March 19, 2025 23:38

matsen reviewed Mar 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General proposal feedback #2

General proposal feedback #2

trvrb commented Mar 19, 2025

matsen left a comment

matsen Mar 20, 2025

matsen Mar 20, 2025

matsen Mar 20, 2025

matsen Mar 20, 2025

		We will provide proof of concept by prioritizing alignments based on structural relevance and novelty. Starting with our viral datasets, we will gradually expand to include additional sequences. First, we'll focus on FoldSeek clusters containing relevant viral proteins, then move to other clusters that are closely related according to structural similarity metrics such as TM-score and local distance difference test. We hypothesize that including these related structures will improve performance according to our metrics. We will track performance improvements as we incorporate additional MSAs.

		_TB: The beauty of ESM is that you can train one giant model on all proteins and then query this model. If I'm following the proposal, you would still need buckets and there would be an "HA" model and a "spike" model. So I'm assuming the outputs would be a bunch of distinctly trained models? Each model on a different FoldSeek cluster? You'd have hundreds or thousands of small(ish) models? Some clarity here would be helpful._

		The EvolutionaryScale startup company, which develops the ESM class of protein language models, is in a sense a competitor. They use masked language modeling as does the rest of the community, and I suspect their models would perform better on functional tasks if they used a DASM-like framework.

		_TB: I don't know if you want to mention Evo and genome language models? I understand that you're restricting yourself to proteins and mutation/selection in protein sequences, but does mutational input matter for DNA prediction? I'd think it would._

General proposal feedback #2

Are you sure you want to change the base?

General proposal feedback #2

Conversation

trvrb commented Mar 19, 2025

matsen left a comment

Choose a reason for hiding this comment

matsen Mar 20, 2025

Choose a reason for hiding this comment

matsen Mar 20, 2025

Choose a reason for hiding this comment

matsen Mar 20, 2025

Choose a reason for hiding this comment

matsen Mar 20, 2025

Choose a reason for hiding this comment