Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparing taxonomic assignement across merging methods #2091

Open
alexis-roy5 opened this issue Mar 5, 2025 · 1 comment
Open

Comparing taxonomic assignement across merging methods #2091

alexis-roy5 opened this issue Mar 5, 2025 · 1 comment

Comments

@alexis-roy5
Copy link

Hi @benjjneb,
We have a project with the merge step in dada2 pipeline. We wanted to evaluate which read merging strategy would yield the best taxonomic assignments to make an informed choice between them. We evaluated three approaches: default Merging, using Forward reads only, or justConcatenate.

Here we used a dataset of 376 samples of ITS1-ITS2 amplicon sequencing from apple tree phyllosphere using MiSeq 300x300 in paired end. We realise our reads won’t always overlap due to the variable length of ITS amplicons. When we look at the number of ASVs and unique taxa at various ranks, we found it interesting that the Forward method had lower ASV count but assigned a higher number of species, genus, family, order, class followed closely by the Concatenated method.

| nb.ASV| nb.species| nb.genus| nb.family| nb.order| nb.class| nb.phylum|TaxTable     |
|------:|----------:|--------:|---------:|--------:|--------:|---------:|:------------|
|   7880|       1404|     1135|       477|      178|       59|        13|forward      |
|   5602|       1148|      943|       421|      163|       55|        12|merged       |
|  12976|       1339|     1060|       447|      171|       57|        13|concatenated |
|  13051|       1233|      996|       434|      168|       57|        13|rescued      |

We think there are more ASVs with Concat because this method could create different ASVs from the same true biological sequence if the reads are not trimmed exactly the same way, which seems possible. Nonetheless, it is surprising that we get a higher taxonomic diversity than the Fwd method.

To visualize the overall agreement of assignments, we made Venn diagrams using unique taxa identified at different taxonomic ranks :

Image

Most interesting is the fact that the Fwd and Concat methods each find several different unique genera or family. Naturally, we wonder which is most likely to be the right answer.

The following shows the percentage of ASVs in each sample that were assigned a label at various taxonomic ranks:

Image

As you can see, overall the concatenation method has higher assignement than the other methods. It is higher, but is it better ?

Of note, we also tested this on two other independent datasets (another ITS and a trnL experiment) with similar findings.

Here are the questions that come to mind when looking at all of this.
First, why do you think the Fwd method identifies more Genera (or Families) than Concat, given that it essentially uses a subset of the information contained in the Concat method? Could it be that Fwd is prone to lower accuracy and somehow recalls taxa that aren’t really there?

Second, we are puzzled by the fact that the Concat and Fwd method each find a pretty high number of Genera or Families that the other doesn’t find. Can you think of a reason for that? Would you trust more the Concat or the Forward?

Finally, given that Concat will preserve and append some reads that should have been overlapped and merged, this means the resulting reads will contain duplicate information for some portion of the amplicon. Could this have an impact on taxonomic assignment? Maybe it causes false positive assignement and explain the higher number of ASV for this method.

We are looking forward to reading your thoughts on this. Thanks in advance for your time. Cheers!

Note: The FilterandTrim parameters were maxEE = c(4, 4), truncQ = 2, minLen = 100. We used Cutadapt to remove primers.

@benjjneb
Copy link
Owner

benjjneb commented Mar 5, 2025

I would use forward reads alone.

Concatenation keeps more bases, so will result in a higher rate of taxonomic assignments. The quality of those additional assignments is unclear, and my guess is that much of what you are observing by looking at the ASV/taxa level is dealing with very low abundance variants.

More unique taxonomic assignments != better description of the measured community.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants