Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed how index reads and non-index reads are identified in error m… #122

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

nkongenelly
Copy link
Contributor

@nkongenelly nkongenelly commented Feb 24, 2025

Ticket: DATAOPS-999

Updated the terminology in CheckQC so that we refer to all reads and indexes as their “read number”, and then specify if it is an index or not. This would correspond to the way InterOp stats are displayed in MultiQC. Example: A Paried-End run with dual index would have: Read 1, Read 2 (I), Read 3 (I) and Read 4.

@nkongenelly nkongenelly force-pushed the DATAOPS-999_reads_and_index_reads branch from 9951f0e to 646203f Compare February 24, 2025 10:41
Copy link
Collaborator

@Aratz Aratz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

codecov bot commented Feb 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93%. Comparing base (df787ed) to head (a516659).
Report is 63 commits behind head on master.

Additional details and impacted files
@@          Coverage Diff           @@
##           master   #122    +/-   ##
======================================
+ Coverage      92%    93%    +1%     
======================================
  Files          22     33    +11     
  Lines        1088   1313   +225     
======================================
+ Hits         1000   1215   +215     
- Misses         88     98    +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nkongenelly
Copy link
Contributor Author

Thanks for the review. I have also converted other string formatting to f-string in these 2 files

@nkongenelly
Copy link
Contributor Author

Nice! Could you also apply the same changes to https://github.com/Molmed/checkQC/blob/master/checkQC/qc_checkers/error_rate.py ?

Oh, I didn't realise I could make this change here in this branch. I will update this then update you when done.

@nkongenelly
Copy link
Contributor Author

@Aratz , I have now completed the changes and the code is ready for your review.

Copy link
Collaborator

@Aratz Aratz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank you for addressing my comments 🙏

@matrulda
Copy link
Collaborator

matrulda commented Mar 4, 2025

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

@nkongenelly
Copy link
Contributor Author

nkongenelly commented Mar 6, 2025

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

Oooh, this is great information. I think maybe we can provide this information as a comment

@nkongenelly
Copy link
Contributor Author

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

@Aratz , what do you think about this? Will having a comment be enough to explain that error rate is not expected for index reads? Or is it better to do away with index reads for error_rate handler?

@Aratz
Copy link
Collaborator

Aratz commented Mar 10, 2025

From what I understand, with the current technology it doesn't make sense to talk about error rate for index reads, so I think the code should reflect that and ignore them too. Otherwise it might create some unnecessary confusion.

@nkongenelly
Copy link
Contributor Author

@matrulda , I have now removed index reads from error rate handler.

Copy link
Collaborator

@matrulda matrulda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added some comments. Let me know if you think.

@@ -163,8 +163,7 @@ def run(self):
self._send_to_subscribers(("error_rate",
{"lane": lane+1,
"read": read_nbr+1,
"error_rate": error_rate,
"is_index_read":is_index_read}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to keep this, but in the handler skip reads where is_index_read is true.
That makes it extra clear that the handler is not supposed to handle index reads.


return [
qc_report
for lane, lane_data in qc_data.sequencing_metrics.items()
for read, read_data in lane_data["reads"].items()
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read, read_data["is_index"]))
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we make sure that reads where is_index_read = True are skipped?

@nkongenelly
Copy link
Contributor Author

Thanks @matrulda for the review. i have now pushed the updated code

@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read):
for lane, lane_data in qc_data.sequencing_metrics.items()
for read, read_data in lane_data["reads"].items()
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for performance reasons, it would be better to write the conditional in the opposite order:

if not read_data["is_index"] and (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)

This way the qc_report is only computed when is_index is false, see https://en.wikipedia.org/wiki/Short-circuit_evaluation for details

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍

Copy link
Collaborator

@matrulda matrulda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments :)

else:
continue
# error_rate handler is not supposed to handle index reads.
if not is_index_read:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this, I think it would be cleaner to handle it at the start of the for loop (line 46). Maybe something like this:

for error_dict in filter(lambda x: not x["is_index_read"], self.error_results):

@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read):
for lane, lane_data in qc_data.sequencing_metrics.items()
for read, read_data in lane_data["reads"].items()
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 👍

@nkongenelly
Copy link
Contributor Author

Wooow! I have learnt a lot from this. Thank you @Aratz and @matrulda for looking into this and for the great tips.

@matrulda , I have now pushed the updated code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants