Changed how index reads and non-index reads are identified in error m… #122

nkongenelly · 2025-02-24T10:40:18Z

Updated the terminology in CheckQC so that we refer to all reads and indexes as their “read number”, and then specify if it is an index or not. This would correspond to the way InterOp stats are displayed in MultiQC. Example: A Paried-End run with dual index would have: Read 1, Read 2 (I), Read 3 (I) and Read 4.

…essage

Aratz

Nice! Could you also apply the same changes to https://github.com/Molmed/checkQC/blob/master/checkQC/qc_checkers/error_rate.py ?

checkQC/handlers/error_rate_handler.py

codecov · 2025-02-27T12:24:23Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93%. Comparing base (df787ed) to head (a516659).
Report is 63 commits behind head on master.

Additional details and impacted files

@@          Coverage Diff           @@
##           master   #122    +/-   ##
======================================
+ Coverage      92%    93%    +1%     
======================================
  Files          22     33    +11     
  Lines        1088   1313   +225     
======================================
+ Hits         1000   1215   +215     
- Misses         88     98    +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nkongenelly · 2025-02-27T12:24:39Z

Thanks for the review. I have also converted other string formatting to f-string in these 2 files

nkongenelly · 2025-02-27T12:43:12Z

Nice! Could you also apply the same changes to https://github.com/Molmed/checkQC/blob/master/checkQC/qc_checkers/error_rate.py ?

Oh, I didn't realise I could make this change here in this branch. I will update this then update you when done.

nkongenelly · 2025-02-27T12:58:06Z

@Aratz , I have now completed the changes and the code is ready for your review.

Aratz

Perfect, thank you for addressing my comments 🙏

matrulda · 2025-03-04T07:43:22Z

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

nkongenelly · 2025-03-06T15:08:00Z

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

Oooh, this is great information. I think maybe we can provide this information as a comment

nkongenelly · 2025-03-10T09:57:52Z

Great work, Nelly!

As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed.

Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker.

Sorry for not realizing this sooner! I should have added this information to the ticket.

@Aratz , what do you think about this? Will having a comment be enough to explain that error rate is not expected for index reads? Or is it better to do away with index reads for error_rate handler?

Aratz · 2025-03-10T11:31:55Z

From what I understand, with the current technology it doesn't make sense to talk about error rate for index reads, so I think the code should reflect that and ignore them too. Otherwise it might create some unnecessary confusion.

…t expected

nkongenelly · 2025-03-10T13:17:04Z

@matrulda , I have now removed index reads from error rate handler.

matrulda

Looks good, added some comments. Let me know if you think.

matrulda · 2025-03-13T14:43:20Z

checkQC/parsers/interop_parser.py

@@ -163,8 +163,7 @@ def run(self):
                self._send_to_subscribers(("error_rate",
                                        {"lane": lane+1, 
                                         "read": read_nbr+1, 
-                                         "error_rate": error_rate,
-                                         "is_index_read":is_index_read}))


I think it would make sense to keep this, but in the handler skip reads where is_index_read is true.
That makes it extra clear that the handler is not supposed to handle index reads.

matrulda · 2025-03-13T14:44:47Z

checkQC/qc_checkers/error_rate.py


    return [
        qc_report
        for lane, lane_data in qc_data.sequencing_metrics.items()
        for read, read_data in lane_data["reads"].items()
-        if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read, read_data["is_index"]))
+        if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))


Same here, can we make sure that reads where is_index_read = True are skipped?

nkongenelly · 2025-03-13T15:27:55Z

Thanks @matrulda for the review. i have now pushed the updated code

Aratz · 2025-03-13T15:42:10Z

checkQC/qc_checkers/error_rate.py

@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read):
        for lane, lane_data in qc_data.sequencing_metrics.items()
        for read, read_data in lane_data["reads"].items()
        if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))


Actually, for performance reasons, it would be better to write the conditional in the opposite order:

if not read_data["is_index"] and (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)

This way the qc_report is only computed when is_index is false, see https://en.wikipedia.org/wiki/Short-circuit_evaluation for details

Agreed 👍

matrulda

Some final comments :)

matrulda · 2025-03-14T08:04:01Z

checkQC/handlers/error_rate_handler.py

-            else:
-                continue
+            # error_rate handler is not supposed to handle index reads.
+            if not is_index_read:


Instead of doing this, I think it would be cleaner to handle it at the start of the for loop (line 46). Maybe something like this:

for error_dict in filter(lambda x: not x["is_index_read"], self.error_results):

matrulda · 2025-03-14T08:08:19Z

checkQC/qc_checkers/error_rate.py

@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read):
        for lane, lane_data in qc_data.sequencing_metrics.items()
        for read, read_data in lane_data["reads"].items()
        if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read))


Agreed 👍

nkongenelly · 2025-03-14T12:30:38Z

Wooow! I have learnt a lot from this. Thank you @Aratz and @matrulda for looking into this and for the great tips.

@matrulda , I have now pushed the updated code

nkongenelly force-pushed the DATAOPS-999_reads_and_index_reads branch from 9951f0e to 646203f Compare February 24, 2025 10:41

Changed how index reads and non-index reads are identified in error m…

646203f

…essage

Aratz reviewed Feb 26, 2025

View reviewed changes

checkQC/handlers/error_rate_handler.py Outdated Show resolved Hide resolved

checkQC/handlers/error_rate_handler.py Outdated Show resolved Hide resolved

Changed format to f-string in errorr_rate and q_30 handler

8d1b752

Updated qc_checkers/error_rate file to show if index reads

c24929c

Bumped up version

69f89ee

Aratz approved these changes Feb 28, 2025

View reviewed changes

DATAOPS-999: Removed index reads in error rate handler as they are no…

1e06cd7

…t expected

matrulda reviewed Mar 13, 2025

View reviewed changes

Ensuring error_rate handler skips index reads

32b7e6c

Aratz reviewed Mar 13, 2025

View reviewed changes

matrulda reviewed Mar 14, 2025

View reviewed changes

Optimised code after review

a516659

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed how index reads and non-index reads are identified in error m… #122

Changed how index reads and non-index reads are identified in error m… #122

nkongenelly commented Feb 24, 2025 •

edited

Loading

Aratz left a comment

codecov bot commented Feb 27, 2025 •

edited

Loading

nkongenelly commented Feb 27, 2025

nkongenelly commented Feb 27, 2025

nkongenelly commented Feb 27, 2025

Aratz left a comment

matrulda commented Mar 4, 2025

nkongenelly commented Mar 6, 2025 •

edited

Loading

nkongenelly commented Mar 10, 2025

Aratz commented Mar 10, 2025

nkongenelly commented Mar 10, 2025

matrulda left a comment

matrulda Mar 13, 2025

matrulda Mar 13, 2025

nkongenelly commented Mar 13, 2025

Aratz Mar 13, 2025

matrulda Mar 14, 2025

matrulda left a comment

matrulda Mar 14, 2025

matrulda Mar 14, 2025

nkongenelly commented Mar 14, 2025

Changed how index reads and non-index reads are identified in error m… #122

Are you sure you want to change the base?

Changed how index reads and non-index reads are identified in error m… #122

Conversation

nkongenelly commented Feb 24, 2025 • edited Loading

Aratz left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 27, 2025 • edited Loading

Codecov Report

nkongenelly commented Feb 27, 2025

nkongenelly commented Feb 27, 2025

nkongenelly commented Feb 27, 2025

Aratz left a comment

Choose a reason for hiding this comment

matrulda commented Mar 4, 2025

nkongenelly commented Mar 6, 2025 • edited Loading

nkongenelly commented Mar 10, 2025

Aratz commented Mar 10, 2025

nkongenelly commented Mar 10, 2025

matrulda left a comment

Choose a reason for hiding this comment

matrulda Mar 13, 2025

Choose a reason for hiding this comment

matrulda Mar 13, 2025

Choose a reason for hiding this comment

nkongenelly commented Mar 13, 2025

Aratz Mar 13, 2025

Choose a reason for hiding this comment

matrulda Mar 14, 2025

Choose a reason for hiding this comment

matrulda left a comment

Choose a reason for hiding this comment

matrulda Mar 14, 2025

Choose a reason for hiding this comment

matrulda Mar 14, 2025

Choose a reason for hiding this comment

nkongenelly commented Mar 14, 2025

nkongenelly commented Feb 24, 2025 •

edited

Loading

codecov bot commented Feb 27, 2025 •

edited

Loading

nkongenelly commented Mar 6, 2025 •

edited

Loading