-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed how index reads and non-index reads are identified in error m… #122
base: master
Are you sure you want to change the base?
Changed how index reads and non-index reads are identified in error m… #122
Conversation
9951f0e
to
646203f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Could you also apply the same changes to https://github.com/Molmed/checkQC/blob/master/checkQC/qc_checkers/error_rate.py ?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #122 +/- ##
======================================
+ Coverage 92% 93% +1%
======================================
Files 22 33 +11
Lines 1088 1313 +225
======================================
+ Hits 1000 1215 +215
- Misses 88 98 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Thanks for the review. I have also converted other string formatting to f-string in these 2 files |
Oh, I didn't realise I could make this change here in this branch. I will update this then update you when done. |
@Aratz , I have now completed the changes and the code is ready for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you for addressing my comments 🙏
Great work, Nelly! As of today, error rate is never computed for index reads, i.e. error rate for a index read will be NAN. The error rate is based on comparing the (non-indexed) reads of a known sequence (PhiX) to a reference. This is done by the Illumina software that creates the Interop files. PhiX can be indexed (SNP&SEQ does not use that currently), but in those cases I'm still quite certain that no error rate would be calculated for them since the sequence is so short. However, we do not know if this will change in the future, so I think it still makes sense for the handler to take them into account. However, just reading the code, it would look like error rate is expected for index reads. Do you think this should be communicated somehow? Or would it be better to just skip index reads in the error rate handler / qc checker? To be clear, I think the current implementation works as well, I just wonder from a developer's point of view if this should be addressed. Also, to be clear, index reads are expected to have a Q30 value, so this concern only applies to the error rate handler/checker. Sorry for not realizing this sooner! I should have added this information to the ticket. |
Oooh, this is great information. I think maybe we can provide this information as a comment |
@Aratz , what do you think about this? Will having a comment be enough to explain that error rate is not expected for index reads? Or is it better to do away with index reads for error_rate handler? |
From what I understand, with the current technology it doesn't make sense to talk about error rate for index reads, so I think the code should reflect that and ignore them too. Otherwise it might create some unnecessary confusion. |
@matrulda , I have now removed index reads from error rate handler. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, added some comments. Let me know if you think.
checkQC/parsers/interop_parser.py
Outdated
@@ -163,8 +163,7 @@ def run(self): | |||
self._send_to_subscribers(("error_rate", | |||
{"lane": lane+1, | |||
"read": read_nbr+1, | |||
"error_rate": error_rate, | |||
"is_index_read":is_index_read})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to keep this, but in the handler skip reads where is_index_read is true.
That makes it extra clear that the handler is not supposed to handle index reads.
checkQC/qc_checkers/error_rate.py
Outdated
|
||
return [ | ||
qc_report | ||
for lane, lane_data in qc_data.sequencing_metrics.items() | ||
for read, read_data in lane_data["reads"].items() | ||
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read, read_data["is_index"])) | ||
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, can we make sure that reads where is_index_read = True are skipped?
Thanks @matrulda for the review. i have now pushed the updated code |
checkQC/qc_checkers/error_rate.py
Outdated
@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read): | |||
for lane, lane_data in qc_data.sequencing_metrics.items() | |||
for read, read_data in lane_data["reads"].items() | |||
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for performance reasons, it would be better to write the conditional in the opposite order:
if not read_data["is_index"] and (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)
This way the qc_report is only computed when is_index
is false, see https://en.wikipedia.org/wiki/Short-circuit_evaluation for details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments :)
else: | ||
continue | ||
# error_rate handler is not supposed to handle index reads. | ||
if not is_index_read: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this, I think it would be cleaner to handle it at the start of the for loop (line 46). Maybe something like this:
for error_dict in filter(lambda x: not x["is_index_read"], self.error_results):
checkQC/qc_checkers/error_rate.py
Outdated
@@ -49,6 +49,7 @@ def _qualify_error(error, lane, read): | |||
for lane, lane_data in qc_data.sequencing_metrics.items() | |||
for read, read_data in lane_data["reads"].items() | |||
if (qc_report := _qualify_error(read_data["mean_error_rate"], lane, read)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed 👍
Ticket: DATAOPS-999
Updated the terminology in CheckQC so that we refer to all reads and indexes as their “read number”, and then specify if it is an index or not. This would correspond to the way InterOp stats are displayed in MultiQC. Example: A Paried-End run with dual index would have: Read 1, Read 2 (I), Read 3 (I) and Read 4.