Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI - display no of passed/skipped/failed Nemesis runs #597

Open
mykaul opened this issue Feb 26, 2025 · 5 comments
Open

UI - display no of passed/skipped/failed Nemesis runs #597

mykaul opened this issue Feb 26, 2025 · 5 comments
Labels
Argus enhancement New feature or request

Comments

@mykaul
Copy link

mykaul commented Feb 26, 2025

This is what I see today - pass or fail per run:

Image

It'd be much more useful if I get a clearer, from the overview, picture - how many passed/failed/skipped. It'll allow me to weigh in what the importance of investigating a run.

Moreover - if I have the 1st Nemesis run, that would be even more helpful - if I see it's the same Nemesis, I may decide to de-prioritize the investigation.

@mykaul mykaul added Argus enhancement New feature or request labels Feb 26, 2025
@fruch
Copy link
Contributor

fruch commented Feb 26, 2025

I'm not sure I understand what you want here.

this panel was introduced for the sake of select a set of job for further investigation.

I don't see the point of putting more details into it.

@mykaul
Copy link
Author

mykaul commented Feb 26, 2025

I'm not sure I understand what you want here.

this panel was introduced for the sake of select a set of job for further investigation.

I don't see the point of putting more details into it.

Since we are not doing a great job at investigating all failures, we need to prioritize. This (along with runtime information, btw), would help me do that:

  1. I'd start with all failed executions that took 15 minutes or less - an infra / SCT issue most likely.
  2. I'd continue with those that have a single failure.
  3. I'll skip those that look similar (see in my screenshot above - 599-603 are all with the same exact failure)

@fruch
Copy link
Contributor

fruch commented Feb 26, 2025

I'm not sure I understand what you want here.

this panel was introduced for the sake of select a set of job for further investigation.

I don't see the point of putting more details into it.

Since we are not doing a great job at investigating all failures, we need to prioritize. This (along with runtime information, btw), would help me do that:

  1. I'd start with all failed executions that took 15 minutes or less - an infra / SCT issue most likely.

You asked for that in a different issue, I don't see how it's relevant to what is asked in this issue

  1. I'd continue with those that have a single failure.

In most cases you'll have more then one error event, I'm not sure how this metric help to triage.

  1. I'll skip those that look similar (see in my screenshot above - 599-603 are all with the same exact failure)

As for look similar, we are working on something that can help classify events as happened in other runs, once it's operational, we might be able to show indications of it.

As for where those kind of things should be shown, I'm not sure, maybe a widget with table, would be better fit form this kind of requirement.

@mykaul
Copy link
Author

mykaul commented Feb 27, 2025

I'm not sure I understand what you want here.
this panel was introduced for the sake of select a set of job for further investigation.
I don't see the point of putting more details into it.

Since we are not doing a great job at investigating all failures, we need to prioritize. This (along with runtime information, btw), would help me do that:

  1. I'd start with all failed executions that took 15 minutes or less - an infra / SCT issue most likely.

You asked for that in a different issue, I don't see how it's relevant to what is asked in this issue

I did. It did not happen (yet?), so I'm asking for alternatives, which are not contradicting, btw.

  1. I'd continue with those that have a single failure.

In most cases you'll have more then one error event, I'm not sure how this metric help to triage.

More than a single Nemesis failure?
Then show me the first. Or the last. Or just a number if it's >1. I'll know that there's more work to analyze that one, than others.

  1. I'll skip those that look similar (see in my screenshot above - 599-603 are all with the same exact failure)

As for look similar, we are working on something that can help classify events as happened in other runs, once it's operational, we might be able to show indications of it.

That's great. I was hoping we can have AI helping us here.

As for where those kind of things should be shown, I'm not sure, maybe a widget with table, would be better fit form this kind of requirement.

Yes, I'm open for a better UI suggestions.

@k0machi
Copy link
Collaborator

k0machi commented Feb 27, 2025

I could experiment with both adding this information to cards and to the selector - We could fit simple counter: v 15 / x 3 for nemeses and a duration field: took 15 minutes. I think it should look nice and not clutter things too much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Argus enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants