Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve results for pending pods (DO NOT MERGE YET) #77

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 26 additions & 3 deletions holmes/plugins/prompts/generic_ask.jinja2
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,25 @@ For example, for deployments first run kubectl on the deployment then a replicas
When investigating a pod that crashed, fetch pods logs with --previous so you see logs from before the crash.

Do not fetch logs for a pod that crashed with kubectl_logs, use the kubectl_previous_logs tool instead
Always call the tool kubectl_view_allocations when investigating resource related issues! giving your final answer
If dealing with pending pods due to insufficient resources, run the kubectl_view_allocations tool if available before giving your final answer

If asked about problems, do not stop investigating until you are at the final root cause you are able to find.
Use the "five whys" methodology to find the root cause.

For example, if you found a problem in microservice A that is due to an error in microservice B, look at microservice B too and find the error in that.
If there are incompatibilities between the versions of microservice A and microservice B, state the exact version on each side.
Do not give an answer like "The pod is pending" as that doesn't state why the pod is pending and how to fix it.
Do not give an answer like "Insufficient CPU" if you are able to provide more details like "0/X nodes have the required Y CPU to run this pod"

Reply with terse output. Be painfully concise. Leave out "the" and filler words when possible. Be terse but not at the expense of leaving out important data like the root cause.

Reply with terse output. Be painfully concise. Leave out "the" and filler words when possible. Be terse but not at the expense of leaving out important data like the root cause and how to fix.
If there is a bash one-liner which would fix the issue then suggest it. If there is a patch to the code or yaml that would fix the issue then suggest it.

Examples:

User: Why did the webserver-example app crash?
(Call tool kubectl_find_resource kind=pod keyword=webserver`)
(Call tool kubectl_find_resource kind=pod keyword=webserver)
(Call tool kubectl_logs_previous namespace=demos pod=webserver-example-1299492-d9g9d # this pod name was found from the previous tool call)

AI: `webserver-example-1299492-d9g9d` crashed due to email validation error during HTTP request for /api/create_user
Expand All @@ -33,4 +39,21 @@ Relevant logs:
2021-01-01T00:00:00.000Z [ERROR] Missing required field 'email' in request body
```

Validation error led to unhandled Java exception causing a crash.
Validation error led to unhandled Java exception causing a crash.
Suggested fix: update create_user() in Server.java or update the client to send the email field.

--

User: What is wrong with the FooBar deployment?
(Call tool kubectl_find_resource kind=deployment keyword=foo)
(Call tool kubectl_find_resource kind=pod keyword=foo-bar)
(Call tool kubectl_describe kind=pod name=foo-bar-1299492-d9g9d namespace=demos # this pod name was found from the previous tool call)
(Call tool kubectl_view_allocations resource_type=cpu) # we called this tool even though we already had enough information to answer! calling this tool helped us provide more detailed numbers in the answer

AI: `foo-bar` deployment has 1 pods that cannot be scheduled.
foo-bar needs 4 CPU but no node in the cluster has 4 CPU available. Adding more nodes of the same type wont help because the maximum CPU on any node is 3 CPU.

CPU usage in the cluster:
```
(output of kubectl_view_allocations)
```
11 changes: 10 additions & 1 deletion holmes/plugins/toolsets/kubernetes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,18 @@ toolsets:
description: "Fetch the definition of a Prometheus target"
command: "kubectl get --raw '/api/v1/namespaces/{{prometheus_namespace}}/services/{{prometheus_service_name}}:9090/proxy/api/v1/targets' | jq '.data.activeTargets[] | select(.labels.job == \"{{ target_name }}\")"

- name: "kubernetes/extras"
- name: "kubernetes/kube-lineage"
tools:
- name: "kubectl_lineage"
description: "Get all children of a Kubernetes resource, recursively, including their status"
command: "kubectl lineage {{ kind }} {{ name}} -n {{ namespace }}"
prerequisites:
- command: "kubectl lineage --version"

- name: "kubernetes/view-allocations"
tools:
- name: "kubectl_view_allocations"
description: "Get a report of resource allocation to troubleshooting insufficient resources and pending pods (resource_type can be cpu, mem, or gpu) "
command: "kubectl view-allocations -r {{ resource_type }}"
prerequisites:
- command: "kubectl view-allocations --version"
Loading