Network error rate is too high

How to troubleshoot network errors

Classification:

Public

What to do if NetworkErrorRateTooHigh or NetworkCheckErrorRateTooHigh

Check for cilium errors

Check the following recipe to see if there are any cilium errors.

Check machine network connection

To identify quickly if there are actual network issues on the machine you can run

while true; do echo -n "`date +'%F_%H%M%S%Z'` ";curl https://github.com -o /dev/null --silent -w '%{http_code}\n'; sleep 1; done

All responses should be 200 if everything is fine - 000 indicates a general network issue. You can continue debugging from there after you have some certainty that the network issue also affects the machine.

What to do if NetworkCheckErrorRateTooHigh or DNSCheckErrorRateTooHigh

In case of NetworkCheckErrorRateTooHigh or DNSCheckErrorRateTooHighalert means that net-exporter is not able to connect to kubernetes api service.

Check the logs in net-exporter pods to confirm.

If the error is related to looking up kubernetes service, it’s likely a problem with cilium or coredns. If it’s a problem related to looking up giantswarm.io, it’s likely a problem with external connectivity or coredns.

Get the pods with errors in prometheus. rate(network_error_total[10m]) > 0 for NetworkCheckErrorRateTooHigh rate(dns_error_total[10m]) > 0 for DNSCheckErrorRateTooHigh

If the error appears at the same time in many pods of the cluster could mean that the API is taking a lot of time to resolve the request. Check in the logs of kubernetes API pods to get more information,

If there are no logs related to the request maybe is just a networking problem between the pod and the api. You can verify this trying to curl kubernetes service cluster-ip.

Last modified December 9, 2024: Fix metadata and index for ops recipes (#302) (b50ecd3)