How to manage an incident FAQ

FAQ

  1. What are the P1/P2 severities?

For simplification there are only two severities, critical (P1) or routine (P2). Critical incidents are those which impair a customer production system. Routine incidents are all other regular incidents which don’t impact production and which have a straightforward process.

  1. What do the different roles mean? When should they be used? What does setting them do?

There are two roles, Incident Coordinator who acts as communicator and driver of the incident and Operation Engineer(s) who has the responsibility to investigate the problem. For not critical incidents (P2), Incident Coordinator is not needed and Operation Engineer act as both roles.

More info

  1. What is the correct way to contact a customer if needed?

Every incident declared there is a link to the Customer Escalation Matrix which redirect you to intranet customer page.

  1. When is it appropriate to escalate or page another areas on-caller? How do you do that?

In critical processes (P1), the Incident Coordinator will decide together with Operation Engineer when someone else is needed. In case of a routine incident (P2), Operation Engineer should decide when to escalate it based on the impact and knowledge about the problem. To escalate use /inc escalate (or the button) in the incident channel and select the team or person that you want to target.

Note: To call for an incident coordinator you can select the incident_coordinators group in the Who I need? popup window when you escalate through incident.io.

  1. How to silence an alert out-of-hours.

Silences are managed in a single repo and they are a special Custom Resource. Create a new resource copying from existing and modifying cluster/installation names. If you are in the middle of the night don’t hesitate to merge without approval. More info.

  1. How should I raise issues I discover during debugging? Do all need issues creating?

Taking into account all incidents should end up with a postmortem, make sure before creating a new one there is no already an existing one. In the postmortem, you can point to problems found or suggestions raised and the target team will address them based on priority.

Further links