Helping Humans Troubleshoot

Language is key during incidents

“Cognitive load” is a term that refers to the amount of working memory resources someone can process at a given time. Working memory is a limited resource in capacity, related to how we learn and process information. If we have a lot of distractions (noise) or are constantly shifting tasks, it can impact our ability to learn. Why does this matter? During troubleshooting, people are actively trying to learn about what is happening in a system. They need to gather facts and understand what happened. Decisions need to be made. Depending on incident severity, the pressure to get things working can also be strong. Even more factors can be added, like if it’s their first on-call shift or if they think they are responsible for breaking the system. Whatever the factors, there’s a lot in an incident and that they compound against the troubleshooter’s ability to learn what’s happening. This is also why we need to ensure we’re not adding noise to an already noisy working memory of everyone involved. Ensure that the people in the room (real or virtual) don’t have diminished cognitive load and can stay focused on the problem.

While it’s great that we know what cognitive load is and that we have limited memory resources, what do we do with this information? We need to think about how we can provide clear information during troubleshooting and remove the burden on others to action on your words if it’s not necessary. Think about how you ask questions, how you provide information, and how you contribute to the troubleshooting. Additionally, think about how you can reduce unnecessary emotions coming up because you wanted to share your opinion. Not only will that add noise, it may hurt feelings and that’s easy to avoid.

Here are some tips you can use to reduce the cognitive ‘noise” in an incident:

  • Think about how you phrase your questions to gain information for yourself and the group.
  • Don’t introduce opinions or “hot takes”.
  • Make statements about what happened or what needs to happen next.
  • State who is working on problems, such as incident roles or volunteers for tasks
  • If you have a question and can get the answer on your own, do the research. Let the incident response team know about your question, along with your research.
  • Ask open questions when it’s apparent a line of research appears ignored or exhausted.
  • If you don’t have the right information, either through access or experience, find someone who can answer the questions.
  • Write information down in shared notes!
  • Volunteer for work if you have a particular expertise

Examples of Noise Reduction:

  • “We think a bunch of new users are causing…” becomes “there was an observed increase in new user signup correlating to…”.
  • “Would adding more instances fix this problem…” becomes “due to an increase in requests, adding 10 more instances addresses one known problem…”.
  • “If we had just done what I said we should have done six months ago… “ becomes “I researched this issue six months ago and have a more permanent fix we can investigate after we address the current problem…”.
  • “Would updating the volume type address the provisioned IOPS issue we are experiencing…” to “I researched available volume types and according to the documentation a change in volume type would not address our current problem…”.
  • “I think this works this way based on what I’ve done before…” becomes “I have experience with this problem, and I can help…”.
  • “Can we throw money at this…” becomes, “do we have permission to add larger instances to alleviate this problem…”.
  • “What would happen if we rolled back the last deployment…” becomes “what is the risk of rolling back the last deployment”.
  • “Increasing the memory should fix the full heap size…” becomes “the heap is full and memory needs to be added…”.

Start by thinking about how you ask questions. Do they introduce guesswork? Do they include a lot of modal verbs, like “should” and “could”? Try removing some of those words from your normal line of questioning and introduce more declarative statements. Do a minute of cursory searching for your answer before asking it. Pull back from emotional takes and try to be supportive. Whatever you choose to begin with, it’ll be a great first step in reducing troubleshooting noise. I’m sure the humans in the room will appreciate it.