Process Design: Rollout a Risky Change

An example interview on rolling out a risky change.

Question

You need to rollout a risky change, such as a system-wide configuration change or new binary deployment. How would you do that?

Background

Managing change in an ambiguous environment is a core part of the TPM skillset, whether it is a process, system, or feature change. This question is designed to exercise this skill and provide a framework to help you do so.

Similar to a previous question, we’ll use a role playing approach to simulate a real live interview.

Solution approach

We will use the following structured approach for this question:

  • Clarify the scope of the change.
  • Outline the potential risks.
  • Propose mitigations for these risks.

Sample answer

Interviewee: First, I’d like to understand the scale and scope of the change to understand the potential impact. These are some questions I have:

  • How will the change be deployed? How fast will the change be felt?
  • How many machines will be impacted?
  • What are the potential negative effects of this change? How many users would be impacted?
  • How fast can we rollback if something goes wrong?
  • Do we have access to a test environment to validate these changes?

Interviewer: Good questions. Here is some additional information:

  • The change will originate from our deployment server, and once a machine is targeted for the change, it will be in order of minutes for the change to take effect.
  • This change will need to go to thousands of servers.
  • Please brainstorm things that could go wrong. The machines in scope serve millions of users.
  • Rollback can happen as quickly as rolling out the change: order of minutes.
  • Yes, you will have access to a test environment.

Interviewee: Great. Let me enumerate some of the potential issues that could go wrong:

  • The change could cause machines to crash, taking down our application. This could be due to a bug in the configuration or binary.
  • The change could introduce a user-facing bug that could impact some or all users. This can result in any number of impacts depending on the nature of service: lost usage or lost revenue, etc.
  • The change could impact performance and introduce unacceptable latency due to more expensive computations.

Interviewer: Sounds like a good list. How would we go about mitigating these risks?

Interviewee: There are several initiatives we can take to mitigate these risks:

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.