Google - Site Reliability Engineering
sre.google
Google - Site Reliability Engineering
Exercises like “the Five Why’s” (see page 108) begin with chronic problems. As a team, can you identify fundamental causes?
Poorly performing patterns are often merely symptoms of an underlying problem. Addressing symptoms may ease the pain, but it does little to ensure sustainability. For that we need to expose the problem’s root cause. This can be done using simple yet robust techniques called “root cause analyses.” While there are a great many to choose from, we
... See moreFor example, instead of looking at incidents through arbitrary categories (P1 to P4), System of Profound Knowledge could be used to identify common-cause and special-cause patterns across all incidents. Leadership would be those same supervisors using these incidents as on-the-job training opportunities. John Allspaw says incidents are “unplanned
... See more