How Google Does Planet-Scale Engineering for Planet-Scale Infra

Insightful talk from Melisa on the secret sauce of culture Google invented to achieve high availability.  We test if we are connected by hitting google.com. That’s the trust we’ve on Google.

Key takeaways

  • A role ‘Site Reliability Engineer’ (SRE) responsible for the availability. They are software engineers who are custodian of site availability. They participate in design along with devs help making reliable software.
  • Devs want to push features. With every code push there is a risk on availability of the site. SRE’s goal is to maintain high up time. It’s contrasting force and ultimately SRE evaluate and decide if the new code is ready to be pushed to prod.
  • If the code fails in prod, it’s not the problem with human but with the process.
  • Error budget per team per quarter. Error budget is number of errors permitted in production. It determines the number of features the team pushes to production. More the features, less the site reliability and more error. Great thought, Google.

Leave a Reply

Your email address will not be published. Required fields are marked *