As a member of the Site Reliability Engineering, our mission is to protect, provide for, and progress the software and systems behind all of Google’s public services with an ever-watchful eye on their availability, latency, performance, and capacity.
Site Reliability Engineer at Network Telemetry
- enhanced the resilience of 2 global distributed systems (Sinapse and gTIB). I was personally focused on Chaos Engineering, Capacity Management, and Resources Management.
- Simple Network Abstraction and Simulation Engine (Sinapse) → is the software model of Google’s network used for live Bandwidth Enforcement as well as network analysis including PCR (Production Change Request) vetting & capacity planning.
- Google Traffic Information Base (gTIB) provides traffic telemetry, i.e., traffic matrices, packet samples, and connection events to consumers, who are often real-time control loops. Telemetry is also branched off for non-real-time monitoring or offline access.
- Co-lead the creation of our own diagnostic service (Network Expert) to bridge the gap between oncall tooling (playbooks, graphs, logs) and large-scale fully automated systems (mitigation, self-healing, etc) with a goal of reducing the MTTR (mean time to repair) for outages and bugs. Automation tool to increase feature velocity, decrease toil for internal team and help external teams to troubleshoot outages.
- Implemented several automation workflows to improve oncaller life, decreasing the time to handle some issues in 60%.
- Implemented the foundation of plugins used to create automation workflows, these plugins enabled the wideness of problems that can be tackled with Network Expert.
- Customer-first approach for NT-SRE
- Process: Interviewed all customers of our services, and created an impact assessment to understand how our outages would impact our customers and how to mitigate that risk.
- Output: OKRs based on the output of the interviews
- Creator & Owner of Capacity Management Roadmap 2022/2023
- Results:
- Saved 100K vCPUs
- Implemented dashboards & alerts to identify clusters at risk of Resource Stockout
- Using Machine Learning to do Forecast of resource usage
- Defined Mitigation strategy to deal with Resource Stockout
- Improved Rollout Strategy creating a better distributing system for network domains
- We use progressive rollouts with waves distribution to decrease the blast radius of errors
- This system was not taking into account the different networking domains, hence, increasing the risk of a bug causing an outage for a single domain
- Deprecation of canary environment (Sinapse2)
- Getting agreement and influencing Devs to stop using Canary environment after improving rollout strategy
- Coordination and collaboration cross-team to turn down an environment
- Saving 70K vCPUs, 260TB RAM, 2PB HDD
- Innovative Roadmap
- Research, Roadmap, Execution, Delegation
- Creator & Owner Chaos Engineering Roadmap 2022
- Research, Roadmap
- Proposal of KRs to achieve better reliability and test fail-over options hypothesis
- SRE Exchange with BigTable
- Helped both teams to improve their processes
- 1 Week exchange with the Bigtable team to learn how they work and collaborate, at the end created a document with suggestions for both sides
Community work:
- Judge @ The 17 Sustainable Development Goals of the United Nations
Solution Challenge 2022 & 2023.
- Mentoring startups @ Digital Health Innovation Accelerator Programme 2022 & 2023. (https://www.wfp.org/)
- Mentoring people @ Manara Tech (https://manara.tech/) 2022
- Speaker @ SRE Global Conference 2022.
- Translator & Reviewer for SRE courses on Coursera.