Course Summary
With Site Reliability Engineering (SRE) Practitioner, you will learn about:
• Practical view of how to successfully implement a flourishing SRE culture in your organisation
• The underlying principles of SRE and an understanding of what it is not in terms of antipatterns
• Organisational impact of introducing SRE. SLIs and SLOs in a distributed ecosystem and extending the usage of Error Budgets
• Building security and resilience by design in a distributed, zero-trust environment
• Implementing full-stack observability, distributed tracing and Observability-driven development culture
• Curating data using AI to move from reactive to proactive and predictive incident management
• Using DataOps to build clean data lineage
• Why Platform Engineering is important in building consistency and predictability
• Implementing practical Chaos Engineering
• Major incident response responsibilities
• SRE Execution model
Course Content
Module 1: SRE Anti-patterns
• Rebranding Ops or DevOps or Dev as SRE
• Users notice an issue before you do
• Measuring until my Edge
• False positives are worse than no alerts
• Configuration management trap for snowflakes
• The Dogpile: Mob incident response
• Point fixing
• Production Readiness Gatekeeper
• Fail-Safe really?
Module 2: SLO is a Proxy for Customer Happiness
• Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
• Defining System boundaries in a distributed ecosystem for defining correct SLIs
• Use error budgets to help your team have better discussions and make better data-driven decisions
• Overall, Reliability is only as good as the weakest link on your service graph
• Error thresholds when 3rd party services are used
Module 3: Building Secure and Reliable Systems
• SRE and their role in Building Secure and Reliable systems
• Design for Changing Architecture
• Fault tolerant Design
• Design for Security
• Design for Resiliency
• Design for Scalability
• Design for Performance
• Design for Reliability
• Ensuring Data Security and Privacy
Module 4: Full-Stack Observability
• Modern Apps are Complex & Unpredictable
• Slow is the new down
• Pillars of Observability
• Implementing Synthetic and End user monitoring
• Observability driven development
• Distributed Tracing
• What happens to Monitoring?
• Instrumenting using Libraries an Agents
Module 5: Platform Engineering and AIOPs
• Taking a Platform Centric View solves Organisational scalability challenges such as fragmentation, inconsistency and unpredictability.
• How do you use AIOps to improve Resiliency
• How can DataOps help you in the journey
• A simple recipe to implement AIOps
• Indicative measurement of AIOps
Module 6: SRE & Incident Response Management
• SRE Key Responsibilities towards incident response
• DevOps & SRE and ITIL
• OODA and SRE Incident Response
• Closed Loop Remediation and the Advantages
• Swarming – Food for Thought
• AI/ML for better incident management
Module 7: Chaos Engineering
• Navigating Complexity
• Chaos Engineering Defined
• Quick Facts about Chaos Engineering
• Chaos Monkey Origin Story
• Who is adopting Chaos Engineering
• Myths of Chaos
• Chaos Engineering Experiments
• GameDay Exercises
• Security Chaos Engineering
• Chaos Engineering Resources
Module 8: SRE is the Purest form of DevOps
• Key Principles of SRE
• SREs help increase Reliability across the product spectrum
• Metrics for Success
• Selection of Target areas
• SRE Execution Model
• Culture and Behavioral Skills are key
• SRE Case study
Other Popular Courses
Executive Cyber Risk Certification (ECRC)
- Duration: 2 Days
- Language: English
- Level: Intermediate
- Exam: ECRC
Mastering Communication & Presentation Te...
- Duration: 4 Days
- Language: Danish
- Level: Intermediate
- Exam: MCPT
Next Generation Mindfulness
- Duration: 1 Days
- Language: English
- Level: Foundation
- Exam: NGM