Speaker
Kavin Arvind Ragavan is a Cloud Performance Architect with 12 years of experience in Performance and Reliability Engineering.
Kavin has strong expertise in architecting and designing solutions for validating Cloud Performance (Server side/ Client side) and Resilience. He is Specialized in AWS & GCP Cloud Performance and Resilience Engineering. He has architected Cloud Chaos Frameworks for GCP using GCP cloud Workflows & open-source tools and presented that as a solution in conferences.
Kavin has designed and presented CICD based frameworks to perform early performance, resilience, and accessibility validation in CICD Pipeline and to identify potential performance bottlenecks during the development phase. Kavin has vast experience in Cloud Performance testing solutions, monitoring strategies using Performance and APM Tools for Cloud Migrations and new Application development in Cloud.
He has also Presented Technical Whitepapers, participated in Interactive Talks and conducted Workshops related to Cloud Performance Testing, Resilience Testing and Microservices in various Software Conferences.He has published Blogs related to Chaos and Observability frameworks in various events and platforms like Medium.com
Applying SRE principles to Containers- Kubernetes Chaos Engineering
SRE Best Practices for Containers
1. Shift Left into Dev cycle
Dev, Perf and Kubernetes SRE teams can identify weaknesses & potential outages in infrastructures earlier by inducing modern chaos tests in a controlled way in CI Pipeline. Chaos tests can be done anywhere in the DevOps cycle.
2. Shift Right into Production
Resilience can be validated in the staging environment and eventually in production with actual user load to find bugs and vulnerabilities, fixing them which leads to an increased resilience of the system. The extent of chaos tests varies from lower level env to production.
3. Testing for Kubernetes Changes
Testing in all scenarios like- Deploying new code, Adding dependencies, Observing changes in usage patterns, Mitigating problems, Kubernetes upgrades certification, post-upgrade validation of services, etc.
4. Validate Application/Service Resilience
Verify the application resilience whenever a change has happened in the underlying stack. This can also be Continuous Resilience- Process of continuously verifying if the application service is resilient against faults
5. Validate Infrastructure Resilience
Application resilience depends more on the underlying stack than the application. If the application is stabilized, the resilience of the service that runs on Kubernetes depends on other components and infrastructure most of the time
6. Resilience Benchmarking
Chaos workflows supports the user in defining the expected result, observing the result, analyzing the overall system behavior, and in the decision-making process- if the system needs to be tuned for improving the resilience and resilience benchmarking, etc.
Platform | Pod Chaos | ||
Platform | Node Chaos | ||
Network | Network Chaos | ||
Network | DNS Chaos | ||
Infrastructure | Stress Chaos | ||
Infrastructure | File IO Chaos | ||
Infrastructure | Time Chaos | ||
Infrastructure | Kernel Chaos | ||
Application | Http Chaos | ||
Application | JVM Chaos |
Resilience/ Chaos Engineering for Containers
Resiliency is the ability of the system to gracefully handle and recover from hardware and software failures and provide an acceptable level of service to the business
Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production
Applying Chaos Engineering experiments on Cloud Services and Kubernetes helps to continuously improve application’s performance, observability, and resiliency through different fault simulations.
- SPOF Failures– Failure of one service or component should not have cascading impact on the other components
- Dependency Failures– Failure of the dependent service like the database, cache shouldn’t make the application down
- App level Failure Injections- Introduce resource, state, network level faults into the application
- Data Failures- Data to be available to applications if the system that originally hosted the data fails
- Canary Deployment Failures- Verify automated rollback mechanism for code in production in case of failure
Tools for Kubernetes Chaos
Chaos Mesh is an open-source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, we can simulate various abnormalities that might occur in the development, testing, and production environments and find potential problems in the system
Litmus is a Cloud-Native Chaos Engineering Framework with cross-cloud support. Its purpose is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments
Chaos Mesh Framework and Features
- Authenticated Login : RBAC- Role based access control to login to clusters
- Cloud Native: Chaos Mesh supports every Kubernetes environment with its powerful automation ability utilizing its CRDs
- Workflow Orchestration: Design own Chaos experiment scenarios on the platform, including multiple mixing experiments and application status checks
- High security: Chaos Mesh is designed with multiple layers of security control and provides high security.
- Community support: Chaos Mesh is an incubating project hosted by CNCF and has a growing number of contributors and adopters all over the world
Litmus Framework and Features:
Users & Teams
- Creation of Users with Role Based Access Control
- Creating a Team of multiple Users
- Authenticating Users
Monitoring & Observability
- Connecting a Data Source (from any Agent) and monitor workflows
- Monitor effect of chaos in real time with interleaved events and metrics from Prometheus Data source
Customized Workflows
- Creation of scenarios Templates, Custom Workflows from Scratch (using Chaos Hubs), From pre-created YAMLs
- Attaching priority to Chaos Experiments based on your use cases
Modern fault scenarios
- Many new Kubernetes native chaos scenarios for fault simulation in distributed testing system
More Speakers
- Aditya Garg
- Ajay Balamurugadas
- Aliasgar Chaiwala
- Amol Deshpande
- Andrew Knight
- Anindita Rath
- Anubha Bagui
- Anwesha Roy Choudhawry
- Arpita Swer
- Balvinder Khurana
- Brijesh Deb
- Chidambaram Vetrivel
- Craig Risi
- Deepak Koul
- Deepthi K
- Dhairya Thakkar
- Gajapathy Rasamala
- Gaurav Soni
- Gauri Gupta
- Gayathri Mohan
- Geosley Andrades
- Giri Shankar
- Giridhar Rajkumar
- Harpreet Kaur Kahai
- Harsh Sahay
- Hema Latha
- Hina Sharma
- Hitesh Prajapati
- Jaisudhan Selvaraj
- James Thomas
- Kanwarpreet Singh Khurana
- Kavin Arvind Ragavan
- Khushboo Rajpurohit
- Kiruthika Ganesan
- Kumudha Ganesan
- Kunal Samel
- Maaret Pyhäjärvi
- Mahathee Dandibhotla
- Marta Firlej
- Meera Vyas
- Mohanpriya P
- Mukund Zalke
- Nikhil Bhandari
- Nimesh Bhatt
- Niranjan Limbachiya
- Nitasha Rawat
- Pallavi Vadlamani
- Parinita Patankar
- Poorva Pal
- Pranesh Gaikwad
- Pricilla Bilavendran
- Puja Sakhia
- Pushan Ghosh
- Rahul Parwal
- Rajani Sinha
- Rik Marselis
- Rishil Bhatt
- Ritu Chowdhary
- Sakthikannan Subramanian
- Saurabh Bhardwaj
- Schalk Cronje
- Seema Prabhu
- Shailesh Gohel
- Shubha Lokeshaiah
- Shuchita Singh Basu
- Sneha Viswalingam
- Soumya Mukherjee
- Sowmya Ramesh
- Sujata Dutta
- Sujit Pathak
- Sumit Mundhada
- Sundaresan Krishnaswami
- Tejaswi Sedimbi
- Venkatesh Belde
- Videos
- Vikas Kataria
- Vishal Parmar