Kavin Arvind Ragavan - Global Testing Retreat 2022

Speaker

Kavin Arvind Ragavan is a Cloud Performance Architect with 12 years of experience in Performance and Reliability Engineering.

Kavin has strong expertise in architecting and designing solutions for validating Cloud Performance (Server side/ Client side) and Resilience. He is Specialized in AWS & GCP Cloud Performance and Resilience Engineering. He has architected Cloud Chaos Frameworks for GCP using GCP cloud Workflows & open-source tools and presented that as a solution in conferences.

Kavin has designed and presented CICD based frameworks to perform early performance, resilience, and accessibility validation in CICD Pipeline and to identify potential performance bottlenecks during the development phase. Kavin has vast experience in Cloud Performance testing solutions, monitoring strategies using Performance and APM Tools for Cloud Migrations and new Application development in Cloud.

He has also Presented Technical Whitepapers, participated in Interactive Talks and conducted Workshops related to Cloud Performance Testing, Resilience Testing and Microservices in various Software Conferences.He has published Blogs related to Chaos and Observability frameworks in various events and platforms like Medium.com

Day 1: Interactive Session on - Applying SRE principles to Containers- Kubernetes Chaos Engineering

Applying SRE principles to Containers- Kubernetes Chaos Engineering

SRE Best Practices for Containers

1. Shift Left into Dev cycle

Dev, Perf and Kubernetes SRE teams can identify weaknesses & potential outages in infrastructures earlier by inducing modern chaos tests in a controlled way in CI Pipeline. Chaos tests can be done anywhere in the DevOps cycle.

2. Shift Right into Production

Resilience can be validated in the staging environment and eventually in production with actual user load to find bugs and vulnerabilities, fixing them which leads to an increased resilience of the system. The extent of chaos tests varies from lower level env to production.

3. Testing for Kubernetes Changes

Testing in all scenarios like- Deploying new code, Adding dependencies, Observing changes in usage patterns, Mitigating problems, Kubernetes upgrades certification, post-upgrade validation of services, etc.

4. Validate Application/Service Resilience

Verify the application resilience whenever a change has happened in the underlying stack. This can also be Continuous Resilience- Process of continuously verifying if the application service is resilient against faults

5. Validate Infrastructure Resilience

Application resilience depends more on the underlying stack than the application. If the application is stabilized, the resilience of the service that runs on Kubernetes depends on other components and infrastructure most of the time

6. Resilience Benchmarking

Chaos workflows supports the user in defining the expected result, observing the result, analyzing the overall system behavior, and in the decision-making process- if the system needs to be tuned for improving the resilience and resilience benchmarking, etc.

sr.no	Category	Type	Faults
1.	Platform	Pod Chaos	Simulates Pod failures, such as Pod node restart, Pod’s persistent unavailability, and certain container failures in a specific Pod
2	Platform	Node Chaos	Simulates GCP platform failures, such as the GCP node restart.
3	Network	Network Chaos	Simulates network failures, such as network latency, packet loss, packet disorder, and network partitions.
4	Network	DNS Chaos	Simulates DNS failures, such as the parsing failure of DNS domain name and the wrong IP address returned.
5	Infrastructure	Stress Chaos	Simulates CPU or memory stress
6	Infrastructure	File IO Chaos	Simulates the I/O failure of an application file, such as I/O delays, read and write failures.
7	Infrastructure	Time Chaos	Simulates the time jump exception.
8	Infrastructure	Kernel Chaos	Simulates kernel failures, such as an exception of the application memory allocation.
9	Application	Http Chaos	Simulates HTTP communication failures, such as HTTP communication latency.
10	Application	JVM Chaos	Simulates JVM application failures, such as the function call delay

Resilience/ Chaos Engineering for Containers

Resiliency is the ability of the system to gracefully handle and recover from hardware and software failures and provide an acceptable level of service to the business
Resilience/ Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production
Applying Chaos Engineering experiments on Cloud Services and Kubernetes helps to continuously improve application’s performance, observability, and resiliency through different fault simulations.

SPOF Failures– Failure of one service or component should not have cascading impact on the other components
Dependency Failures– Failure of the dependent service like the database, cache shouldn’t make the application down
App level Failure Injections- Introduce resource, state, network level faults into the application
Data Failures- Data to be available to applications if the system that originally hosted the data fails
Canary Deployment Failures- Verify automated rollback mechanism for code in production in case of failure

Tools for Kubernetes Chaos

Chaos Mesh is an open-source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. Using Chaos Mesh, we can simulate various abnormalities that might occur in the development, testing, and production environments and find potential problems in the system

Litmus is a Cloud-Native Chaos Engineering Framework with cross-cloud support. Its purpose is to help Kubernetes SREs and Developers to find weaknesses in both Non-Kubernetes as well as platforms and applications running on Kubernetes by providing a complete Chaos Engineering framework and associated Chaos Experiments

Chaos Mesh Framework and Features

Authenticated Login : RBAC- Role based access control to login to clusters
Cloud Native: Chaos Mesh supports every Kubernetes environment with its powerful automation ability utilizing its CRDs
Workflow Orchestration: Design own Chaos experiment scenarios on the platform, including multiple mixing experiments and application status checks
High security: Chaos Mesh is designed with multiple layers of security control and provides high security.
Community support: Chaos Mesh is an incubating project hosted by CNCF and has a growing number of contributors and adopters all over the world

Litmus Framework and Features:

Users & Teams

Creation of Users with Role Based Access Control
Creating a Team of multiple Users
Authenticating Users

Monitoring & Observability

Connecting a Data Source (from any Agent) and monitor workflows
Monitor effect of chaos in real time with interleaved events and metrics from Prometheus Data source

Customized Workflows

Creation of scenarios Templates, Custom Workflows from Scratch (using Chaos Hubs), From pre-created YAMLs
Attaching priority to Chaos Experiments based on your use cases

Modern fault scenarios

Many new Kubernetes native chaos scenarios for fault simulation in distributed testing system

Hear what Kavin has to say about the Interactive session

Speaker

More Speakers

Community Partners

Brought to you by

ATAGTR © 2022. All Rights Reserved

code of conduct