posted on Nov 14, 2022 by Dominique Raviart
Tags: Tata Consultancy Services, Application Testing Management, IT outsourcing
TCS recently briefed NelsonHall on its approach to site reliability engineering (SRE) in the context of quality engineering (QE).
SRE emerged almost a decade ago as part of the shift-right move, targeting production environments beyond traditional IT infrastructure activities such as services desk and monitoring activities. While no definition of SRE has fully emerged, TCS points out that SRE focuses on two topics: resiliency and reliability, through with observability and AIOps, automation, and chaos engineering as key services.
TCS prioritizes cloud-hosted applications for its SRE services, as cloud hosting increases the likelihood of application outage since applications that have been migrated were not initially designed and configured for cloud or multi-cloud hosting.
Generally, there has been very little SRE in QE activity, even though the industry has emphasized shift-right for several years. The shift-right notion in QE refers to feeding back production information to dev and test teams, breaking down the traditional silos between build and run activities. And in activities such as application monitoring (relying on the APM tools) and associated AI use cases (to make sense of APM-triggered events), the classification of defects found in production, and in sentiment analysis, have become common.
We think shift-right activities can still be improved, building on monitoring activities. Chaos engineering is a good example of a developing proactive service. More importantly, the feedback from production to dev and test needs to be improved, and we think SRE will help here.
Observability/Monitoring, AIOps, and Chaos Engineering
TCS' approach to SRE relies on application monitoring, AIOps, automation, and chaos engineering. Application monitoring ('observability') remains at the core of TCS' portfolio. For this, the company will deploy APM tools, collect logs and traces, and provide reporting. One of the challenges in application monitoring is data dissemination across different applications and databases. Accordingly, data centralization is a priority for TCS.
Once it has collected monitoring data, TCS deploys AI models (AIOps) to automate event detection and correlation and eventually move to a prediction phase. TCS' main AI use cases are predictive alerts, root cause analysis, event prioritization, and outage likelihood. The company will use third-party tools such as Dynatrace (combined with application monitoring) or deploy its own IP, depending on the client's tool usage.
For deployment and recoverability, its next step after AIOps, TCS will complement application deployment with automated rollbacks and ticket creation. At this stage, when facing application defects, the SRE team will also involve the development teams to conduct RCA and fix application defects.
TCS will also conduct chaos engineering. Chaos engineering complements performance engineering and testing in that it evaluates applications' behavior under more strenuous conditions. With chaos engineering, TCS will conduct attacks such as instance shutdown, increased CPU usage, and black holes to assess how the applications being tested behave. TCS has integrated tools such as Gremlins and Azure Chaos Studio in its DevOps portfolio to embed chaos engineering as part of continuous testing.
Demand Is Still Nascent
TCS typically deploys SRE teams of six engineers for monitoring applications. It highlights that SRE adoption is still nascent, and it will lead such programs with marquee clients initially.
In broad terms, the future of SRE lies in DevOps and becoming part of continuous testing, where all activities are scheduled and automated, for new build/release execution. TCS is an early mover in this area and is currently honing its tools and consulting capabilities. Platforms combining tools and targeting comprehensive services as part of continuous testing are the company's next step.