A major challenge today in AI deployment is the trust deficit. Engineers are naturally cautious about handing over control to AI, especially in high-stakes reliability domains. In a freewheeling discussion with CXO Media, Hemant Burman, Director of Engineering, Walmart, outlines how AI models must be explainable, tested for edge cases, and rollback-safe before they can be integrated into production workflows. Read more….
How do you see AI and machine learning reshaping core engineering practices over the next few years?
AI and ML are evolving from being ancillary tools to becoming foundational layers of engineering practice—especially in reliability engineering. Traditionally, SRE teams relied on static thresholds and reactive mechanisms. Now, we are entering a phase where AI augments core workflows, from incident detection and mitigation to deployment risk assessment and capacity planning.
Over the next few years, there is likely to be widespread adoption of autonomous observability systems, where ML models continuously learn system baselines and proactively flag deviations with high precision. AI will also fuel real-time root cause analysis, surfacing contributing factors across layers (network, app, infra) within seconds. Additionally, AI will influence intelligent release gating, enabling systems to prevent bad deploys without human intervention.
This shift will empower engineers to focus more on system design and proactive resilience rather than firefighting.
What are the biggest challenges engineering teams face when integrating AI/ML solutions into existing systems?
There are both technical and organizational hurdles. From a technical standpoint, many legacy systems do not expose clean telemetry or structured operational data needed to train and maintain reliable models. Observability gaps, inconsistent labels, or missing context can lead to inaccurate inferences or model drift.
Organizationally, a major challenge is the trust deficit. Engineers are naturally cautious about handing over control to AI, especially in high-stakes reliability domains. And rightly so—models must be explainable, tested for edge cases, and rollback-safe before they can be integrated into production workflows.
To bridge these gaps, we have built AI-assisted tools that augment, rather than replace, human judgment. For example, AI systems that suggest but do not enforce actions, or that sit behind a manual approval flow during early adoption. We also build robust evaluation pipelines—simulating incidents, drift, and false positives—before greenlighting production use.
How do you prioritize AI/ML initiatives within your technology roadmap—what factors drive those decisions?
We prioritize based on impact, feasibility, and alignment with SRE principles. A high-priority AI/ML initiative typically demonstrates the ability to reduce toil, improve MTTR/MTTD, or drive measurable improvement in SLIs.
For instance, if we observe repetitive incidents tied to post-deployment regressions, that is a strong candidate for an AI-driven solution like deployment risk scoring or change intelligence. We also consider data maturity—whether we have historical data with a strong signal-to-noise ratio—and operational integrability, i.e., how easily the AI output can tie into existing workflows like on-call tools, dashboards, or CI/CD gates.
Another key factor is long-term maintainability. We avoid prioritizing “one-off” ML experiments that cannot be monitored, retrained, or governed at scale. Instead, we invest in reusable ML infrastructure like shared feature stores and evaluation harnesses.
Can you share an example of a successful AI/ML deployment and the key factors that made it work?
One example was an AI-powered change intelligence system we built to score deployment risk across thousands of services. The model ingested metadata such as change frequency, past incident correlations, service criticality, and historical success/failure patterns. Based on this, it predicted the likelihood of a change causing an outage.
The key to its success was not just the algorithm—it was the tight integration into the SRE ecosystem. Engineers could see a risk score directly in the CI/CD pipeline, along with rationale and suggested mitigation steps (e.g., enable progressive rollout, add test coverage). We also added feedback loops where post-deployment outcomes were used to retrain the model.
What made it stick was:
- Clear ROI (30–40% reduction in incident-triggering deploys)
- Human-centric design (no “black box” automation)
- Embedded trust loops (engineer overrides and CoE feedback)
How do you ensure the ethical and responsible use of AI in product development and decision-making?
In SRE, responsibility and transparency are non-negotiable—especially when systems operate at scale and in production environments. To ensure responsible AI use, we follow three core principles:
- Explainability: Every AI decision must be traceable and justifiable. For example, if an incident correlation tool ranks an event as high priority, it must show the underlying signals and weights that led to that decision.
- Human override and safety controls: We always provide engineers with the ability to override or disable AI actions. No auto-remediation or rollout gating operates without manual fail-safes.
- Governance and auditability: All AI outputs are logged, versioned, and periodically reviewed. We also ensure that training data is stripped of PII and complies with internal privacy policies and relevant regulatory frameworks like GDPR.
By embedding these principles in design, we prevent the AI from becoming a risk vector and instead make it a responsible co-pilot in our reliability journey.
What is your approach to building and scaling teams with strong AI/ML capabilities?
We have found success in a hybrid team model: integrating applied ML engineers within core infrastructure and SRE teams. This ensures AI practitioners understand the context of the system—SLIs, incident patterns, scaling bottlenecks—while SREs gain fluency in ML workflows and tooling.
Our hiring focuses on cross-functional talent—people who can reason about distributed systems and think statistically. Beyond hiring, we build organizational muscle by:
- Running AI/SRE boot camps internally
- Publishing internal reliability+ML playbooks
- Hosting “ML for reliability” hack weeks
- Creating safe experimentation zones where SREs can trial AI tooling in shadow mode
This approach ensures that AI/ML is not a side-project, but a core competency within the reliability domain.
With the rise of generative AI, how do you evaluate which tools or models are worth adopting in your tech stack?
Generative AI is incredibly promising but must be evaluated through a SRE lens of trust, safety, and control. We assess tools using these criteria:
- Domain fit: Is the model grounded in infrastructure concepts (e.g., logs, metrics, topology)? Can it understand SRE-specific contexts like postmortem formats or error logs?
- Control and observability: Can we monitor hallucination rates, configure token limits, and enforce output boundaries?
- Workflow integration: Does it meaningfully reduce toil or enhance outcomes—like summarizing incident chats, generating RCA drafts, or correlating metric anomalies with recent changes?
We pilot these tools in read-only or advisory modes first. Only after measuring performance, cost, and risk, do we consider production integration. For example, our internal tooling now uses LLMs to prefill incident review templates—improving speed without compromising accuracy.
How do you balance innovation in AI/ML with the need for system reliability, data privacy, and security?
We treat innovation and reliability as complementary goals, not opposing forces. Every AI/ML system is subject to the same SRE-grade rigour as any production system. That includes:
- Progressive rollout strategies: Canary deployments, red/black testing, and shadow evaluation environments
- SLO-based gating: AI systems are held to reliability standards themselves—e.g., latency, correctness, or false positive thresholds
- Data governance: We use anonymization, access control, and clear lineage tracking across all AI training pipelines
- Resilience drills: We simulate degraded AI behaviour (drift, misclassification) to test fail-safes and fallback mechanisms
This approach allows us to innovate safely, without compromising the trust we’ve built around platform stability, privacy, or security.

