The promise of the hybrid cloud, once a beacon of flexibility and scalability, has evolved into a complex operational challenge for enterprises. The seamless integration of on-premises infrastructure with multiple public and private cloud environments has created a labyrinth of interconnected systems and siloed teams, making it increasingly difficult to achieve true visibility, understanding, and rapid remediation of operational issues. This complexity is further exacerbated by the relentless pace of AI adoption, which demands that already stretched site reliability and operations teams not only maintain existing systems but also adapt to and leverage new AI-driven capabilities. The inherent instability and security vulnerabilities of modern enterprise stacks amplify these challenges, pushing traditional operational models to their breaking point.
To address this evolving landscape, enterprise operations must undergo a fundamental transformation, moving beyond the reactive, siloed approaches of the past. The notion that maintaining rigid boundaries between services or divisions offers security is now a relic, as these very silos impede the potential return on investment from AI initiatives. The path forward lies in adopting a closed-loop operations model, where orchestration, observability, and remediation are integrated into a continuous feedback cycle, powered by intelligent AI agents.
Phanidhar Koganti, Senior Distinguished Technologist in Hewlett Packard Enterprise (HPE) Hybrid Cloud, emphasizes this paradigm shift. "You’ve got to understand that Day 2 and Day 1 are in a closed loop because what you provision, you need to understand what the current Brownfield is. And you might not want to make a lot of changes when there are a lot of issues going on," Koganti explained to The New Stack. This interconnectedness is crucial for managing the "Brownfield" – the existing, often complex, operational environment – and ensuring that new deployments do not introduce further instability.
The current constraints facing operations teams necessitate a strategic application of AI techniques to extract meaningful signals from the overwhelming volume of data. This transition from manual, "DIY" approaches to autonomous remediation requires more than just an infusion of AI tools; it demands a holistic platform engineering strategy, advanced predictive analytics, and the redefinition of operational metrics. The goal is not to replace human operators but to augment their capabilities, enabling them to be more efficient, strategic, and less stressed in an increasingly demanding environment.
The Era of Constrained Resources
The advent of AI has been heralded by some, such as the Outcome Engineering Manifesto, as a catalyst that will "limit creation only by the cost of compute, not capacity." However, for many operations teams, the reality has been the opposite, with AI paradoxically increasing feelings of limitation due to escalating demands on time and resources.
Sridhar Katere, VP of Engineering for HPE’s Data Center Business Unit, observes this pressure firsthand. "Our customers are under pressure to continue to offer the same SLAs [service level agreements] with a lot less resources," Katere stated. This translates into leaner operations teams tasked with troubleshooting increasingly complex issues that arise during the critical "Day 2" operations phase.
Operations agents, managed by these teams, represent a vital opportunity to scale troubleshooting and remediation efforts without necessitating an expansion of personnel. HPE OpsRamp Software, which recently achieved General Availability (GA), introduces an agentic operations copilot. Koganti elaborates on its capabilities: "You can express what you are trying to achieve in a very high-level intent, and that will be converted to detailed deployment plans, which will include data center, networking-related automation, storage, and various other components for the whole infrastructure to come together." This high-level intent translation streamlines complex deployments and ongoing management.
Beyond immediate remediation, AI can empower operations teams to shift from a reactive to a proactive stance. This includes advocating for and achieving right-sized resource budgeting. "When we say issues, it doesn’t mean the nature of the failure is service-disruptive only," Koganti noted. "If you’re going to go out of capacity, meaning optics could fall, that’s a binary failure that you’re trying to predict."
The ability to predict impending failures, such as a critical switch nearing the end of its operational life, provides invaluable lead time. This foresight not only allows for the prevention of costly outages but also enables more accurate hardware resource planning. In an era of persistent supply chain uncertainties, this predictive capability is a significant advantage. "This is where we are building predictive analytics to help customers plan and procure the necessary hardware and components ahead of time," Katere added.
HPE’s strategic investments and recent acquisitions are geared towards building a comprehensive operations control center designed to define this next-generation hybrid, multi-cloud operational model. Their CloudOps Software suite, anchored by HPE OpsRamp Software and HPE Morpheus Software, with HPE Apstra Data Center Director available as an add-on, aims to provide organizations with the tools to map out their complete cloud operating model. Following the GA release of OpsRamp, other HPE products are anticipated to integrate agentic systems with conversational interfaces, further enhancing user interaction and operational efficiency.
"Whatever we are talking about in optimizing the full stack will be applicable to capacity issues or even performance issues across the whole stack," Koganti asserted, underscoring the multi-layered nature of enterprise systems and the interconnectedness of operational challenges.
Redefining Operations Metrics for the AI Age
While traditional metrics like Mean Time To Resolution (MTTR) – calculated as total downtime divided by the total number of incidents – remain relevant, they are increasingly insufficient in the face of enterprise-grade complexity. The speed and method of issue identification and detection also take on new dimensions with the integration of AI.
AI-driven operations must prioritize "time to correlate," leveraging AI operations tooling to analyze logs, metrics, and traces. The objective is to connect disparate signals and distill them into a single, traceable, and actionable recommendation. This process is critical for accurately diagnosing issues in complex, interconnected environments.
Consider the common scenario where the network is often the first suspected culprit in an incident. In a full-stack environment, the symptom of a failure and its root cause rarely reside in the same layer. "First instinct is often that network is the culprit, so it’s very important for networking to self-diagnose and come back to say: Hey, I’m not the problem," Katere stated, highlighting the need for components to accurately report their status.
This self-diagnostic capability is crucial for optimizing "mean time to innocence," a concept that addresses the time taken to confirm that a particular component or layer is not the source of the problem. Koganti elaborated on this phenomenon: "What is common that we see among the majority of our customers is that the symptom of a failure and the cause of the failure are never in the same layer. What I mean by that is, for example, the symptom can appear in the application layer as my application transactions are timing out, and so on. While the actual cause of it could be somewhere down in the network layer or the storage layer, most of the time, the network is the culprit."
The escalating complexity of IT stacks necessitates enhanced collaboration across various layers, both through agentic interactions and traditional AIOps collaboration. In scenarios where multiple failures occur concurrently, as is common in enterprise environments, tighter integrations across the full stack are paramount. AI plays a vital role in filtering out extraneous noise and identifying the critical signals that point to the true cause. The subsequent step involves AI agents autonomously remediating issues, but crucially, only when their actions are explainable and verifiable by human operators.
HPE tracks its own efficacy metrics, including quarterly Key Performance Indicators (KPIs) designed to measure the platform’s ability to predict and troubleshoot issues. Currently, the HPE platform, which utilizes graph and GraphQL databases with a highly contextualized data stack, demonstrates approximately 40% accuracy in troubleshooting. The company aims to exceed 70% accuracy by the end of the year. This ongoing development integrates nearly a century of HPE’s enterprise experience into its CloudOps Software, alongside agentic skills trained across various layers, from the platform and drivers to replaceable units and the overall system.
"We are not living in a static world. It’s a very dynamic world," Katere acknowledged, recognizing that the pursuit of operational excellence is a continuous journey. He further noted that this dynamic environment presents a "moving target in light of how fast things are changing even in the slower enterprise software development lifecycle." This acknowledgment underscores the agile and adaptive approach required for modern enterprise operations.
The continuous evolution of hybrid cloud environments, coupled with the rapid integration of AI, presents both significant challenges and transformative opportunities. Organizations that embrace a closed-loop, AI-driven operational model, focusing on intelligent automation, predictive analytics, and redefined metrics, will be best positioned to navigate this complexity, ensure stability, and unlock the full potential of their technology investments. The journey from a reactive, siloed past to a proactive, integrated future is well underway, and AI is emerging as the indispensable compass for enterprise operations.
