Copilots to Operators: The Agentic Evolution of Enterprise IT

The relentless march of technological advancement, particularly the rapid integration of artificial intelligence, has placed unprecedented pressure on enterprise IT operations teams. Faced with escalating complexity, an ever-increasing volume of alerts, and the persistent threat of cyberattacks, these teams are teetering on the brink of operational fatigue. Hewlett Packard Enterprise (HPE) is proposing a significant shift in how these challenges are met, heralding the dawn of an "Agentic Era" in enterprise IT, where specialized AI agents, imbued with "agent skills," will collaborate with human operators to navigate the intricate landscape of modern infrastructure.

This transition, outlined in HPE’s whitepaper "Copilots to Operators: The Agentic Evolution of Enterprise IT," suggests a future where AI moves beyond simple assistance to become a proactive, intelligent partner in managing and securing complex IT environments. The core of this evolution lies in the development of agentic AI systems that possess specialized knowledge and workflows, capable of bridging persistent data and operational silos. When granted explicit permission and operating within auditable frameworks, these AI agents can undertake autonomous actions based on goal-oriented reasoning, fundamentally altering the dynamics of incident response and IT management.

The Growing Strain on Operations Teams

The urgency for such advancements is underscored by a stark reality: fewer than half of enterprises feel operationally prepared for the widespread adoption of AI across their infrastructure, data management, risk assessment, and talent development. This preparedness gap places an immense burden on already stretched operations teams, who are increasingly tasked with managing the complexities introduced by AI itself. The exponential growth in AI-generated code, while offering potential benefits, also introduces new security risks and amplifies the existing challenge of keeping pace with technological change.

A recent study of cybersecurity and operations leaders highlights the most pressing concerns: a significant percentage cited the overwhelming volume of alerts as a primary issue, leading to a situation where a substantial portion of these alerts are never investigated. Osterman Research data further illustrates this critical failure, revealing that 40% of alerts in large enterprises go unexamined due to sheer volume. This inaction has tangible consequences, with 73% of organizations reporting outages in 2025 directly linked to these overlooked or suppressed alerts.

The complexity inherent in modern hybrid and multi-cloud environments exacerbates these challenges. A staggering two-thirds of enterprises utilizing these architectures express a lack of confidence in their real-time threat detection and response capabilities. This technological complexity is not merely a technical hurdle; it is a direct driver of emotional exhaustion among IT professionals. While engineers may push through in the short term, this persistent cognitive drag contributes to long-term attrition. The specialized nature of these roles makes them difficult to fill, leading to a loss of invaluable institutional knowledge as experienced professionals depart. The consequences of this burnout extend beyond employee retention, negatively impacting productivity and extending incident response times, thereby increasing the likelihood of avoidable errors. All of this occurs against a backdrop of escalating cybersecurity threats and the accelerating pace of AI-driven code generation, creating a perfect storm of more code, more alerts, and insufficient human resources.

The Promise of Agentic Root Cause Analysis

Agentic AI for DevOps, which refers to the application of autonomous AI solutions to operational tasks, presents a compelling opportunity to alleviate the workload of human operators, reduce alert noise, and significantly enhance response times. However, it is crucial to recognize that AI is not a panacea. Many existing AI tools, rather than streamlining triage, have inadvertently increased alert noise, further eroding trust in the technology. A concerning 66% of AI tools are known to generate false positives, a statistic that directly contributes to increased stress and errors among IT personnel. These inaccuracies often stem from stale data within AI models and a lack of transparency in how decisions are reached.

To foster transparency in complex, distributed systems, any enterprise-grade operational agentic AI solution must be designed to dismantle cross-organizational data silos. Platform engineering has emerged as a critical pathway to not only consolidate disparate datasets but also to establish essential guardrails and gates for quality, security, and compliance, benefiting both human and AI-driven development processes.

The HPE whitepaper posits that when implemented correctly, agentic operations can deliver several key benefits:

Reduced Operational Fatigue: By automating routine tasks and assisting with complex analysis, AI agents can free up human operators to focus on higher-level strategic initiatives.
Faster Incident Resolution: Agentic AI’s ability to rapidly process vast amounts of data and identify patterns can significantly shorten the time it takes to pinpoint the root cause of an incident.
Enhanced Proactive Maintenance: AI agents can identify potential issues before they escalate into critical failures, enabling a shift from reactive to proactive system management.
Improved Accuracy and Consistency: AI agents, free from human biases and fatigue, can perform analysis with a high degree of accuracy and consistency.
Knowledge Augmentation: Agentic AI can act as an extension of human expertise, providing access to a broader range of information and analytical capabilities.

During the beta program for HPE’s agentic operations copilot, early adopters have reported significant improvements, particularly in root cause analysis. AI agents have proven adept at overcoming blind spots that human teams might encounter. For instance, a single enterprise might deploy numerous code releases weekly, making it impossible for any human operator to track every change. AI, however, can meticulously analyze these changes, identify correlations with emerging issues, and leverage cross-organizational memory to provide crucial context.

Phanidhar Koganti, senior distinguished technologist in Hewlett Packard Enterprise (HPE) hybrid cloud, elaborates on this point: "During our beta program, a lot of our customers have told us that many issues that happen will typically be related to a change they made four or five days previously. They explicitly want us to track the changes they are making and take that as an additional context when agentically root case analyzing a particular issue." This capability highlights the power of AI agents in providing a comprehensive historical and contextual understanding that is often beyond human capacity to recall or process in real-time.

The whitepaper details the planning stages of how an agentic operator investigates across its root cause analysis, including:

Data Ingestion and Correlation: Gathering logs, metrics, traces, and change event data from across the IT environment.
Pattern Recognition: Identifying anomalies and correlations within the ingested data.
Hypothesis Generation: Formulating potential causes for the observed issues based on recognized patterns.
Hypothesis Testing: Actively querying systems and running diagnostics to validate or refute generated hypotheses.
Contextualization: Integrating external information, such as known vulnerabilities or ongoing maintenance windows, into the analysis.
Root Cause Identification: Pinpointing the most probable cause of the incident.

As SREs, DevOps, and sysadmin teams contribute their crucial institutional knowledge, this information is fed back into the agentic memory, enhancing the collective understanding of both agents and humans across the organization.

The Power of Skills-Based AI Agents

A critical distinction highlighted by Koganti is the necessity of moving beyond general-purpose large language models (LLMs) for enterprise operations. The effectiveness of agentic AI in this domain hinges on the concept of "agent skills." These skills represent specialized knowledge, capabilities, and workflows tailored to specific operational tasks. Instead of providing an LLM with exhaustive details, operators can offer high-level guidance, akin to outlining a skeleton.

"You are not giving it 100% of the details, but you’re giving it high-level guidance on the skeleton," Koganti explains. "In the operations world, let’s say you get a particular type of alert with a particular symptom, like virtualization issues, then you know you have a knowledge or a skill saying that: For these kinds of alerts related to virtualization, you want to go and look at the CPU utilization in the VM and look at the storage IO with respect to a particular other detail and so on. Providing high-level directional guidance, captured in skills, is necessary, because all this agentic stuff, if you leave it 100% to LLMs, they hallucinate anything."

This structured approach ensures that AI agents operate within defined parameters, mitigating the risk of hallucinations and improving the reliability of their outputs. Agent skills are already gaining traction among developers, and HPE’s initiative aims to bring this paradigm to the operations space. "That’s a unique thing, and we believe it’s only a matter of time until the rest of the vendors in the market will also align with that, similar to how Infrastructure as Code was adopted primarily from the developer side of the ecosystem at first," Koganti anticipates, as HPE seeks to encode curated ops skills beyond root cause analysis and incident investigation to encompass specific domains like virtualization and networking.

Agentic Auditability: The Cornerstone of Trust

For AI in operations to gain widespread acceptance, particularly within highly regulated industries, it must address the inherent trust gap. Compliance, cybersecurity mandates, and the demands of operators necessitate that AI agents can clearly explain and substantiate their decision-making processes. HPE’s approach to autonomous operators is therefore built with a strong emphasis on audit trails, transparent reasoning, and comprehensive observability.

Full Audit Trail: Every action taken by an AI agent is meticulously logged, creating a complete record of its operations. This includes:
- Event Sequencing: A clear chronological record of all actions performed.
- Data Provenance: Traceability of the data used to inform decisions.
- Configuration Changes: Documentation of any modifications made to agent parameters or skills.
Transparent Reasoning: The logic behind an AI agent’s conclusions is made accessible, allowing human operators to understand how a particular outcome was reached. This involves:
- Decision Pathways: Visualization of the steps taken by the agent to arrive at a conclusion.
- Confidence Scores: Indication of the AI’s certainty in its analysis or recommendations.
- Rule Explanations: Clarity on the specific "agent skills" or rules that were triggered.
Observability and Traceability: The operational state and performance of AI agents are continuously monitored, providing insights into their behavior and enabling swift identification of any anomalies. This includes:
- Real-time Monitoring: Continuous tracking of agent activity and performance metrics.
- Performance Dashboards: Visual representations of agent efficiency and effectiveness.
- Alerting on Agent Anomalies: Proactive notification if an agent deviates from expected behavior.

"Operators do get burnt out, especially in high pressure moments when these issues typically happen, and they do make a lot of mistakes, whereas the machine doesn’t miss a piece of data, doesn’t make any mistakes in gathering the right pieces of data, as well as doing a very fast and objective analysis," Koganti emphasizes, underscoring the value of agentic root cause analysis in mitigating human error and improving efficiency.

Despite the advanced capabilities of agentic AI, HPE is adopting a cautious approach to autonomous remediation. The AI operations agent will provide suggestions and recommendations, but it will not execute actions without explicit human authorization. This "human-in-the-loop" model ensures that critical remediation steps remain under human control, even as the AI significantly accelerates the discovery of the root cause, potentially cutting the time required by up to half.

"The actual remediation, which involves, perhaps, touching the particular deployment – let’s say you want to reboot something – is up to the operator. OpsRamp does have the ability to automatically trigger selective fixes, that must be configured by the human. None of our agents will take autonomous actions. It is policy-driven, and that policy will be that it is human-configured," Koganti clarifies. This policy-driven framework, where human operators define the parameters for any automated actions, is central to building trust and ensuring safe adoption of agentic AI in critical operational environments.

As the report suggests, by embracing agentic skills, enterprises are initiating a fundamental shift from a reactive approach to problem-solving towards the proactive construction of self-healing and self-optimizing systems. This evolution marks a significant step towards a more resilient, efficient, and secure enterprise IT future, where human expertise and AI capabilities converge to meet the challenges of tomorrow.

Leave a Reply Cancel reply