Why Network and IT Management Are Failing
Nobody's happy with how their network and IT management is handled. Application performance monitoring (APM) or "observability" could be the key to everything.
This article originally appeared on No Jitter.
According to the view of almost 100 CIOs, 39% of line managers are not fully satisfied with the applications that IT has provided. CIOs say that the thing that creates the most problems for line managers is application failure, which to them means both actual outages and situations where performance doesn't meet business goals. Just under three-quarters of these same CIOs say that satisfying line managers is the responsibility of their network and IT management systems and staff, and only ten percent of CIOs say they're satisfied the responsibility is being met. Why? Because, CIOs say, they're managing the wrong thing.
Some of the problem is the usual one that relates network and IT management goals compared to what matters to users. Users care about how well their applications run. You can imagine senior managers sitting around talking about key performance indicators (KPIs) and quality of service (QoS) while operations people are talking about things like quality of experience (QoE) and uptime. Are we in the same universe here? Probably not, because from the first it's difficult to find a kind of conversion factor that converts operations management metrics to user satisfaction metrics. Only a third of CIOs think they've mastered that, and another third aren't even sure it's possible. I don't have broad data on line management attitudes, but what's available to me says that well over 80% of line managers believe the IT and network organizations don't "see" all the problems that impact their operation.
I don't want to talk about how many QoS's can fit on a KPI here; that's an argument that's pretty much proven useless to both sides. Instead, I want to focus on that last point, which is the fact that operations people don't even see the things that matter. How can that be, given all the stuff that's collected in management systems and logs? In many enterprises, surprisingly, there's nobody really asking that question. In fact, only 35 of 198 enterprises which told me they had a line/IT disconnect said that there was a specific team or even person that was supposed to be crossing the divide. Those that do cited a single issue as the primary cause of line dissatisfaction, which was management system polarization.
On the average, those 35 enterprises said that they had five different management systems, and associated operations staffs: one running the data center equipment, one running the cloud, one running the wide-area network, one for support of connectivity in remote offices, and one for applications. The order I've presented them is the average of the order in which the 35 enterprises offered them, and most admitted that they were listed as they came to mind, meaning in the order of their intrinsic importance. The problem, according to almost all of the 35, is that no single system reflects the state of the user's essential IT resources, and there's no consistent strategy for creating a user-side view.
There's a reason for this, of course. Nearly all IT and network operations people agree that their management processes are driven by alerts from problems. These professionals get an alert, an error message that says something is wrong with the IT elements under their control. They perform problem determination/isolation, and they fix it. This is a daily process for most enterprises, and not surprisingly it drives both the selection of tools and the way that operations practices are defined. Of course, the great majority of these alerts are only indirectly related to user QoE, and many things that impact QoE don't cause alerts. Further, some conditions that are visible to IT and network operations personnel cause QoE issues that aren't recognized, because the conditions aren't correlated because of management separation.
So we need a single pane of glass, as they say? The problem with that, say the 35 enterprises, is that the single pane of glass usually turns out to be frosted glass. Or, perhaps, one viewed through frosted eyeballs. A common management console that receives alert notifications from our five hypothetical management systems can display them, but to whom? Who will, as the saying goes, watch the guards themselves? According to the 35 enterprises who have actual people responsible for overall QoE, that doesn't mean they have people who could understand all those management views. They need another strategy, or better yet, multiple strategic steps to take.
Where is the best view of QoE obtained? According to our 35 enterprises, it's at the application level, meaning that application performance monitoring (APM) or "observability" should be the key to everything. With APM, we should be able to track work as it moves through network and cloud and data center, seeing every point of fault or delay, tracking every trend. The ultimate visibility from the perspective of the user, a way of measuring the experience behind QoE.
Well, the good news is that there aren't five different APM systems for our professionals to have to monitor. The bad news is that there's usually one in each of the technology areas I mentioned in the last paragraph, and none of them have full visibility. Good APM needs to be based on "probes" inserted in software to log activity and support time-based analysis of real application workflows. Many enterprises say they use APM but fewer than a fifth of them say they have probes in all their key applications, capable of tracking work through the domains of the other four management systems. The great equalizing management strategy, APM, is mostly blind so it can't make much use of single panes of glass either.
Where this leaves enterprises is simple and sad. Even the 35 enterprises who have some sense of overall performance as it relates to line users say they get line user views from ... the line users. Somebody calls a help desk. Which one? That depends on the judgment of the caller. They can't connect with their application, so maybe the network is down, or maybe the application, or the cloud or the data center... or maybe it's space aliens shooting cosmic rays? The point is that the "alert" here is generated by a tech amateur and yet it is almost surely handled by whoever the amateur decides is responsible. Pause here while the Chorus of the Thirty-Five Enterprises sings, "Two different worlds, we live in two different worlds," talking about network and IT -- or maybe even "Five different worlds" if we want management systems precision.
Separate management strategies create separate failures. You cannot unify network and IT management without something to unify around, which means that we have to make APM real and complete, and capable of being the universal help-desk contact point for users, and the response coordinators for all those disparate management teams. How many of those 35 enterprises offer their user response teams that APM-centric perspective. You guessed it ... zero. That has to change or we're spinning our wheels when we talk about supporting the end users with our networks and IT.
Read more about:
No JitterAbout the Authors
You May Also Like