Our just-in-time, highly interconnected economy depends on digital technology to support an unparalleled level of speed, convenience, and prosperity. But that means the malfunction of even a small component can cause massive problems.
In the latest example, thousands of flights across the U.S. were canceled or delayed on Jan. 11, 2023 in an FAA “ground stop” – the first such nationwide aviation halt since 9/11. The cause? A key file in the Notice to Air Missions (NOTAM) database was unintentionally deleted by a contractor who was trying to correct synchronization problems with the backup database.
NOTAM delivers vital safety updates to pilots about potential problems such as storms, airport construction, even air shows that might cross their paths. When the NOTAM system failed early on the morning of Jan. 11, the U.S. Federal Aviation Administration (FAA) called a ground stop, pausing takeoffs until it could restore service (about two hours later). The result: delayed flights for most of the day and another set of frustrated customers simply trying to get from one place to another.
After the FAA ground stop, the agency’s next steps followed the usual script: Investigations that found no evidence of cyber-attacks, a promise to make its systems more resilient, requests for higher IT funding, and bracing for tough questioning from Congress about its poor planning.
Technology vulnerabilities aren’t unique to aviation
Now that the dust has settled, let’s look at the underlying problem: complex, vulnerable, often aging IT systems that need extensive monitoring. Airlines, railways, HMOs, banks, factories, utilities, retailers, and more – just about every industry counts on highly available core systems, business applications, hardware, and networks.
And these IT components depend on a swift, uninterrupted file feed. These files convey the details of everyday transactions, such as inventory listings, flight manifests, sales data, and commodity prices. But many things can disrupt or delay file feeds, from user error to application glitches or missing data. As we’ve seen with the FAA ground stop (and the Southwest Airlines meltdown over the holidays), even a small technical flaw can interrupt downstream processes, cost millions of dollars, and ultimately make whole industries grind to a halt.
To avoid this, it’s essential to spot and deal with any glitches proactively.
IT operations need more efficiency
The next question is how to streamline the routine and unglamorous duty of searching for file feed problems. Merely drafting more IT people into manual monitoring not only isn’t scalable, it’s an inefficient use of staff who could be directing their skills toward more sophisticated problems and product improvements.
Single-purpose file monitoring and IT operations management (ITOM) tools have the potential to raise such alerts. But these narrowly defined tools may either overlook issues or – possibly worse – raise too many alerts, creating “alert fatigue” that can overwhelm staff.
Organizations require a bird’s-eye view of all their file feeds across every application, as well as of the overall health of their IT infrastructure. What’s more, it should not only report on problems but foster a zero-touch environment that proactively fixes or even prevents them.
Selecting the right file monitoring solution can be challenging. It needs to offer:
- A holistic view of the health of all business technology functions.
- Quick setup and customization.
- Minimal impact on system availability.
- Scalability to accommodate changing operational needs.
- Correlation of current incidents with other alerts. This can be very useful in helping to identify real problems, reduce false positives, and analyze root causes.
- Easy integration with third-party ITOM and change management solutions.
- And don’t forget affordability – more important than ever in this economy.
The ideal outcome is an all-in-one solution that addresses a range of IT operations needs, including infrastructure and application monitoring, event management, analysis, and automated self-healing, even preventive maintenance.
Even better, an ideal solution would draw on the power of AIOps – that is, leverage AI and machine learning and use advanced analytics on the big data sets that modern applications generate. That further streamlines IT operations and helps prevent these issues with event correlation, anomaly detection, and causality determination.
Using a machine-first approach, comprehensive solutions such as Digitate Business Health Monitoring can speed up the mean time to detection (MTTD) and mean time to resolution (MTTR). By combining the power of AI/ML, real-time analytics, and proactive fault fixing, the issues can be automatically diagnosed and resolved autonomously, maximizing customer safety and satisfaction while preventing massive halts such as the recent FAA ground stop.