Workload management is ubiquitous when it comes to automating critical business processes. With time, workload management as a technology is going through a gradual evolution, from ‘just automation’ to an orchestrator of intelligent automation. This necessitates a layer of observability and intelligence to facilitate the move from workload automation to workload management.
However, with the scheduler-based monitoring tools available in the market, enterprises face severe business risks due to low level of observability, insights and, in fact, even the automated resolutions.
For instance, a US-based payment system provider was leveraging workload automation to enable services across diverse lines of business serving varied customers, transactions, and platforms. However, they faced multiple issues with the stability of the batch jobs. Every faulty batch job within the environment posed a significant risk to maintaining customer reputation. This required the right tool to observe the workloads, and analyze their dynamic patterns, for which they selected Digitate ignio AI.Workload management product.
Upon investigation, the team discovered that despite 2.4 million failure events occurring over two months, they were attributable to just 500 jobs, comprising a mere 2.5% of total jobs. This revelation startled management, leading to a determination to stabilize the environment and ensure timely compliance with SLAs.
Similarly, an American retail giant with over 20,000 stores relies on timely batch job processing to ensure that various business critical SLAs are met. These include daily processing of application families covering critical areas such as Store Closure, IT Financial Systems, IT Supply Chain, POS and COM systems, Distribution, HR & Payroll and Legal. Each of these processes have a defined time by which all batch jobs must be completed.
However, even with multiple homegrown dashboards for batch monitoring, powered by a well-known observability tool, the company was facing frequent SLA misses impacting business deliverables, with no capability to predict or get early look-ahead for critical SLA misses. This situation changed after they adopted ignio, as ignio could provide automated notification on job failures/long running jobs with AI/ML based probable cause analysis. More importantly, using AI-powered insights, it now provides a 90 min look-ahead window for all business-critical SLAs with probable cause analysis.
At Digitate, we have been working with more than 75 large enterprises to build more resilient processes backed by workload or batch jobs, and in many ways, we are at the forefront of this evolution. This has given us a deeper understanding of the drivers of the change. These motivators range from adoption of AI in processes, ability to pick the critical events from a flood, depend on machine than on tacit knowledge, diversity in technology – from schedulers to underlying infra, increase in dynamic behavior due to cloud adoption and containerization and many more. This is driving the need of unified observability into not just batch jobs but also associated business process and IT infrastructure, as well as AI-based analytics that can make sense of the increased volume of data and enable actions.
Our latest release, Flamingo, is a step in this direction – to enable better AI-adoption for managing batch jobs, as well as better Observability. Let us take a quick look at what’s in store:
Predicting workload issues with AI-powered insights
When it comes to the predicting ahead of time using the AI Powered insights, most of the schedulers in the market have been trying to master the progress of the batches. The advanced ones have been able to reach the stage of predicting the SLAs associated with the underlying jobs as well and with every release, more and more schedulers, keep claiming about better accuracy in these predictions. While this contributes to making the batches stable and easy to monitor but the inherent stability that predictions bring, introduces a new kind of problem in operating the batches. Today, with these strong insights, it is not the SLAs that are in danger but the unexpected failures and delays that impact the batch monitoring and operation readiness more.
The batch job support is getting into a unique problem of missing out the failures and their impact because the batches at many enterprises are smooth most of the days. Even planning a support team becomes unforeseeable. It could lead to scenarios like delays that may happen during a night shift and the right support group is not available or not reachable when it occurs. Such situations may create havoc in support and operations and thereby impacting the business.
To address such scenarios, the product, ignio AI.Workload Management now offers a feature to predict not only SLAs but also the failures and delays in advance. This means that:
- a delay that can occur because of a large or invalid file can be identified before it occurs
- a failure because a job is going to start at the wrong time can be identified based on the predicted batch run time
- an upstream job, which is running longer than usual and may cause the downstream to delay, can be notified much in advance for the support users to jump into action.
Another comparatively small but challenging case of prediction going out of control is the case of predicting month end batches. The month end or quarter end batches are difficult to predict because the data points are less, and they do not follow a standard schedule and day to run. With the Flamingo release of ignio AI.Workload Management, an attempt to predict such batches accurately is being made by:
- predicting only when the batch starts executing by removing the dependency on calendars
- adapting at run time with the runs based on the jobs that are executing
Improving observability of batch jobs
One of the major challenges in observability for batch jobs is the limited ability of schedulers to view the progress of batches over a period. For example, in the insurance segment, payment files received at regular intervals are processed by inherent batch jobs. A classic executive view in this case would be a CIO being able to see that the payment files received at hourly basis are getting processed on time for the last x hours, without any obstruction.
The Flamingo release of ignio AI.Workload Management addresses these challenges.
The ignio watch is a mechanism to make operations on the Workload Automation systems viewable in a single view pane. This enables the operations user to reach the right events to focus on and proactively take actions where needed.
With this release, ignio additionally provides a clear view for the executives on the progress and gives an ability to immediately identify if things start going wrong. Furthermore, an ability to horizontally drill to underlying business functions helps them identify the right people to reach out to and find the impact. Also, in the above cases, a single file may also have multiple sources of information from different payment parties. So, vertical observability inside the file gives a power to proactively reach out to the right affected parties before the noise occurs.
Another very important aspect, which has been just touched upon above but needs a bigger highlight, is the ability to observe the files and the jobs together. Almost every industry has batch jobs that process files. These files may also be generated by the third-party system, and if not present at the desired location on time, impact the entire batch completion process. Along with the availability, the processing time of the file also depends on the size of the file. For example, a clearing system end-of-day batch processing would depend upon the number of trade records in the trade file received from trading system.
Thus, the observability of a batch gets enhanced multifold when an operations user is:
- able to view all the impacted jobs and the gravity of impact by the availability of the files
- able to view in advance how long is the batch expected to run, based on the number of records or size of the file.
This feature is one of the most important aspects of this release of ignio AI.Workload Management.
Closing the loop with automation
While automation has always been the forte of the entire product line of ignio, the batch space carries a unique characteristic where the jobs may fail due to different reasons, and depending on the reason codes and the parameters, different set of actions may need to be applied to fix this.
The product now offers a mechanism where the user can drive this configuration. For example, if a job fails because of file unavailability, then retry after 5 minutes. However, if the jobs fail because of file format being wrong, then check the file format and send the mail to the relevant stakeholder to correct the file format. Today at most of the cases, the schedulers offer only a single option of restarting on failure. With this release, the product would be smart enough to read the error code and accordingly switch the course of action to be taken.
The product takes these configurations from the user in a simple csv format and provides an extensive way for the user to define various ways of fixing the issue based on any set of parameters.
Conclusion:
While the above are just some glimpses of the enhancements done in the Flamingo release, there are loads of further enhancements been made to bring about better usability in the product.