Batch Workloads: Part 1

As we introduced in our recent webinar, batch jobs management is a crucial component of cognitive computing. Today we take a deeper dive into batch jobs—also known as batch workloads—and discuss both their advantages and their risks.

In general, there are two types of workloads used within enterprises: transactional and batch. The processes that manage these workloads have a variety of names, including job scheduling, batch scheduling, distributed resource management, and workload automation. They often orchestrate data from real-time and daily activities, taking into account informational dependencies and resource availability, to provide continuity across the business.

Transactional workloads usually depend on real-time or near real-time processing of events and requests, requiring systems to be up and ready. Examples include stock purchases and sales at a financial institution, which are highly dependent upon the time of the requests. The frequency of the requests, as well as the need for timely results, can create severe resource loading during operational hours.

Batch workloads, on the other hand, are less time-dependent, and are often deferred to run when system demand is lower and resources are more available. Batch workloads can include calculating daily P&L reports or clearing trade settlement. While the immediacy of the results is less critical, expectations for the accuracy of the results are not any lower, and not having the results when they are needed has just as much impact on the business as missed transactions. While seemingly less glamorous than transactional workloads, batch workloads can make up the bulk of the underlying processes that support daily businesses and keep them going long term.
Batch workloads at Digitate are characterized by the following:

  1. Precedence: Batch jobs often depend upon the completion of other batch jobs. For example, the creation of a summary of performance across many business units requires individual summaries from each unit to be calculated first. Because the individual summaries must precede the overall performance summary, the precedence relations must be known and defined in the system before the processes are automated.
  2. Grouping: Not every data stream in a business is related to every other data stream. Grouping the data into streams, or sub-streams, by function or dependencies helps focus the processing and isolate and filter unrelated data.
  3. Constraints: In addition to constraints imposed by precedence relations and dependencies, business needs can dictate when results are required. These constraints impact when the batch processes can start and—more importantly—when their results need to be available. For example, the close of the business day at a bank branch indicates when the earliest batch process can start. These results are needed before regional batch processes can run. If the available results are not timed well, then the batch processes will not run smoothly, or at all.

Let’s take a look at some batch workloads issues and ways to reduce the risks involved.

The number of jobs included in a batch workload can be in the tens to hundreds of thousands, without any set size. A failure—or even a delay—in a single process can affect the completion of thousands of downstream jobs. Simply monitoring the jobs is not enough; hundreds of thousands of processes may generate thousands of alerts per day. The effort required to filter, process, and resolve these alerts could be a full-time job in itself.

At that scale, it is difficult to understand how processes are related to one another. Without knowing the relevant precedence rules and dependency relations to help narrow the search, it becomes even more difficult to find the root causes of failures. Finding the dependencies of distributed systems is especially challenging since they can span several geographies, time zones, and operations teams (who are also limited by the lack of system visibility and their understanding of those systems).

While batch workloads pose a number of concerns that require careful consideration, over time they are usually fairly deterministic compared to transactional workloads. Also, because they tend to run often, there is usually a great deal of data that has already been collected that can be used to help keep operations in check. Their repetitive nature creates a relatively static system, with outcomes, operational timings, and even alerts that are well-known and expected, even if they are not well-understood.

Paradoxically, this stability sometimes creates business issues that take longer to fix than when dealing with less-stable systems. When things are working smoothly and changes are seen as risks to the achieved harmony, the desire to maintain that tranquility often inhibits needed changes.

Implementing changes and upgrades can seem like a regression when the once well-behaved system begins to break as it adapts to the new behavior and updated dependencies and precedence rules. The time and resources required to understand and fix the new state could justify holding off on future changes, or even avoiding them altogether. In this case, stability is not the desired end result, but merely an indicator of a higher level of achievement.

The underlying issue is a lack of understanding of the system. Whether the result is a chaotic environment in which problems cannot be resolved, or one in which the processes cannot keep up with the required speed of change, business needs will not be met.

Attaining stability without the underlying understanding of the system is like learning how to take off without learning how to land: it may get you into the sky faster, but you will eventually run into problems that could impact you far worse than a simple delay.

Join our free webinar on Thursday, April 27th on how to keep IT outages from disrupting your business! Register here before spaces run out.

Leave a Reply