Skip to content
  • Products

      Products

      ignio AIOps

      Redefining IT Operations with AI and Automation

      ignio AI.Workload Management

      Enabling Predictable, Agile and Silent Batch Operations in a Closed-loop solution

      ignio AI.ERPOps

      End-to-end automation for incidents and service requests in SAP

      ignio AI.Digital Workspace

      Autonomously Detect, Triage and Remediate Endpoint issues

      ​ignio Cognitive Procurement

      AI-based analytics to improve Procure-to-Pay effectiveness

      ignio AI.Assurance

      Transform software testing and speed up software release cycles

      Let us demonstrate the power of Automation to transform your enterprise.
      Request a Demo
  • Solutions

      Solutions

      Business Health Monitoring

      Proactively Monitor Retail and eCommerce Health to deliver optimal customer experiences

      IDoc Management for SAP

      Monitor, Prioritize and Resolve IDoc issues to ensure business continuit

      IT Event Management

      Prevent outages with Predictive Intelligence and Autonomous Event Management

      Business SLA Prediction

      Never miss a business SLA with AI-powered monitoring and predictions

      Artificial Intelligence and Machine Learning
      Read More
  • Resources

      Resources

      Analyst Reports

      Discover what the top industry analysts have to say about Digitate

      Blogs

      Explore Insights on Intelligent Automation from Digitate experts

      ROI

      Get Insights from the Forrester Total Economic Impact™ study on Digitate ignio

      Case Studies

      Learn how Digitate ignio helped transform the Walgreens Boots Alliance

      Trust Center

      Digitate policies on security, privacy, and licensing

      e-Books

      Digitate ignio™ eBooks Provide Insights into Intelligent Automation

      Infographics

      Discover the Capabilities of ignio™’s AI Solutions

      White Papers and POV

      Discover ignio White papers and Point of view library

      Webinars & Events

      Explore our upcoming and recorded webinars & events

  • About

      About Us

      Leadership

      We’re committed to helping enterprise companies realize autonomous operations

      Newsroom

      Explore the Latest News and information about Digitate

      Partners

      Grow your business with our Elevate Partner Program

      Academy

      Evolve your skills and get certified

      Contact Us

      Get in Touch or Request a Demo

      Let us demonstrate the power of Automation to transform your enterprise.
      Request a Demo
Request a Demo
Search
Close
BLOG

“Small data” – the untapped gold mine!

By Satya Samudrala

The AI rush is leading to bigger and bolder advancements in data analytics. Data-driven decision making has become quite mainstream. And the entire focus of data analytics, both in industry as well as academia, has been towards “Big Data,” traditionally defined as data with Four V’s: high Volume (quantity), high Velocity (speed of change), high Variety (sources and types), and high Veracity (quality). Many organizations are investing in data-hungry AI initiatives analyzing this Big Data to understand customer behavior, assess risks, manage infrastructure, predict demands, and more. Their repositories may involve petabytes or even exabytes (i.e. one billion gigabytes).

But in this Big Data revolution, we forget that many valuable observations about an organization most often are quite small, in the range of megabytes or even mere kilobytes. In fact, this “small data” is often defined as data that is small enough to be processed by a single machine or understandable by a single individual. Formally, it meets the minimum size requirements to perform any statistical analysis, typically around 5 to 10 occurrences for every variable. Small data used to be known simply as data!

The everyday presence of small data

We see small data around us every day. It could be the contacts list in our phone, our calendar, or our monthly bank statement. Even in enterprise IT systems, small data is quite prevalent. It might even hold more business-impacting insights than Big Data!

You may wonder why small data sets that are apparently so simple need to be analyzed with advanced statistical and AI techniques. If you’re only looking at a few dozen rows of data, can’t you just intuitively perceive patterns, like noticing the neighborhoods where most of your customers live? Or use elementary algebra to plot the data points and draw a trend line?

Well, important business decisions need a firmer foundation than intuition. Excel Chart Wizard won’t answer all the questions you might have about a data set. And in fact, you risk inaccurately interpreting small data if you don’t take due care during data processing and analysis.

Consider a runner who has run the Mumbai marathon for the last five years, whose finish time we want to predict this year. If we only use the data points of the last five finish times and extrapolate using simple techniques such as averaging them or drawing trend lines, we might be far from accurate. We can improve our predictions with sophisticated analytic techniques, such as treating outliers and bias in the history, augmenting data with other factors (such as terrain, weather, route, etc.), or trying an ensemble of modeling approaches.

One of the biggest areas where analyzing small data pays off is predicting rare phenomena – whether benign (winners of awards or elections, bumper crops, or even big sales deals), ominous (business-critical utility outages or high-severity incidents), or simply unusual, such as behavioral anomalies. Anomalies by their very definition are few. If they are frequent, then they stop being anomalies because they define normal behavior!

Small data situations also arise due to inadequate monitoring. Here, data for the question you’re interested in simply isn’t collected as often as you’d like. Or it’s thinly scattered across broad parameters, such as a century or country.

Challenges with small data

The key problem in analyzing small data is about striking a balance between bias and variance, two major sources of potential error in predictive data models. We want to create a model that accurately captures the range and quirks of the source data, but also can be generalized to make good predictions about new, wider data sets.

Biasrefers to how well the model captures the variations in the training data. Models with high bias error are very generic. They fail to capture the patterns present in the training data. Such models are referred to as underfitting

Variancerefers to how well the model applies to new data (test data). When the model captures too much of the patterns in the training data, and therefore fails to apply to any other data, this reflects high variance error. These models are referred to as overfitting

That’s why we need a balanced model that avoids the worst of both errors.

Examples of overfitting, underfitting, and desired (balanced) models are shown in the below figure.


Fig. 1. Overfitting models vs. underfitting vs. desired

The key challenges in dealing with small data can be broadly grouped into the following categories:

  • Missing data: A common problem in many real-world scenarios. You can’t simply delete the incomplete lines of data because the data set will be even smaller. The alternative is filling in the gaps. However, this requires some sort of pattern in the data, such as most frequent value, mean, or median to fill the gaps with, which is a challenge when the data set is too small to provide much guidance.

  • Outliers: The first step of any analysis is to remove the anomalies or outliers in the data. With a small data set, it is statistically difficult to detect abnormal patterns. Even a few outliers can form a significant proportion of the population and significantly alter the model!

  • Too few samples for training and testing: While creating any machine learning (ML) models, it is a good practice to split the data into three samples, one for training the model to know what to look for (training data), one for tuning the hyper-parameters (validation data), and one for final evaluation (test data). But if we perform this process on small data, the resulting samples will be too small to yield optimal results.

  • Overfitting: When the data is small, then there is a high chance that ML models will over-fit to the data. And that could lead the model to perform poorly on unknown data sets. Cross-validation is one remedy. But that is too risky when the data is small as it will further split an already small data set, leaving very few values for validation.

  • Magnified measurement errors: Data collection often involves measurement errors due to various issues such as calibration of devices, delays in data collection, or environmental noise. The effect of even small errors can add up quickly in small data sets and distort the models derived from them.

How to work with it effectively

Solving small data challenges demands lateral thinking and creative ideation. Here are the most common ways to analyze small data for insights.

Get creative with the data

  • Get more features by derivation: Often the small data doesn’t have many variables or features. This limits the number of features to play with and leads to inaccurate predictions. This could happen because either the users are hesitant to share the full set of variables or they don’t store them. One way to deal with this situation is to derive additional features from the data itself. Existing features such as age and salary can be grouped to create aggregate features. For example, individual age values can be grouped into an aggregate feature named “age group” with values such as 0-10, 10-20, 20-30, and so on. Timestamp features can be broken down to convey day of week, week of year, day of month, hour of day, holidays, and weekends. It is highly recommended to derive only relevant features that would really help the model. Creating irrelevant features will lead to overfitting or inaccurate predictions.

  • Get more features from other data sources: Publicly available data sources can be leveraged to add more relevant variables. For instance, demographic information can be used to expand zip codes into city, state, country, weather, traffic conditions, crime rate, literacy rate, etc. Let’s say that you want to predict the finish time of a marathon runner but you only have the times of previous runs. Since marathon performance relies on many factors, you can’t predict a finish time by simply averaging previous results. But you can find out other relevant conditions, including details of the racetrack, its terrain, temperature, and so forth, and add in these variables to accurately predict the finish time.

  • Use simulations: Techniques such as Monte Carlo simulations consider the properties of small data and use randomness to generate possible outcomes and try to represent a larger population.

  • Get more rows by data augmentation: Often, we just don’t have enough records in the data. This could either be due to limitation in monitoring, the inherent rarity of the events, or data collection constraints. Data augmentation can be used creatively to add more records. It creates more synthetic data (close to actual) from the existing data. This type of synthetic data can be used for training. In fact, it improves the model estimation by providing more information about the data without disturbing the features in it.

    For example, maybe we need to create a model to identify the dog in an image. We created the synthetic data as shown in the figure below and trained the model. Now this model can identify the dog, no matter in what direction or color gradient or size it is in. Some common augmentations for images include cropping, flipping, zooming, rotation, noise injection, color changes, and edge changes.

Fig. 2. Augmentation – Created multiple images from one image using augmentations.

Similarly, with text, we can create new samples by synonym replacement, context-aware replacement, and so forth.

  • Remove data imbalance by data sampling: These techniques are used when the data is imbalanced, i.e. one class of the data dominates over the rest. Examples are fraud/non-fraud or churn/non-churn financial records, where there are far fewer anomalies (fraud or churn) compared to the majority. This will lead the model to predict only the majority class and fail to estimate the other. Up-sampling or Over-sampling technique can increase the sample size of the minority class compared to the majority, making analysis easier.

  • Data inversion: In some cases, flipping the data gives a very different perspective. It is easier to predict when a rare event does not occur than when it does because the rarer the event is, the larger its inverse will be. This technique is more efficient in finding complex patterns. For example, an event that occurs every second Monday for five months will have only five data points. But its inverse will have 145 data points (i.e. all the days that aren’t the second Monday of the month).

Get creative with analytics strategy

    • Prefer simple models: Prefer simple models when possible and limit the number of parameters to estimate. Increasing the complexity by including irrelevant features will lead the model to over-fit.

    • Adopt an ensemble approach: Make generalized models by training an ensemble of multiple algorithms instead of one single model. Using those multiple models (by weighted or voting technique) will significantly reduce the variance and give accurate results from small data. Boosting and Bagging are powerful machine learning methods. Bagging is a parallel ensemble technique where it builds all models independently and then aggregates them. Boosting is a sequential ensemble technique where it builds models sequentially and adapts from the observations of the past runs.

    • Go for ranges over point estimates:Predicting point estimates is often hard – and misleading in the case of small data. So give yourself a margin of error and recommend ranges and confidence intervals instead. The example below shows two ways of predicting an item’s weight. With only a few data points, it is safer to predict the range [151 – 158 pounds] instead of an exact value (155 pounds).

Fig. 3. Range over point estimate

  • Opt for pattern mining rather than prediction: Techniques of correlation and prediction might not be effective on small data. In such cases, the analysis strategy should change to finding patterns, similarities, or influences using various statistical tests.

    Get creative with external helpers

    • Accommodate domain expertise: Small data often does not have the luxury of using traditional feature selection techniques. Domain expertise can be used to identify features of interest, derive more features, or select the best model.

    • Apply transfer leaning: Transfer learning helps build robust models even on small data. We can use the knowledge acquired from universally available pre-trained models. We apply transfer learning in our day-to-day life. For example, the balancing skills acquired while learning to ride a bicycle help in learning any other motored two-wheeler. The same concept can be applied to many real-world small data problems of classification, prediction, or anomaly detection, among others.

    ignio’s take

    ignio can effectively handle small data and provide accurate predictions. To quote one example, small data empowered ignio to predict the finish time of runners in a leading international marathon! Since this race takes place only once a year, we had only one observation per year for a runner. Even if a runner has run this marathon for 10 years, it still adds up to only 10 observations. Besides, results only captured individual run times, with no additional information.

    ignio adopted some creative approaches to small data. It used data augmentation to join other publicly available data sources to add features such as terrain, temperature, humidity, number of runners, race time, etc. It used the ensemble algorithms approach to apply various models to get accurate predictions. It used transfer learning to extend the model developed on known runners to unknown runners. For more details, please refer to this blog post.

    With similar strategies, ignio has handled many real-world scenarios that include forecasting the end time of a process that runs only once a month, predicting an outage that occurs only three or four times a year, or detecting a fraud transaction with very limited details about the transactions.

    Conclusion: Small (data) can be beautiful

    No definition of AI or ML requires Big Data to fully unleash their power. In fact, the potential of small data has been underplayed because of the hype around its massive counterpart.

    The lack of sufficient data has been the single biggest challenge in industry’s adoption of AI. Many businesses need AI solutions for targeted problems where information is limited. We strongly believe that small data is an untapped gold mine. It can significantly increase the use of AI in businesses. All you need is creativity!

Related

Share

Recent Posts

Pareto principle: the law of the vital few

February 2, 2023

We’re not bluffing: Poker and other games are good models of the autonomous enterprise

January 31, 2023

The curse and cure of dimensionality

January 23, 2023

IDoc Management Helping Organizations Achieve Flawless ERP Communication- In Depth

December 15, 2022

How ignio uses ML to take the tediousness out of tech support

December 6, 2022

How to make an elephant fly

November 7, 2022
Author

Satya Samudrala

Data Scientist | Digitate
Contacts
Head Office

5201 Great America Parkway Suite 522
Santa Clara, CA 95054 USA

Twitter Linkedin Youtube Facebook Instagram
Company
  • About Digitate
  • Partner With Us
  • Newsroom
  • Blogs
  • Contact Us
Support
  • Data Privacy Notice
  • Website Use Terms
  • Cookie Policy Notice
  • Trust Center
  • Cookies Settings
Stay Connected
© Tata Consultancy Services Limited, 2022. All rights reserved
Products

ignio AIOps

Redefining IT Operations with AI and Automation

ignio AI.Workload Management

Enabling Predictable, Agile and Silent Batch Operations in a Closed-loop solution

ignio AI.ERPOps

End-to-end automation for incidents and service requests in SAP

ignio AI.Digital Workspace

Autonomously Detect, Triage and Remediate Endpoint issues

​ignio Cognitive Procurement

AI-based analytics to improve Procure-to-Pay effectiveness

ignio AI.Assurance

Transform software testing and speed up software release cycles

Solutions

Business Health Monitoring

Proactively Monitor Retail and eCommerce Health to deliver optimal customer experiences

IDoc Management for SAP

Monitor, Prioritize and Resolve IDoc issues to ensure business continuit

IT Event Management

Prevent outages with Predictive Intelligence and Autonomous Event Management

Business SLA Prediction

Never miss a business SLA with AI-powered monitoring and predictions

Resources

Analyst Reports

Discover what the top industry analysts have to say about Digitate

Blogs

Explore Insights on Intelligent Automation from Digitate experts

ROI

Get Insights from the Forrester Total Economic Impact™ study on Digitate ignio

Case Studies

Learn how Digitate ignio helped transform the Walgreens Boots Alliance

Trust Center

Digitate policies on security, privacy, and licensing

e-Books

Digitate ignio™ eBooks Provide Insights into Intelligent Automation

Infographics

Discover the Capabilities of ignio™’s AI Solutions

White Papers and POV

Discover ignio White papers and Point of view library

Webinars & Events

Explore our upcoming and recorded webinars & events

About Us

Leadership

We’re committed to helping enterprise companies realize autonomous operations

Newsroom

Explore the Latest News and information about Digitate

Partners

Grow your business with our Elevate Partner Program

Academy

Evolve your skills and get certified

Contact Us

Get in Touch or Request a Demo

Request a Demo
Products

ignio AIOps

Redefining IT Operations with AI and Automation

ignio AI.Workload Management

Enabling Predictable, Agile and Silent Batch Operations in a Closed-loop solution

ignio AI.ERPOps

End-to-end automation for incidents and service requests in SAP

ignio AI.Digital Workspace

Autonomously Detect, Triage and Remediate Endpoint issues

​ignio Cognitive Procurement

AI-based analytics to improve Procure-to-Pay effectiveness

ignio AI.Assurance

Transform software testing and speed up software release cycles

Solutions

Business Health Monitoring

Proactively Monitor Retail and eCommerce Health to deliver optimal customer experiences

IDoc Management for SAP

Monitor, Prioritize and Resolve IDoc issues to ensure business continuit

IT Event Management

Prevent outages with Predictive Intelligence and Autonomous Event Management

Business SLA Prediction

Never miss a business SLA with AI-powered monitoring and predictions

Resources

Analyst Reports

Discover what the top industry analysts have to say about Digitate

Blogs

Explore Insights on Intelligent Automation from Digitate experts

ROI

Get Insights from the Forrester Total Economic Impact™ study on Digitate ignio

Case Studies

Learn how Digitate ignio helped transform the Walgreens Boots Alliance

Trust Center

Digitate policies on security, privacy, and licensing

e-Books

Digitate ignio™ eBooks Provide Insights into Intelligent Automation

Infographics

Discover the Capabilities of ignio™’s AI Solutions

White Papers and POV

Discover ignio White papers and Point of view library

Webinars & Events

Explore our upcoming and recorded webinars & events

About Us

Leadership

We’re committed to helping enterprise companies realize autonomous operations

Newsroom

Explore the Latest News and information about Digitate

Partners

Grow your business with our Elevate Partner Program

Academy

Evolve your skills and get certified

Contact Us

Get in Touch or Request a Demo

Request a Demo
 

Loading Comments...