“Small data” – the untapped gold mine!

Satya Samudrala

By – Satya Samudrala

(Data scientist | Digitate)

The AI rush is leading to bigger and bolder advancements in data analytics. Data-driven decision making has become quite mainstream. And the entire focus of data analytics, both in industry as well as academia, has been towards “Big Data,” traditionally defined as data with Four V’s: high Volume (quantity), high Velocity (speed of change), high Variety (sources and types), and high Veracity (quality). Many organizations are investing in data-hungry AI initiatives analyzing this Big Data to understand customer behavior, assess risks, manage infrastructure, predict demands, and more. Their repositories may involve petabytes or even exabytes (i.e. one billion gigabytes).

But in this Big Data revolution, we forget that many valuable observations about an organization most often are quite small, in the range of megabytes or even mere kilobytes. In fact, this “small data” is often defined as data that is small enough to be processed by a single machine or understandable by a single individual. Formally, it meets the minimum size requirements to perform any statistical analysis, typically around 5 to 10 occurrences for every variable. Small data used to be known simply as data!

The everyday presence of small data

We see small data around us every day. It could be the contacts list in our phone, our calendar, or our monthly bank statement. Even in enterprise IT systems, small data is quite prevalent. It might even hold more business-impacting insights than Big Data!

You may wonder why small data sets that are apparently so simple need to be analyzed with advanced statistical and AI techniques. If you’re only looking at a few dozen rows of data, can’t you just intuitively perceive patterns, like noticing the neighborhoods where most of your customers live? Or use elementary algebra to plot the data points and draw a trend line?

Well, important business decisions need a firmer foundation than intuition. Excel Chart Wizard won’t answer all the questions you might have about a data set. And in fact, you risk inaccurately interpreting small data if you don’t take due care during data processing and analysis.

Consider a runner who has run the Mumbai marathon for the last five years, whose finish time we want to predict this year. If we only use the data points of the last five finish times and extrapolate using simple techniques such as averaging them or drawing trend lines, we might be far from accurate. We can improve our predictions with sophisticated analytic techniques, such as treating outliers and bias in the history, augmenting data with other factors (such as terrain, weather, route, etc.), or trying an ensemble of modeling approaches.

One of the biggest areas where analyzing small data pays off is predicting rare phenomena – whether benign (winners of awards or elections, bumper crops, or even big sales deals), ominous (business-critical utility outages or high-severity incidents), or simply unusual, such as behavioral anomalies. Anomalies by their very definition are few. If they are frequent, then they stop being anomalies because they define normal behavior!

Small data situations also arise due to inadequate monitoring. Here, data for the question you’re interested in simply isn’t collected as often as you’d like. Or it’s thinly scattered across broad parameters, such as a century or country.

Challenges with small data

The key problem in analyzing small data is about striking a balance between bias and variance, two major sources of potential error in predictive data models. We want to create a model that accurately captures the range and quirks of the source data, but also can be generalized to make good predictions about new, wider data sets.

Biasrefers to how well the model captures the variations in the training data. Models with high bias error are very generic. They fail to capture the patterns present in the training data. Such models are referred to as underfitting

Variancerefers to how well the model applies to new data (test data). When the model captures too much of the patterns in the training data, and therefore fails to apply to any other data, this reflects high variance error. These models are referred to as overfitting

That’s why we need a balanced model that avoids the worst of both errors.

Examples of overfitting, underfitting, and desired (balanced) models are shown in the below figure.

 

Overfitting models vs. underfitting vs. desired

Fig. 1. Overfitting models vs. underfitting vs. desired

The key challenges in dealing with small data can be broadly grouped into the following categories:

  • Missing data: A common problem in many real-world scenarios. You can’t simply delete the incomplete lines of data because the data set will be even smaller. The alternative is filling in the gaps. However, this requires some sort of pattern in the data, such as most frequent value, mean, or median to fill the gaps with, which is a challenge when the data set is too small to provide much guidance.

  • Outliers: The first step of any analysis is to remove the anomalies or outliers in the data. With a small data set, it is statistically difficult to detect abnormal patterns. Even a few outliers can form a significant proportion of the population and significantly alter the model!

  • Too few samples for training and testing: While creating any machine learning (ML) models, it is a good practice to split the data into three samples, one for training the model to know what to look for (training data), one for tuning the hyper-parameters (validation data), and one for final evaluation (test data). But if we perform this process on small data, the resulting samples will be too small to yield optimal results.

  • Overfitting: When the data is small, then there is a high chance that ML models will over-fit to the data. And that could lead the model to perform poorly on unknown data sets. Cross-validation is one remedy. But that is too risky when the data is small as it will further split an already small data set, leaving very few values for validation.

  • Magnified measurement errors: Data collection often involves measurement errors due to various issues such as calibration of devices, delays in data collection, or environmental noise. The effect of even small errors can add up quickly in small data sets and distort the models derived from them.

How to work with it effectively

Solving small data challenges demands lateral thinking and creative ideation. Here are the most common ways to analyze small data for insights.

Get creative with the data

  • Get more features by derivation: Often the small data doesn’t have many variables or features. This limits the number of features to play with and leads to inaccurate predictions. This could happen because either the users are hesitant to share the full set of variables or they don’t store them. One way to deal with this situation is to derive additional features from the data itself. Existing features such as age and salary can be grouped to create aggregate features. For example, individual age values can be grouped into an aggregate feature named “age group” with values such as 0-10, 10-20, 20-30, and so on. Timestamp features can be broken down to convey day of week, week of year, day of month, hour of day, holidays, and weekends. It is highly recommended to derive only relevant features that would really help the model. Creating irrelevant features will lead to overfitting or inaccurate predictions.

  • Get more features from other data sources: Publicly available data sources can be leveraged to add more relevant variables. For instance, demographic information can be used to expand zip codes into city, state, country, weather, traffic conditions, crime rate, literacy rate, etc. Let’s say that you want to predict the finish time of a marathon runner but you only have the times of previous runs. Since marathon performance relies on many factors, you can’t predict a finish time by simply averaging previous results. But you can find out other relevant conditions, including details of the racetrack, its terrain, temperature, and so forth, and add in these variables to accurately predict the finish time.

  • Use simulations: Techniques such as Monte Carlo simulations consider the properties of small data and use randomness to generate possible outcomes and try to represent a larger population.

  • Get more rows by data augmentation: Often, we just don’t have enough records in the data. This could either be due to limitation in monitoring, the inherent rarity of the events, or data collection constraints. Data augmentation can be used creatively to add more records. It creates more synthetic data (close to actual) from the existing data. This type of synthetic data can be used for training. In fact, it improves the model estimation by providing more information about the data without disturbing the features in it.

    For example, maybe we need to create a model to identify the dog in an image. We created the synthetic data as shown in the figure below and trained the model. Now this model can identify the dog, no matter in what direction or color gradient or size it is in. Some common augmentations for images include cropping, flipping, zooming, rotation, noise injection, color changes, and edge changes.

Augmentation – Created multiple images from one image using augmentations.Fig. 2. Augmentation – Created multiple images from one image using augmentations.

Similarly, with text, we can create new samples by synonym replacement, context-aware replacement, and so forth.

  • Remove data imbalance by data sampling: These techniques are used when the data is imbalanced, i.e. one class of the data dominates over the rest. Examples are fraud/non-fraud or churn/non-churn financial records, where there are far fewer anomalies (fraud or churn) compared to the majority. This will lead the model to predict only the majority class and fail to estimate the other. Up-sampling or Over-sampling technique can increase the sample size of the minority class compared to the majority, making analysis easier.

  • Data inversion: In some cases, flipping the data gives a very different perspective. It is easier to predict when a rare event does not occur than when it does because the rarer the event is, the larger its inverse will be. This technique is more efficient in finding complex patterns. For example, an event that occurs every second Monday for five months will have only five data points. But its inverse will have 145 data points (i.e. all the days that aren’t the second Monday of the month).

Get creative with analytics strategy

    • Prefer simple models: Prefer simple models when possible and limit the number of parameters to estimate. Increasing the complexity by including irrelevant features will lead the model to over-fit.

    • Adopt an ensemble approach: Make generalized models by training an ensemble of multiple algorithms instead of one single model. Using those multiple models (by weighted or voting technique) will significantly reduce the variance and give accurate results from small data. Boosting and Bagging are powerful machine learning methods. Bagging is a parallel ensemble technique where it builds all models independently and then aggregates them. Boosting is a sequential ensemble technique where it builds models sequentially and adapts from the observations of the past runs.

    • Go for ranges over point estimates:Predicting point estimates is often hard – and misleading in the case of small data. So give yourself a margin of error and recommend ranges and confidence intervals instead. The example below shows two ways of predicting an item’s weight. With only a few data points, it is safer to predict the range [151 – 158 pounds] instead of an exact value (155 pounds).

Range over point estimateFig. 3. Range over point estimate

  • Opt for pattern mining rather than prediction: Techniques of correlation and prediction might not be effective on small data. In such cases, the analysis strategy should change to finding patterns, similarities, or influences using various statistical tests.

    Get creative with external helpers

    • Accommodate domain expertise: Small data often does not have the luxury of using traditional feature selection techniques. Domain expertise can be used to identify features of interest, derive more features, or select the best model.

    • Apply transfer leaning: Transfer learning helps build robust models even on small data. We can use the knowledge acquired from universally available pre-trained models. We apply transfer learning in our day-to-day life. For example, the balancing skills acquired while learning to ride a bicycle help in learning any other motored two-wheeler. The same concept can be applied to many real-world small data problems of classification, prediction, or anomaly detection, among others.

    ignio’s take

    ignio can effectively handle small data and provide accurate predictions. To quote one example, small data empowered ignio to predict the finish time of runners in a leading international marathon! Since this race takes place only once a year, we had only one observation per year for a runner. Even if a runner has run this marathon for 10 years, it still adds up to only 10 observations. Besides, results only captured individual run times, with no additional information.

    ignio adopted some creative approaches to small data. It used data augmentation to join other publicly available data sources to add features such as terrain, temperature, humidity, number of runners, race time, etc. It used the ensemble algorithms approach to apply various models to get accurate predictions. It used transfer learning to extend the model developed on known runners to unknown runners. For more details, please refer to this blog post.

    With similar strategies, ignio has handled many real-world scenarios that include forecasting the end time of a process that runs only once a month, predicting an outage that occurs only three or four times a year, or detecting a fraud transaction with very limited details about the transactions.

    Conclusion: Small (data) can be beautiful

    No definition of AI or ML requires Big Data to fully unleash their power. In fact, the potential of small data has been underplayed because of the hype around its massive counterpart.

    The lack of sufficient data has been the single biggest challenge in industry’s adoption of AI. Many businesses need AI solutions for targeted problems where information is limited. We strongly believe that small data is an untapped gold mine. It can significantly increase the use of AI in businesses. All you need is creativity!

Related posts

Leave a Reply