Good data is more than just a matter of volume. It’s a vehicle to understand your business, and find answers that were beyond your reach! Just having a lot of data is never enough. So what makes big data useful? What properties should you look for to feel confident you have the right data?
Big data comes from three primary sources:
1) Social data – from messages, tweets, retweets, likes, shared pictures and other files, and comments from various social media platforms.
2) Machine data – the data generated by devices such as phones, cash registers, IoT sensors, and servers, including not only the applications they run but their monitoring data, such as CPU and memory allocation, error logs, and crash reports. (Medical imaging accounts for a growing share of machine data; a single CT scan can be 20 GB or more.)
3) Transactional data – This data is generated from all the transactions that happen every day, such as banking, finance, retail, and supply chain processes, a source that is continually increasing.
To derive value from the data, it needs to be looked at from different perspectives. This is where the Four Vs of Big Data come to help: Volume, velocity, variety, and veracity.
Volume, as the name implies, describes the quantity of the data. Volume is important because we need enough data to derive insights that are statistically reliable. And for use cases that are time-sensitive, the data also needs to cover a long enough span.
Consider a use case of deriving trends and patterns in CPU utilization. Accurate daily, weekly, or monthly patterns can be detected only when there are sufficient data points across many days, weeks, or months! Similarly, comprehensive detection of anomalies relies on accurately modeling normal behavior. Insufficient data points or duration will very likely lead to large numbers of false positives or false negatives.
Velocity refers to the speed at which data is generated and consumed. Velocity helps assess how long a data point retains value. This varies depending on the use case. For example, stock market data could be very short-lived, losing relevance in milliseconds; procurement data could stay valid for weeks. In the IT world, applications and infrastructures frequently go through tech refresh or get decommissioned, making many of the older observations about their performance, capacity, and availability lose their relevance for the enterprise architects.
To measure the impact of velocity, you can use factors such as persistence and recency. Persistence provides an indication of how long a value has been measured, thereby ensuring that it is not a temporary, transient observation. Recency provides an indication of how recently this data point was captured, thereby ensuring that it is not stale and obsolete. These factors can then be used to assess the relevance of the observations derived from the data.
One of the everyday uses of velocity is in making online recommendations to customers based on their purchase history. A customer’s most recent and consistent purchases carry more weight in guiding recommendations for similar or complementary items.
Variety refers to different formats of data, such as CSVs, relational databases, videos, audio, or event and error logs. It also refers to different sources from which the data is collected. Having a sufficient variety of data formats lets us create a holistic understanding of the problem at hand.
For example, a cab service provider wants to use data analysis to optimize its operations. It can collect data about traffic conditions, cab demands, weather, travel patterns, crime rates, accidents, and other parameters. And this data can come in various forms: maps, videos, and live feeds, among others. As you can imagine, a higher variety of data will lead to better understanding, better insights, and better decision-making!
High-variety data shows up more with the increasing penetration of IoT in our everyday lives. Today, an urban consumer uses many connected devices in the household, such as a mobile phone, smart watch, smart TV, or digital voice assistant. All types of data can combine to create a rich user profile!
One more aspect of variety is the content of the data. Consider an example of events data collected from an enterprise IT system that captures various errors, anomalies, and abnormal conditions. Events data coming from different layers of technology might have the same format, but the content varies. We would need to understand the range of events that are captured as part of this data, such as availability problems, performance levels, or IT policy non-compliance.
Furthermore, we’d want to know which part of the technology stack is capturing these events: applications, servers, storage devices, network devices, or others. A wide variety of content ensures holistic, comprehensive analysis.
Veracity refers to the completeness, consistency, and trustworthiness of the data. This aspect becomes all the more important as data comes from a wider range of sources and drives more outcomes. The technology research firm Gartner found in 2021 that poor-quality data and the errors it causes cost the average U.S. business $12.9 million a year. Factors such as bias, abnormalities, inconsistencies, and data gaps can affect the overall quality of data. Identifying and removing all of these helps in improving the data’s value. It also improves the accuracy of the results derived from analyzing it.
Lack of veracity can cause many kinds of problems. Let’s say we want to analyze vehicle pictures taken by automated traffic cameras. This data often contains lots of blurs, images that don’t contain a license plate number, and other gaps or inconsistencies. Poor image quality can get in the way of automatically and efficiently detecting traffic violations.
Or consider incidents in enterprise IT. These incidents are analyzed to mine patterns, correlations, and problem signatures so that IT teams can predict and quickly fix or even prevent problems from occurring. However, incident data often contains time gaps when data was not collected due to glitches in monitoring or collection. Or attributes in the data, such as OS versions, host name, and host location, are stale or inconsistent with the infrastructure repository.
So you can see that if the underlying data isn’t complete or trustworthy, the insights derived from it aren’t very useful.
Selecting the right data set is the most vital step to ensure the utility and trustworthiness of insights derived from it. Analyzing big data that is incomplete, stale, or insufficient leads to a lot of wasted effort. Worse, it carries the risk of misleading the user with incorrect insights and recommendations. Looking at the Four Vs of volume, velocity, variety, and veracity can help find limitations of your data set much earlier in the process.
Despite the advancements in artificial intelligence, many organizations still don’t trust these AI-driven insights to make business decisions. An early data quality assessment can play a vital role in deriving meaningful insights and actionable recommendations, and thereby help in increasing the adoption of AI solutions in the industry.
Big data is one of the factors enabling the rise of AI. To learn more about how AI can power up your business, check out this entertaining and informative primer on basic AI and machine learning concepts.