The story of data science goes hand-in-hand with the story of data management. This story has been developing rapidly in the last decade, such that, the data and data-driven decisions have become an integral part of the operations, approach, and character of an organization. The data penetration is so deep and opportunities so vast that Professor Yuval Harari in his book “Homo Deus: A Brief History of Tomorrow” refers to the concept of “data religion” as a way to promise happiness, peace, prosperity, and even eternal life with the help of data-processing technology.
Data management is the process of ingesting, storing, retrieving, organizing, and maintaining the data created and collected by an organization. It forms the base for creating a data-driven organization. The performance, effectiveness, and even trustworthiness of data-driven decisions largely depends on the soundness of data management process. However, the bulk of data science initiatives ironically focus on AI/ML techniques, and data management is often treated as an afterthought.
The massive data growth and equally massive opportunities presented by data-driven insights have led to a phenomenal story of the evolution of data management. The last few decades have seen impressive innovations not just in the way data is analyzed but also in the way data is stored, accessed, and maintained. This blog is an attempt to present some of the major highlights in the data management story. This is also the start to a series of blogs discussing the fundamentals of data management, aiming to cover various topics ranging from types of databases, data bodies, data pipelines, data governance, among others.
Databases
Database is a good start to the data management story. Relational Database Management Systems (RDBMS) came into picture in the 1970’s that were used to store data as tables with rows and columns. Each table had a well-defined set of attributes with well-defined data types and constraints. Relational databases were very appealing to the businesses for its ACID properties (Atomicity, Consistency, Isolation, and Durability). These databases became a perfect home for storing inventory, orders, transactions, etc.
However, the well-defined structure of RDBMS came with some tradeoffs. The rigid schema has made RDBMS difficult to setup, maintain, and grow. Post-facto changes in schema are difficult and time-consuming. Secondly, RDBMS also became a poor choice for storing unstructured and semi-structured data. Thus, to better cater to the requirements of modern big data, non-relational databases (or NoSQL) started surfacing up.
A NoSQL database provides the flexibility to store unstructured and semi-structured data. Users do not need to define the data types during the setup and the system can easily accommodate changes in data types or schema. NoSQL databases are also designed to distribute data across different nodes and are horizontally scalable to support large data volumes.
However, this power of NoSQL databases comes at a cost. They are not ACID compliant, and they do not guarantee data consistency. They guarantee “eventual consistency”, which informally guarantees that if no new updates are made to a distributed database, then eventually all access to the databases will return the last updated value. For example, consider a search engine built using a distributed NoSQL database. While the database is getting updated, the search queries may not return the most updated response. Instead, it will provide the best output that it can, and would eventually provide the most updated response. This mode of operation will not work for business use cases where consistency is of utmost importance (e.g., trades, orders, transactions, etc.). But it may work fine where we need a quick response, and some level of inaccuracy can be tolerated (e.g., search engine queries).
Over a period of time, specialized databases came into existence, each serving a specific purpose.
- Key-value stores: they are probably the simplest form of databases. They can only store pairs of keys and values. These simple systems are usually inadequate for complex applications, however, this simplicity makes them a preferred choice for scenarios that require high performance in limited resources. Cosmos DB, Redis are examples of key-value stores.
- Wide-column stores: they store data in records with the ability to hold very large number of dynamic columns. Google’s BigTable is considered to be the origin of this class of databases. Cassandra and HBase are two popular databases from this category.
- Document stores: instead of storing data in fixed rows and columns, they store data in documents. A document is usually in the form of a JSON or XML and stores the information of one object and its associated metadata. These databases are known for their schema-free organization of data, and they are well suited for semi-structured data. MongoDB and DynamoDB are examples of document stores.
- Graph database: they are purpose-built to store and navigate relationships. Entities are stored as nodes and relationships as edges. Edges can be of different types to capture different types of relationships. In graph databases, traversing the relationships is very fast and are well-suited for social networks or recommendation engines that need to create and query different data relationships. Neo4j and ArangoDB are examples of graph databases.
- Time-series database: They are designed to efficiently store and retrieve time-series data i.e., metrics and events that are captured against a timestamp, for example, InfluxDB.
- Event stores: these databases are optimized for storage of events. While most databases store the current state of an object, these stores persist across all state changing events of an object together with a time-stamp. For example, for a shopping card object, an event store will store addition of each item in the cart as a separate event.
Given the wide variety of databases, each with its own strengths and weaknesses, most applications are designed using an ensemble of databases to best leverage their strengths. For example, consider modeling an enterprise IT estate through the databases. The entities and relationships can be best captured through a graph database; the historical data of performance metrics of server, storage, and network can be stored in a time-series database; the data of events, incidents, change request, etc. can be stored in events stored; and the various analysis reports and insights can be captured in a document store.
Another interesting direction is to develop a database-agnostic application design. The idea is to design applications such that the interactions with the data layer are abstract enough to switch from one database to another. This concept is very enticing when an application needs to be deployed in customer environments and the customer has various constraints and preferences for certain databases. This concept is also very appealing when your application needs to support different scales of deployment. A customer with a small-scale enterprise and requiring features that are not data-intensive might need a simple and single database design. While a customer with a large estate and hosting a whole host of data intensive features might need a complex multiple database design.
Data Bodies
For many years, RDBMS was sufficient for businesses as the data was small and RDBMS offered performance and reliability. However, with the rise of digitization and observability, the collected data started expanding to very large volumes. As a result, organizations ended up with multiple databases for different business functions. These databases were disconnected from each other and were serving different types of users and purposes. This led to data silos — decentralized, fragmented storage of data across the organization. This problem led to the inception of Data Warehouse.
Data warehouse emerged as a technology that brings together a collection of relational databases, allowing the data to be viewed and queried as a whole. At first, these were typically run on expensive on-premise hardware, and later became available on cloud. They offered many advantages, such as, integration of multiple data sources, optimized read access, execution of quick queries, and data governance. Data warehouse is primarily designed for structured data. With the rise of unstructured data, it started falling short in serving the data and analytics needs of many organizations. This gave rise to the concept of Data Lakes.
Data lakes offer storage areas for unstructured, semi-structured, and structured data, taken from multiple data sources, without defining any predefined schema. Following are some key differences between data warehouse and data lakes.
- Data warehouse can only ingest structured data, whereas, data lakes can ingest unstructured and semi-structured data as well.
- Data warehouses require ETL (Extract-Transform-Load) tools to clean and structure data before ingestion, whereas data lakes are used with ELT (Extract-Load-Transform) tools. Data is first loaded into data lakes and then transformed as and when required.
- Data warehouse offers good data quality with data de-duplication and data validations, whereas Data lakes may contain unverified and erroneous data.
- Data warehouse offers fast query performance, whereas Data lakes prioritizes data flexibility and volume over performance.
Most organizations now maintain data lakes as the first entry point of the raw data, and then create different data warehouses for different business use cases.
Whether you use a database, a data warehouse, or a data lake, the value that data can offer largely depends on how clean and pollution-free you keep it. It is also important to design these data bodies with a purpose. The “data first, questions later” approach can only lead to very large volumes of data getting collected with no analytical value and no meaningful insights. In such cases, bigger data is not always better. Instead, bigger is just bigger, more costly, and difficult to derive any meaningful insights from.
Getting the Data Right
While databases and data bodies can offer all the power to efficiently store and analyze the data, the utility of these initiatives highly depends on the quality of data that is being processed. Examples of poor data quality include various factors such as incompleteness, inconsistency, duplication, staleness, etc. This has led to increasing importance for data quality management. Data quality management offers ways to identify data flaws, assess their impact, and identify ways to mitigate these issues. Following are some of the crucial aspects of data quality management.
- Data cleaning, also called data scrubbing, is the process of detecting, removing, or rectifying incorrect, inconsistent, duplicate, or wrongly formatted data. This problem usually aggravates when data from multiple data sources is combined leading to duplicate or mislabeled data.
- Data transformation is the process of converting data from one format to another and make it easily accessible for visualization and analysis. This process is also referred to as data wrangling or data munging.
- Data quality assessment is the process of evaluating data on the dimensions of volume, velocity, variety, and veracity. It creates useful summaries of data and discovers data quality issues. This enables early detection of data gaps and saves time and effort in costly analysis.
- Data rollup is the process of aggregating data based on time to condense the data set. It comes most handy in cases of time series data. It is always sensible to roll up the granular data into segments or wider granularity to reduce the bulk.
Today, industries are consumed with the need to collect more and more data about their business, IT, and operations. Yet, too often, the insights derived from the data are biased or over-simplified. At times, the interpretations are too broad or reasoned to prove an already believed hypothesis! Darell Huff’s “How to Lie with Statistics” uncovers several cases of how data gets misunderstood or manipulated.
Selecting the right data set is the most vital step to ensure the utility and trustworthiness of insights derived from it. Analyzing big data that is incomplete, stale, or insufficient leads to a lot of wasted effort. Worse, it carries the risk of misleading the user with incorrect insights and recommendations.
Data Governance
Insights derived from data analytics are highly influenced by the input data. In a world where data is everywhere and data-driven decisions are taking the centerstage, data governance becomes an absolute essential. Data governance typically entails the following key aspects:
- Data security enforces governance policies to guard sensitive data.
- Data accessibility dictates which data should be accessible for what analysis.
- Data compliance ensures adherence to laws and regulations and defines the scope of permissible analysis.
- Data auditability establishes lineage of data origin and transformation and plays an important role in establishing trust in the insights derived from data.
- Data privacy deals with handling data in compliance with the data protection laws, regulations, and privacy best practices. It addresses how data should be collected, stored, and shared with third parties.
Closing Thoughts and Upcoming Blogs
Data management is a continuous journey. It does not happen in one day, instead, it evolves over time. With ever increasing data volume and the increasing reliance on data-driven decisions, having a strong data management foundation is vital for today’s enterprise.
Over the next few weeks, we will publish a series of blogs covering different aspects of data management. They are intended to present a simple explanation of the data management concepts, as well as give a perspective into how our product features are built.