In today’s data- driven world, we use a variety of data to infer insights and make decisions. Right from assessing people’s sentiment on a social media platform, to predicting sales of a product, to automatically driving a car, everything involves data and data-driven decision making.
When it comes to data, there are producers of data and there are consumers of data. Some examples of the producers of data are humans, IoT devices, monitoring tools, etc. Some examples of the consumers of the data are AI/ML programs, dashboards, data warehouse, etc. Consider the following examples:
- Point-of-sale servers collect sales data which is used to forecast future sales.
- Monitoring tools collect performance metrics of applications which is used to render real-time dashboards of application health.
- Customers post reviews about a product which is used to assess product fitment.
- Various sensors in a car collect data which is used by an AI engine to automatically drive a car.
- Users search and view videos, the history of which is used by video sharing platforms to provide recommendations.
- Applications collect biometric data which is used by criminal investigation bodies to identify the victims or criminals.
The producers and consumers often look at the data in different ways. The data generated by the producers often cannot be consumed by the consumers in its as-is state. It often needs cleaning, pre-processing, and transformation, before it can be used by the consumers. Hence, the need for data pipelines.
What is a data pipeline?
Data pipeline is a process to perform a set of data processing steps and move the data from a source to a destination. It often involves various steps to clean, pre-process, and transform data and convert the data generated by the producer to make it usable by the consumer.
In simpler terms, data pipelines are like sophisticated data-processing factories. They take in raw, often messy, information and put it through a series of carefully orchestrated steps to turn it into something organized, valuable, and ready for analysis.
Consider the example of a social media data pipeline for a restaurant. The customers post their reviews on social media and the restaurant can use these posts to make operational improvements to enhance the overall customer experience.
- The producer of data here are customers and they are generating unstructured blob of text in the form of customer review.
- The consumer of the data is a business user who expects recommendations to improve customer experiences.
- The data pipeline connecting the producer to the consumer needs to aggregate data from social media platforms, perform deduplication, pre-process data to remove noise, assess sentiments, derive problem areas, and generate recommendations.
- The data in each step takes different forms from blobs of text to word embeddings, from cost-benefit equations to optimization functions.
Below are some processing patterns frequently realized using data pipelines:
- Collect activity data from an application, mask personally identifiable information, and store in a data warehouse.
- Collect data from multiple local data sources and migrate to cloud data warehouses.
- Collect logs and events from a Kafka stream, flatten the fields, standardize data attributes such as date, distance, temperature, do data-type conversions, and store in an Elastic search database for further analysis.
- Query data warehouse to create specialized data slices and store as data lakes.
Read the data from a local storage, remove duplicates, add additional fields such as identifiers or ingestion dates, handle cases of missing data and store it back to the same storage.
What are the key steps to perform data pipelining?
To gain a deeper understanding of how data pipelines function, let’s take a closer look at their fundamental building blocks:
1. Data Capturing from Sources
This is the first step of any data processing. Ability to capture data from different sources in different formats and store it appropriately plays a key role.
- Once the data is captured from various sources, the data needs to be stored in a format that can be used for further analysis. This storage could be a simple file, a relational database, or specialized databases such as graph or a timeseries database.
- The data must be pre-processed to make it usable. For example, CPU utilization captured for different machines and operating systems come in different time format and granularity. Converting them into one format and granularity to make it uniform could be a part of the pre-processing.
- The data must be stored in a secured way.
- The data storage should be designed such that it adapts and expands as the data grows.
In the digital world, it’s like Amazon S3 or Azure Blob Storage that safeguards your data until it is ready for the analysis. Consider an e-commerce platform which collects customers clickstream data and store it at one place to polish it further and generate personalized product recommendations.
2. Data Transformation
Data transformation involves the steps wherein the collected data is formatted and transformed before storing for analysis. A classic example where data goes through the transformation process is when data received from different sources, businesses and departments of the same enterprise is to be loaded into the data-warehouse. For example, a stock exchange supports trading in the equity stocks, futures and options in equities and currencies. The orders and trade transactions for each one of these segments would have different details and nuances. Storing all of them in the same exchange warehouse database involves a process of transformation to bring the diverse data in a common format.
When it comes to data transformation and loading strategies, two prominent approaches take centre stage: ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). Let’s unravel the differences and the ideal scenarios for each:
ETL (Extract, Transform, Load):Â In the ETL approach, data is first extracted from source systems, then transformed into the desired format on a separate processing server, and finally loaded into the data warehouse. Following are some scenarios where the ETL approach is often relevant:
- Data workflows requiring compute-intensive transformations.
- Data workflows on legacy systems with limited data processing capabilities.
- Data workflows requiring manipulation before entering the target system.
Imagine a healthcare organization needing to anonymize patient data for compliance. ETL enables them to extract patient records, anonymize the data on a separate server, and load the transformed, compliant data into their warehouse. Similarly, an insurance company may use ETL to process claims data, standardizing it before storing it in a data warehouse for fraud detection.
ELT (Extract, Load, Transform):Â ELT flips the script, with data being extracted, loaded directly into the data warehouse, and then transformed within the warehouse itself. This approach is praised for its flexibility and scalability. Following are some scenarios where the ELT approach is often relevant:
- Data workflows involving large volumes of data.
- Data workflows processing datasets containing structured and unstructured data.
- Data workflows developing diverse business intelligence.
Imagine an e-commerce platform continuously collecting customer interactions. With ELT, these interactions can be loaded into the data warehouse as raw data. This raw data acts as a rich historical archive, allowing the business intelligence team to run new transformations and queries as strategies evolve. For example, the company can adapt to changing customer behaviour by running sentiment analysis directly from the data warehouse.
3. Data Analysis
The data collected, transformed, and stored as a part of the above steps becomes insightful and much more useful once analysis is run on it. This analysis could generate results that can turn data observations into insights and recommendations.
Analysis on historical data: Traditional data analytics has been about gathering the historical data over time and generating insights out of it for decision making. It relies mostly on the batch data pipelines where huge amount of historic data blocks is collected at a time on regular schedules and fed to the system for analysis. This process is often carried out during a time-period of low user activity such as, on weekends or at night to avoid overloading the systems. For example, performing the sales trend analysis by collecting the data from different sources such as transaction systems, websites, etc. for the whole quarter.
Analysis in real-time: Real time analytics is about deriving the insights from continuous data flow within seconds or milliseconds. Streaming pipelines are used here to capture real time events and ingest the data as it is received and accordingly provide real time summary statistics, updated metrics, or reports. It allows organisations to get up-to-date information and take quick decisions without a delay. For example, ride-hailing service providers use real-time data from drivers and passenger’s app such as current location, time, traffic, demand, etc. to run dynamic pricing, calculate estimated time of arrival, etc.
4. Data Presentation
The final step is delivery, where valuable insights are presented in easy-to-understand formats such as reports and dashboards.
Real-time dashboards: The live dashboard is used to show the real-time data with the help of interactive visualisation. It provides instant access to the critical data by automatically updating the data with time. For example, consider a portfolio management dashboard, where user can view all the holdings and changes happening to it with respect to market price, live orders, overall and present-day return, change in index of stock markets, etc.
Data slice and dice views: There are slice and dice views which are used to break the information into smaller parts and to analyse it from different perspectives. The main goal of this view is to provide diversification. For example, a user’s mutual fund portfolio is viewed through different views showing the region-wise, sector-wise, asset allocation-wise, and market cap-wise categorisation of a fund.
Pre-canned reports: The other interesting form of data visualisation is pre-canned reports. They are pre-built and standardized reports that are ready to use. Many businesses rely on it as it helps in making important business decisions by providing clear and concise summary of data. Some examples for pre-canned reports are balance sheet, profit and loss statement, expenses, cash flow, etc. that are used by finance managers.
Data Pipelining — Batch vs. Stream Processing: Timing Matters
In the realm of data pipelining, it’s not just about what you do but when you do it. Batch and stream processing represent two distinct timing strategies. The size of data and the frequency at which the data is being collected and analysed often play a role in deciding if the data pipeline is implemented in an offline batch mode or a near real-time streaming mode.
Batch Pipelines
Batch processing involves collecting data over a period and then processing it as a single batch. This approach is ideal for scenarios where real-time analytics aren’t imperative however, processing large volumes of data takes priority. Generating weekly sales reports, computing end of the day inventory reports, monthly payroll processing are some suitable use cases for the batch mode of data pipelines.
Stream Processing
Stream processing, on the other hand, processes data as it arrives, making it perfect for scenarios where immediate insights are crucial. Detecting faults transaction, monitoring applications to detect performance anomalies are some suitable uses for the streaming mode of data pipelines.
Industries and Use Cases where Data Pipelines Shine
Data pipelines are not limited to a single industry or use case. Their applications are many and diverse:
1. Retail: e-commerce platforms use data pipelines for personalized recommendations based on customer’s purchase behaviour. They also use data pipelines for inventory management, ensuring that popular items are always in stock.
2. Healthcare: Healthcare providers use data pipeline to securely process sensitive patient information and ensure compliance with privacy regulations. Additionally, data pipelines support medical research by efficiently processing and analysing large volumes of medical data, enabling researchers to identify trends and develop new treatments.
3. Finance: Financial institutions use data pipelines to continuously monitor transactions for any unusual or fraudulent activities. They employ data pipelines for analysing market trends to make informed investment decisions.
4.Manufacturing: Manufacturing industries use data pipelines for predictive maintenance and supply chain optimization. Data pipelines analyse sensor data from machinery and predict when the equipment maintenance is needed to prevent breakdowns and production stoppages. Additionally, data pipelines optimize the supply chain by monitoring inventory levels, predicting demand, and ensuring that raw materials are available when needed, minimizing disruptions in production.
These examples illustrate how data pipelines are versatile tools with applications across various industries, contributing to improved efficiency, better decision-making, and cost savings.
The Road Ahead for Modern Data Pipelines
The future of data pipelines is all about exciting advancements. Here are some key directions where modern data pipelines are headed:
1. Real-time Everything: In a world filled with IoT devices (smart gadgets that talk to each other) and real-time applications (things happening instantly), data pipelines will become even faster. They will focus on giving immediate insights and information as things happen, rather than waiting for a while.
2. AI and Machine Learning Integration: Data pipelines will get even smarter with the advancements in AI/ML. It plays an important role in machine learning model training and evaluation. It helps ML models to make accurate predictions out of raw data by ensuring data integrity. This integration will help in faster decision making and will evolve AI enabled design making with the benefits of big data storage.
3. Serverless Architecture: Serverless computing is a model where the infrastructure such as servers and their underlying software are owned and managed by the cloud computing platforms. It is taking all the data pipelines to the cloud-based environments like Amazon Web Services (AWS) and Google Cloud Platform (GCP), Microsoft Azure, etc. With the help of cloud, you need to pay only for what you use and scale as per demand.
These changes in data pipelines will make our lives easier, more connected, and full of quick, smart decisions. It’s an exciting journey into the future of data!
ignio’s Take
ignio employs data pipelines across its suite of products to deliver efficient and intelligent solutions for various IT and business operations. It uses both steaming and batch pipelines. It uses streaming data pipelines for monitoring, anomaly detection, prediction, and automated event and incident management. ignio uses batch data pipelines for maintaining a data warehouse of raw data of metrics, events, and logs. This historical data is then analysed to profile normal behaviour, mine trends and patterns, and identify opportunities for continuous improvements.
ignio follows the ELT process to perform analysis of historical data. Data is first loaded as-is with basic validations and is transformed into the required format, granularity and scope before doing any analysis. An ELT approach is followed because the data transformation is a time-taking activity and different transformations are required for different types of analysis. For example, profiling metric data needs down-sampling of data into uniform time intervals, while profiling of events needs text processing of event descriptions before doing analysis.
On the other hand, ignio follows an ETL approach while doing real-time analysis. Here, the data size is usually small and hence transformation is done before-hand and stored in various formats required for analysis. For example, a real-time information of job runs is captured, transformed, and stored in a staging area from where different analysis is run to perform forecasting and predictions.
By harnessing the power of data pipelines, ignio helps organizations drive efficiency, reduce operational costs, and enhance the overall quality of their services and applications.
Conclusion
Data pipelines are like the quiet champions in the world of data. They work behind the scenes, turning messy data into valuable insights that help businesses make smart choices and succeed. By picking the right way to do things (ETL or ELT) and when to do them (batch or stream), organizations can make the most of their data pipelines and stay ahead in the ever-changing world of data. As we move forward, we can expect data pipelines to play an even bigger role in how businesses work and come up with new ideas. It’s an exciting journey into the future of data-driven innovation!