In today’s data-driven world of information, organizations are constantly generating or accumulating vast amounts of data from various sources. Data can be compared to a staple crop of any culture where either it is grown in-house with significant effort, or it is exported or imported for others to consume. While data is an asset that needs to be collected, organized, and maintained by the organizations to provide insights, improve decision making and drive innovation, neither can the data be forever new, nor any organization can have an infinite pool of storage and resources to just collect them without thinking of maintenance.
In the previous blogs, we have talked about how different databases can be used to store this data. This blog will explore data maintenance practices focusing on types of data, and their archival/retention periods with consideration of the quality, security, and compliance aspects of any industry.
What is Data Maintenance?
Data maintenance means the journey of data before it becomes obsolete. During this journey, the data is
- Collected and stored
- Formatted and cleansed
- Backed up/archived
- Purged
Any data that is generated or collected follows its own journey before becoming obsolete or non-usable. The duration of validity of data may be driven by the aspects of
- Compliance — for example, a regulatory board deciding on how long the transactions should be maintained for clients before taking action
- Historical evidence and reporting — for example, a bank providing the ability to view a transaction that is not older than 6 years.
- Analysis — for example, a retail store maintaining its sales for the last 3 years to do analysis on the footfall and take strategic actions.
- Debugging — a system maintaining logs for the last 30 days to debug any discrepancy or issue reported.
In each of the above cases, the decision-making process of the duration of maintenance is made by different owners. There are also cases where the same data can be used for different purposes and follows different maintenance durations . Before diving into the details of the data maintenance process, let us first dive into different data types that this blog is going to touch upon with the data maintenance relevance of each one of them.
Reference data
What data and what is it used for: This is the categorical or master data that becomes the classification source for other data in any system. Most of the time this does not change with time or changes minimally over a prolonged period of time. For example, all supported time zones in any application could be reference data.
How long does it remain valid: Typically, the reference data is valid forever till the sunset of system using it.
Operational data
What data and what is it used for:Â The data that is used actively in day-to-day operations. This kind of data is generated during real time or near real time. For example, it can be employee data, inventory data in case of a retail store, or the sensor data captured on a daily basis when an employee logs in to the office
How long does it remain valid:The validity of this data varies from months to years depending upon organization policy and compliance requirements.
Transactional data
What data and what is it used for:Transactional data is the data that records transactions done within the application. It can be a financial transaction for a banking application, or a goods sales transaction done in a retail store. Transactional data consists of a timestamp, transaction ID , type, and details of transaction and status. Typically, transaction data is huge in volume.
How long does it remain valid:Â The validity of transactions is different for different purposes. For example, a monitoring tool may be interested in just the transactions that are happening in real-time, but the same transactions become the backbone of historical analysis and hence need to be maintained for years for some business purposes.
Analytical data
What data and what is it used for:Â Analytical data is the data coming from one or more sources, which is processed, transformed, and aggregated in order to make it suitable for analysis to gain insights which can be used to take strategic decisions for an organization. This data is also used in reporting, mining, business intelligence etc. It is organized and optimized for querying and reporting purposes.
How long does it remain valid:Â Retention period of analytical data is longer than operational data. It can span several years as it is used for analyzing trends and patterns, historical comparisons and reporting etc.
Timeseries data
What data and what is it used for:Â Timeseries data is the data that is collected at specific time intervals or time periods; say every min, hour, daily, weekly etc. This type of data is also used to analyze the trends, patterns, and changes in data over a period of time.
How long does it remain valid: Retention period of time series data can be from weeks to months to years depending upon the organization’s business requirement.
Logs, audit, or event data
What data and what is it used for:Â Log or audit data consists of entries generated by various components of the application. These entries contain several types of events such as services start logs, stop logs, user action related logs, security events, performance logs, application logs, authentication, and authorization events. etc. These logs/events entries are made in a particular order in which they occur. This type of data is especially useful for debugging if any issue occurs in the application or in any kind of investigation related to security and also during an audit.
How long does it remain valid:Â Retention period for Log/audit data can range from months to years or more. As these logs are needed for investigations, compliance etc. purposes. This one is a classic example, where the recent data needs to be more accessible for debugging and the old data can be archived.
Observations
What data and what is it used for:Â This data has the observations that have been derived out of the analytical data. It is typically used for gaining insights into the system or business behavior and taking decisions based on the same. For example, a sales report saying that there is a major drop in sales in the Europe region would help organizations strategize their marketing plan for Europe.
How long does it remain valid:Â Retention period for this data typically varies from months to years. Most of the time, this data becomes obsolete as soon as a fresh analysis is available. Hence, the retention also depends on the frequency at which any analysis is run.
Can Data just be stored for infinite time?
Many times , there is a tendency to collect data without worrying about data cleansing or purging it. For example, a timeseries data about CPU utilization that is being collected by the monitoring tool becomes unusable after a certain period when the applications running on the server change their behavior. Similarly, the jobs run duration from a scheduler tool stored for years after the job behavior has changed may be needed just for auditing purposes but not for any analysis.
In any case, information cannot be stored forever as it not only adds to the storage size but also to the costs. Hence, it is extremely important that there is enough thought process given to maintain the age of data and make decisions based on that.
Unfortunately planning for data purging is the last point in the design decision or sometimes even totally ignored. This comes as an afterthought in design decisions when changing the design becomes a tougher task to then take into consideration.
Diverse ways of letting data go
Setting up as a configuration in database
Typically, no-SQL database used for storing big-data can be configured to automatically delete data that is older than certain period. For example, setting the TTL (Time to live) configuration in HBASE for a table or for a cell can ensure that data inserted before a period is automatically deleted.
This helps in cases like logs, time-series or observation data stored in such databases to be automatically purged when the relevance is lost.
Partitioning of database
Partitioning of databases helps in accessing and managing data so that different sections or periods of data can be accessed, managed or maintained separately. A very classic usage of this partition is to divide transaction data into months of years so that every quarter data can be stored in different partitions of database. Thus, if an organization has a policy to delete anything older than a year, then the maintenance policy can be applied on the relevant quarter.
Handling in the application
This mechanism becomes most effective when the knowledge of data maintenance is embedded in the application. For example, a banking application has a design call that any transactions which are 6 months old would be served from the real-time application and older than that from the archive. In these cases, the application supporting the transactions would have a design to archive and purge any data older than 6 months from the real-time system.
Automatic rolling up of data to reduce the data volume
This mechanism helps in reducing more data size than in purging. This is applicable to time series data. A typical time series is collected at a regular interval but as the data becomes older, the granularity of the data can be reduced for analysis. For example, a CPU utilization collected every 5 minutes for analysis can be rolled up to max of one hour if the utilization is older than 2 months. This helps in reducing the data stored and quicker analysis and better observations.
Periodic purging as a scheduled activity
This is the most used mechanism when the purging or maintenance is not included in the application, or the data is being used by multiple applications. In this case, there are external scripts or jobs scheduled for purging the data from the databases. These run at a scheduled interval to archive and delete any data that has strategically been marked as obsolete.
Making it mandatory for customers to take a call
When a product or application consumes data created at a customer end, then the ownership of the data is still with the customer. The retention period of such data is decided by the customer rather than the product. So, it is important to provide a configuration of the data for the customer to decide on the information retention period.
Importance of backing up/archiving before purging
Data archival is the process of moving the data that is no longer actively used for long term retention. This helps organizations to keep the historical data for compliance, regulatory and R&D requirements. It also helps in separating out such data on different storage devices, for example: tape, cloud storage. Transfer of the data to tape is slow, transfer of the data on cloud storage is fast. Cloud storage is global so the data can be accessed from anywhere in the world. Whereas for tape, you need to carry it physically. So, cloud storage is a better option but, cloud storage is expensive than tape storage. If cost is a concern, then choose the storage devices accordingly. There are many tools in the market available for data archival such as Google Vault, Bloomberg Vault, Mimecast Vault archive and many more.
Working on historical or archived data
Data that is seldom accessed but needs to be retained for an extended period to meet compliance, security, and other requirements is typically archived. Archived data must be retrieved sometimes for the mentioned purposes.
Retrieval of the archived data can be done through the same tool used to archive it. For example, all the above-mentioned tools provide a way to retrieve the archived data as well as export it in excel or PDF or any other supported format. To retrieve the data from a tape, you can insert the tape into a tape drive, mount the drive and then using tar command it can be extracted. The extracted files can be moved to the required location for usage. To retrieve the data from cloud there is always the option of retrieval and restore in Azure, Google, and Amazon clouds. Using this option, the data can be retrieved and restored to the required location. There is also an option provided by various cloud providers to download the data. Also, criteria from a start date to end date i.e., a data range can be mentioned to retrieve only the required data. The time required for retrieval can vary from seconds to minutes to hours or more depending upon the data volume and source device such as tape or cloud storage or any other device. Retrieval of data also involves cost depending upon the storage type and time required for retrieval or restore.
ignio’s take
ignio deals with a variety of structured and unstructured data. It varies from information from the Customer servers/nodes for defined metrics such as performance metrics (memory, CPU utilization) etc. to observations that are generated out of this. It varies from daily tickets for taking actions on a target machine to weekly analysis run to generate governance reports.
As a part of the data maintenance policy, ignio believes that.
- any observations older than 30 days becomes obsolete and hence analysis must be re-run to generate fresh observations.
- any customer estate information should be controlled by the customer policy which is to be defined and communicated to ignio.
- any customer timeseries data loses its relevance typically after a period of 2 years and hence must no longer be used for analysis.
- any tickets generated in ignio must be readily available for analysis for 30 days and beyond that must be fetched from archival.
Conclusion
In conclusion, effective data maintenance involves understanding the types of data, defining retention periods, and then leveraging the capabilities of your chosen database. This also involves defining a policy for data archival and retention. Regular cleanup of the database helps an organization optimize the database performance, in turn, improves application performance and helps in compliance and auditing.
By using the archival and cleanup features of a database, one can also develop a customized data maintenance strategy as per the organization’s needs.
Request a Demo from Digitate to transform your digital operations.