In last week’s article, we discussed some of the potential risks in failing to proactively manage ROT and over-retained data in a world where data is growing at an exponential rate. As promised, in this week’s article, we’ll explore some of the ways to identify, manage and delete ROT data.
If managing ROT data manually without the help of software were easy and time efficient, there would not be a need for this blog post. As it stands, many organizations rely on end users to effectively manage the very same data they create. Since those end users all have day jobs with competing priorities, data management tasks often fall to the wayside – and that’s not even accounting for the data of departing employees or legacy organizations and systems. It should be no surprise then that ROT data accumulates quickly over time.
Getting Started
As you begin your quest to free your organization of ROT data, resources such as a recent data map, an understanding of your organization’s data retention and legal hold schedules and processes, or even better, direct support and coordination from your records management stakeholders will be helpful to have at hand.
Now that you’re prepped and ready to go with your data map and GRC schedules, it’s time to identify which data is ROT. The harsh reality is that unless your organization has few employees and a small data footprint, you’re going to need some help from technology to identify ROT data.
Using Technology to Locate, Identify, and Classify ROT Data
While many organizations have the benefit of platforms like Office 365, even Office 365 falls short of managing all organizational data. Enterprise governance software like Rational Governance can be used to scan and search all data sources within your organization. Once your data is searchable from a central location, you can determine what data qualifies as ROT for the purposes of your organization.
Locating duplicative and outdated data is the first step in eliminating ROT content across your organization. Most enterprise governance platforms provide some level of search and data visualizations tools that can help:
- 1) identify documents with key ROT data considerations such as document age, documents with exact or near duplicates, or departed custodians; and
- 2) locate where these redundant and obsolete documents reside, as well as the people or departments with which those documents are associated.
PRO-TIP
Use dashboards and data visualizations to assess data source locations and custodians associated with large amounts of redundant or obsolete data. Chances are that locations or custodians with a significant amount of old data will also be the home to many other categories of data that has outlived its purpose.
Similarly, locations associated with large amounts of redundant data may represent legacy data that has been migrated to another location, but never removed from the original location. This data may be able to be deleted in its entirety.
Once you’ve identified redundant and obsolete data, you might want to learn more about the content of this data. In this circumstance, unsupervised learning tools like clustering can be especially helpful.
Clustering organizes documents that the algorithm determines to be conceptually similar into groups (or “clusters”) and names the clusters based on themes that are pertinent to the conceptual nature of documents in each cluster. These clusters can be used to gain quick insight into the redundant and/or obsolete data that you’ve identified and may provide pointers to further exploration prior to deleting data.
In addition, or as an alternative to unsupervised learning, you might consider reviewing statistical samples of both the redundant data results and the obsolete data results. Reviewing a statistical sample of data can give you a better understanding of the content you are preparing to delete so that you can delete with confidence.
The Big T in the Room: Trivial Data
You may have noticed we’ve yet to address the big T in the ROT room: Trivial Data. Identifying trivial data requires additional nuance than what the age or duplicate hash value of a document can tell us. In our last post, we defined trivial data as purposeless data, untouched, and merely taking up space on servers.
When we think of trivial data, we can think of an email inbox – and not just junk mail. Attend one conference and suddenly the number of sales emails increases by 10-20 per day – and that doesn’t even include the related notification emails from social media sites like LinkedIn. Other trivial data may include old daily to-do lists, agendas and notes from recurring meetings that happened years ago, company newsletters, pre-writing, and notes that went into presentations or articles that served no other purpose. Trivial data is a broad category, and its nature can vary greatly across employees and organizations.
Tools like sampling and clustering are great ways to start evaluating your data for potentially trivial documents. Consider also starting with targeted searches for terms known to indicate triviality (e.g., emails that contain the word “unsubscribe”). Tools like clustering and sampling can then be used to gain further insight into the results and identify samples of trivial documents that can be used to teach machine learning algorithms to help identify more of these trivial documents across your organization.
Unfortunately, there is a second harsh reality: any technology company out there that promises its tool will do all the work for you out of the box or hand you an “easy” button for ROT data cleanup is lying to you. It requires both a technical solution and humans who understand your organization’s data and can tailor the tool to meet the needs of the enterprise to effectively identify and delete ROT data. There is no single classifier, no one magic search bullet that will find what you need. The most accurate results are gathered by combining keyword search, metadata search, pattern recognition, and machine learning output.
ROT Eliminated, for now…
Congratulations! By prioritizing time and resources to identify and delete ROT data, you’ve neutralized untold legal, regulatory, and cyber risks. In the process, you’ve also reduced storage costs, opened disk space, and perhaps even increased network speed along the way.
Not so fast, we aren’t done yet.
We may have cleaned up our current data mess, but as we referenced in our post last week, data continues to grow exponentially year over year. The last thing you want to do is repeat this exercise in a few years. Proactive data management is a critical step in preventing future ROT data accumulation. In the next post, we’ll explore proactive (and even automated) measures your organization can implement today to avoid another time-consuming and labor-intensive ROT data cleanup exercise in the future.