At the time of writing, there is no universal standard for data hygiene. Data hygiene is the process of cleaning datasets or groups of data to ensure they’re accurate and organised. There isn’t an International Standards Office reference number you can point to show you’re ‘doing data right’. The most reliable indicator you have is whether your AI projects are delivering accurate results or insights is success. Our decision intelligence platform is as good as the information it is trained on, so it’s essential to be confident in your data sources.
You’d be surprised where some of data behind projects of global importance came from. In December 2001 a copy of a database of 600,000 emails from failed energy company Enron was purchased from the Federal Energy Regulatory Commission (FERC) by computer scientist Andrew McCallum for $10,000. The material, considered in public domain after the FERC investigation, was dubbed the ‘Enron Corpus’ and it became the bedrock for many data science and natural language processing projects. At first the Corpus (totalling 1.7 million messages and 160GB of data) was circulated only on hard drives but it was made available as a mySQL database and stored in the cloud on Amazon S3. It’s there right now if you want to wade through a mix of meeting requests and office politics.
If you think that’s odd you’ll be amazed to know that the data used to train the large language models like Open AI’s GPT, Meta’s Llama and Google’s Bard were trained on conversations from the internet’s favourite discussion website, Reddit.
Your organisation holds data falling into two broad categories: structured data held in common formats like spreadsheets for interrogation and unstructured data based on text, facts, media that were not designed to be gathered in a database – like the aforementioned large language models.
Galvia’s decision intelligence platform focuses on the value that can be unlocked through structured data designed to be searched and analysed. But structured data has its own problems and failing to pay attention to ‘data hygiene’ you can end up with bad inputs that can lead to the failure of projects. There are ways to ensure this does not become a problem.
1. Stay up to date with policy
In the five years since it came into effect GDPR has become a global model for how data is stored and processed. Despite leaving the EU, the UK has largely retained it in the US the California Consumer Privacy Act has introduced the right to access and delete personally identifiable data.
One of GDPR’s core issues is how long data can be retained for, which the answer is as long you need it but only ever for its stated purpose. Your criminal record can travel with you for the rest of your life, but that you were a person of interest in a case ceases to be useful at the conclusion of an investigation.
From an organisational perspective, information on the purchasing history of a client is useful but as soon as that person retires, passes away or their business folds then that information is no longer useful and has to be deleted.
The spectre of a GDPR fine – up to €20 million or 4% of global turnover, whichever is higher – far outstrips the potential benefits. There’s no excuse not to keep up with regulatory changes, which informs our other advice and data hygiene.
2. Watch for simple errors
Databases are reliable as long as users know what they’re doing with them. This can extend from duplicate entries in databases that have to be resolved, to simple errors of inputting information. Something as simple as putting an errant comma at the end of a number can create an accidental value putting it outside the boundaries of what’s useful and what gets cast aside as an outlier.
Simple human errors happen and employees can develop bad habits over time in the absence of supervision. Regular audits of data collection practices such as deduplication or ACID compliance can be the difference between a reliable or unreliable data source. As with many things, the problem can be somewhere between the chair and the table.
3. Avoid the rot
Storage media and applications come and go and the equipment required to access them can become more specialised and harder to find. To maintain data hygiene, make sure that the data you do keep (or are allowed to) is accessible and not reliant on physical media or applications whose time has passed. The debate over whether you should rely on the cloud to store your data or physical media be sure that whatever you use is useful now. Data Collection stored on degrading media can be subject to the ‘Discrot problem’ and information corrupted over time can fall victim to its digital equivalent the ‘Bitrot problem’.
Some experts recommend auditing your physical storage media every five years and it’s no harm to adopt a similar strategy to software to stay current.
Treating data as an asset requiring regular maintenance doesn’t have to be a big job, but it is essential.
Talk to us today about leveraging business growth from your data.