When learning how to go through the data cleansing process, practice really does make perfect. Although you will have to tweak your approach depending on the project you are working on, some critical checks will always be relevant.
Why is data quality important?
Data analysis is fundamental as it can have a significant impact on a business. From analysing target audiences to confirming whether to expand a product or service offering, having clean data to work from is essential to pretty much every decision a business makes. The negative ramifications of working from dirty data are far-reaching.
Which first step should a data analyst take to clean their data?
The first two steps in this data cleansing method should be taken for every single project. It will ensure you don’t have to do the work again in the case of miscommunication, or going back and adding to the analysis after the fact:
- Save a backup of the original data
Trust us; even the most experienced analysts make mistakes. You’ll want the original data in case any errors are made in the cleaning process. It may be that you remove information that you deem unimportant, only to gain new information later down the line that reveals the deleted data was in fact useful.
Keeping the original also offers peace of mind. Rather than lying awake at night wondering if you made a mistake, you can hop into the original, double-check the information and either rest assured that you have done the right thing or fix the error.
- Understand the reason behind the project and confirm the values you’ll be working with
It may feel as though you are a nuisance, but interpretation can be so broad. Whenever you are asked to analyse a data set, ask enough questions to understand why the project is being undertaken, what data sets are essential AND what sets would be useful. Imagine how frustrating it would be to complete a project only to be asked a question which means you have to filter through all the data again to find the metrics that would answer the question.
- Remove metrics that aren’t relevant
Okay, now it’s time to get to the good stuff, the stuff you got into the field to do! Data analysis projects start with sifting through information and ascertaining key metrics you want to base your evaluation and recommendations on. So, now is the time to remove anything that won’t help you. That includes information that is valuable, but not in the context of this project.
- Check for duplication and structural errors
Duplication happens for many reasons, particularly if you’ve had to combine data from different sources, so it’s essential to go through the motions of checking and removing copies in every project you undertake.
When we say small errors, we are referring to things like typo’s, title case differentiation, white spaces and string typos (which are when you require all totals to have three digits for example; therefore, anything with less than three numbers would need zero’s adding to the front), which can really mess up an analysis.
Something else to look out for is odd naming conventions, which can happen if you are merging datasets. An example would be one department using ‘N/A’ and another using ‘Not applicable’. They both mean the same thing but will be categorised separately. You must double-check consistency so that you don’t end up with split data.
The good news is that there are multiple ways to locate and fix things like typos rather than checking through them manually, which increases the chances of missing some; we are all human after all.
Missing values – how do you clean inconsistent data?
Make sure you consider missing values too, which you will come across more often than not. Have a plan in place on how to react to these. For example, if a particular column is missing a lot of values, it may be better to remove this column rather than working with information you can’t fully trust.
Missing values will contaminate your data; so, unless you remove them from the picture altogether, you need to react to them rather than ignore them. That could mean something as simple as going back to the project leader, explaining the situation and asking how you should proceed.
Alternatively, you can make an educated assumption and fill in the missing values with an approximation. Tread carefully here, though, and make sure you flag when you do this. Another option is to simply ensure that any modelling showcases that the data is missing and displays it accordingly.
- Consider data outliers
If you come across information that widely differs from everything else, you will need to analyse whether it’s a mistake. If your investigations find that a piece of information is wrong, it can be filtered out. However, an outlier may end up being useful. So don’t simply filter it out because it doesn’t look right. Dig a little deeper to ascertain whether it holds any secret value.
- Data type conversion
All of your data needs to be uniform so that any automatic changes you make are accurate. Anything that needs to be converted but can’t should be flagged so that you react to it accordingly in any evaluations you carry out. Converting your data will help make analysis much more straightforward.
- Final checks
Your data probably looks immaculate now, so much so that you’ll be raring to jump in and start your actual analysis. But hold on! Take the time to carry out some final checks to make 100% sure your data is not just clean, but spic and span.
It may be worth creating a checking template, particularly if there are other data analysts in your team so that you are all singing from the same hymn sheet. The template can include questions such a:
- Does the data make sense?
- Are all appropriate rules being followed?
- Can you tell at a cursory glance whether the theory you are trying to prove can be answered?
This can mean the difference between continuing with the analysis or improving the data to ensure quality and accurate conclusions are drawn.
Now it’s time to dive in and start your actual analysis with peace of mind that you are working with high-quality clean data.
If you are intrigued by a data analysis career and want to train to get into the industry quickly, apply for a place on our Data Science Bootcamp.