The data can boldly be described as 21st century oil. We call data a huge, often chaotic collection of information with invaluable business value.
The data can boldly be described as 21st century oil. We call data a huge, often chaotic collection of information with invaluable business value. Entrepreneurs who draw knowledge from data run profitable and fast-growing businesses. Only in the last few years have we produced more good data quality than since the beginning of human civilisation.
Rapid technological development almost every day provides business with new tools for good data quality processing. Despite the progressing automation, human knowledge of how to deal with this valuable resource is still needed. Qualified analysts can be compared to 19th century gold prospectors, because their work depends on skilful sifting of valuable lumps of information from worthless sand.
It does not matter whether you run a one-man business or co-create a huge corporation. Data is processed by every company today and all business decisions are made based on it. The key issue for business today is the technique of correct data cleaning. The most important thing is the quality of this valuable raw material, which can be damaged at almost every stage of processing: from acquisition to final analysis. I would like to devote the next part of the article to the basic issues related to the subject of data cleaning, although I invite not only programmers to read it. Find out the importance of data analytics for business success.
Missing values are called values in a dataset that we do not know. Their naming is different from the programming environment. In SQL they are called NULL and in Python None. Often text without content (“”) or values that you define (e.g. “.”, “none”) is also used.
Where do the missing values come from? The most common cause is incomplete or incorrect good data quality collection systems. However, the occurrence of missing values may also be caused by external constraints, such as a legal ban on storing data without the consent of the data owners. Unfortunately, you cannot always afford to ignore a line with missing values. All the more so if we are missing a whole data sample, not just a single value. So how do we find the missing value? I will surprise you, well, you can guess it!
This is the most popular method. It consists in substituting in an empty space a value that is most often repeated in a given column. Missing values can also be replaced with an average. However, it is now also possible to guess using artificial intelligence technology.
Anomalies are values that we know, but they are decidedly different from the rest of the set, so we have doubts about their truth. For better understanding, let me explain this with an example. Well, let us imagine that we conducted a survey and on one of the sheets.
Under the heading ‘age’, we have noted the number 100. We would certainly think about the veracity of this information, so we can safely call it an anomaly.
Dedicated machine learning models such as Isolation Forest are used to detect anomalies in huge good data quality sets, which uses decision trees to detect data samples that are significantly different from others.
Detected anomalies are usually removed and then treated as missing data and replaced by the most suitable values. However, it is also possible to leave an anomaly (if we believe that the previously quoted number “100” was not a mistake). However, machine learning algorithms are the best method of not only detecting but also transforming the anomaly.
What is the importance of data analytics for business success? Data transformation is very important. It may mean any change in the shape of the data, but we will now focus on the transformation in terms of cleaning and preparing the good date quality for analysis.
The key issue is to maintain a balance between the amount of information contained and the ease of its analysis. What does that mean? Let me take another example. When analysing posts from social media, we cannot limit ourselves to information whether it is positive or negative, because we will lose a lot of information. On the other hand, analysing the full content of the posts will prove difficult and too time-consuming.
Examples of correct data transformation are: changing the client’s age from day to year or changing the place of birth from city to country.
I hope that I have been able to bring closer the business potential of good data quality analysis and the technical aspects of data cleaning. In order to increase the subject matter of cleaning key data for financial institutions, I invite you to watch another episode of our popular science series.