The data can boldly be described as 21st century oil. We call data a huge, often chaotic collection of information with invaluable business value.
The data can boldly be described as 21st century oil. We call data a huge, often chaotic collection of information with invaluable business value. Entrepreneurs who draw knowledge from data run profitable and fast-growing businesses. Only in the last few years have we produced more data than since the beginning of human civilisation.
Rapid technological development almost every day provides business with new tools for data processing. Despite the progressing automation, human knowledge of how to deal with this valuable resource is still needed. Qualified analysts can be compared to 19th century gold prospectors, because their work depends on skilful sifting of valuable lumps of information from worthless sand.
Gold exploration technique
It does not matter whether you run a one-man business or co-create a huge corporation. Data is processed by every company today and all business decisions are made based on it. The key issue for business today is the technique of correct data cleaning. The most important thing is the quality of this valuable raw material, which can be damaged at almost every stage of processing: from acquisition to final analysis. I would like to devote the next part of the article to the basic issues related to the subject of data cleaning, although I invite not only programmers to read it.
Missing values
Missing values are called values in a dataset that we do not know. Their naming is different from the programming environment. In SQL they are called NULL and in Python None. Often text without content (“”) or values that you define (e.g. “.”, “none”) is also used.
Where do the missing values come from? The most common cause is incomplete or incorrect data collection systems. However, the occurrence of missing values may also be caused by external constraints, such as a legal ban on storing data without the consent of the data owners. Unfortunately, you cannot always afford to ignore a line with missing values. All the more so if we are missing a whole data sample, not just a single value. So how do we find the missing value? I will surprise you, well, you can guess it!
Guessing – a method for dealing with missing values
This is the most popular method. It consists in substituting in an empty space a value that is most often repeated in a given column. Missing values can also be replaced with an average. However, it is now also possible to guess using artificial intelligence
technology.
Anomaly
Anomalia to wartości, które znamy, lecz stanowczo różnią się one od reszty zbioru, zatem mamy wątpliwości co do ich prawdziwości. Dla lepszego zrozumienia, wyjaśnię to na przykładzie. Otóż, wyobraźmy sobie, że przeprowadziliśmy ankietę i na jednym z arkuszu
Under the heading ‘age’, we have noted the number 100. We would certainly think about the veracity of this information, so we can safely call it an anomaly.
Dedicated machine learning models such as Isolation Forest are used to detect anomalies in huge data sets, which uses decision trees to detect data samples that are significantly different from others.
Wykryte anomalia są zazwyczaj usuwane, a następnie traktowane jako dane brakujące
and replaced by the most suitable values. However, it is also possible to leave an anomaly (if we believe that the previously quoted number “100” was not a mistake). However, machine learning algorithms are the best method of not only detecting but also transforming the anomaly.
Transformation of data
Data transformation is very important. It may mean any change in the shape of the data, but we will now focus on the transformation in terms of cleaning and preparing the data for analysis.
The key issue is to maintain a balance between the amount of information contained and the ease of its analysis. What does that mean? Let me take another example. When analysing posts from social media, we cannot limit ourselves to information whether it is positive or negative, because we will lose a lot of information. On the other hand, analysing the full content of the posts will prove difficult and too time-consuming.
Examples of correct data transformation are: changing the client’s age from day to year or changing the place of birth from city to country.
I hope that I have been able to bring closer the business potential of data analysis and the technical aspects of data cleaning. In order to increase the subject matter of cleaning key data for financial institutions, I invite you to watch another episode of our popular science series.