Data analysis is a long process. There are some steps to do this. First of all, we need to recognize the data. We have to know every feature in the dataset. Then we must detect the missing values and clear our dataset from these NaN values. We can fill these NaN values with some values (mean, median, etc.) or we can create our function to fill these missing values. We can also drop some columns that are not helpful or have more NaN values than others.
This process can change. It depends on the data and target. But we must finally fight the outliers. We have to detect and handle them. Each data has different types of outliers, whether they are within 1.5 IQR or not. Sometimes these outliers aren’t harmful, so we don’t deal with them. But if we want to get good results in models or our analysis, we need to handle outliers. There are 3 commonly used methods to deal with outliers.
1. Dropping the outliers.
2. Winsorize method.
3. Log transformation.
Let’s look at these methods with Python,
In this demo, we will use the Seaborn diamonds dataset.
We will handle the table feature of the diamonds dataset and assume all NaN values have been processed (we just dropped them). Let’s look at the graphs boxplot and histogram.
As you can see this column has outliers (it is shown at boxplot) and it is right-skewed data(it is easily seen at histogram). Boxplot is the best way to see outliers.
Before handling outliers, we will detect them. We will use Tukey’s rule to detect outliers. It is also known as the IQR rule. First, we will calculate the Interquartile Range of the data (IQR = Q3 — Q1). Later, we will determine our outlier boundaries with IQR.
We will get our lower boundary with this calculation Q1–1.5 * IQR. We will get our upper boundary with this calculation Q3 + 1.5 * IQR.
According to this rule, the data between boundaries are acceptable but the data outside of the between lower and upper boundaries are outliers. We can use 2.5 or 2 to detect IQR. It depends on our data and analysis. But the most commonly used is 1.5 and we will use 1.5 IQR in this analysis.
With the describe method of pandas, we can see our data’s Q1 (%25) and Q3 (%75) percentiles. We can calculate our IQR point and boundaries (with 1.5).
Our upper boundary is 63.5 and our lower boundary is 51.5. This means that these values between 51.5 and 63.5 are acceptable but those outside mean there are outliers. So we need to handle them because they corrupt our data.
Let’s handle outliers.
1 — Dropping the outliers;
We can easily remove outliers, but this narrows our data. If we have a lot of rows, big data, maybe we can take risks. But remember, if we drop the value, we delete all records (row). If we have vulnerable records, they can get lost.
We make some equations to reach outliers and index. As you can see, if we drop outliers, we will lose 605 records. Line 27 shows us the outliers. Line 28 shows us the data without outliers. Check the length of the tables. In the beginning, we had 53940 rows.
We removed the outliers and our data rows drop to 53335. After dropping outliers, let’s check the boxplot and histogram of our data.
Now, we don’t have any outliers.
2 — Winsorize Method;
Our second method is the Winsorize Method. In the Winsorize Method, we limit outliers with an upper and lower limit. We will set the limits. We will make our upper and lower limits for data our new maximum and minimum points.
We will use the table column of the diamonds dataset again. Let’s check the boxplot again.
We have outliers, we detected them at the beginning. For outliers, our upper limit is 63.5 and our lower limit is 51.5.
For the Winsorize Method, we have to import winsorize from Scipy. We need boundaries to apply winsorize. We will limit our data between 53 and 63. These values are within limits for outliers. We need precise points of these values in the percentile and we can use the quantile method of Pandas.
We will create a new variable with Winsorize Method. To implement the Winsorize Method, we write the exact boundary points as a tuple on the percentile. For example, we will write (0.01, 0.02). This means we want to apply quantile(0.01) and quantile(0.98) as a boundary. The first one is the exact point on percentile from the beginning, the second one is exact point on percentile from the end.
We applied Winsorize Method, let’s check data on graphs.
As you can see, there are no outliers. Focus on this, our maximum and minimum values on the boxplot; 53 and 63. We applied them as boundaries. Now, we can look at descriptive statistical values of old and new data with the describe method.
I changed df_table_win to a series to implement the describe method. Notice that, the mean and standard deviation of df_table_win have changed. Minimum and maximum points also changed but these changes do not affect median-logic descriptive statistical values. Therefore, we should apply the Winsorize Method carefully because as you can see, mean-logic descriptive statistical values can change. It can corrupt our data, damage our analysis, or have adverse effects on models.
3 — Log Transformation;
Our last method is Log Transformation. We use log transformation on skewed data. Log transformation reduces the skewness of data and tries to make it normal. Log transformation doesn’t always make it normal, sometimes makes data more skewed. So it depends on the data. We have to apply transformation and control the result.
For this method, we will use the carat column of diamonds dataset. Let’s check the data and graphs.
There are many outliers and the data is right-skewed. Log transformation will transform data to normal or close to normal. Let’s apply the log transformation to reduce the variability of data.
We implemented log transformation from NumPy with np.log. It completely changed our data and it removed outliers, we can see this in the boxplot.
Log transformation is commonly used for machine learning algorithms. Be careful, it changes our values but removes outliers. It makes our model normal and the machine learning algorithm likes normal distribution data. There are some methods and features in machine learning algorithms such as scaling and normalization. We will talk about these terms in our next stories.
It takes a long time to deal with outliers but it’s worth it. We have to choose the best method for the data. It is important to know the data very well. With this domain knowledge, we will decide on the method of handling outliers. Apart from these methods, we can consider these outliers as missing values. We can use the methods of filling missing values to get rid of outliers.