Encoding with Pandas get_dummies

Hasan Ersan YAĞCI
6 min readJan 24, 2021

We previously covered the issue of encoding and its importance. In short, machine learning models are mathematical models that use algorithms that work with numerical data types, and neural networks also work with numerical data types. Therefore, we need encoding methods to convert non-numerical data to meaningful numerical data. We have covered the encoding methods and the options that we can apply these encoding methods at this link.

In this story, we will look at the Pandas get_dummies method. Pandas get_dummies is the easiest way to implement one hot encoding method and it has very useful parameters, of which we will mention the most important ones. With get_dummies we can get a hot encoder data frame (dummy variables) in one row.

Two Methods of One Hot Encoding

We will use car_price data for this demo. You can download data and notebook from this link.

The purpose of this data science process is to predict the prices of cars. We will use Linear Regression for this data, but the data is not ready for the machine learning model. Because there are categorical (non-numerical) columns and we need to transform them. For this, we will implement get_dummies.

Before diving into Get_dummies, let’s check the data.

As you can see, the data has no missing values and I acquired this structure after handling outliers. You can check this link to see how to detect and handle outliers with pandas.

We have 5 categorical columns that we need to make numerical before machine learning algorithm. These columns are; “make_model”, “body_type”, “Body Color”, “Gear Type” and “Extras”, all of them are nominal data. For example, for “Body Color” there is no hierarchy between colors. Black has no superiority over red. That’s why we have to use one hot encoder.

Firstly we will just apply get_dummies to the “Body Color” column to see details of get_dummies, then we will use all dataframe;

1 — get_dummies()

The “Body Color” column has 13 unique values, which means we will get 13 columns after applying get_dummies.

With this syntax we can apply get_dummies to a column of dataframe;

pd.get_dummies(df['Body Color'])

We didn’t use any parameters, get_dummies has default parameters. As you can see we got a 13 column dataframe after get_dummies. There are 4800 rows. We can see the color of the car. For example; the first car color black, the second car color red, etc.

Our original data frame, df, keeps its shape. We must merge these dataframes. Now we can assign “Body Color”s dummy variables (one hot encoder) to a new dataframe to merge with the main dataframe.

We will now merge them into data frames using the join method. But we should drop the “Body Color” column, we don’t need it anymore because it’s not numerical.

With this syntax, we dropped the “Body Color” column and added our dummy variables. As you can see, all colors have a column and column types are numerical.

2 — get_dummies with ‘drop_first’ parameter

One column transformed into 13 columns. We can use the ‘drop_first’ parameter and decrease one column. We can take 12 columns. Normally the default value of this parameter is ‘False’, we just set it to ‘True’. Let’s see how it works.

pd.get_dummies(df['Body Color'], drop_first = True)

Check the number of columns, instead of 13 we got 12 columns.

It removes the first column of the get_dummies dataframe. The first column for the “Body Color” column is Beige. If there is a beige car, all columns are 0. When all columns are 0, the model knows it’s a beige car. Check out the example below.

More columns mean less performance and more training time. Imagine we have 20 columns that are not numerical. If we use ‘drop_first’, we get 20 columns less. So it is useful to use the drop_first parameter for model performance.

3 — get_dummies with ‘prefix’ parameter

If the dataframe had the “Upholstery Color” column, we would also get a black or a brown column for the tile color after get_dummies except “Body Color”. Multiple columns with the same name can cause problems. We can use the ‘prefix’ parameter to avoid this situation.

pd.get_dummies(df['Body Color'], drop_first = True, prefix = 'BC')

This parameter adds the word as a prefix with an underscore. For the example above, we have used the BC prefix.

4— get_dummies with ‘columns’ parameter

We can apply get_dummies directly to a dataframe instead of applying it individually. It will automatically add the column name as a prefix for each dummy variable.

pd.get_dummies(df, drop_first = True)

But the dataframe evolved to 349 columns because the “Extras” column has many unique values. We have to consider this column individually. In this case, we can use the columns parameter.

pd.get_dummies(df, columns = ['make_model', 'body_type', 'Body Color', 'Gearing Type'], drop_first = True)

With the columns parameter, we can apply the columns we want. So we can handle custom columns later with str.get_dummies.

4 — str.get_dummies

The str.get_dummies method is a version of get_dummies that can be applied to a series. It is a string-handling version. The str.get_dummies () method divides each string in the given series with the separator. There is only one separator parameter.

In this dataframe, there are some features for cars in the “Extras” column. For example; alloy wheels, sports seats, voice control, etc. These features are important in terms of price. Check below, it says there are 325 unique values. However, there are 16 extra features. Cars have a different number of extra features.

If we apply get_dummies directly, it adds 325 more columns. So we have to use str.get_dummies. We use a comma (,) as the separator because the values in the data are separated by commas (,).

df['Extras'].str.get_dummies(',')

As you can see, there are 16 extra features. Check for the first row. It has alloy wheels, catalytic converter, and sound control; these are 1, others are 0.

Now let’s get the latest version of the dataframe by merging them;

All columns are numeric. Our data is now ready for the model.

Conclusion

Pandas get_dummies is the final touch on data before modeling. Because we have to make all non-numerical columns numerical. We have to use it or some other method to get the encoding tables / dummy variables. But as you can see, get_dummies is the easiest way and it has many parameters that make our model more readable and smoother.

Import the data, handle with missing values and outliers, then apply get_dummies. It is ready for the model.

--

--