https://hersanyagci.medium.com/random-resampling-methods-for-imbalanced-data-with-imblearn-1fbba4a0e6d3

The imbalance of data is a big problem for classification tasks. In python, there is a library to allow to use of many algorithms to handle this imbalanced state of the data and its harms.

imbalanced-learn is a python package offering a several re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects. There is an introduction, you can read the imbalanced-learn library and get some idea about its outline.

In this series, each technique will be handled separately. Re-sampling techniques are divided into two categories mainly;

● Under-sampling…


https://github.com/DistrictDataLabs/yellowbrick/tree/develop/examples

Visualization is essential to make our analysis or modeling process understandable. We need visualization to see results or workflow, especially in machine learning algorithms.

Yellowbrick was created for this. Yellowbrick is a visualization library that can work with Scikit Learn machine learning algorithms. It is not a part of scikit-learn-contrib projects, but it uses Scikit-Learn API to make classification, clustering, hyperparameter selection, model selection, etc. It helps the user in many areas. Yellowbrick generates visualizations by wrapping Matplotlib, the most prominent Python scientific visualization library.

When we want to add a good visual to our workbook, or when we can’t…


https://freerangestock.com/

Feature selection is one of the most critical stages of a machine learning pipeline. We may have to struggle with a lot of features or useless features, so we have to do some elimination of features. So why is feature selection important? Because we can increase the performance and quality of our model, we can shorten the training time. We do not want to train our model on unnecessary features, we eliminate noisy (non-informative) features that do not contribute to our model or cause overfitting. The feature selection can be the key to a successful machine learning process.

Feature selection…


Reference

We want the data we prepared or analyzed for the model to be perfect. However, data may have missing values, outliers, complex data types. We always do some preprocessing. This process is normal. However, sometimes we encounter imbalanced data in classification tasks. This is unusual for algorithms and this imbalanced situation needs to be fixed.

For example, you have a binary classification task and you have 100 records (rows) in your data, of which 90 rows are labeled as 1 and the remaining 10 rows are labeled as 0.


Do you really want to this happen?

In the data science process, we need to do some preprocessing before machine learning algorithms. These can be some basic data analysis processes such as handling missing values and outliers and data cleaning. We also apply scaling (data transformation) for some data.

Scaling is not mandatory, but it performs better to scale the data before some machine learning algorithms.

The main purpose of scaling is to avoid the effects of greater numeric ranges. …


We have already touched on the importance of model deployment and sharing this model with others. We need to share our model with stakeholders to collaborate or get feedback. Therefore, we need a web app with powerful and interactive content for our colleagues or clients for who we want to showcase our work. Streamlit is the most practical and fastest way to create a web app. You can browse the below to learn more about Streamlit and see simple model deployment examples.

In this tutorial, we will create a virtual environment. In this virtual environment, we will prepare a python…


Streamlit is an open-source Python library that makes it easy to build beautiful custom web-apps for machine learning and data science.

Data scientists run the data science process to arrive at a solution, creativity, and a model. They create a product and at the end of this process, they have to share their product with their stakeholders in order to collaborate or get feedback. Stakeholders can be customers, colleagues, or anyone else, anyone located in another city or perhaps another country.

To share the model, it is necessary to deploy it over the web. Deploying the model is the most…


We previously covered the issue of encoding and its importance. In short, machine learning models are mathematical models that use algorithms that work with numerical data types, and neural networks also work with numerical data types. Therefore, we need encoding methods to convert non-numerical data to meaningful numerical data. We have covered the encoding methods and the options that we can apply these encoding methods at this link.

In this story, we will look at the Pandas get_dummies method. Pandas get_dummies is the easiest way to implement one hot encoding method and it has very useful parameters, of which we…


Label Encoding vs One Hot Encoding

We need numerical data in data science techniques such as machine learning and deep learning models. We start our analysis with categorical and numerical data types. When preparing the data for the model, we drop some categorical data types if we don’t need them, or we use some techniques such as regex and get numerical values. I refer to non-numeric data such as text, object, datetime, etc. with “categorical data”. There will always be some columns we need and, there won’t be any numerical values we can get in regex or another function. …


Data analysis is a long process. There are some steps to do this. First of all, we need to recognize the data. We have to know every feature in the dataset. Then we must detect the missing values and clear our dataset from these NaN values. We can fill these NaN values with some values (mean, median, etc.) or we can create our function to fill these missing values. We can also drop some columns that are not helpful or have more NaN values than others.

This process can change. It depends on the data and target. But we must…

Hasan Ersan YAĞCI

Data Scientist | Machine Learning Proficiency

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store