Yellowbrick; Machine Learning Visualization
Visualization is essential to make our analysis or modeling process understandable. We need visualization to see results or workflow, especially in machine learning algorithms.
Yellowbrick was created for this. Yellowbrick is a visualization library that can work with Scikit Learn machine learning algorithms. It is not a part of scikit-learn-contrib projects, but it uses Scikit-Learn API to make classification, clustering, hyperparameter selection, model selection, etc. It helps the user in many areas. Yellowbrick generates visualizations by wrapping Matplotlib, the most prominent Python scientific visualization library.
When we want to add a good visual to our workbook, or when we can’t decide on selection parameters or interpret scores, we can rely on Yellowbrick. Yellowbrick not only provides good visualization but also provides advice on parameters.
Using Yellowbrick
pip install yellowbrick
The Yellowbrick API is specially designed to play nicely with scikit-learn. The workflow is very similar to using a scikit-learn transformer and the visualizers are intended to be integrated with scikit-learn utilities. In Yellowbrick, the primary interface is a visualizer. Visualizers are scikit-learn estimator objects.
To use the visualizers, simply use the same workflow as with a scikit-learn model. Import the visualizer, instantiate it, call the visualizer’s fit() method, then to render the visualization, call the visualizer’s show() method.
from sklearn.linear_model import LinearRegression
model = LinearRegression()from yellowbrick.regressor import ResidualsPlotvisualizer = ResidualsPlot(model)visualizer.fit(X_train, y_train) # Fit the training data
visualizer.score(X_test, y_test) # Evaluate the model on the test
visualizer.show() # Finalize and render the figure
In the above example, we instantiate LinearRegression from scikit-learn. Next, for the ResidualsPlot, we instantiate the visualizer with LinearRegression from Yellowbrick. Then we use fit, score, and show methods such as scikit-learn workflow. We don’t fit LinearRegression into train data. After dividing the data, we apply Yellowbrick directly.
The image can be saved as follows In addition to the .png
extension, .pdf
is also commonly used for high-quality publication-ready images.
visualizer.show(outpath="pcoords.png")
Usage areas of Yellowbrick
- Classification Visualization
- Clustering Visualization
- Feature Visualization
- Model Selection Visualization
- Regression Visualization
- Target Visualization
- Text Visualization
We can use Yellowbrick visualizers in different areas and at different stages in the machine learning pipeline. It helps the user to make a more accurate decision with good visualization. It makes the process easier and better.
For example; Defining numbers of the cluster is a difficult choice in clustering problems. If we do not have any prior knowledge, we need to find the number of clusters for the K-mean clustering algorithm. We use the elbow method, but in this method, it is our job to choose the number of clusters. But Yellowbrick can decide instead of us with good visualization.
You can see the API documentation for Yellowbrick here.
Let’s give a few examples with Yellowbrick;
Elbow Method
In the clustering problem, we need the number of clusters and we can use the elbow method. We generate a dataset with 8 random clusters and apply KElbowVisualizer of Yellowbrick. We use the calinski_harabasz score to obtain the k parameter.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from yellowbrick.cluster import KElbowVisualizer
X, y = make_blobs(n_samples=1000, n_features=12, centers=8, random_state=42)
# Instantiate the clustering model and visualizermodel = KMeans()
visualizer = KElbowVisualizer(
model, k=(4,12), metric='calinski_harabasz', timings=False)
visualizer.fit(X) # Fit the data to the visualizer
visualizer.show() # Finalize and render the figure
We got the plot above. It gave us the parameter k correctly. If we use the classical elbow method, our plot will be like the below. 8 seems to be the breaking point.
Prediction Error Plot
A prediction error plot shows the actual targets in the data set and the predicted values generated by our model. We can see the variance of the model. This plot also gives us the score.
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import PredictionError
# Load a regression datasetX, y = load_concrete()
# Create the train and test dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Instantiate the linear model and visualizermodel = Lasso()
visualizer = PredictionError(model)
visualizer.fit(X_train, y_train) # Fit the training data
visualizer.score(X_test, y_test) # Evaluate the model
visualizer.show() # Finalize and render the figure
Alpha Selection
Regularization is designed to penalize model complexity, therefore the higher the alpha, the less complex the model, decreasing the error due to variance (overfit). Alphas that are too high on the other hand increase the error due to bias (underfit). It is important, therefore to choose an optimal alpha such that the error is minimized in both directions.
The AlphaSelection Visualizer demonstrates how different values of alpha influence model selection during the regularization of linear models. Generally speaking, alpha increases the effect of regularization, e.g. if alpha is zero there is no regularization, and the higher the alpha, the more the regularization parameter influences the final model.
import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.datasets import load_concrete
from yellowbrick.regressor import AlphaSelection
# Load the regression dataset
X, y = load_concrete()
# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)
# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
visualizer.show()
Rank Features
A two-dimensional ranking of features utilizes a ranking algorithm that takes into account pairs of features at a time (e.g. joint plot analysis). The pairs of features are then ranked by score and visualized using the lower-left triangle of a feature co-occurrence matrix.
By default, the Rank2D visualizer utilizes the Pearson correlation score to detect colinear relationships.
from yellowbrick.datasets import load_credit
from yellowbrick.features import Rank2D
# Load the credit dataset
X, y = load_credit()
# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D(algorithm='pearson')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Finalize and render the figure
Conclusion
Yellowbrick is the machine learning library that makes things easy. Making the machine learning model in Yellowbrick is not very effective because we can set parameters in scikit-learn meaningfully, but it can be used as a consultant in the visualization and machine learning process.
Yellowbrick official website has all codes you can take them and apply to your project easily.