scikit-learn pipelines – Mihai's Blog

Author: Mihai Avram | Date: 5/17/2020

Machine Learning has evolved far beyond just training a model on data and running that trained model to return classification results. In order to efficiently build Machine Learning solutions that effectively run in production environments, we must expand our solutions to be able to provision, clean, train, validate, and monitor the data and model at scale. This requires a new exemplary skillset called a Machine Learning pipeline.

Scikit-learn is a very popular Machine Learning framework, so let’s frame this idea around it and start with a simple pipeline example.

A Simple scikit-learn Machine Learning Pipeline

Scikit-learn is one of the most popular machine libraries implemented in Python, and the key is the Pipeline package from sklearn.pipeline as shown in the code. We start with the following code.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Retrieving our data using our custom function
x_vals, y_vals = load_in_raw_data()

# Building the pipeline
pipeline = Pipeline([
    ('scalar_step', StandardScaler()),
    # More data preprocessing steps can go here
    ('dimensionality_reduction_step', PCA(n_components = 3)),
    ('classification_step', LogisticRegression())
])

# Running our pipeline against our data to fit and create the model
pipeline.fit(x_vals, y_vals)

Let us go through the code step by step.

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

Here we import all of the packages needed from the scikit-learn library in order to build our pipeline. You may need to import more of them based on the problem you have at hand and the steps involving your pipeline. For instance, if your pipeline involves a Naive Bayes classifier, then the following import would be needed.

>>> from sklearn.naive_bayes import GaussianNB

Next,

# Retrieving our data using our custom function
x_vals, y_vals = load_in_raw_data()

We leverage a function we have created in our code load_in_raw_data() which is not included in this post because it is open to interpretation and varies from case to case. For instance, this function could load the popular Iris Data Set from the UCI ML Repository via sklearn.datasets.load_iris() or it could simply load a file from disk.

Afterward,

# Building the pipeline
pipeline = Pipeline([
    ('scalar_step', StandardScaler()),
    # More data preprocessing steps can go here
    ('dimensionality_reduction_step', PCA(n_components = 3)),
    ('classification_step', LogisticRegression())
])

We build our pipeline by providing a sequence of transformations that our dataset will go through. These transformations will happen in the sequence they are provided, so the scalar_step, will happen before the dimensionality_reduction_step. Note that you can include different transformations and as many as you would like depending on the Machine Learning problem you are looking to solve.

Finally,

# Running our pipeline against our data to fit and create the model
pipeline.fit(x_vals, y_vals)

We run our data through our pipeline to create and fit our model to our resulting values provided (y_vals). You can later use that model to predict future y values based on new x values.

And voila! That’s the skinny on scikit-learn pipelines, for more information, you can check out the following three resources, which can fortify your knowledge of scikit-learn pipelines.

Scikit-learn documentation – (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

Scikit-learn pipeline examples from Queirozf -(http://queirozf.com/entries/scikit-learn-pipeline-examples)

Creating Sklearn pipelines in Python (Video)

Creating Pipelines Using Sklearn in Python (Video)

Full-fledged Machine Learning Pipeline Frameworks

Now imagine having to run this Machine Learning task in a full-fledged production environment servicing many stakeholders that needs to do the following:

Have the flexibility to quickly be configured and re-configured
Be able to scale quickly
Retrieve and clean data
Perform feature extraction and selection
Train the Machine Learning model(s)
Test and validate the Machine Learning model(s)
Monitor the running Machine Learning model(s)
Take care of algorithm biases, fairness, and safety
Send alerts if there are any anomalies in the system
Follow security best practices
Be fault-tolerant

The simple scikit-learn pipeline does not have the features to be able to take care of these problems. This is where other DevOps frameworks and pipelines come in which we will discuss next.

Tensorflow Machine Learning Pipeline With TFX

TensorFlow Extended (also known as TFX) is a production-grade pipeline framework created by Google. The way it works is by segregating Machine Learning tasks into different components that run in a sequence. An example of such a component may be a code segment which takes in the input data and splits it into training and test sets, while another example may be a code segment which trains a Logistic Regression model. All of these components run in a Directed Acyclic Graph (DAG) which is a technical way of saying that the components run sequentially without forming any loops (e.g. step B will always follow step A just once).

A typical TFX pipeline consists of the following components shown below.

ExampleGen – reads in the input data and can split it into training and test sets

StatisticsGen – computes statistics about the input dataset

SchemaGen – creates a schema for the input data and infers data types, categories, ranges, and more

ExampleValidator – validates the input data and checks for training/test skews or anomalies in the data

Transform – creates features from the input data

Trainer – trains the model based on the data and features

Evaluator – tests the trained model and performs validation checks as well as an analysis of the model so assess whether it is ready to be deployed in production

Pusher – deploys the trained, tested, and polished model to production

Here’s a simple example illustrating the Directed Acyclic Graph (DAG) of these steps using Apache Airflow.

This image has an empty alt attribute; its file name is screen-shot-2020-05-17-at-12.58.14-pm.png — (Ref. – link)

As you can see, the first step here is the CsvExampleGen, which feeds into the other steps and the steps do not loop around to the top. This way it creates a dependency graph whereby a step such as the Trainer cannot run until the SchemaGen and Transform have completed.

An important bit that needs to be highlighted is that after each component finishes, it stores any output artifacts in a metadata store which then gets picked up as input to the next component. This is how the components can execute sequentially by feeding off of each other. As you may imagine, this complex sequential runtime of a Machine Learning pipeline would need to run under an orchestration service. The orchestration service will take care of hosting the pipeline on various machine nodes or even clusters. They can delegate process memory, disk space, and processing power for various tasks and compute nodes, as well as direct the flow of the pipeline in a simple manner. Examples of such orchestrator tools one could use are Kubeflow or Apache Airflow.

TFX is a vast and complete Machine Learning pipeline service that should really be best covered in a course. This blog post really cannot do it justice besides a very cursory introduction for what it can do.

For more information, the TensorFlow Extended site has some great starting guides, examples, and tutorials to get you started!

Azure Machine Learning Pipelines

In the same vein of Machine Learning pipelines, another powerful offering is the Microsoft Azure Machine Learning pipeline. While this is a deep topic that requires its own post, it consists of the following steps.

First, one must sign up for the Azure service and create an Azure Machine Learning workspace. Then, one needs to set up the Azure ML SDK to enable the ability to configure the pipeline. Afterward, one needs to set up a datastore for storing artifacts from the pipeline in persistent storage, and a PipelineData object to allow for data to easily flow between data steps and enable the pipeline steps to communicate with each other. The final step involves configuring the compute nodes or targets in which the pipeline will run. The rest will just consist of code to create and launch pipeline steps in regards to (data preparation, model training, model storage, model validation, model deployment, and monitoring). Are you seeing some patterns? This is somewhat similar to the TensorFlow extended example.

For specifics, check out the following two articles as they explain this topic in length.

What are ML Azure pipelines – (https://docs.microsoft.com/en-us/azure/machine-learning/concept-ml-pipelines)

Creating ML pipelines with the Azure ML SDK – (https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline)

There are also other notable Machine Learning pipeline frameworks that we should be aware of, highlighted below.

Keras using scikit-learn pipelines – (https://www.kaggle.com/residentmario/using-keras-models-with-scikit-learn-pipelines)

Apache Spark pipelines – (https://spark.apache.org/docs/latest/ml-pipeline.html)

AWS Machine Learning pipelines using Amazon SageMaker and Apache Airflow – (https://aws.amazon.com/blogs/machine-learning/build-end-to-end-machine-learning-workflows-with-amazon-sagemaker-and-apache-airflow/)

d6tflow (can use PyTorch as well) – (https://www.kdnuggets.com/2019/09/5-step-guide-scalable-deep-learning-pipelines-d6tflow.html)

Some general Python pipeline packages – (https://medium.com/@Minyus86/comparison-of-pipeline-workflow-packages-airflow-luigi-gokart-metaflow-kedro-pipelinex-5daf57c17e7)

AutoML – A Simpler Way to Leverage Machine Learning Pipelines

Google Cloud AutoML

In case you haven’t made this observation yet, the notion of creating a Machine Learning pipeline can be incredibly time consuming and complex. AutoML aims to simplify this process by skipping all the intermediary steps such as feature selection and model training/tuning and go from the initial raw data to final predictions about that data. This is great because one can essentially build a Machine Learning pipeline with very little effort and have it compute results in no time. This does come with drawbacks, however. AutoML frameworks typically only emphasize performance as the end goal (i.e. did it classify well or not?), and often, there is more to Machine Learning than performance, such as bias/fairness, as well as space and time complexity. Finally, AutoML can build some very powerful standard models, but if you have a more custom or unique problem that requires combining some esoteric Machine Learning and statistical concepts, or even need to maximize performance and accuracy, you will be better off building the Machine Learning pipeline yourself.

Some leaders in this space are the following

Google Cloud AutoML – (https://cloud.google.com/automl)

Auto Sklearn – (https://automl.github.io/auto-sklearn/master/)

H₂0 AutoML – (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)

Auto Keras – (https://autokeras.com/)

Final Remarks

As we conclude, I want to leave you with a final note. If you want to be proficient in quickly learning and using Machine Learning pipeline tools it may be worthwhile to add Docker to your Machine Learning skillset. Moreover, you should be familiar with Object-Oriented Programming Principles (OOP), and have a good understanding in how you will organize all the different components of your Machine Learning applications (e.g. your input files, trainers, optimizers, validators, hyperparameters, etc.). David Chong wrote a good post to help you learn how to do this.

I hope this post has shed light on a more complex and progressive topic of Machine Learning that we should soon pick up on as responsible Data Scientists and Machine Learning engineers.

Cheers and happy coding!

Tag: scikit-learn pipelines

Why Use Machine Learning Pipelines and What Frameworks Exist for Them?