A Simple Introduction to Pipelines in Machine Learning

Victoria Akintomide
2 min readMay 15, 2021

A pipeline is a simple way of automating the machine learning workflow. The pipeline consists of sequential steps combining data extraction, preprocessing, modeling and even deployment. The machine learning workflow is split into into independent, reusable, modular parts that can then be pipelined together to create models.

Pipelines have some important benefits when building machine learning models.

  1. Cleaner Code: With a pipeline, It is not needed to manually keep track of training and validation data.
  2. Fewer Bugs: The pipeline provides few opportunities to forget steps particularly preprocessing steps.
  3. Easier to Productionize: Pipelines help to transition a model from a prototype to a deployable model.

A simple machine learning pipeline can be created in three steps. I utilized the Melbourne Housing Snapshot Dataset on Kaggle for this example.

  1. Preprocessing step: The ColumnTransformer class from the scikit learn library is used to bundle together different preprocessing steps. In the code below, I impute missing values in numerical and categorical data and also apply a one-hot encoding to categorical data.
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

numerical_transformer = SimpleImputer(strategy='constant')


categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])


preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])

2. Model definition: I can then define a model, in this example, I would be using the Logistic Regression Model from the sklearn library.

from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(solver='liblinear', random_state=0)

3. Create and Evaluate the Pipeline: The Pipeline class is used to define a pipeline that bundles the preprocessing and modeling steps.

from sklearn.metrics import mean_absolute_errorfrom sklearn.pipeline import Pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])

my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)


score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Pipelines can be very helpful in cleaning up code, preprocessing and evaluation of models.

References:

https://www.kaggle.com/alexisbcook/pipelines

--

--