scikit-learn

Lecture 12

Dr. Colin Rundel

scikit-learn

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

  • Simple and efficient tools for predictive data analysis
  • Accessible to everybody, and reusable in various contexts
  • Built on NumPy, SciPy, and matplotlib
  • Open source, commercially usable - BSD license
import sklearn
sklearn.__version__
'1.8.0'

Installation

You probably noticed - the package is called scikit-learn but the module is called sklearn - this is a common source of confusion.

To install the package use the longer name:

uv add scikit-learn

Previously you could also use sklearn as the package name, but this is no longer supported and will result in an error.

Submodules

The sklearn package contains a large number of submodules which are specialized for different tasks / models,

  • sklearn.base - Base classes and utility functions
  • sklearn.calibration - Probability Calibration
  • sklearn.cluster - Clustering
  • sklearn.compose - Composite Estimators
  • sklearn.covariance - Covariance Estimators
  • sklearn.cross_decomposition - Cross decomposition
  • sklearn.datasets - Datasets
  • sklearn.decomposition - Matrix Decomposition
  • sklearn.discriminant_analysis - Discriminant Analysis
  • sklearn.ensemble - Ensemble Methods
  • sklearn.exceptions - Exceptions and warnings
  • sklearn.experimental - Experimental
  • sklearn.feature_extraction - Feature Extraction
  • sklearn.feature_selection - Feature Selection
  • sklearn.gaussian_process - Gaussian Processes
  • sklearn.impute - Impute
  • sklearn.inspection - Inspection
  • sklearn.isotonic - Isotonic regression
  • sklearn.kernel_approximation - Kernel Approximation
  • sklearn.kernel_ridge - Kernel Ridge Regression
  • sklearn.linear_model - Linear Models
  • sklearn.manifold - Manifold Learning
  • sklearn.metrics - Metrics
  • sklearn.mixture - Gaussian Mixture Models
  • sklearn.model_selection - Model Selection
  • sklearn.multiclass - Multiclass classification
  • sklearn.multioutput - Multioutput regression and classification
  • sklearn.naive_bayes - Naive Bayes
  • sklearn.neighbors - Nearest Neighbors
  • sklearn.neural_network - Neural network models
  • sklearn.pipeline - Pipeline
  • sklearn.preprocessing - Preprocessing and Normalization
  • sklearn.random_projection - Random projection
  • sklearn.semi_supervised - Semi-Supervised Learning
  • sklearn.svm - Support Vector Machines
  • sklearn.tree - Decision Trees
  • sklearn.utils - Utilities

Model Fitting

Sample data

To begin, we will examine a simple data set on the size and weight of a number of books. The goal is to model the weight of a book using some combination of the other features in the data.

The included columns are:

  • volume - book volumes in cubic centimeters

  • weight - book weights in grams

  • cover - a categorical variable with levels "hb" hardback, "pb" paperback

books = pd.read_csv("data/daag_books.csv"); books
    volume  weight cover
0      885     800    hb
1     1016     950    hb
2     1125    1050    hb
3      239     350    hb
4      701     750    hb
5      641     600    hb
6     1228    1075    hb
7      412     250    pb
8      953     700    pb
9      929     650    pb
10    1492     975    pb
11     419     350    pb
12    1010     950    pb
13     595     425    pb
14    1034     725    pb

g = sns.relplot(data=books, x="volume", y="weight", hue="cover")

Linear regression

scikit-learn uses an object oriented system for implementing the various modeling approaches, the class LinearRegression is part of the linear_model submodule.

from sklearn.linear_model import LinearRegression 

Each modeling class needs to be constructed (potentially with options) and then the resulting object will provide attributes and methods for fitting and using the model.

lm = LinearRegression()

m = lm.fit(
  X = books[["volume"]],
  y = books.weight
)
m.coef_
array([0.70863714])
m.intercept_
np.float64(107.67931061376612)
lm.coef_
array([0.70863714])
lm.intercept_
np.float64(107.67931061376612)

Note lm and m are labels for the same underlying LinearRegression object,

A couple of considerations

When fitting a model, scikit-learn expects X to be a 2d array-like object (e.g. a np.array or pd.DataFrame), so it will not accept objects like a pd.Series or 1d np.array.

lm.fit(
  X = books.volume,
  y = books.weight
)
ValueError: Expected a 2-dimensional container but got <class 'pandas.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.
lm.fit(
  X = np.array(books.volume),
  y = books.weight
)
ValueError: Expected 2D array, got 1D array instead:
array=[ 885 1016 1125  239  701  641 1228  412  953  929 1492  419 1010  595
 1034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
lm.fit(
  X = np.array(books.volume).reshape(-1,1),
  y = books.weight
)

Model parameters

Depending on the model being used, there will be a number of parameters that can be configured when constructing the model object or via the set_params() method.

lm.get_params()
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}
lm.set_params(fit_intercept = False)
LinearRegression(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
lm = lm.fit(X = books[["volume"]], y = books.weight)
lm.intercept_
0.0
lm.coef_
array([0.81932487])

Model prediction

Once the model coefficients have been fit, it is possible to predict from the model via the predict() method, this method requires a matrix-like X as input and in the case of LinearRegression returns an array of predicted y values.

lm.predict(X = books[["volume"]])
array([ 725.10251417,  832.43407276,  921.74048411,  195.81864507,
        574.34673721,  525.18724472, 1006.13094621,  337.5618484 ,
        780.81660565,  761.15280865, 1222.43271315,  343.29712253,
        827.51812351,  487.49830048,  847.1819205 ])
books = books.assign(
  pred = lambda x: lm.predict(X = x[["volume"]])
)
books
    volume  weight cover         pred
0      885     800    hb   725.102514
1     1016     950    hb   832.434073
2     1125    1050    hb   921.740484
3      239     350    hb   195.818645
4      701     750    hb   574.346737
5      641     600    hb   525.187245
6     1228    1075    hb  1006.130946
7      412     250    pb   337.561848
8      953     700    pb   780.816606
9      929     650    pb   761.152809
10    1492     975    pb  1222.432713
11     419     350    pb   343.297123
12    1010     950    pb   827.518124
13     595     425    pb   487.498300
14    1034     725    pb   847.181921

plt.figure()
sns.scatterplot(data=books, x="volume", y="weight", hue="cover")
sns.lineplot(data=books, x="volume", y="pred", color="c")
plt.show()

Residuals?

There is no built in functionality for calculating residuals, so this needs to be done by hand.

books["resid"] = books["weight"] - books["pred"]
plt.figure(layout="constrained")
ax = sns.scatterplot(data=books, x="volume", y="resid", hue="cover")
ax.axhline(c="k", ls="--", lw=1)
plt.show()

Categorical variables?

Scikit-learn expects that the model matrix be numeric before fitting,

lm = lm.fit(
  X = books[["volume", "cover"]],
  y = books.weight
)
ValueError: could not convert string to float: 'hb'

the solution here is to dummy code the categorical variables - this can be done with pandas via pd.get_dummies() or with a scikit-learn preprocessor.

pd.get_dummies(books[["volume", "cover"]])
    volume  cover_hb  cover_pb
0      885      True     False
1     1016      True     False
2     1125      True     False
3      239      True     False
4      701      True     False
5      641      True     False
6     1228      True     False
7      412     False      True
8      953     False      True
9      929     False      True
10    1492     False      True
11     419     False      True
12    1010     False      True
13     595     False      True
14    1034     False      True

Dummy coded model

lm = LinearRegression().fit(
  X = pd.get_dummies(books[["volume", "cover"]]),
  y = books.weight
)
lm.intercept_
np.float64(105.93920788192202)
lm.coef_
array([  0.71795374,  92.02363569, -92.02363569])

Do the above results look reasonable? What went wrong?

Quick comparison with R

d = read.csv('data/daag_books.csv')
d['cover_hb'] = ifelse(d$cover == "hb", 1, 0)
d['cover_pb'] = ifelse(d$cover == "pb", 1, 0)
lm = lm(weight~volume+cover_hb+cover_pb, data=d)
summary(lm)

Call:
lm(formula = weight ~ volume + cover_hb + cover_pb, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-110.10  -32.32  -16.10   28.93  210.95 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.91557   59.45408   0.234 0.818887    
volume        0.71795    0.06153  11.669  6.6e-08 ***
cover_hb    184.04727   40.49420   4.545 0.000672 ***
cover_pb           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 78.2 on 12 degrees of freedom
Multiple R-squared:  0.9275,    Adjusted R-squared:  0.9154 
F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

Avoiding co-linearity

lm1 = LinearRegression(
  fit_intercept = False
).fit(
  X = pd.get_dummies(
    books[["volume", "cover"]]
  ),
  y = books.weight
)
lm2 = LinearRegression(
  fit_intercept = True
).fit(
  X = pd.get_dummies(
    books[["volume", "cover"]], 
    drop_first=True
  ),
  y = books.weight
)
lm1.intercept_
0.0
lm1.coef_
array([  0.71795374, 197.96284357,  13.91557219])
lm1.feature_names_in_
array(['volume', 'cover_hb', 'cover_pb'], dtype=object)
lm2.intercept_
np.float64(197.96284357271747)
lm2.coef_
array([   0.71795374, -184.04727138])
lm2.feature_names_in_
array(['volume', 'cover_pb'], dtype=object)

Preprocessors

Preprocessors

These are a collection of transformer classes present in the sklearn.preprocessing submodule that are designed to help with the preparation of raw feature data into quantities more suitable for downstream modeling tools.

Like the modeling classes, they have an object oriented design that shares a common interface (methods and attributes) for bringing in data, transforming it, and returning it.

OneHotEncoder

For dummy coding we can use the OneHotEncoder preprocessor, the default is to use one hot encoding but standard dummy coding can be achieved via the drop parameter.

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse_output=False)
enc.fit(X = books[["cover"]])
OneHotEncoder(sparse_output=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
enc.transform(X = books[["cover"]])
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])
enc = OneHotEncoder(
  sparse_output=False, drop="first"
)
enc.fit_transform(X = books[["cover"]])
array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]])

Other useful bits

enc.get_feature_names_out()
array(['cover_hb', 'cover_pb'], dtype=object)
f = enc.transform(X = books[["cover"]])
f
array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])
enc.inverse_transform(f)
array([['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb']], dtype=object)

A cautionary note

Unlike pd.get_dummies() it is not safe to use OneHotEncoder with both numerical and categorical features, as the former will also be transformed.

enc = OneHotEncoder(sparse_output=False)
X = enc.fit_transform(X = books[["volume", "cover"]])
pd.DataFrame(data=X, columns = enc.get_feature_names_out())
    volume_239  volume_412  volume_419  ...  volume_1492  cover_hb  cover_pb
0          0.0         0.0         0.0  ...          0.0       1.0       0.0
1          0.0         0.0         0.0  ...          0.0       1.0       0.0
2          0.0         0.0         0.0  ...          0.0       1.0       0.0
3          1.0         0.0         0.0  ...          0.0       1.0       0.0
4          0.0         0.0         0.0  ...          0.0       1.0       0.0
5          0.0         0.0         0.0  ...          0.0       1.0       0.0
6          0.0         0.0         0.0  ...          0.0       1.0       0.0
7          0.0         1.0         0.0  ...          0.0       0.0       1.0
8          0.0         0.0         0.0  ...          0.0       0.0       1.0
9          0.0         0.0         0.0  ...          0.0       0.0       1.0
10         0.0         0.0         0.0  ...          1.0       0.0       1.0
11         0.0         0.0         1.0  ...          0.0       0.0       1.0
12         0.0         0.0         0.0  ...          0.0       0.0       1.0
13         0.0         0.0         0.0  ...          0.0       0.0       1.0
14         0.0         0.0         0.0  ...          0.0       0.0       1.0

[15 rows x 17 columns]

Putting it together

cover = OneHotEncoder(
  sparse_output=False
).fit_transform(
  books[["cover"]]
)
X = np.c_[books.volume, cover]

lm2 = LinearRegression(
  fit_intercept=False
).fit(
  X = X,
  y = books.weight
)

lm2.coef_
array([  0.71795374, 197.96284357,  13.91557219])
books["pred2"] = lm2.predict(X=X)
books.drop(
  ["pred", "resid"], 
  axis=1
)
    volume  weight cover        pred2
0      885     800    hb   833.351907
1     1016     950    hb   927.403847
2     1125    1050    hb  1005.660805
3      239     350    hb   369.553788
4      701     750    hb   701.248418
5      641     600    hb   658.171193
6     1228    1075    hb  1079.610041
7      412     250    pb   309.712515
8      953     700    pb   698.125490
9      929     650    pb   680.894600
10    1492     975    pb  1085.102558
11     419     350    pb   314.738191
12    1010     950    pb   739.048853
13     595     425    pb   441.098050
14    1034     725    pb   756.279743

Model fit

Model residuals

Model performance

Scikit-learn comes with a number of builtin functions for measuring model performance in the sklearn.metrics submodule - these are generally just functions that take the vectors y_true and y_pred and return a scalar score.

import sklearn.metrics as metrics 
metrics.r2_score(books.weight, books.pred)
0.7800969547785039
metrics.mean_squared_error(
  books.weight, books.pred
)
14833.682083774476
metrics.root_mean_squared_error(
  books.weight, books.pred
)
121.79360444528471
metrics.r2_score(books.weight, books.pred2)
0.927477575682168
metrics.mean_squared_error(
  books.weight, books.pred2
) 
4892.04042259509
metrics.root_mean_squared_error(
  books.weight, books.pred2
)
69.94312276839725

Exercise 1

Create and fit a model for the books data that includes an interaction effect between volume and cover.

You will need to do this manually with pd.get_dummies() and some additional data munging.

The data can be read into pandas with,

books = pd.read_csv(
  "https://sta663-sp26.github.io/slides/data/daag_books.csv"
)

Other transformers

Polynomial regression

We will now look at another flavor of regression model, that involves preprocessing and a hyperparameter - namely polynomial regression.

df = pd.read_csv("data/gp.csv")
sns.relplot(data=df, x="x", y="y")

By hand

It is certainly possible to construct the necessary model matrix by hand (or even use a function to automate the process), but this is less than desirable generally - particularly if we want to do anything fancy (e.g. cross validation)

X = np.c_[
    np.ones(df.shape[0]),
    df.x,
    df.x**2,
    df.x**3
]

plm = LinearRegression(
  fit_intercept = False
).fit(
  X=X, y=df.y
)

plm.coef_
array([ 2.36985684, -8.49429068, 13.95066369, -8.39215284])
df["y_pred"] = plm.predict(X=X)

plt.figure(layout="constrained")
sns.scatterplot(data=df, x="x", y="y")
sns.lineplot(data=df, x="x", y="y_pred", color="k")
plt.show()

X = np.c_[
    np.ones(df.shape[0]), df.x,
    df.x**2, df.x**3,
    df.x**4, df.x**5
]

plm = LinearRegression(
  fit_intercept = False
).fit(
  X=X, y=df.y
)
df["y_pred"] = plm.predict(X=X)

PolynomialFeatures

This is another transformer class from sklearn.preprocessing that simplifies the process of constructing polynomial features for your model matrix. Usage is similar to that of OneHotEncoder.

from sklearn.preprocessing import PolynomialFeatures
X = np.array(range(6)).reshape(-1,1)
pf = PolynomialFeatures(degree=3)
pf = pf.fit(X)
pf.transform(X)
array([[  1.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.],
       [  1.,   2.,   4.,   8.],
       [  1.,   3.,   9.,  27.],
       [  1.,   4.,  16.,  64.],
       [  1.,   5.,  25., 125.]])
pf.get_feature_names_out()
array(['1', 'x0', 'x0^2', 'x0^3'], dtype=object)
pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(X)
array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  4.],
       [ 3.,  9.],
       [ 4., 16.],
       [ 5., 25.]])
pf.get_feature_names_out()
array(['x0', 'x0^2'], dtype=object)

Interactions

If the feature matrix X has more than one column then PolynomialFeatures transformer will include interaction terms with total degree up to degree.

X.reshape(-1, 2)
array([[0, 1],
       [2, 3],
       [4, 5]])
pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(
  X.reshape(-1, 2)
)
array([[ 0.,  1.,  0.,  0.,  1.],
       [ 2.,  3.,  4.,  6.,  9.],
       [ 4.,  5., 16., 20., 25.]])
pf.get_feature_names_out()
array(['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2'], dtype=object)
X.reshape(-1, 3)
array([[0, 1, 2],
       [3, 4, 5]])
pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(
  X.reshape(-1, 3)
)
array([[ 0.,  1.,  2.,  0.,  0.,  0.,  1.,  2.,  4.],
       [ 3.,  4.,  5.,  9., 12., 15., 16., 20., 25.]])
pf.get_feature_names_out()
array(['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2',
       'x2^2'], dtype=object)

Modeling with PolynomialFeatures

from sklearn.metrics import root_mean_squared_error as rmse
def poly_model(X, y, degree):
  X  = PolynomialFeatures(
    degree=degree, include_bias=False
  ).fit_transform(
    X=X
  )
  y_pred = LinearRegression(
  ).fit(
    X=X, y=y
  ).predict(
    X
  )
  return rmse(y, y_pred)
poly_model(X=df[["x"]], y=df.y, degree=2)
0.5449418707295371
poly_model(X=df[["x"]], y=df.y, degree=3)
0.5208157900621085
degrees = range(1,10)
rmses = [
  poly_model(X=df[["x"]], y=df.y, degree=d) 
  for d in degrees
]
g = sns.relplot(x=degrees, y=rmses)

Pipelines

Pipelines

You may have noticed that PolynomialFeatures takes a model matrix as input and returns a new model matrix as output which is then used as the input for LinearRegression. This is not an accident, and by structuring the library in this way sklearn is designed to enable the connection of these steps together, into what sklearn calls a pipeline.

from sklearn.pipeline import make_pipeline

p = make_pipeline(
  PolynomialFeatures(degree=4),
  LinearRegression()
)
p
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Using Pipelines

Once constructed, this object can be used just like our previous LinearRegression model (i.e. fit to our data and then used for prediction)

p = p.fit(X = df[["x"]], y = df.y)
p.predict(X = df[["x"]])
array([ 1.6295693 ,  1.65734929,  1.6610466 ,  1.67779767,  1.69667491,
        1.70475286,  1.75280126,  1.78471392,  1.79049912,  1.82690007,
        1.82966357,  1.83376043,  1.84494343,  1.86002819,  1.86228095,
        1.86619112,  1.86837909,  1.87065283,  1.88417882,  1.8844024 ,
        1.88527174,  1.88577463,  1.88544367,  1.86890805,  1.86365035,
        1.86252922,  1.86047349,  1.85377801,  1.84937708,  1.83754576,
        1.82623453,  1.82024199,  1.81799793,  1.79767794,  1.77255319,
        1.77034143,  1.76574288,  1.75371272,  1.74389585,  1.73804309,
        1.73356954,  1.65527727,  1.64812184,  1.61867613,  1.6041325 ,
        1.5960389 ,  1.56080881,  1.55036459,  1.54004364,  1.50903953,
        1.45096594,  1.43589836,  1.41886389,  1.39423307,  1.36180712,
        1.23072992,  1.21355164,  1.11776117,  1.11522002,  1.09595388,
        1.06449719,  1.04672121,  1.03662739,  1.01407206,  0.98208703,
        0.98081577,  0.96176797,  0.87491417,  0.87117573,  0.84223005,
        0.84171166,  0.82875003,  0.8085086 ,  0.79166069,  0.78167248,
        0.78078036,  0.73538157,  0.7181484 ,  0.70046945,  0.67233502,
        0.67229069,  0.64782899,  0.64050946,  0.63726823,  0.63526047,
        0.62323271,  0.61965166,  0.61705548,  0.6141438 ,  0.60978056,
        0.60347713,  0.5909255 ,  0.566617  ,  0.50905785,  0.44706202,
        0.44177711,  0.43291379,  0.40957833,  0.38480262,  0.38288511,
        0.38067928,  0.3791518 ,  0.37610476,  0.36932957,  0.36493067,
        0.35806518,  0.3475729 ,  0.3466828 ,  0.33332696,  0.30717941,
        0.3006981 ,  0.29675876,  0.29337641,  0.29333354,  0.27631567,
        0.26899076,  0.2676092 ,  0.2672602 ,  0.26716133,  0.26241605,
        0.25405246,  0.25334542,  0.25322869,  0.25322576,  0.25410989,
        0.25622496,  0.25808334,  0.25849729,  0.26029845,  0.26043195,
        0.26319956,  0.26466962,  0.26480578,  0.2648598 ,  0.26488966,
        0.28177285,  0.28525208,  0.28861016,  0.28917644,  0.29004253,
        0.29444629,  0.29559749,  0.30233373,  0.30622039,  0.31322114,
        0.31798208,  0.32104799,  0.32700307,  0.32822585,  0.32927281,
        0.3326599 ,  0.33397022,  0.33710573,  0.34110873,  0.34140708,
        0.34707419,  0.35926445,  0.37678278,  0.37774536,  0.38884519,
        0.39078249,  0.39517758,  0.40743395,  0.41040931,  0.42032703,
        0.43577431,  0.46157615,  0.46668313,  0.47144763,  0.47196742,
        0.47425178,  0.47510175,  0.47762453,  0.48381558,  0.48473821,
        0.4906733 ,  0.50202549,  0.50448149,  0.50674907,  0.50959756,
        0.51456778,  0.51694399,  0.51848152,  0.52576027,  0.53292675,
        0.53568264,  0.53601729,  0.53790775,  0.53878741,  0.53876248,
        0.53838784,  0.53822688,  0.53756849,  0.53748661,  0.53650016,
        0.53481469,  0.53372126,  0.53274257,  0.52871724,  0.52377536,
        0.52346188,  0.52313791,  0.52286872,  0.49655523,  0.49552641,
        0.47578596,  0.4669369 ,  0.43757684,  0.38609879,  0.38104404,
        0.31131919,  0.2984486 ,  0.28774333,  0.27189053,  0.25239709,
        0.2384553 ,  0.22915234,  0.17792316,  0.17355182,  0.09982541,
        0.09880754,  0.09413432,  0.09001771,  0.0844749 ,  0.01787073,
       -0.00849026, -0.03051945, -0.06842454, -0.09116713, -0.10695813,
       -0.13889128, -0.20217854, -0.2210452 , -0.23334664, -0.39045798,
       -0.46280636, -0.47155946, -0.48247123, -0.5697079 , -0.57972246,
       -0.68977946, -0.81351875, -0.83477874, -0.88303201, -0.91521502,
       -0.96937509, -0.99388351, -1.1634133 , -1.19336585, -1.21548881])

plt.figure(layout="constrained")
sns.scatterplot(data=df, x="x", y="y")
sns.lineplot(x=df.x, y=p.predict(X = df[["x"]]), color="k")
plt.show()

Model coefficients (or other attributes)

The attributes of pipeline steps are not directly accessible, but can be accessed via the steps or named_steps attributes,

p.coef_
AttributeError: 'Pipeline' object has no attribute 'coef_'
p.steps
[('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression())]
p.steps[1][1].coef_
array([  0.        ,   7.39051417, -57.67175293, 102.72227443,
       -55.38181361])
p.named_steps["linearregression"].intercept_
np.float64(1.6136636604768198)

Other useful bits

p.steps[0][1].get_feature_names_out()
array(['1', 'x', 'x^2', 'x^3', 'x^4'], dtype=object)
p.steps[1][1].get_params()
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}

Anyone notice a problem?

p.steps[1][1].rank_
4
p.steps[1][1].n_features_in_
5

What about step parameters?

By accessing each step we can adjust their parameters (via set_params()),

p.named_steps["linearregression"].get_params()
{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}
p.named_steps["linearregression"].set_params(
  fit_intercept=False
)
LinearRegression(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
p.fit(X = df[["x"]], y = df.y)
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
                ('linearregression', LinearRegression(fit_intercept=False))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
p.named_steps["linearregression"].intercept_
0.0
p.named_steps["linearregression"].coef_
array([  1.61366366,   7.39051417, -57.67175293, 102.72227443,
       -55.38181361])

Pipeline parameter names

These parameters can also be directly accessed at the pipeline level, names are constructed as step name + __ + parameter name:

p.get_params()
{'memory': None, 'steps': [('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression(fit_intercept=False))], 'transform_input': None, 'verbose': False, 'polynomialfeatures': PolynomialFeatures(degree=4), 'linearregression': LinearRegression(fit_intercept=False), 'polynomialfeatures__degree': 4, 'polynomialfeatures__include_bias': True, 'polynomialfeatures__interaction_only': False, 'polynomialfeatures__order': 'C', 'linearregression__copy_X': True, 'linearregression__fit_intercept': False, 'linearregression__n_jobs': None, 'linearregression__positive': False, 'linearregression__tol': 1e-06}
p.set_params(
  linearregression__fit_intercept=True, 
  polynomialfeatures__include_bias=False
)
Pipeline(steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=4, include_bias=False)),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

p.fit(X = df[["x"]], y = df.y)
Pipeline(steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=4, include_bias=False)),
                ('linearregression', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
p.named_steps["polynomialfeatures"].get_feature_names_out()
array(['x', 'x^2', 'x^3', 'x^4'], dtype=object)
p.named_steps["linearregression"].intercept_
np.float64(1.6136636604768482)
p.named_steps["linearregression"].coef_
array([  7.39051417, -57.67175293, 102.72227443, -55.38181361])

Column Transformers

Column Transformers

Are a tool for selectively applying transformer(s) to column(s) of an array or DataFrame, they function in a way that is similar to a pipeline and similarly have a make_ helper function.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
ct = make_column_transformer(
  (StandardScaler(), ["volume"]),
  (OneHotEncoder(), ["cover"]),
).fit(
  books
)
ct.get_feature_names_out()
array(['standardscaler__volume', 'onehotencoder__cover_hb',
       'onehotencoder__cover_pb'], dtype=object)
ct.transform(books)
array([[ 0.12100717,  1.        ,  0.        ],
       [ 0.51996539,  1.        ,  0.        ],
       [ 0.85192299,  1.        ,  0.        ],
       [-1.84637457,  1.        ,  0.        ],
       [-0.43936162,  1.        ,  0.        ],
       [-0.62209057,  1.        ,  0.        ],
       [ 1.1656077 ,  1.        ,  0.        ],
       [-1.31950608,  0.        ,  1.        ],
       [ 0.32809999,  0.        ,  1.        ],
       [ 0.25500841,  0.        ,  1.        ],
       [ 1.9696151 ,  0.        ,  1.        ],
       [-1.2981877 ,  0.        ,  1.        ],
       [ 0.5016925 ,  0.        ,  1.        ],
       [-0.76218277,  0.        ,  1.        ],
       [ 0.57478408,  0.        ,  1.        ]])

Keeping or dropping other columns

Another important argument is remainder which determines what happens to unspecified columns. The default is "drop" which is why weight was removed, the alternative is "passthrough" which retains untransformed columns.

ct = make_column_transformer(
  (StandardScaler(), ["volume"]),
  (OneHotEncoder(), ["cover"]),
  remainder = "passthrough"
).fit(
  books
)
ct.get_feature_names_out()
array(['standardscaler__volume', 'onehotencoder__cover_hb',
       'onehotencoder__cover_pb', 'remainder__weight'], dtype=object)
ct.transform(books)
array([[ 1.2101e-01,  1.0000e+00,  0.0000e+00,  8.0000e+02],
       [ 5.1997e-01,  1.0000e+00,  0.0000e+00,  9.5000e+02],
       [ 8.5192e-01,  1.0000e+00,  0.0000e+00,  1.0500e+03],
       [-1.8464e+00,  1.0000e+00,  0.0000e+00,  3.5000e+02],
       [-4.3936e-01,  1.0000e+00,  0.0000e+00,  7.5000e+02],
       [-6.2209e-01,  1.0000e+00,  0.0000e+00,  6.0000e+02],
       [ 1.1656e+00,  1.0000e+00,  0.0000e+00,  1.0750e+03],
       [-1.3195e+00,  0.0000e+00,  1.0000e+00,  2.5000e+02],
       [ 3.2810e-01,  0.0000e+00,  1.0000e+00,  7.0000e+02],
       [ 2.5501e-01,  0.0000e+00,  1.0000e+00,  6.5000e+02],
       [ 1.9696e+00,  0.0000e+00,  1.0000e+00,  9.7500e+02],
       [-1.2982e+00,  0.0000e+00,  1.0000e+00,  3.5000e+02],
       [ 5.0169e-01,  0.0000e+00,  1.0000e+00,  9.5000e+02],
       [-7.6218e-01,  0.0000e+00,  1.0000e+00,  4.2500e+02],
       [ 5.7478e-01,  0.0000e+00,  1.0000e+00,  7.2500e+02]])

Column selection

One lingering issue with the above approach is that we’ve had to hard code the column names (or use indexes). Often we want to select columns based on their dtype (e.g. categorical vs numerical) this can be done via pandas or sklearn,

from sklearn.compose import make_column_selector
ct1 = make_column_transformer(
  ( StandardScaler(),
    make_column_selector(
      dtype_include=np.number
    )
  ),
  ( OneHotEncoder(),
    make_column_selector(
      dtype_include=[str, bool]
    )
  )
)
ct2 = make_column_transformer(
  ( StandardScaler(),
    books.select_dtypes(
      include=['number']
    ).columns
  ),
  ( OneHotEncoder(),
    books.select_dtypes(
      include=['str']
    ).columns
  )
)

ct1.fit_transform(books)
array([[ 0.121 ,  0.3594,  1.    ,  0.    ],
       [ 0.52  ,  0.9369,  1.    ,  0.    ],
       [ 0.8519,  1.3219,  1.    ,  0.    ],
       [-1.8464, -1.3733,  1.    ,  0.    ],
       [-0.4394,  0.1668,  1.    ,  0.    ],
       [-0.6221, -0.4107,  1.    ,  0.    ],
       [ 1.1656,  1.4182,  1.    ,  0.    ],
       [-1.3195, -1.7583,  0.    ,  1.    ],
       [ 0.3281, -0.0257,  0.    ,  1.    ],
       [ 0.255 , -0.2182,  0.    ,  1.    ],
       [ 1.9696,  1.0332,  0.    ,  1.    ],
       [-1.2982, -1.3733,  0.    ,  1.    ],
       [ 0.5017,  0.9369,  0.    ,  1.    ],
       [-0.7622, -1.0845,  0.    ,  1.    ],
       [ 0.5748,  0.0706,  0.    ,  1.    ]])
ct1.get_feature_names_out()
array(['standardscaler__volume', 'standardscaler__weight',
       'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)
ct2.fit_transform(books)
array([[ 0.121 ,  0.3594,  1.    ,  0.    ],
       [ 0.52  ,  0.9369,  1.    ,  0.    ],
       [ 0.8519,  1.3219,  1.    ,  0.    ],
       [-1.8464, -1.3733,  1.    ,  0.    ],
       [-0.4394,  0.1668,  1.    ,  0.    ],
       [-0.6221, -0.4107,  1.    ,  0.    ],
       [ 1.1656,  1.4182,  1.    ,  0.    ],
       [-1.3195, -1.7583,  0.    ,  1.    ],
       [ 0.3281, -0.0257,  0.    ,  1.    ],
       [ 0.255 , -0.2182,  0.    ,  1.    ],
       [ 1.9696,  1.0332,  0.    ,  1.    ],
       [-1.2982, -1.3733,  0.    ,  1.    ],
       [ 0.5017,  0.9369,  0.    ,  1.    ],
       [-0.7622, -1.0845,  0.    ,  1.    ],
       [ 0.5748,  0.0706,  0.    ,  1.    ]])
ct2.get_feature_names_out()
array(['standardscaler__volume', 'standardscaler__weight',
       'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)