scikit-learn

Lecture 12

Dr. Colin Rundel

scikit-learn

Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Simple and efficient tools for predictive data analysis

Accessible to everybody, and reusable in various contexts

Built on NumPy, SciPy, and matplotlib

Open source, commercially usable - BSD license

import sklearn
sklearn.__version__

'1.8.0'

Installation

You probably noticed - the package is called scikit-learn but the module is called sklearn - this is a common source of confusion.

To install the package use the longer name:

uv add scikit-learn

Previously you could also use sklearn as the package name, but this is no longer supported and will result in an error.

Submodules

The sklearn package contains a large number of submodules which are specialized for different tasks / models,

sklearn.base - Base classes and utility functions
sklearn.calibration - Probability Calibration
sklearn.cluster - Clustering
sklearn.compose - Composite Estimators
sklearn.covariance - Covariance Estimators
sklearn.cross_decomposition - Cross decomposition
sklearn.datasets - Datasets
sklearn.decomposition - Matrix Decomposition
sklearn.discriminant_analysis - Discriminant Analysis
sklearn.ensemble - Ensemble Methods
sklearn.exceptions - Exceptions and warnings
sklearn.experimental - Experimental
sklearn.feature_extraction - Feature Extraction
sklearn.feature_selection - Feature Selection
sklearn.gaussian_process - Gaussian Processes
sklearn.impute - Impute
sklearn.inspection - Inspection
sklearn.isotonic - Isotonic regression
sklearn.kernel_approximation - Kernel Approximation

sklearn.kernel_ridge - Kernel Ridge Regression
sklearn.linear_model - Linear Models
sklearn.manifold - Manifold Learning
sklearn.metrics - Metrics
sklearn.mixture - Gaussian Mixture Models
sklearn.model_selection - Model Selection
sklearn.multiclass - Multiclass classification
sklearn.multioutput - Multioutput regression and classification
sklearn.naive_bayes - Naive Bayes
sklearn.neighbors - Nearest Neighbors
sklearn.neural_network - Neural network models
sklearn.pipeline - Pipeline
sklearn.preprocessing - Preprocessing and Normalization
sklearn.random_projection - Random projection
sklearn.semi_supervised - Semi-Supervised Learning
sklearn.svm - Support Vector Machines
sklearn.tree - Decision Trees
sklearn.utils - Utilities

Model Fitting

Sample data

To begin, we will examine a simple data set on the size and weight of a number of books. The goal is to model the weight of a book using some combination of the other features in the data.

The included columns are:

volume - book volumes in cubic centimeters
weight - book weights in grams
cover - a categorical variable with levels "hb" hardback, "pb" paperback

books = pd.read_csv("data/daag_books.csv"); books

    volume  weight cover
0      885     800    hb
1     1016     950    hb
2     1125    1050    hb
3      239     350    hb
4      701     750    hb
5      641     600    hb
6     1228    1075    hb
7      412     250    pb
8      953     700    pb
9      929     650    pb
10    1492     975    pb
11     419     350    pb
12    1010     950    pb
13     595     425    pb
14    1034     725    pb

g = sns.relplot(data=books, x="volume", y="weight", hue="cover")

Linear regression

scikit-learn uses an object oriented system for implementing the various modeling approaches, the class LinearRegression is part of the linear_model submodule.

from sklearn.linear_model import LinearRegression

Each modeling class needs to be constructed (potentially with options) and then the resulting object will provide attributes and methods for fitting and using the model.

lm = LinearRegression()

m = lm.fit(
  X = books[["volume"]],
  y = books.weight
)

m.coef_

array([0.70863714])

m.intercept_

np.float64(107.67931061376612)

lm.coef_

array([0.70863714])

lm.intercept_

np.float64(107.67931061376612)

Note lm and m are labels for the same underlying LinearRegression object,

A couple of considerations

When fitting a model, scikit-learn expects X to be a 2d array-like object (e.g. a np.array or pd.DataFrame), so it will not accept objects like a pd.Series or 1d np.array.

lm.fit(
  X = books.volume,
  y = books.weight
)

ValueError: Expected a 2-dimensional container but got <class 'pandas.Series'> instead. Pass a DataFrame containing a single row (i.e. single sample) or a single column (i.e. single feature) instead.

lm.fit(
  X = np.array(books.volume),
  y = books.weight
)

ValueError: Expected 2D array, got 1D array instead:
array=[ 885 1016 1125  239  701  641 1228  412  953  929 1492  419 1010  595
 1034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

lm.fit(
  X = np.array(books.volume).reshape(-1,1),
  y = books.weight
)

Model parameters

Depending on the model being used, there will be a number of parameters that can be configured when constructing the model object or via the set_params() method.

lm.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}

lm.set_params(fit_intercept = False)

LinearRegression(fit_intercept=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

lm = lm.fit(X = books[["volume"]], y = books.weight)
lm.intercept_

0.0

lm.coef_

array([0.81932487])

Model prediction

Once the model coefficients have been fit, it is possible to predict from the model via the predict() method, this method requires a matrix-like X as input and in the case of LinearRegression returns an array of predicted y values.

lm.predict(X = books[["volume"]])

array([ 725.10251417,  832.43407276,  921.74048411,  195.81864507,
        574.34673721,  525.18724472, 1006.13094621,  337.5618484 ,
        780.81660565,  761.15280865, 1222.43271315,  343.29712253,
        827.51812351,  487.49830048,  847.1819205 ])

books = books.assign(
  pred = lambda x: lm.predict(X = x[["volume"]])
)
books

    volume  weight cover         pred
0      885     800    hb   725.102514
1     1016     950    hb   832.434073
2     1125    1050    hb   921.740484
3      239     350    hb   195.818645
4      701     750    hb   574.346737
5      641     600    hb   525.187245
6     1228    1075    hb  1006.130946
7      412     250    pb   337.561848
8      953     700    pb   780.816606
9      929     650    pb   761.152809
10    1492     975    pb  1222.432713
11     419     350    pb   343.297123
12    1010     950    pb   827.518124
13     595     425    pb   487.498300
14    1034     725    pb   847.181921

plt.figure()
sns.scatterplot(data=books, x="volume", y="weight", hue="cover")
sns.lineplot(data=books, x="volume", y="pred", color="c")
plt.show()

Residuals?

There is no built in functionality for calculating residuals, so this needs to be done by hand.

books["resid"] = books["weight"] - books["pred"]

plt.figure(layout="constrained")
ax = sns.scatterplot(data=books, x="volume", y="resid", hue="cover")
ax.axhline(c="k", ls="--", lw=1)
plt.show()

Categorical variables?

Scikit-learn expects that the model matrix be numeric before fitting,

lm = lm.fit(
  X = books[["volume", "cover"]],
  y = books.weight
)

ValueError: could not convert string to float: 'hb'

the solution here is to dummy code the categorical variables - this can be done with pandas via pd.get_dummies() or with a scikit-learn preprocessor.

pd.get_dummies(books[["volume", "cover"]])

    volume  cover_hb  cover_pb
0      885      True     False
1     1016      True     False
2     1125      True     False
3      239      True     False
4      701      True     False
5      641      True     False
6     1228      True     False
7      412     False      True
8      953     False      True
9      929     False      True
10    1492     False      True
11     419     False      True
12    1010     False      True
13     595     False      True
14    1034     False      True

Dummy coded model

lm = LinearRegression().fit(
  X = pd.get_dummies(books[["volume", "cover"]]),
  y = books.weight
)

lm.intercept_

np.float64(105.93920788192202)

lm.coef_

array([  0.71795374,  92.02363569, -92.02363569])

Do the above results look reasonable? What went wrong?

Quick comparison with R

d = read.csv('data/daag_books.csv')
d['cover_hb'] = ifelse(d$cover == "hb", 1, 0)
d['cover_pb'] = ifelse(d$cover == "pb", 1, 0)
lm = lm(weight~volume+cover_hb+cover_pb, data=d)
summary(lm)


Call:
lm(formula = weight ~ volume + cover_hb + cover_pb, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-110.10  -32.32  -16.10   28.93  210.95 

Coefficients: (1 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.91557   59.45408   0.234 0.818887    
volume        0.71795    0.06153  11.669  6.6e-08 ***
cover_hb    184.04727   40.49420   4.545 0.000672 ***
cover_pb           NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 78.2 on 12 degrees of freedom
Multiple R-squared:  0.9275,    Adjusted R-squared:  0.9154 
F-statistic: 76.73 on 2 and 12 DF,  p-value: 1.455e-07

Avoiding co-linearity

lm1 = LinearRegression(
  fit_intercept = False
).fit(
  X = pd.get_dummies(
    books[["volume", "cover"]]
  ),
  y = books.weight
)

lm2 = LinearRegression(
  fit_intercept = True
).fit(
  X = pd.get_dummies(
    books[["volume", "cover"]], 
    drop_first=True
  ),
  y = books.weight
)

lm1.intercept_

0.0

lm1.coef_

array([  0.71795374, 197.96284357,  13.91557219])

lm1.feature_names_in_

array(['volume', 'cover_hb', 'cover_pb'], dtype=object)

lm2.intercept_

np.float64(197.96284357271747)

lm2.coef_

array([   0.71795374, -184.04727138])

lm2.feature_names_in_

array(['volume', 'cover_pb'], dtype=object)

Preprocessors

These are a collection of transformer classes present in the sklearn.preprocessing submodule that are designed to help with the preparation of raw feature data into quantities more suitable for downstream modeling tools.

Like the modeling classes, they have an object oriented design that shares a common interface (methods and attributes) for bringing in data, transforming it, and returning it.

OneHotEncoder

For dummy coding we can use the OneHotEncoder preprocessor, the default is to use one hot encoding but standard dummy coding can be achieved via the drop parameter.

from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(sparse_output=False)
enc.fit(X = books[["cover"]])

OneHotEncoder(sparse_output=False)

enc.transform(X = books[["cover"]])

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

enc = OneHotEncoder(
  sparse_output=False, drop="first"
)
enc.fit_transform(X = books[["cover"]])

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.],
       [1.]])

Other useful bits

enc.get_feature_names_out()

array(['cover_hb', 'cover_pb'], dtype=object)

f = enc.transform(X = books[["cover"]])
f

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.]])

enc.inverse_transform(f)

array([['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['hb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb'],
       ['pb']], dtype=object)

A cautionary note

Unlike pd.get_dummies() it is not safe to use OneHotEncoder with both numerical and categorical features, as the former will also be transformed.

enc = OneHotEncoder(sparse_output=False)
X = enc.fit_transform(X = books[["volume", "cover"]])
pd.DataFrame(data=X, columns = enc.get_feature_names_out())

    volume_239  volume_412  volume_419  ...  volume_1492  cover_hb  cover_pb
0          0.0         0.0         0.0  ...          0.0       1.0       0.0
1          0.0         0.0         0.0  ...          0.0       1.0       0.0
2          0.0         0.0         0.0  ...          0.0       1.0       0.0
3          1.0         0.0         0.0  ...          0.0       1.0       0.0
4          0.0         0.0         0.0  ...          0.0       1.0       0.0
5          0.0         0.0         0.0  ...          0.0       1.0       0.0
6          0.0         0.0         0.0  ...          0.0       1.0       0.0
7          0.0         1.0         0.0  ...          0.0       0.0       1.0
8          0.0         0.0         0.0  ...          0.0       0.0       1.0
9          0.0         0.0         0.0  ...          0.0       0.0       1.0
10         0.0         0.0         0.0  ...          1.0       0.0       1.0
11         0.0         0.0         1.0  ...          0.0       0.0       1.0
12         0.0         0.0         0.0  ...          0.0       0.0       1.0
13         0.0         0.0         0.0  ...          0.0       0.0       1.0
14         0.0         0.0         0.0  ...          0.0       0.0       1.0

[15 rows x 17 columns]

Putting it together

cover = OneHotEncoder(
  sparse_output=False
).fit_transform(
  books[["cover"]]
)
X = np.c_[books.volume, cover]

lm2 = LinearRegression(
  fit_intercept=False
).fit(
  X = X,
  y = books.weight
)

lm2.coef_

array([  0.71795374, 197.96284357,  13.91557219])

books["pred2"] = lm2.predict(X=X)
books.drop(
  ["pred", "resid"], 
  axis=1
)

    volume  weight cover        pred2
0      885     800    hb   833.351907
1     1016     950    hb   927.403847
2     1125    1050    hb  1005.660805
3      239     350    hb   369.553788
4      701     750    hb   701.248418
5      641     600    hb   658.171193
6     1228    1075    hb  1079.610041
7      412     250    pb   309.712515
8      953     700    pb   698.125490
9      929     650    pb   680.894600
10    1492     975    pb  1085.102558
11     419     350    pb   314.738191
12    1010     950    pb   739.048853
13     595     425    pb   441.098050
14    1034     725    pb   756.279743

Model fit

Model residuals

Model performance

Scikit-learn comes with a number of builtin functions for measuring model performance in the sklearn.metrics submodule - these are generally just functions that take the vectors y_true and y_pred and return a scalar score.

import sklearn.metrics as metrics

metrics.r2_score(books.weight, books.pred)

0.7800969547785039

metrics.mean_squared_error(
  books.weight, books.pred
)

14833.682083774476

metrics.root_mean_squared_error(
  books.weight, books.pred
)

121.79360444528471

metrics.r2_score(books.weight, books.pred2)

0.927477575682168

metrics.mean_squared_error(
  books.weight, books.pred2
)

4892.04042259509

metrics.root_mean_squared_error(
  books.weight, books.pred2
)

69.94312276839725

Exercise 1

Create and fit a model for the books data that includes an interaction effect between volume and cover.

You will need to do this manually with pd.get_dummies() and some additional data munging.

The data can be read into pandas with,

books = pd.read_csv(
  "https://sta663-sp26.github.io/slides/data/daag_books.csv"
)

Other transformers

Polynomial regression

We will now look at another flavor of regression model, that involves preprocessing and a hyperparameter - namely polynomial regression.

df = pd.read_csv("data/gp.csv")
sns.relplot(data=df, x="x", y="y")

By hand

It is certainly possible to construct the necessary model matrix by hand (or even use a function to automate the process), but this is less than desirable generally - particularly if we want to do anything fancy (e.g. cross validation)

X = np.c_[
    np.ones(df.shape[0]),
    df.x,
    df.x**2,
    df.x**3
]

plm = LinearRegression(
  fit_intercept = False
).fit(
  X=X, y=df.y
)

plm.coef_

array([ 2.36985684, -8.49429068, 13.95066369, -8.39215284])

df["y_pred"] = plm.predict(X=X)

plt.figure(layout="constrained")
sns.scatterplot(data=df, x="x", y="y")
sns.lineplot(data=df, x="x", y="y_pred", color="k")
plt.show()

X = np.c_[
    np.ones(df.shape[0]), df.x,
    df.x**2, df.x**3,
    df.x**4, df.x**5
]

plm = LinearRegression(
  fit_intercept = False
).fit(
  X=X, y=df.y
)
df["y_pred"] = plm.predict(X=X)

PolynomialFeatures

This is another transformer class from sklearn.preprocessing that simplifies the process of constructing polynomial features for your model matrix. Usage is similar to that of OneHotEncoder.

from sklearn.preprocessing import PolynomialFeatures
X = np.array(range(6)).reshape(-1,1)

pf = PolynomialFeatures(degree=3)
pf = pf.fit(X)
pf.transform(X)

array([[  1.,   0.,   0.,   0.],
       [  1.,   1.,   1.,   1.],
       [  1.,   2.,   4.,   8.],
       [  1.,   3.,   9.,  27.],
       [  1.,   4.,  16.,  64.],
       [  1.,   5.,  25., 125.]])

pf.get_feature_names_out()

array(['1', 'x0', 'x0^2', 'x0^3'], dtype=object)

pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(X)

array([[ 0.,  0.],
       [ 1.,  1.],
       [ 2.,  4.],
       [ 3.,  9.],
       [ 4., 16.],
       [ 5., 25.]])

pf.get_feature_names_out()

array(['x0', 'x0^2'], dtype=object)

Interactions

If the feature matrix X has more than one column then PolynomialFeatures transformer will include interaction terms with total degree up to degree.

X.reshape(-1, 2)

array([[0, 1],
       [2, 3],
       [4, 5]])

pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(
  X.reshape(-1, 2)
)

array([[ 0.,  1.,  0.,  0.,  1.],
       [ 2.,  3.,  4.,  6.,  9.],
       [ 4.,  5., 16., 20., 25.]])

pf.get_feature_names_out()

array(['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2'], dtype=object)

X.reshape(-1, 3)

array([[0, 1, 2],
       [3, 4, 5]])

pf = PolynomialFeatures(
  degree=2, include_bias=False
)
pf.fit_transform(
  X.reshape(-1, 3)
)

array([[ 0.,  1.,  2.,  0.,  0.,  0.,  1.,  2.,  4.],
       [ 3.,  4.,  5.,  9., 12., 15., 16., 20., 25.]])

pf.get_feature_names_out()

array(['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2',
       'x2^2'], dtype=object)

Modeling with PolynomialFeatures

from sklearn.metrics import root_mean_squared_error as rmse
def poly_model(X, y, degree):
  X  = PolynomialFeatures(
    degree=degree, include_bias=False
  ).fit_transform(
    X=X
  )
  y_pred = LinearRegression(
  ).fit(
    X=X, y=y
  ).predict(
    X
  )
  return rmse(y, y_pred)

poly_model(X=df[["x"]], y=df.y, degree=2)

0.5449418707295371

poly_model(X=df[["x"]], y=df.y, degree=3)

0.5208157900621085

degrees = range(1,10)
rmses = [
  poly_model(X=df[["x"]], y=df.y, degree=d) 
  for d in degrees
]
g = sns.relplot(x=degrees, y=rmses)

Pipelines

You may have noticed that PolynomialFeatures takes a model matrix as input and returns a new model matrix as output which is then used as the input for LinearRegression. This is not an accident, and by structuring the library in this way sklearn is designed to enable the connection of these steps together, into what sklearn calls a pipeline.

from sklearn.pipeline import make_pipeline

p = make_pipeline(
  PolynomialFeatures(degree=4),
  LinearRegression()
)

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
                ('linearregression', LinearRegression())])

Using Pipelines

Once constructed, this object can be used just like our previous LinearRegression model (i.e. fit to our data and then used for prediction)

p = p.fit(X = df[["x"]], y = df.y)
p.predict(X = df[["x"]])

array([ 1.6295693 ,  1.65734929,  1.6610466 ,  1.67779767,  1.69667491,
        1.70475286,  1.75280126,  1.78471392,  1.79049912,  1.82690007,
        1.82966357,  1.83376043,  1.84494343,  1.86002819,  1.86228095,
        1.86619112,  1.86837909,  1.87065283,  1.88417882,  1.8844024 ,
        1.88527174,  1.88577463,  1.88544367,  1.86890805,  1.86365035,
        1.86252922,  1.86047349,  1.85377801,  1.84937708,  1.83754576,
        1.82623453,  1.82024199,  1.81799793,  1.79767794,  1.77255319,
        1.77034143,  1.76574288,  1.75371272,  1.74389585,  1.73804309,
        1.73356954,  1.65527727,  1.64812184,  1.61867613,  1.6041325 ,
        1.5960389 ,  1.56080881,  1.55036459,  1.54004364,  1.50903953,
        1.45096594,  1.43589836,  1.41886389,  1.39423307,  1.36180712,
        1.23072992,  1.21355164,  1.11776117,  1.11522002,  1.09595388,
        1.06449719,  1.04672121,  1.03662739,  1.01407206,  0.98208703,
        0.98081577,  0.96176797,  0.87491417,  0.87117573,  0.84223005,
        0.84171166,  0.82875003,  0.8085086 ,  0.79166069,  0.78167248,
        0.78078036,  0.73538157,  0.7181484 ,  0.70046945,  0.67233502,
        0.67229069,  0.64782899,  0.64050946,  0.63726823,  0.63526047,
        0.62323271,  0.61965166,  0.61705548,  0.6141438 ,  0.60978056,
        0.60347713,  0.5909255 ,  0.566617  ,  0.50905785,  0.44706202,
        0.44177711,  0.43291379,  0.40957833,  0.38480262,  0.38288511,
        0.38067928,  0.3791518 ,  0.37610476,  0.36932957,  0.36493067,
        0.35806518,  0.3475729 ,  0.3466828 ,  0.33332696,  0.30717941,
        0.3006981 ,  0.29675876,  0.29337641,  0.29333354,  0.27631567,
        0.26899076,  0.2676092 ,  0.2672602 ,  0.26716133,  0.26241605,
        0.25405246,  0.25334542,  0.25322869,  0.25322576,  0.25410989,
        0.25622496,  0.25808334,  0.25849729,  0.26029845,  0.26043195,
        0.26319956,  0.26466962,  0.26480578,  0.2648598 ,  0.26488966,
        0.28177285,  0.28525208,  0.28861016,  0.28917644,  0.29004253,
        0.29444629,  0.29559749,  0.30233373,  0.30622039,  0.31322114,
        0.31798208,  0.32104799,  0.32700307,  0.32822585,  0.32927281,
        0.3326599 ,  0.33397022,  0.33710573,  0.34110873,  0.34140708,
        0.34707419,  0.35926445,  0.37678278,  0.37774536,  0.38884519,
        0.39078249,  0.39517758,  0.40743395,  0.41040931,  0.42032703,
        0.43577431,  0.46157615,  0.46668313,  0.47144763,  0.47196742,
        0.47425178,  0.47510175,  0.47762453,  0.48381558,  0.48473821,
        0.4906733 ,  0.50202549,  0.50448149,  0.50674907,  0.50959756,
        0.51456778,  0.51694399,  0.51848152,  0.52576027,  0.53292675,
        0.53568264,  0.53601729,  0.53790775,  0.53878741,  0.53876248,
        0.53838784,  0.53822688,  0.53756849,  0.53748661,  0.53650016,
        0.53481469,  0.53372126,  0.53274257,  0.52871724,  0.52377536,
        0.52346188,  0.52313791,  0.52286872,  0.49655523,  0.49552641,
        0.47578596,  0.4669369 ,  0.43757684,  0.38609879,  0.38104404,
        0.31131919,  0.2984486 ,  0.28774333,  0.27189053,  0.25239709,
        0.2384553 ,  0.22915234,  0.17792316,  0.17355182,  0.09982541,
        0.09880754,  0.09413432,  0.09001771,  0.0844749 ,  0.01787073,
       -0.00849026, -0.03051945, -0.06842454, -0.09116713, -0.10695813,
       -0.13889128, -0.20217854, -0.2210452 , -0.23334664, -0.39045798,
       -0.46280636, -0.47155946, -0.48247123, -0.5697079 , -0.57972246,
       -0.68977946, -0.81351875, -0.83477874, -0.88303201, -0.91521502,
       -0.96937509, -0.99388351, -1.1634133 , -1.19336585, -1.21548881])

plt.figure(layout="constrained")
sns.scatterplot(data=df, x="x", y="y")
sns.lineplot(x=df.x, y=p.predict(X = df[["x"]]), color="k")
plt.show()

Model coefficients (or other attributes)

The attributes of pipeline steps are not directly accessible, but can be accessed via the steps or named_steps attributes,

p.coef_

AttributeError: 'Pipeline' object has no attribute 'coef_'

p.steps

[('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression())]

p.steps[1][1].coef_

array([  0.        ,   7.39051417, -57.67175293, 102.72227443,
       -55.38181361])

p.named_steps["linearregression"].intercept_

np.float64(1.6136636604768198)

Other useful bits

p.steps[0][1].get_feature_names_out()

array(['1', 'x', 'x^2', 'x^3', 'x^4'], dtype=object)

p.steps[1][1].get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}

Anyone notice a problem?

p.steps[1][1].rank_

p.steps[1][1].n_features_in_

What about step parameters?

By accessing each step we can adjust their parameters (via set_params()),

p.named_steps["linearregression"].get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False, 'tol': 1e-06}

p.named_steps["linearregression"].set_params(
  fit_intercept=False
)

LinearRegression(fit_intercept=False)

p.fit(X = df[["x"]], y = df.y)

Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
                ('linearregression', LinearRegression(fit_intercept=False))])

p.named_steps["linearregression"].intercept_

0.0

p.named_steps["linearregression"].coef_

array([  1.61366366,   7.39051417, -57.67175293, 102.72227443,
       -55.38181361])

Pipeline parameter names

These parameters can also be directly accessed at the pipeline level, names are constructed as step name + __ + parameter name:

p.get_params()

{'memory': None, 'steps': [('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression(fit_intercept=False))], 'transform_input': None, 'verbose': False, 'polynomialfeatures': PolynomialFeatures(degree=4), 'linearregression': LinearRegression(fit_intercept=False), 'polynomialfeatures__degree': 4, 'polynomialfeatures__include_bias': True, 'polynomialfeatures__interaction_only': False, 'polynomialfeatures__order': 'C', 'linearregression__copy_X': True, 'linearregression__fit_intercept': False, 'linearregression__n_jobs': None, 'linearregression__positive': False, 'linearregression__tol': 1e-06}

p.set_params(
  linearregression__fit_intercept=True, 
  polynomialfeatures__include_bias=False
)

Pipeline(steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=4, include_bias=False)),
                ('linearregression', LinearRegression())])

p.fit(X = df[["x"]], y = df.y)

Pipeline(steps=[('polynomialfeatures',
                 PolynomialFeatures(degree=4, include_bias=False)),
                ('linearregression', LinearRegression())])

p.named_steps["polynomialfeatures"].get_feature_names_out()

array(['x', 'x^2', 'x^3', 'x^4'], dtype=object)

p.named_steps["linearregression"].intercept_

np.float64(1.6136636604768482)

p.named_steps["linearregression"].coef_

array([  7.39051417, -57.67175293, 102.72227443, -55.38181361])

Column Transformers

Are a tool for selectively applying transformer(s) to column(s) of an array or DataFrame, they function in a way that is similar to a pipeline and similarly have a make_ helper function.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = make_column_transformer(
  (StandardScaler(), ["volume"]),
  (OneHotEncoder(), ["cover"]),
).fit(
  books
)
ct.get_feature_names_out()

array(['standardscaler__volume', 'onehotencoder__cover_hb',
       'onehotencoder__cover_pb'], dtype=object)

ct.transform(books)

array([[ 0.12100717,  1.        ,  0.        ],
       [ 0.51996539,  1.        ,  0.        ],
       [ 0.85192299,  1.        ,  0.        ],
       [-1.84637457,  1.        ,  0.        ],
       [-0.43936162,  1.        ,  0.        ],
       [-0.62209057,  1.        ,  0.        ],
       [ 1.1656077 ,  1.        ,  0.        ],
       [-1.31950608,  0.        ,  1.        ],
       [ 0.32809999,  0.        ,  1.        ],
       [ 0.25500841,  0.        ,  1.        ],
       [ 1.9696151 ,  0.        ,  1.        ],
       [-1.2981877 ,  0.        ,  1.        ],
       [ 0.5016925 ,  0.        ,  1.        ],
       [-0.76218277,  0.        ,  1.        ],
       [ 0.57478408,  0.        ,  1.        ]])

Keeping or dropping other columns

Another important argument is remainder which determines what happens to unspecified columns. The default is "drop" which is why weight was removed, the alternative is "passthrough" which retains untransformed columns.

ct = make_column_transformer(
  (StandardScaler(), ["volume"]),
  (OneHotEncoder(), ["cover"]),
  remainder = "passthrough"
).fit(
  books
)

ct.get_feature_names_out()

array(['standardscaler__volume', 'onehotencoder__cover_hb',
       'onehotencoder__cover_pb', 'remainder__weight'], dtype=object)

ct.transform(books)

array([[ 1.2101e-01,  1.0000e+00,  0.0000e+00,  8.0000e+02],
       [ 5.1997e-01,  1.0000e+00,  0.0000e+00,  9.5000e+02],
       [ 8.5192e-01,  1.0000e+00,  0.0000e+00,  1.0500e+03],
       [-1.8464e+00,  1.0000e+00,  0.0000e+00,  3.5000e+02],
       [-4.3936e-01,  1.0000e+00,  0.0000e+00,  7.5000e+02],
       [-6.2209e-01,  1.0000e+00,  0.0000e+00,  6.0000e+02],
       [ 1.1656e+00,  1.0000e+00,  0.0000e+00,  1.0750e+03],
       [-1.3195e+00,  0.0000e+00,  1.0000e+00,  2.5000e+02],
       [ 3.2810e-01,  0.0000e+00,  1.0000e+00,  7.0000e+02],
       [ 2.5501e-01,  0.0000e+00,  1.0000e+00,  6.5000e+02],
       [ 1.9696e+00,  0.0000e+00,  1.0000e+00,  9.7500e+02],
       [-1.2982e+00,  0.0000e+00,  1.0000e+00,  3.5000e+02],
       [ 5.0169e-01,  0.0000e+00,  1.0000e+00,  9.5000e+02],
       [-7.6218e-01,  0.0000e+00,  1.0000e+00,  4.2500e+02],
       [ 5.7478e-01,  0.0000e+00,  1.0000e+00,  7.2500e+02]])

Column selection

One lingering issue with the above approach is that we’ve had to hard code the column names (or use indexes). Often we want to select columns based on their dtype (e.g. categorical vs numerical) this can be done via pandas or sklearn,

from sklearn.compose import make_column_selector

ct1 = make_column_transformer(
  ( StandardScaler(),
    make_column_selector(
      dtype_include=np.number
    )
  ),
  ( OneHotEncoder(),
    make_column_selector(
      dtype_include=[str, bool]
    )
  )
)

ct2 = make_column_transformer(
  ( StandardScaler(),
    books.select_dtypes(
      include=['number']
    ).columns
  ),
  ( OneHotEncoder(),
    books.select_dtypes(
      include=['str']
    ).columns
  )
)

ct1.fit_transform(books)

array([[ 0.121 ,  0.3594,  1.    ,  0.    ],
       [ 0.52  ,  0.9369,  1.    ,  0.    ],
       [ 0.8519,  1.3219,  1.    ,  0.    ],
       [-1.8464, -1.3733,  1.    ,  0.    ],
       [-0.4394,  0.1668,  1.    ,  0.    ],
       [-0.6221, -0.4107,  1.    ,  0.    ],
       [ 1.1656,  1.4182,  1.    ,  0.    ],
       [-1.3195, -1.7583,  0.    ,  1.    ],
       [ 0.3281, -0.0257,  0.    ,  1.    ],
       [ 0.255 , -0.2182,  0.    ,  1.    ],
       [ 1.9696,  1.0332,  0.    ,  1.    ],
       [-1.2982, -1.3733,  0.    ,  1.    ],
       [ 0.5017,  0.9369,  0.    ,  1.    ],
       [-0.7622, -1.0845,  0.    ,  1.    ],
       [ 0.5748,  0.0706,  0.    ,  1.    ]])

ct1.get_feature_names_out()

array(['standardscaler__volume', 'standardscaler__weight',
       'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)

ct2.fit_transform(books)

array([[ 0.121 ,  0.3594,  1.    ,  0.    ],
       [ 0.52  ,  0.9369,  1.    ,  0.    ],
       [ 0.8519,  1.3219,  1.    ,  0.    ],
       [-1.8464, -1.3733,  1.    ,  0.    ],
       [-0.4394,  0.1668,  1.    ,  0.    ],
       [-0.6221, -0.4107,  1.    ,  0.    ],
       [ 1.1656,  1.4182,  1.    ,  0.    ],
       [-1.3195, -1.7583,  0.    ,  1.    ],
       [ 0.3281, -0.0257,  0.    ,  1.    ],
       [ 0.255 , -0.2182,  0.    ,  1.    ],
       [ 1.9696,  1.0332,  0.    ,  1.    ],
       [-1.2982, -1.3733,  0.    ,  1.    ],
       [ 0.5017,  0.9369,  0.    ,  1.    ],
       [-0.7622, -1.0845,  0.    ,  1.    ],
       [ 0.5748,  0.0706,  0.    ,  1.    ]])

ct2.get_feature_names_out()

array(['standardscaler__volume', 'standardscaler__weight',
       'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)

	categories categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20	'auto'
	drop drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21 The parameter `drop` was added in 0.21. .. versionchanged:: 0.23 The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1 Support for dropping infrequent categories.	None
	sparse_output sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in "Compressed Sparse Row" (CSR) format. .. versionadded:: 1.2 `sparse` was renamed to `sparse_output`	False
	dtype dtype: number type, default=np.float64 Desired dtype of output.	<class 'numpy.float64'>
	handle_unknown handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. The infrequent category will be mapped to the last position in the encoding. During inverse transform, an unknown category will be mapped to the category denoted `'infrequent'` if it exists. If the `'infrequent'` category does not exist, then :meth:`transform` and :meth:`inverse_transform` will handle an unknown category as with `handle_unknown='ignore'`. Infrequent categories exist based on `min_frequency` and `max_categories`. Read more in the :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform a warning is issued, and the encoding then proceeds as described for `handle_unknown="infrequent_if_exist"`. .. versionchanged:: 1.1 `'infrequent_if_exist'` was added to automatically handle unknown categories and infrequent categories. .. versionadded:: 1.6 The option `"warn"` was added in 1.6.	'error'
	min_frequency min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered infrequent. - If `float`, categories with a smaller cardinality than `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	max_categories max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1 Read more in the :ref:`User Guide `.	None
	feature_name_combiner feature_name_combiner: "concat" or callable, default="concat" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `"concat"` concatenates encoded feature name and category with `feature + "_" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3	'concat'

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('polynomialfeatures', ...), ('linearregression', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	degree degree: int or tuple (min_degree, max_degree), default=2 If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple `(min_degree, max_degree)` is passed, then `min_degree` is the minimum and `max_degree` is the maximum polynomial degree of the generated features. Note that `min_degree=0` and `min_degree=1` are equivalent as outputting the degree zero term is determined by `include_bias`.	4
	interaction_only interaction_only: bool, default=False If `True`, only interaction features are produced: features that are products of at most `degree` distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded: - included: `x[0]`, `x[1]`, `x[0] * x[1]`, etc. - excluded: `x[0] 2`, `x[0] 2 * x[1]`, etc.	False
	include_bias include_bias: bool, default=True If `True` (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).	True
	order order: {'C', 'F'}, default='C' Order of output array in the dense case. `'F'` order is faster to compute, but may slow down subsequent estimators. .. versionadded:: 0.21	'C'

	steps steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.	[('polynomialfeatures', ...), ('linearregression', ...)]
	transform_input transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6	None
	memory memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.	None
	verbose verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.	False

	degree degree: int or tuple (min_degree, max_degree), default=2 If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple `(min_degree, max_degree)` is passed, then `min_degree` is the minimum and `max_degree` is the maximum polynomial degree of the generated features. Note that `min_degree=0` and `min_degree=1` are equivalent as outputting the degree zero term is determined by `include_bias`.	4
	interaction_only interaction_only: bool, default=False If `True`, only interaction features are produced: features that are products of at most `degree` distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded: - included: `x[0]`, `x[1]`, `x[0] * x[1]`, etc. - excluded: `x[0] 2`, `x[0] 2 * x[1]`, etc.	False
	include_bias include_bias: bool, default=True If `True` (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).	True
	order order: {'C', 'F'}, default='C' Order of output array in the dense case. `'F'` order is faster to compute, but may slow down subsequent estimators. .. versionadded:: 0.21	'C'

	fit_intercept fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).	False
	copy_X copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.	True
	tol tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7	1e-06
	n_jobs n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.	None
	positive positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24	False