'1.8.0'
Lecture 12
Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
You probably noticed - the package is called scikit-learn but the module is called sklearn - this is a common source of confusion.
To install the package use the longer name:
Previously you could also use sklearn as the package name, but this is no longer supported and will result in an error.
The sklearn package contains a large number of submodules which are specialized for different tasks / models,
sklearn.base - Base classes and utility functionssklearn.calibration - Probability Calibrationsklearn.cluster - Clusteringsklearn.compose - Composite Estimatorssklearn.covariance - Covariance Estimatorssklearn.cross_decomposition - Cross decompositionsklearn.datasets - Datasetssklearn.decomposition - Matrix Decompositionsklearn.discriminant_analysis - Discriminant Analysissklearn.ensemble - Ensemble Methodssklearn.exceptions - Exceptions and warningssklearn.experimental - Experimentalsklearn.feature_extraction - Feature Extractionsklearn.feature_selection - Feature Selectionsklearn.gaussian_process - Gaussian Processessklearn.impute - Imputesklearn.inspection - Inspectionsklearn.isotonic - Isotonic regressionsklearn.kernel_approximation - Kernel Approximationsklearn.kernel_ridge - Kernel Ridge Regressionsklearn.linear_model - Linear Modelssklearn.manifold - Manifold Learningsklearn.metrics - Metricssklearn.mixture - Gaussian Mixture Modelssklearn.model_selection - Model Selectionsklearn.multiclass - Multiclass classificationsklearn.multioutput - Multioutput regression and classificationsklearn.naive_bayes - Naive Bayessklearn.neighbors - Nearest Neighborssklearn.neural_network - Neural network modelssklearn.pipeline - Pipelinesklearn.preprocessing - Preprocessing and Normalizationsklearn.random_projection - Random projectionsklearn.semi_supervised - Semi-Supervised Learningsklearn.svm - Support Vector Machinessklearn.tree - Decision Treessklearn.utils - UtilitiesTo begin, we will examine a simple data set on the size and weight of a number of books. The goal is to model the weight of a book using some combination of the other features in the data.
The included columns are:
volume - book volumes in cubic centimeters
weight - book weights in grams
cover - a categorical variable with levels "hb" hardback, "pb" paperback
scikit-learn uses an object oriented system for implementing the various modeling approaches, the class LinearRegression is part of the linear_model submodule.
Each modeling class needs to be constructed (potentially with options) and then the resulting object will provide attributes and methods for fitting and using the model.
When fitting a model, scikit-learn expects X to be a 2d array-like object (e.g. a np.array or pd.DataFrame), so it will not accept objects like a pd.Series or 1d np.array.
ValueError: Expected 2D array, got 1D array instead:
array=[ 885 1016 1125 239 701 641 1228 412 953 929 1492 419 1010 595
1034].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Depending on the model being used, there will be a number of parameters that can be configured when constructing the model object or via the set_params() method.
LinearRegression(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Once the model coefficients have been fit, it is possible to predict from the model via the predict() method, this method requires a matrix-like X as input and in the case of LinearRegression returns an array of predicted y values.
array([ 725.10251417, 832.43407276, 921.74048411, 195.81864507,
574.34673721, 525.18724472, 1006.13094621, 337.5618484 ,
780.81660565, 761.15280865, 1222.43271315, 343.29712253,
827.51812351, 487.49830048, 847.1819205 ])
volume weight cover pred
0 885 800 hb 725.102514
1 1016 950 hb 832.434073
2 1125 1050 hb 921.740484
3 239 350 hb 195.818645
4 701 750 hb 574.346737
5 641 600 hb 525.187245
6 1228 1075 hb 1006.130946
7 412 250 pb 337.561848
8 953 700 pb 780.816606
9 929 650 pb 761.152809
10 1492 975 pb 1222.432713
11 419 350 pb 343.297123
12 1010 950 pb 827.518124
13 595 425 pb 487.498300
14 1034 725 pb 847.181921
There is no built in functionality for calculating residuals, so this needs to be done by hand.
Scikit-learn expects that the model matrix be numeric before fitting,
the solution here is to dummy code the categorical variables - this can be done with pandas via pd.get_dummies() or with a scikit-learn preprocessor.
volume cover_hb cover_pb
0 885 True False
1 1016 True False
2 1125 True False
3 239 True False
4 701 True False
5 641 True False
6 1228 True False
7 412 False True
8 953 False True
9 929 False True
10 1492 False True
11 419 False True
12 1010 False True
13 595 False True
14 1034 False True
Do the above results look reasonable? What went wrong?
Call:
lm(formula = weight ~ volume + cover_hb + cover_pb, data = d)
Residuals:
Min 1Q Median 3Q Max
-110.10 -32.32 -16.10 28.93 210.95
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.91557 59.45408 0.234 0.818887
volume 0.71795 0.06153 11.669 6.6e-08 ***
cover_hb 184.04727 40.49420 4.545 0.000672 ***
cover_pb NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 78.2 on 12 degrees of freedom
Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154
F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07
These are a collection of transformer classes present in the sklearn.preprocessing submodule that are designed to help with the preparation of raw feature data into quantities more suitable for downstream modeling tools.
Like the modeling classes, they have an object oriented design that shares a common interface (methods and attributes) for bringing in data, transforming it, and returning it.
For dummy coding we can use the OneHotEncoder preprocessor, the default is to use one hot encoding but standard dummy coding can be achieved via the drop parameter.
OneHotEncoder(sparse_output=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
array([[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.]])
Unlike pd.get_dummies() it is not safe to use OneHotEncoder with both numerical and categorical features, as the former will also be transformed.
volume_239 volume_412 volume_419 ... volume_1492 cover_hb cover_pb
0 0.0 0.0 0.0 ... 0.0 1.0 0.0
1 0.0 0.0 0.0 ... 0.0 1.0 0.0
2 0.0 0.0 0.0 ... 0.0 1.0 0.0
3 1.0 0.0 0.0 ... 0.0 1.0 0.0
4 0.0 0.0 0.0 ... 0.0 1.0 0.0
5 0.0 0.0 0.0 ... 0.0 1.0 0.0
6 0.0 0.0 0.0 ... 0.0 1.0 0.0
7 0.0 1.0 0.0 ... 0.0 0.0 1.0
8 0.0 0.0 0.0 ... 0.0 0.0 1.0
9 0.0 0.0 0.0 ... 0.0 0.0 1.0
10 0.0 0.0 0.0 ... 1.0 0.0 1.0
11 0.0 0.0 1.0 ... 0.0 0.0 1.0
12 0.0 0.0 0.0 ... 0.0 0.0 1.0
13 0.0 0.0 0.0 ... 0.0 0.0 1.0
14 0.0 0.0 0.0 ... 0.0 0.0 1.0
[15 rows x 17 columns]
volume weight cover pred2
0 885 800 hb 833.351907
1 1016 950 hb 927.403847
2 1125 1050 hb 1005.660805
3 239 350 hb 369.553788
4 701 750 hb 701.248418
5 641 600 hb 658.171193
6 1228 1075 hb 1079.610041
7 412 250 pb 309.712515
8 953 700 pb 698.125490
9 929 650 pb 680.894600
10 1492 975 pb 1085.102558
11 419 350 pb 314.738191
12 1010 950 pb 739.048853
13 595 425 pb 441.098050
14 1034 725 pb 756.279743
Scikit-learn comes with a number of builtin functions for measuring model performance in the sklearn.metrics submodule - these are generally just functions that take the vectors y_true and y_pred and return a scalar score.
Create and fit a model for the books data that includes an interaction effect between volume and cover.
You will need to do this manually with pd.get_dummies() and some additional data munging.
The data can be read into pandas with,
We will now look at another flavor of regression model, that involves preprocessing and a hyperparameter - namely polynomial regression.
It is certainly possible to construct the necessary model matrix by hand (or even use a function to automate the process), but this is less than desirable generally - particularly if we want to do anything fancy (e.g. cross validation)
This is another transformer class from sklearn.preprocessing that simplifies the process of constructing polynomial features for your model matrix. Usage is similar to that of OneHotEncoder.
If the feature matrix X has more than one column then PolynomialFeatures transformer will include interaction terms with total degree up to degree.
array([[0, 1],
[2, 3],
[4, 5]])
array([[ 0., 1., 0., 0., 1.],
[ 2., 3., 4., 6., 9.],
[ 4., 5., 16., 20., 25.]])
array(['x0', 'x1', 'x0^2', 'x0 x1', 'x1^2'], dtype=object)
array([[0, 1, 2],
[3, 4, 5]])
array([[ 0., 1., 2., 0., 0., 0., 1., 2., 4.],
[ 3., 4., 5., 9., 12., 15., 16., 20., 25.]])
array(['x0', 'x1', 'x2', 'x0^2', 'x0 x1', 'x0 x2', 'x1^2', 'x1 x2',
'x2^2'], dtype=object)
You may have noticed that PolynomialFeatures takes a model matrix as input and returns a new model matrix as output which is then used as the input for LinearRegression. This is not an accident, and by structuring the library in this way sklearn is designed to enable the connection of these steps together, into what sklearn calls a pipeline.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Once constructed, this object can be used just like our previous LinearRegression model (i.e. fit to our data and then used for prediction)
array([ 1.6295693 , 1.65734929, 1.6610466 , 1.67779767, 1.69667491,
1.70475286, 1.75280126, 1.78471392, 1.79049912, 1.82690007,
1.82966357, 1.83376043, 1.84494343, 1.86002819, 1.86228095,
1.86619112, 1.86837909, 1.87065283, 1.88417882, 1.8844024 ,
1.88527174, 1.88577463, 1.88544367, 1.86890805, 1.86365035,
1.86252922, 1.86047349, 1.85377801, 1.84937708, 1.83754576,
1.82623453, 1.82024199, 1.81799793, 1.79767794, 1.77255319,
1.77034143, 1.76574288, 1.75371272, 1.74389585, 1.73804309,
1.73356954, 1.65527727, 1.64812184, 1.61867613, 1.6041325 ,
1.5960389 , 1.56080881, 1.55036459, 1.54004364, 1.50903953,
1.45096594, 1.43589836, 1.41886389, 1.39423307, 1.36180712,
1.23072992, 1.21355164, 1.11776117, 1.11522002, 1.09595388,
1.06449719, 1.04672121, 1.03662739, 1.01407206, 0.98208703,
0.98081577, 0.96176797, 0.87491417, 0.87117573, 0.84223005,
0.84171166, 0.82875003, 0.8085086 , 0.79166069, 0.78167248,
0.78078036, 0.73538157, 0.7181484 , 0.70046945, 0.67233502,
0.67229069, 0.64782899, 0.64050946, 0.63726823, 0.63526047,
0.62323271, 0.61965166, 0.61705548, 0.6141438 , 0.60978056,
0.60347713, 0.5909255 , 0.566617 , 0.50905785, 0.44706202,
0.44177711, 0.43291379, 0.40957833, 0.38480262, 0.38288511,
0.38067928, 0.3791518 , 0.37610476, 0.36932957, 0.36493067,
0.35806518, 0.3475729 , 0.3466828 , 0.33332696, 0.30717941,
0.3006981 , 0.29675876, 0.29337641, 0.29333354, 0.27631567,
0.26899076, 0.2676092 , 0.2672602 , 0.26716133, 0.26241605,
0.25405246, 0.25334542, 0.25322869, 0.25322576, 0.25410989,
0.25622496, 0.25808334, 0.25849729, 0.26029845, 0.26043195,
0.26319956, 0.26466962, 0.26480578, 0.2648598 , 0.26488966,
0.28177285, 0.28525208, 0.28861016, 0.28917644, 0.29004253,
0.29444629, 0.29559749, 0.30233373, 0.30622039, 0.31322114,
0.31798208, 0.32104799, 0.32700307, 0.32822585, 0.32927281,
0.3326599 , 0.33397022, 0.33710573, 0.34110873, 0.34140708,
0.34707419, 0.35926445, 0.37678278, 0.37774536, 0.38884519,
0.39078249, 0.39517758, 0.40743395, 0.41040931, 0.42032703,
0.43577431, 0.46157615, 0.46668313, 0.47144763, 0.47196742,
0.47425178, 0.47510175, 0.47762453, 0.48381558, 0.48473821,
0.4906733 , 0.50202549, 0.50448149, 0.50674907, 0.50959756,
0.51456778, 0.51694399, 0.51848152, 0.52576027, 0.53292675,
0.53568264, 0.53601729, 0.53790775, 0.53878741, 0.53876248,
0.53838784, 0.53822688, 0.53756849, 0.53748661, 0.53650016,
0.53481469, 0.53372126, 0.53274257, 0.52871724, 0.52377536,
0.52346188, 0.52313791, 0.52286872, 0.49655523, 0.49552641,
0.47578596, 0.4669369 , 0.43757684, 0.38609879, 0.38104404,
0.31131919, 0.2984486 , 0.28774333, 0.27189053, 0.25239709,
0.2384553 , 0.22915234, 0.17792316, 0.17355182, 0.09982541,
0.09880754, 0.09413432, 0.09001771, 0.0844749 , 0.01787073,
-0.00849026, -0.03051945, -0.06842454, -0.09116713, -0.10695813,
-0.13889128, -0.20217854, -0.2210452 , -0.23334664, -0.39045798,
-0.46280636, -0.47155946, -0.48247123, -0.5697079 , -0.57972246,
-0.68977946, -0.81351875, -0.83477874, -0.88303201, -0.91521502,
-0.96937509, -0.99388351, -1.1634133 , -1.19336585, -1.21548881])
The attributes of pipeline steps are not directly accessible, but can be accessed via the steps or named_steps attributes,
Anyone notice a problem?
By accessing each step we can adjust their parameters (via set_params()),
LinearRegression(fit_intercept=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('polynomialfeatures', PolynomialFeatures(degree=4)),
('linearregression', LinearRegression(fit_intercept=False))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. These parameters can also be directly accessed at the pipeline level, names are constructed as step name + __ + parameter name:
{'memory': None, 'steps': [('polynomialfeatures', PolynomialFeatures(degree=4)), ('linearregression', LinearRegression(fit_intercept=False))], 'transform_input': None, 'verbose': False, 'polynomialfeatures': PolynomialFeatures(degree=4), 'linearregression': LinearRegression(fit_intercept=False), 'polynomialfeatures__degree': 4, 'polynomialfeatures__include_bias': True, 'polynomialfeatures__interaction_only': False, 'polynomialfeatures__order': 'C', 'linearregression__copy_X': True, 'linearregression__fit_intercept': False, 'linearregression__n_jobs': None, 'linearregression__positive': False, 'linearregression__tol': 1e-06}
Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('polynomialfeatures',
PolynomialFeatures(degree=4, include_bias=False)),
('linearregression', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. array(['x', 'x^2', 'x^3', 'x^4'], dtype=object)
np.float64(1.6136636604768482)
array([ 7.39051417, -57.67175293, 102.72227443, -55.38181361])
Are a tool for selectively applying transformer(s) to column(s) of an array or DataFrame, they function in a way that is similar to a pipeline and similarly have a make_ helper function.
array([[ 0.12100717, 1. , 0. ],
[ 0.51996539, 1. , 0. ],
[ 0.85192299, 1. , 0. ],
[-1.84637457, 1. , 0. ],
[-0.43936162, 1. , 0. ],
[-0.62209057, 1. , 0. ],
[ 1.1656077 , 1. , 0. ],
[-1.31950608, 0. , 1. ],
[ 0.32809999, 0. , 1. ],
[ 0.25500841, 0. , 1. ],
[ 1.9696151 , 0. , 1. ],
[-1.2981877 , 0. , 1. ],
[ 0.5016925 , 0. , 1. ],
[-0.76218277, 0. , 1. ],
[ 0.57478408, 0. , 1. ]])
Another important argument is remainder which determines what happens to unspecified columns. The default is "drop" which is why weight was removed, the alternative is "passthrough" which retains untransformed columns.
array([[ 1.2101e-01, 1.0000e+00, 0.0000e+00, 8.0000e+02],
[ 5.1997e-01, 1.0000e+00, 0.0000e+00, 9.5000e+02],
[ 8.5192e-01, 1.0000e+00, 0.0000e+00, 1.0500e+03],
[-1.8464e+00, 1.0000e+00, 0.0000e+00, 3.5000e+02],
[-4.3936e-01, 1.0000e+00, 0.0000e+00, 7.5000e+02],
[-6.2209e-01, 1.0000e+00, 0.0000e+00, 6.0000e+02],
[ 1.1656e+00, 1.0000e+00, 0.0000e+00, 1.0750e+03],
[-1.3195e+00, 0.0000e+00, 1.0000e+00, 2.5000e+02],
[ 3.2810e-01, 0.0000e+00, 1.0000e+00, 7.0000e+02],
[ 2.5501e-01, 0.0000e+00, 1.0000e+00, 6.5000e+02],
[ 1.9696e+00, 0.0000e+00, 1.0000e+00, 9.7500e+02],
[-1.2982e+00, 0.0000e+00, 1.0000e+00, 3.5000e+02],
[ 5.0169e-01, 0.0000e+00, 1.0000e+00, 9.5000e+02],
[-7.6218e-01, 0.0000e+00, 1.0000e+00, 4.2500e+02],
[ 5.7478e-01, 0.0000e+00, 1.0000e+00, 7.2500e+02]])
One lingering issue with the above approach is that we’ve had to hard code the column names (or use indexes). Often we want to select columns based on their dtype (e.g. categorical vs numerical) this can be done via pandas or sklearn,
array([[ 0.121 , 0.3594, 1. , 0. ],
[ 0.52 , 0.9369, 1. , 0. ],
[ 0.8519, 1.3219, 1. , 0. ],
[-1.8464, -1.3733, 1. , 0. ],
[-0.4394, 0.1668, 1. , 0. ],
[-0.6221, -0.4107, 1. , 0. ],
[ 1.1656, 1.4182, 1. , 0. ],
[-1.3195, -1.7583, 0. , 1. ],
[ 0.3281, -0.0257, 0. , 1. ],
[ 0.255 , -0.2182, 0. , 1. ],
[ 1.9696, 1.0332, 0. , 1. ],
[-1.2982, -1.3733, 0. , 1. ],
[ 0.5017, 0.9369, 0. , 1. ],
[-0.7622, -1.0845, 0. , 1. ],
[ 0.5748, 0.0706, 0. , 1. ]])
array(['standardscaler__volume', 'standardscaler__weight',
'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)
array([[ 0.121 , 0.3594, 1. , 0. ],
[ 0.52 , 0.9369, 1. , 0. ],
[ 0.8519, 1.3219, 1. , 0. ],
[-1.8464, -1.3733, 1. , 0. ],
[-0.4394, 0.1668, 1. , 0. ],
[-0.6221, -0.4107, 1. , 0. ],
[ 1.1656, 1.4182, 1. , 0. ],
[-1.3195, -1.7583, 0. , 1. ],
[ 0.3281, -0.0257, 0. , 1. ],
[ 0.255 , -0.2182, 0. , 1. ],
[ 1.9696, 1.0332, 0. , 1. ],
[-1.2982, -1.3733, 0. , 1. ],
[ 0.5017, 0.9369, 0. , 1. ],
[-0.7622, -1.0845, 0. , 1. ],
[ 0.5748, 0.0706, 0. , 1. ]])
array(['standardscaler__volume', 'standardscaler__weight',
'onehotencoder__cover_hb', 'onehotencoder__cover_pb'], dtype=object)
Sta 663 - Spring 2026