'3.0.0'
Lecture 09
pandas is an implementation of data frames in Python - it takes much of its inspiration from R and NumPy.
pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
The columns of a DataFrame are constructed using the Series class - these are a 1D array-like object containing values of the same type (similar to a numpy array).
Once constructed the components of a series can be accessed via the array and index attributes.
An index (row names) can also be explicitly provided when constructing a Series,
Series objects are compatible with NumPy-like functions (i.e. vectorized)
Series can be subset by their index values (not position) or by logical expressions (as with NumPy arrays or R vectors)
Series cannot be directly indexed by position, but the iloc attribute can be used for this purpose.
When performing operations with multiple series, generally pandas will attempt to align the operation by the index values,
Series can also be constructed from dictionaries, in which case the keys are used as the index.
Pandas encodes missing values using NaN (mostly),
pd.isna()?NoneIn some cases None can also be used as a missing value, for example:
If instead of using base dtypes we use Pandas’ built-in dtypes we get “native” support for missing values,
Series containing strings can have their strings accessed via the str attribute,
Just like R a DataFrame is a collection of vectors (Series) with a common length (and a common index)
Column dtypes can be heterogeneous
Columns have names stored in the columns index.
It can be useful to think of a dictionary of Series objects where the keys are the column names.
<class 'pandas.DataFrame'>
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
We just saw how to read a DataFrame via read_csv(), DataFrames can also be constructed via DataFrame(), in general this is done using a dictionary of columns / Series.
2d ndarrays can also be used to construct a DataFrame - generally it is a good idea to provide column and row names (indexes)
20
(5, 4)
<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 weight 5 non-null float64
2 height 5 non-null float64
3 date 5 non-null datetime64[us]
dtypes: datetime64[us](1), float64(2), int64(1)
memory usage: 292.0 bytes
Selecting a column - use name or via the . accessor,
As with Series, position based indexing is done via the iloc attribute
id 995
weight 69.552249
height 196.166608
date 2026-02-02 00:00:00
Name: 1, dtype: object
| id | weight | height | date | |
|---|---|---|---|---|
| 1 | 995 | 69.552249 | 196.166608 | 2026-02-02 |
| id | weight | height | date | |
|---|---|---|---|---|
| 0 | 200 | 57.715447 | 156.243857 | 2026-02-01 |
| 1 | 995 | 69.552249 | 196.166608 | 2026-02-02 |
| weight | height | |
|---|---|---|
| 1 | 69.552249 | 196.166608 |
| 2 | 48.327161 | 178.808975 |
| id | date | |
|---|---|---|
| 0 | 200 | 2026-02-01 |
| 1 | 995 | 2026-02-02 |
| 2 | 238 | 2026-02-03 |
| id | weight | |
|---|---|---|
| 0 | 200 | 57.715447 |
| 1 | 995 | 69.552249 |
| 2 | 238 | 48.327161 |
| id | weight | height | date | |
|---|---|---|---|---|
| 1 | 995 | 69.552249 | 196.166608 | 2026-02-02 |
| 3 | 768 | 126.885781 | 168.055900 | 2026-02-04 |
In general most pandas operations will generate a new object. Previously some subset operations would return views, this has mostly been resolved with pandas 3.0.0.
When constructing a DataFrame we can specify the indexes for both the rows (index) and columns (columns),
pandas’ Index class and its subclasses provide the infrastructure necessary for lookups, data alignment, and other related tasks. You can think of them as being an immutable multiset (i.e. duplicate values are allowed).
Index objects can have names which are shown when printing the DataFrame or Index,
| cols | A | B | C |
|---|---|---|---|
| rows | |||
| x | -0.649065 | -0.881551 | 1.478708 |
| y | 1.534376 | 0.202521 | 0.195817 |
| z | 0.834751 | 1.269891 | -1.427342 |
Index(['A', 'B', 'C'], dtype='str', name='cols')
Index(['x', 'y', 'z'], dtype='str', name='rows')
It is possible for an index to contain missing values (e.g. np.nan) but this is generally a bad idea and should be avoided.
Existing columns can be made into an index via set_index() and removed via reset_index(),
New index values can be attached to a DataFrame via reindex(),
These are a hierarchical analog of standard Index objects and are used to represent nested indexes. There are a number of methods for constructing them based on the initial object
MultiIndexes can also be used for columns as well,
| c1 | A | B | |||
|---|---|---|---|---|---|
| c2 | x | y | x | y | |
| r1 | r2 | ||||
| m | l | 0.524746 | 0.096756 | 0.664596 | 0.450048 |
| p | 0.122310 | 0.842365 | 0.867014 | 0.858448 | |
| n | l | 0.465433 | 0.998055 | 0.875562 | 0.567632 |
| p | 0.911802 | 0.745366 | 0.449562 | 0.006176 | |
| c1 | A | B | |||
|---|---|---|---|---|---|
| c2 | x | y | x | y | |
| r1 | r2 | ||||
| m | l | 0.179763 | 0.787379 | 0.565183 | 0.665208 |
| p | 0.220432 | 0.483390 | 0.292793 | 0.589280 | |
| n | l | 0.823848 | 0.654454 | 0.494629 | 0.924918 |
| p | 0.344536 | 0.541508 | 0.544303 | 0.591460 | |
ilocc1 c2
A x 0.179763
y 0.787379
B x 0.565183
y 0.665208
Name: (m, l), dtype: float64
<class 'pandas.Series'>
np.float64(0.7873788607957514)
| c1 | A | B | |||
|---|---|---|---|---|---|
| c2 | x | y | x | y | |
| r1 | r2 | ||||
| m | l | 0.179763 | 0.787379 | 0.565183 | 0.665208 |
| p | 0.220432 | 0.483390 | 0.292793 | 0.589280 | |
loclocIndex slices can also be used with combinations of indexes and index tuples,
The previous methods don’t give easy access to indexing on nested index levels, this is possible via the cross-section method xs(),
It is also possible to construct a MultiIndex or modify an existing one using set_index() and reset_index(),
The query() method can be used for filtering rows, it evaluates a string expression in the context of the data frame.
Beyond the use of loc() and iloc() there is also the filter() method which can be used to select columns (or indices) by name with pattern matching
| weight | height | |
|---|---|---|
| anna | 82.953771 | 193.688192 |
| bob | 100.460597 | 181.511521 |
| carol | 65.316933 | 162.957884 |
| dave | 65.317261 | 178.138401 |
| id | weight | height | date | |
|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 |
Indexing with assignment allows for inplace modification of a DataFrame, while assign() creates a new object (but is chainable)
| id | weight | height | date | student | age | |
|---|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | True | 19.0 |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | True | 22.0 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | True | NaN |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | None | NaN |
| id | weight | height | date | student | age | rand | |
|---|---|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | yes | 19.0 | 0.181825 |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | yes | 22.0 | 0.183405 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | yes | NaN | 0.304242 |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | no | NaN | 0.524756 |
As of pandas 3.0.0 there is also a pd.col() function which can be used within methods like assign() and query() to refer to columns by name,
| id | weight | height | date | student | age | bmi | |
|---|---|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | True | 19.0 | 22.112092 |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | True | 22.0 | 30.492102 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | True | NaN | 24.596597 |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | None | NaN | 20.583199 |
Columns or rows can be removed via the drop() method,
KeyError: "['student'] not found in axis"
| id | weight | height | date | age | |
|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | 19.0 |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | 22.0 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | NaN |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | NaN |
| id | weight | height | date | student | age | |
|---|---|---|---|---|---|---|
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | True | 22.0 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | True | NaN |
KeyError: '[False, False, False, False, False, True] not found in axis'
| id | weight | height | date | student | |
|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | True |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | True |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | True |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | None |
| id | date | student | age | |
|---|---|---|---|---|
| anna | 202 | 2026-02-01 | True | 19.0 |
| bob | 535 | 2026-02-02 | True | 22.0 |
| carol | 960 | 2026-02-03 | True | NaN |
| dave | 370 | 2026-02-04 | None | NaN |
DataFrames can be sorted on one or more columns via sort_values(),
| id | weight | height | date | student | age | |
|---|---|---|---|---|---|---|
| anna | 202 | 82.953771 | 193.688192 | 2026-02-01 | True | 19.0 |
| bob | 535 | 100.460597 | 181.511521 | 2026-02-02 | True | 22.0 |
| carol | 960 | 65.316933 | 162.957884 | 2026-02-03 | True | NaN |
| dave | 370 | 65.317261 | 178.138401 | 2026-02-04 | None | NaN |
All three can be used to combine data frames,
concat() stacks DataFrames on either axis, with basic alignment based on (row) indexes. join argument only supports “inner” and “outer”.
merge() aligns based on one or more shared columns. how supports “inner”, “outer”, “left”, “right”, and “cross”.
join() uses merge() behind the scenes, but prefers to join based on (row) indexes. Also has different default how compared to merge(), “left” vs “inner”.
Pivoting
Split-Apply-Combine (group by / summarize)
Pandas & pyarrow
Polars
Sta 663 - Spring 2026