pandas

Lecture 09

Dr. Colin Rundel

pandas

pandas is an implementation of data frames in Python - it takes much of its inspiration from R and NumPy.

pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

Key features:

DataFrame and Series (column) object classes
Reading and writing tabular data
Data munging (filtering, grouping, summarizing, joining, etc.)
Data reshaping

import pandas as pd
pd.__version__

'3.0.0'

Series

The columns of a DataFrame are constructed using the Series class - these are a 1D array-like object containing values of the same type (similar to a numpy array).

pd.Series([1,2,3,4])

0    1
1    2
2    3
3    4
dtype: int64

pd.Series(["C","B","A"])

0    C
1    B
2    A
dtype: str

pd.Series([True])

0    True
dtype: bool

pd.Series(range(5))

0    0
1    1
2    2
3    3
4    4
dtype: int64

pd.Series([1,"A",True])

0       1
1       A
2    True
dtype: object

Series methods

Once constructed the components of a series can be accessed via the array and index attributes.

s = pd.Series([4,2,1,3])

0    4
1    2
2    1
3    3
dtype: int64

s.array

<NumpyExtensionArray>
[4, 2, 1, 3]
Length: 4, dtype: int64

s.index

RangeIndex(start=0, stop=4, step=1)

An index (row names) can also be explicitly provided when constructing a Series,

t = pd.Series([4,2,1,3], index=["a","b","c","d"])

a    4
b    2
c    1
d    3
dtype: int64

t.array

<NumpyExtensionArray>
[4, 2, 1, 3]
Length: 4, dtype: int64

t.index

Index(['a', 'b', 'c', 'd'], dtype='str')

Series + NumPy

Series objects are compatible with NumPy-like functions (i.e. vectorized)

t = pd.Series([4,2,1,3], index=["a","b","c","d"])

t + 1

a    5
b    3
c    2
d    4
dtype: int64

t / 2 + 1

a    3.0
b    2.0
c    1.5
d    2.5
dtype: float64

np.log(t)

a    1.386294
b    0.693147
c    0.000000
d    1.098612
dtype: float64

np.exp(-t**2/2)

a    0.000335
b    0.135335
c    0.606531
d    0.011109
dtype: float64

Series indexing

Series can be subset by their index values (not position) or by logical expressions (as with NumPy arrays or R vectors)

t = pd.Series([4,2,1,3], index=["a","b","c","d"])

t["c"]

np.int64(1)

t[["a","d"]]

a    4
d    3
dtype: int64

t[t == 3]

d    3
dtype: int64

t[t % 2 == 0]

a    4
b    2
dtype: int64

t["d"] = 6; t

a    4
b    2
c    1
d    6
dtype: int64

Position based indexing

Series cannot be directly indexed by position, but the iloc attribute can be used for this purpose.

t[1]

KeyError: 1

t[[1,2]]

KeyError: "None of [Index([1, 2], dtype='int64')] are in the [index]"

t.iloc[1]

np.int64(2)

t.iloc[[1,2]]

b    2
c    1
dtype: int64

Index alignment

When performing operations with multiple series, generally pandas will attempt to align the operation by the index values,

m = pd.Series([1,2,3,4], index = ["a","b","c","d"])
n = pd.Series([4,3,2,1], index = ["d","c","b","a"])
o = pd.Series([1,1,1,1,1], index = ["b","d","a","c","e"])

m + n

a    2
b    4
c    6
d    8
dtype: int64

n + m

a    2
b    4
c    6
d    8
dtype: int64

n + o

a    2.0
b    3.0
c    4.0
d    5.0
e    NaN
dtype: float64

Series and dicts

Series can also be constructed from dictionaries, in which case the keys are used as the index.

d = {"anna": "A+", "bob": "B-", "carol": "C", "dave": "D+"}
pd.Series(d)

anna     A+
bob      B-
carol     C
dave     D+
dtype: str

Index order will follow key order, unless overridden by index,

pd.Series(d, index = ["dave","carol","bob","anna"])

dave     D+
carol     C
bob      B-
anna     A+
dtype: str

Missing values

Pandas encodes missing values using NaN (mostly),

s1 = pd.Series(
  {"anna": "A+", "bob": "B-", 
   "carol": "C", "dave": "D+"}, 
  index = ["erin","dave","carol",
           "bob","anna"]
)

s2 = pd.Series(
  {"anna": 97, "bob": 82, 
   "carol": 75, "dave": 68}, 
  index = ["erin","dave","carol",
           "bob","anna"],
  dtype = 'int64'
)

s1

erin     NaN
dave      D+
carol      C
bob       B-
anna      A+
dtype: str

pd.isna(s1)

erin      True
dave     False
carol    False
bob      False
anna     False
dtype: bool

s2

erin      NaN
dave     68.0
carol    75.0
bob      82.0
anna     97.0
dtype: float64

pd.isna(s2)

erin      True
dave     False
carol    False
bob      False
anna     False
dtype: bool

Aside - why `pd.isna()`?

s = pd.Series([1,2,3,None]); s

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

pd.isna(s)

0    False
1    False
2    False
3     True
dtype: bool

s == np.nan

0    False
1    False
2    False
3    False
dtype: bool

np.nan == np.nan

False

np.nan != np.nan

True

np.isnan(np.nan)

np.True_

np.isnan(0)

np.False_

Missing via `None`

In some cases None can also be used as a missing value, for example:

pd.Series([1,2,3,None])

0    1.0
1    2.0
2    3.0
3    NaN
dtype: float64

pd.Series([True,False,None])

0     True
1    False
2     None
dtype: object

pd.isna( pd.Series([1,2,3,None]) )

0    False
1    False
2    False
3     True
dtype: bool

pd.isna( pd.Series([True,False,None]) )

0    False
1    False
2     True
dtype: bool

This can have a side effect of changing the dtype of the series.

Native NAs

If instead of using base dtypes we use Pandas’ built-in dtypes we get “native” support for missing values,

pd.Series(
  [1,2,3,None], 
  dtype = pd.Int64Dtype()
)

0       1
1       2
2       3
3    <NA>
dtype: Int64

pd.Series(
  [True, False,None], 
  dtype = pd.BooleanDtype()
)

0     True
1    False
2     <NA>
dtype: boolean

String series

Series containing strings can have their strings accessed via the str attribute,

s = pd.Series(["the quick", "brown fox", "jumps over", "a lazy dog"])

0     the quick
1     brown fox
2    jumps over
3    a lazy dog
dtype: str

s.str.upper()

0     THE QUICK
1     BROWN FOX
2    JUMPS OVER
3    A LAZY DOG
dtype: str

s.str.split(" ")

0      [the, quick]
1      [brown, fox]
2     [jumps, over]
3    [a, lazy, dog]
dtype: object

s.str.split(" ").str[1]

0    quick
1      fox
2     over
3     lazy
dtype: object

pd.Series([1,2,3]).str

AttributeError: Can only use .str accessor with string values, not integer. Did you mean: 'std'?

Categorical Series

pd.Series(
  ["Mon", "Tue", "Wed", "Thu", "Fri"]
)

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: str

pd.Series(
  ["Mon", "Tue", "Wed", "Thu", "Fri"],
  dtype="category"
)

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, str): ['Fri', 'Mon', 'Thu', 'Tue', 'Wed']

pd.Series(
  ["Mon", "Tue", "Wed", "Thu", "Fri"], 
  dtype=pd.CategoricalDtype(ordered=True)
)

0    Mon
1    Tue
2    Wed
3    Thu
4    Fri
dtype: category
Categories (5, str): ['Fri' < 'Mon' < 'Thu' < 'Tue' < 'Wed']

Category orders

pd.Series(
  ["Tue", "Thu", "Mon", "Sat"], 
  dtype=pd.CategoricalDtype(
    categories=["Mon", "Tue", "Wed", "Thu", "Fri"], 
    ordered=True
  )
)

0    Tue
1    Thu
2    Mon
3    NaN
dtype: category
Categories (5, str): ['Mon' < 'Tue' < 'Wed' < 'Thu' < 'Fri']

DataFrames

DataFrame

Just like R a DataFrame is a collection of vectors (Series) with a common length (and a common index)
Column dtypes can be heterogeneous
Columns have names stored in the columns index.
It can be useful to think of a dictionary of Series objects where the keys are the column names.

iris = pd.read_csv("data/iris.csv")
type(iris)

<class 'pandas.DataFrame'>

iris

	Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Constructing DataFrames

We just saw how to read a DataFrame via read_csv(), DataFrames can also be constructed via DataFrame(), in general this is done using a dictionary of columns / Series.

n = 5
d = {
  "id":     np.random.randint(100, 999, n),
  "weight": np.random.normal(70, 20, n),
  "height": np.random.normal(170, 15, n),
  "date":   pd.date_range(start='2/1/2026', periods=n, freq='D')
}

df = pd.DataFrame(d); df

	id	weight	height	date
0	200	57.715447	156.243857	2026-02-01
1	995	69.552249	196.166608	2026-02-02
2	238	48.327161	178.808975	2026-02-03
3	768	126.885781	168.055900	2026-02-04
4	342	93.758617	181.371930	2026-02-05

DataFrame from nparray

2d ndarrays can also be used to construct a DataFrame - generally it is a good idea to provide column and row names (indexes)

pd.DataFrame(
  np.diag([1,2,3]),
  columns = ["x","y","z"]
)

	x	y	z
0	1	0	0
1	0	2	0
2	0	0	3

pd.DataFrame(
  np.diag([1,2,3]),
  index = ["x","y","z"]
)

	0	1	2
x	1	0	0
y	0	2	0
z	0	0	3

pd.DataFrame(
  np.tri(5,3,-1),
  columns = ["x","y","z"],
  index = ["a","b","c","d","e"]
)

	x	y	z
a	0.0	0.0	0.0
b	1.0	0.0	0.0
c	1.0	1.0	0.0
d	1.0	1.0	1.0
e	1.0	1.0	1.0

DataFrame attributes & methods

df.size

df.shape

(5, 4)

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   id      5 non-null      int64         
 1   weight  5 non-null      float64       
 2   height  5 non-null      float64       
 3   date    5 non-null      datetime64[us]
dtypes: datetime64[us](1), float64(2), int64(1)
memory usage: 292.0 bytes

df.dtypes

id                 int64
weight           float64
height           float64
date      datetime64[us]
dtype: object

df.columns

Index(['id', 'weight', 'height', 'date'], dtype='str')

df.index

RangeIndex(start=0, stop=5, step=1)

df.axes

[RangeIndex(start=0, stop=5, step=1), Index(['id', 'weight', 'height', 'date'], dtype='str')]

DataFrame indexing

Selecting a column - use name or via the . accessor,

df["id"]

0    200
1    995
2    238
3    768
4    342
Name: id, dtype: int64

df.id

0    200
1    995
2    238
3    768
4    342
Name: id, dtype: int64

Selecting rows - use a single slice

df[1:3]

	id	weight	height	date
1	995	69.552249	196.166608	2026-02-02
2	238	48.327161	178.808975	2026-02-03

df[0::2]

	id	weight	height	date
0	200	57.715447	156.243857	2026-02-01
2	238	48.327161	178.808975	2026-02-03
4	342	93.758617	181.371930	2026-02-05

Integer based subsets are not allowed:

df[0]

KeyError: 0

Indexing by position

As with Series, position based indexing is done via the iloc attribute

df.iloc[1]

id                        995
weight              69.552249
height             196.166608
date      2026-02-02 00:00:00
Name: 1, dtype: object

df.iloc[[1]]

	id	weight	height	date
1	995	69.552249	196.166608	2026-02-02

df.iloc[0:2]

	id	weight	height	date
0	200	57.715447	156.243857	2026-02-01
1	995	69.552249	196.166608	2026-02-02

df.iloc[1:3,1:3]

	weight	height
1	69.552249	196.166608
2	48.327161	178.808975

df.iloc[0:3, [0,3]]

	id	date
0	200	2026-02-01
1	995	2026-02-02
2	238	2026-02-03

df.iloc[0:3, [True, True, False, False]]

	id	weight
0	200	57.715447
1	995	69.552249
2	238	48.327161

df.iloc[lambda x: x.index % 2 != 0]

	id	weight	height	date
1	995	69.552249	196.166608	2026-02-02
3	768	126.885781	168.055900	2026-02-04

Index by name

df.index = (["anna","bob","carol", "dave", "erin"]); df

	id	weight	height	date
anna	200	57.715447	156.243857	2026-02-01
bob	995	69.552249	196.166608	2026-02-02
carol	238	48.327161	178.808975	2026-02-03
dave	768	126.885781	168.055900	2026-02-04
erin	342	93.758617	181.371930	2026-02-05

df.loc["anna"]

id                        200
weight              57.715447
height             156.243857
date      2026-02-01 00:00:00
Name: anna, dtype: object

df.loc[["anna"]]

	id	weight	height	date
anna	200	57.715447	156.243857	2026-02-01

type(df.loc["anna"])

<class 'pandas.Series'>

type(df.loc[["anna"]])

<class 'pandas.DataFrame'>

df.loc["bob":"dave"]

	id	weight	height	date
bob	995	69.552249	196.166608	2026-02-02
carol	238	48.327161	178.808975	2026-02-03
dave	768	126.885781	168.055900	2026-02-04

df.loc[df.id < 300]

	id	weight	height	date
anna	200	57.715447	156.243857	2026-02-01
carol	238	48.327161	178.808975	2026-02-03

df.loc[:, "date"]

anna    2026-02-01
bob     2026-02-02
carol   2026-02-03
dave    2026-02-04
erin    2026-02-05
Name: date, dtype: datetime64[us]

df.loc[["bob","erin"], "weight":"height"]

	weight	height
bob	69.552249	196.166608
erin	93.758617	181.371930

df.loc[0:2, "weight":"height"]

TypeError: cannot do slice indexing on Index with these indexers [0] of type int

Views vs. Copies

In general most pandas operations will generate a new object. Previously some subset operations would return views, this has mostly been resolved with pandas 3.0.0.

d = pd.DataFrame(np.arange(6).reshape(3,2), columns = ["x","y"]); d

	x	y
0	0	1
1	2	3
2	4	5

v = d.iloc[0:2,0:2]; v

	x	y
0	0	1
1	2	3

d.iloc[0,1] = -1; d

	x	y
0	0	-1
1	2	3
2	4	5

	x	y
0	0	1
1	2	3

v.iloc[0,0] = 100; v

	x	y
0	100	1
1	2	3

	x	y
0	0	-1
1	2	3
2	4	5

Element access

df

	id	weight	height	date
anna	200	57.715447	156.243857	2026-02-01
bob	995	69.552249	196.166608	2026-02-02
carol	238	48.327161	178.808975	2026-02-03
dave	768	126.885781	168.055900	2026-02-04
erin	342	93.758617	181.371930	2026-02-05

df[0,0]

KeyError: (0, 0)

df.iat[0,0]

np.int64(200)

df.id[0]

KeyError: 0

df[0:1].id[0]

KeyError: 0

df["anna", "id"]

KeyError: ('anna', 'id')

df.at["anna", "id"]

np.int64(200)

df["id"]["anna"]

np.int64(200)

df["id"][0]

KeyError: 0

Index objects

Columns and index

When constructing a DataFrame we can specify the indexes for both the rows (index) and columns (columns),

df1 = pd.DataFrame(
  np.random.randn(5, 3), 
  columns=['A', 'B', 'C']
)
df1

	A	B	C
0	0.555298	1.237481	-1.096399
1	-0.457000	-0.817442	-1.658612
2	-0.943632	-0.190040	-0.850275
3	-0.325382	-0.152456	0.426736
4	1.276448	-1.010736	1.063402

df2 = pd.DataFrame(
  np.random.randn(3, 3), 
  index=['x','y','z'], 
  columns=['A', 'B', 'C']
)
df2

	A	B	C
x	-1.080576	-0.032645	-0.379919
y	-2.691178	-0.247386	0.541314
z	-0.217621	1.012104	-0.367918

df1.columns

Index(['A', 'B', 'C'], dtype='str')

df1.index

RangeIndex(start=0, stop=5, step=1)

df2.columns

Index(['A', 'B', 'C'], dtype='str')

df2.index

Index(['x', 'y', 'z'], dtype='str')

Index objects

pandas’ Index class and its subclasses provide the infrastructure necessary for lookups, data alignment, and other related tasks. You can think of them as being an immutable multiset (i.e. duplicate values are allowed).

pd.Index(['A','B','C'])

Index(['A', 'B', 'C'], dtype='str')

pd.Index(['A','B','C','A'])

Index(['A', 'B', 'C', 'A'], dtype='str')

pd.Index(range(5))

RangeIndex(start=0, stop=5, step=1)

pd.Index(list(range(5)))

Index([0, 1, 2, 3, 4], dtype='int64')

Index names

Index objects can have names which are shown when printing the DataFrame or Index,

df = pd.DataFrame(
  np.random.randn(3, 3), 
  index=pd.Index(['x','y','z'], name="rows"),
  columns=pd.Index(['A', 'B', 'C'], name="cols")
)
df

cols	A	B	C
rows
x	-0.649065	-0.881551	1.478708
y	1.534376	0.202521	0.195817
z	0.834751	1.269891	-1.427342

df.columns

Index(['A', 'B', 'C'], dtype='str', name='cols')

df.index

Index(['x', 'y', 'z'], dtype='str', name='rows')

Indexes and missing values

It is possible for an index to contain missing values (e.g. np.nan) but this is generally a bad idea and should be avoided.

pd.Index([1,2,3,np.nan,5])

Index([1.0, 2.0, 3.0, nan, 5.0], dtype='float64')

pd.Index(["A","B",np.nan,"D", None])

Index(['A', 'B', nan, 'D', nan], dtype='str')

Missing values can be replaced via the fillna() method,

pd.Index([1,2,3,np.nan,5]).fillna(0)

Index([1.0, 2.0, 3.0, 0.0, 5.0], dtype='float64')

pd.Index(["A","B",np.nan,"D", None]).fillna("Z")

Index(['A', 'B', 'Z', 'D', 'Z'], dtype='str')

Changing a DataFrame’s index

Existing columns can be made into an index via set_index() and removed via reset_index(),

data

	a	b	c	d
0	bar	one	z	1
1	bar	two	y	2
2	foo	one	x	3
3	foo	two	w	4

data.set_index('a')

	b	c	d
a
bar	one	z	1
bar	two	y	2
foo	one	x	3
foo	two	w	4

data.set_index('a').reset_index()

	a	b	c	d
0	bar	one	z	1
1	bar	two	y	2
2	foo	one	x	3
3	foo	two	w	4

drop argument

data.set_index('c', drop=False)

	a	b	c	d
c
z	bar	one	z	1
y	bar	two	y	2
x	foo	one	x	3
w	foo	two	w	4

data.set_index('c').reset_index(drop=True)

	a	b	d
0	bar	one	1
1	bar	two	2
2	foo	one	3
3	foo	two	4

Creating a new index

New index values can be attached to a DataFrame via reindex(),

data

	a	b	c	d
0	bar	one	z	1
1	bar	two	y	2
2	foo	one	x	3
3	foo	two	w	4

data.reindex(columns=["a","b","c","d","e"])

	a	b	c	d	e
0	bar	one	z	1	NaN
1	bar	two	y	2	NaN
2	foo	one	x	3	NaN
3	foo	two	w	4	NaN

data.reindex(["w","x","y","z"])

	a	b	c	d
w	NaN	NaN	NaN	NaN
x	NaN	NaN	NaN	NaN
y	NaN	NaN	NaN	NaN
z	NaN	NaN	NaN	NaN

data.index = ["w","x","y","z"]; data

	a	b	c	d
w	bar	one	z	1
x	bar	two	y	2
y	foo	one	x	3
z	foo	two	w	4

data.reindex(range(4,0,-1))

	a	b	c	d
4	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN

data.index = range(4,0,-1); data

	a	b	c	d
4	bar	one	z	1
3	bar	two	y	2
2	foo	one	x	3
1	foo	two	w	4

MultiIndexes

MultiIndex objects

These are a hierarchical analog of standard Index objects and are used to represent nested indexes. There are a number of methods for constructing them based on the initial object

tuples = [('A','x'), ('A','y'),
          ('B','x'), ('B','y'),
          ('C','x'), ('C','y')]
pd.MultiIndex.from_tuples(
  tuples, names=["1st","2nd"]
)

MultiIndex([('A', 'x'),
            ('A', 'y'),
            ('B', 'x'),
            ('B', 'y'),
            ('C', 'x'),
            ('C', 'y')],
           names=['1st', '2nd'])

pd.MultiIndex.from_product(
  [["A","B","C"],
   ["x","y"]], 
  names=["1st","2nd"]
)

MultiIndex([('A', 'x'),
            ('A', 'y'),
            ('B', 'x'),
            ('B', 'y'),
            ('C', 'x'),
            ('C', 'y')],
           names=['1st', '2nd'])

DataFrame with MultiIndex

idx = pd.MultiIndex.from_tuples(
  tuples, names=["1st","2nd"]
)

pd.DataFrame(
  np.random.rand(6,2), 
  index = idx, 
  columns=["m","n"]
)

		m	n
1st	2nd
A	x	0.432674	0.954728
A	y	0.519791	0.771165
B	x	0.949299	0.521805
B	y	0.809213	0.208332
C	x	0.777001	0.002249
C	y	0.762001	0.976363

Column MultiIndex

MultiIndexes can also be used for columns as well,

cidx = pd.MultiIndex.from_product(
  [["A","B"],["x","y"]], names=["c1","c2"]
)

pd.DataFrame(
  np.random.rand(4,4), columns = cidx
)

c1	A		B
c2	x	y	x	y
0	0.660493	0.181417	0.784601	0.691375
1	0.137302	0.545351	0.640257	0.525258
2	0.009936	0.646707	0.128306	0.315168
3	0.320345	0.371164	0.558344	0.280129

ridx = pd.MultiIndex.from_product(
  [["m","n"],["l","p"]], names=["r1","r2"]
)

pd.DataFrame(
  np.random.rand(4,4), 
  index= ridx, columns = cidx
)

	c1	A		B
	c2	x	y	x	y
r1	r2
m	l	0.524746	0.096756	0.664596	0.450048
m	p	0.122310	0.842365	0.867014	0.858448
n	l	0.465433	0.998055	0.875562	0.567632
n	p	0.911802	0.745366	0.449562	0.006176

MultiIndex indexing

data

	c1	A		B
	c2	x	y	x	y
r1	r2
m	l	0.179763	0.787379	0.565183	0.665208
m	p	0.220432	0.483390	0.292793	0.589280
n	l	0.823848	0.654454	0.494629	0.924918
n	p	0.344536	0.541508	0.544303	0.591460

data["A"]

	c2	x	y
r1	r2
m	l	0.179763	0.787379
m	p	0.220432	0.483390
n	l	0.823848	0.654454
n	p	0.344536	0.541508

data["x"]

KeyError: 'x'

data["m"]

KeyError: 'm'

data["m","A"]

KeyError: ('m', 'A')

data["A","x"]

r1  r2
m   l     0.179763
    p     0.220432
n   l     0.823848
    p     0.344536
Name: (A, x), dtype: float64

data["A"]["x"]

r1  r2
m   l     0.179763
    p     0.220432
n   l     0.823848
    p     0.344536
Name: x, dtype: float64

MultiIndex indexing via `iloc`

data.iloc[0]

c1  c2
A   x     0.179763
    y     0.787379
B   x     0.565183
    y     0.665208
Name: (m, l), dtype: float64

type(data.iloc[0])

<class 'pandas.Series'>

data.iloc[(0,1)]

np.float64(0.7873788607957514)

data.iloc[[0,1]]

	c1	A		B
	c2	x	y	x	y
r1	r2
m	l	0.179763	0.787379	0.565183	0.665208
m	p	0.220432	0.483390	0.292793	0.589280

data.iloc[:,0]

r1  r2
m   l     0.179763
    p     0.220432
n   l     0.823848
    p     0.344536
Name: (A, x), dtype: float64

type(data.iloc[:,0])

<class 'pandas.Series'>

data.iloc[0,1]

np.float64(0.7873788607957514)

data.iloc[0,[0,1]]

c1  c2
A   x     0.179763
    y     0.787379
Name: (m, l), dtype: float64

MultiIndex indexing via `loc`

data.loc["m"]

c1	A		B
c2	x	y	x	y
r2
l	0.179763	0.787379	0.565183	0.665208
p	0.220432	0.483390	0.292793	0.589280

data.loc["l"]

KeyError: 'l'

data.loc[:,"A"]

	c2	x	y
r1	r2
m	l	0.179763	0.787379
m	p	0.220432	0.483390
n	l	0.823848	0.654454
n	p	0.344536	0.541508

data.loc[("m","l")]

c1  c2
A   x     0.179763
    y     0.787379
B   x     0.565183
    y     0.665208
Name: (m, l), dtype: float64

data.loc[:,("A","y")]

r1  r2
m   l     0.787379
    p     0.483390
n   l     0.654454
    p     0.541508
Name: (A, y), dtype: float64

Fancier indexing with `loc`

Index slices can also be used with combinations of indexes and index tuples,

data.loc["m":"n"]

	c1	A		B
	c2	x	y	x	y
r1	r2
m	l	0.179763	0.787379	0.565183	0.665208
m	p	0.220432	0.483390	0.292793	0.589280
n	l	0.823848	0.654454	0.494629	0.924918
n	p	0.344536	0.541508	0.544303	0.591460

data.loc[("m","p"):"n"]

	c1	A		B
	c2	x	y	x	y
r1	r2
m	p	0.220432	0.483390	0.292793	0.589280
n	l	0.823848	0.654454	0.494629	0.924918
n	p	0.344536	0.541508	0.544303	0.591460

data.loc[("m","l"):("n","l")]

	c1	A		B
	c2	x	y	x	y
r1	r2
m	l	0.179763	0.787379	0.565183	0.665208
m	p	0.220432	0.483390	0.292793	0.589280
n	l	0.823848	0.654454	0.494629	0.924918

data.loc[[("m","p"),("n","l")]]

	c1	A		B
	c2	x	y	x	y
r1	r2
m	p	0.220432	0.483390	0.292793	0.589280
n	l	0.823848	0.654454	0.494629	0.924918

Selecting nested levels

The previous methods don’t give easy access to indexing on nested index levels, this is possible via the cross-section method xs(),

data.xs("p", level="r2")

c1	A		B
c2	x	y	x	y
r1
m	0.220432	0.483390	0.292793	0.58928
n	0.344536	0.541508	0.544303	0.59146

data.xs("y", level="c2", axis=1)

	c1	A	B
r1	r2
m	l	0.787379	0.665208
m	p	0.483390	0.589280
n	l	0.654454	0.924918
n	p	0.541508	0.591460

data.xs("m", level="r1")

c1	A		B
c2	x	y	x	y
r2
l	0.179763	0.787379	0.565183	0.665208
p	0.220432	0.483390	0.292793	0.589280

data.xs("B", level="c1", axis=1)

	c2	x	y
r1	r2
m	l	0.565183	0.665208
m	p	0.292793	0.589280
n	l	0.494629	0.924918
n	p	0.544303	0.591460

Setting MultiIndexes

It is also possible to construct a MultiIndex or modify an existing one using set_index() and reset_index(),

data

	a	b	c	d
0	bar	one	z	1
1	bar	two	y	2
2	foo	one	x	3

data.set_index(['a','b'])

		c	d
a	b
bar	one	z	1
bar	two	y	2
foo	one	x	3

data.set_index(['a','b']).reset_index()

	a	b	c	d
0	bar	one	z	1
1	bar	two	y	2
2	foo	one	x	3

data.set_index('c', append=True)

		a	b	d
	c
0	z	bar	one	1
1	y	bar	two	2
2	x	foo	one	3

data.set_index(['a','b']).reset_index(level=1)

	b	c	d
a
bar	one	z	1
bar	two	y	2
foo	one	x	3

Working with DataFrames

Filtering rows

The query() method can be used for filtering rows, it evaluates a string expression in the context of the data frame.

df.query('date == "2026-02-01"')

	id	weight	height	date

df.query('weight > 50')

	id	weight	height	date
anna	202	82.953771	193.688192	2026-02-01
bob	535	100.460597	181.511521	2026-02-02
carol	960	65.316933	162.957884	2026-02-03
dave	370	65.317261	178.138401	2026-02-04

df.query('weight > 50 & height < 165')

	id	weight	height	date
carol	960	65.316933	162.957884	2026-02-03

qid = 202
df.query('id == @qid')

	id	weight	height	date
anna	202	82.953771	193.688192	2026-02-01

Selecting Columns

Beyond the use of loc() and iloc() there is also the filter() method which can be used to select columns (or indices) by name with pattern matching

df.filter(items=["id","weight"])

	id	weight
anna	202	82.953771
bob	535	100.460597
carol	960	65.316933
dave	370	65.317261

df.filter(like = "i")

	id	weight	height
anna	202	82.953771	193.688192
bob	535	100.460597	181.511521
carol	960	65.316933	162.957884
dave	370	65.317261	178.138401

df.filter(regex="ght$")

	weight	height
anna	82.953771	193.688192
bob	100.460597	181.511521
carol	65.316933	162.957884
dave	65.317261	178.138401

df.filter(like="a", axis=0)

	id	weight	height	date
anna	202	82.953771	193.688192	2026-02-01
carol	960	65.316933	162.957884	2026-02-03
dave	370	65.317261	178.138401	2026-02-04

Adding columns

Indexing with assignment allows for inplace modification of a DataFrame, while assign() creates a new object (but is chainable)

df['student'] = [True, True, True, None]
df['age'] = [19, 22, None, None]
df

	id	weight	height	date	student	age
anna	202	82.953771	193.688192	2026-02-01	True	19.0
bob	535	100.460597	181.511521	2026-02-02	True	22.0
carol	960	65.316933	162.957884	2026-02-03	True	NaN
dave	370	65.317261	178.138401	2026-02-04	None	NaN

df.assign(
  student = lambda x: np.where(x.student, "yes", "no"),
  rand = np.random.rand(n)
)

	id	weight	height	date	student	age	rand
anna	202	82.953771	193.688192	2026-02-01	yes	19.0	0.181825
bob	535	100.460597	181.511521	2026-02-02	yes	22.0	0.183405
carol	960	65.316933	162.957884	2026-02-03	yes	NaN	0.304242
dave	370	65.317261	178.138401	2026-02-04	no	NaN	0.524756

pd.col()

As of pandas 3.0.0 there is also a pd.col() function which can be used within methods like assign() and query() to refer to columns by name,

df.assign(
  bmi = pd.col("weight") / np.power(pd.col("height")/100, 2)
)

	id	weight	height	date	student	age	bmi
anna	202	82.953771	193.688192	2026-02-01	True	19.0	22.112092
bob	535	100.460597	181.511521	2026-02-02	True	22.0	30.492102
carol	960	65.316933	162.957884	2026-02-03	True	NaN	24.596597
dave	370	65.317261	178.138401	2026-02-04	None	NaN	20.583199

Removing columns (and rows)

Columns or rows can be removed via the drop() method,

df.drop(['student'])

KeyError: "['student'] not found in axis"

df.drop(['student'], axis=1)

	id	weight	height	date	age
anna	202	82.953771	193.688192	2026-02-01	19.0
bob	535	100.460597	181.511521	2026-02-02	22.0
carol	960	65.316933	162.957884	2026-02-03	NaN
dave	370	65.317261	178.138401	2026-02-04	NaN

df.drop(['anna','dave'])

	id	weight	height	date	student	age
bob	535	100.460597	181.511521	2026-02-02	True	22.0
carol	960	65.316933	162.957884	2026-02-03	True	NaN

df.drop(columns = df.columns == "age")

KeyError: '[False, False, False, False, False, True] not found in axis'

df.drop(columns = df.columns[df.columns == "age"])

	id	weight	height	date	student
anna	202	82.953771	193.688192	2026-02-01	True
bob	535	100.460597	181.511521	2026-02-02	True
carol	960	65.316933	162.957884	2026-02-03	True
dave	370	65.317261	178.138401	2026-02-04	None

df.drop(columns = df.columns[df.columns.str.contains("ght")])

	id	date	student	age
anna	202	2026-02-01	True	19.0
bob	535	2026-02-02	True	22.0
carol	960	2026-02-03	True	NaN
dave	370	2026-02-04	None	NaN

Sorting

DataFrames can be sorted on one or more columns via sort_values(),

df

	id	weight	height	date	student	age
anna	202	82.953771	193.688192	2026-02-01	True	19.0
bob	535	100.460597	181.511521	2026-02-02	True	22.0
carol	960	65.316933	162.957884	2026-02-03	True	NaN
dave	370	65.317261	178.138401	2026-02-04	None	NaN

df.sort_values(by=["student","id"], ascending=[True,False])

	id	weight	height	date	student	age
carol	960	65.316933	162.957884	2026-02-03	True	NaN
bob	535	100.460597	181.511521	2026-02-02	True	22.0
anna	202	82.953771	193.688192	2026-02-01	True	19.0
dave	370	65.317261	178.138401	2026-02-04	None	NaN

join vs merge vs concat

All three can be used to combine data frames,

concat() stacks DataFrames on either axis, with basic alignment based on (row) indexes. join argument only supports “inner” and “outer”.
merge() aligns based on one or more shared columns. how supports “inner”, “outer”, “left”, “right”, and “cross”.
join() uses merge() behind the scenes, but prefers to join based on (row) indexes. Also has different default how compared to merge(), “left” vs “inner”.

Next week

Pivoting
Split-Apply-Combine (group by / summarize)
Pandas & pyarrow
Polars

pandas

pandas

Series

Series

Series methods

Series + NumPy

Series indexing

Position based indexing

Index alignment

Series and dicts

Missing values

Aside - why pd.isna()?

Missing via None

Native NAs

String series

Categorical Series

Category orders

DataFrames

DataFrame

Constructing DataFrames

DataFrame from nparray

DataFrame attributes & methods

DataFrame indexing

Indexing by position

Index by name

Views vs. Copies

Element access

Index objects

Columns and index

Index objects

Index names

Indexes and missing values

Changing a DataFrame’s index

drop argument

Creating a new index

MultiIndexes

MultiIndex objects

DataFrame with MultiIndex

Column MultiIndex

MultiIndex indexing

MultiIndex indexing via iloc

MultiIndex indexing via loc

Fancier indexing with loc

Selecting nested levels

Setting MultiIndexes

Working with DataFrames

Filtering rows

Selecting Columns

Adding columns

pd.col()

Removing columns (and rows)

Sorting

join vs merge vs concat

Next week

Aside - why `pd.isna()`?

Missing via `None`

MultiIndex indexing via `iloc`

MultiIndex indexing via `loc`

Fancier indexing with `loc`