跳转至

Topic 3.6 - Pandas 中的数据筛选

在本节开始前,我们先导入一下 Pandas 库:

import pandas as pd

1. 基于条件的数据筛选

在前面的章节中,我们介绍了按行和列的索引进行数据筛选,在实际的数据分析中,更常见的是基于条件进行数据筛选。

(1) 基于单个条件的筛选

在 Pandas 中,可以使用布尔索引来实现基于单个条件的筛选

  • 格式是 df[条件],其中 条件 的格式是:df["列名"] 后面跟上比较运算符和具体的值,注意这里不能只跟列名
  • 例如,在学生成绩表中筛选出数学成绩大于 90 的记录:df1[df1["math"] > 90]
df1 = pd.read_csv("../data/students.csv")
df1
name class gender age chinese math english
0 Alice Johnson 1 Female 18 85 88 90
1 Bob Smith 2 Male 19 78 75 80
2 Charlie Brown 3 Male 18 92 90 94
3 David Miller 1 Male 20 88 85 87
4 Emma Wilson 2 Female 19 76 78 74
... ... ... ... ... ... ... ...
86 Jonah Larson 3 Male 21 78 76 77
87 Kylie Newman 1 Female 18 85 83 84
88 Lucas Porter 2 Male 22 91 92 93
89 Megan Arnold 3 Female 19 79 77 78
90 Nick Keller 1 Male 20 87 85 88

91 rows × 7 columns

df1[df1["math"] > 90]
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93

事实上:

  • df1["math"] > 90 会创建一个新的 Series,是布尔类型的,表示每一行的数学成绩是否大于 90
  • 将这个布尔 Series 再传入 df1[...] 中,就会只保留对应为 True 的那些行
df1["math"] > 90
0     False
1     False
2     False
3     False
4     False
      ...  
86    False
87    False
88     True
89    False
90    False
Name: math, Length: 91, dtype: bool

(2) 基于多个条件的筛选

如果基于多个条件进行筛选,可以使用逻辑运算符将多个条件组合起来:

  • Pandas 中的逻辑运算符和 Python 中的有所不同:

  • 逻辑与(AND):&

  • 逻辑或(OR):|
  • 逻辑非(NOT):~

  • 并且,每个条件都需要用括号括起来

例如:

  • 我们想筛选出数学成绩大于 90 且英语成绩大于 85 的记录:
df1[(df1["math"] > 90) & (df1["english"] >= 85)]
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93
  • 数学成绩大于 90 且英语和语文至少一个大于 70 的记录:
df1[(df1["math"] > 90) & ( (df1["english"] >= 70) | (df1["chinese"] >= 70) )]
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93

(3) 高级筛选方法

除了常见的数值判断之外,Pandas 还支持一些更高级的筛选方法:

  • 范围筛选:使用 between() 方法筛选在某个范围内的值,例如筛选数学成绩在 80 到 90之 间的记录(包括边界值 80 和 90):
df1[df1["math"].between(80, 90)]
name class gender age chinese math english
0 Alice Johnson 1 Female 18 85 88 90
2 Charlie Brown 3 Male 18 92 90 94
3 David Miller 1 Male 20 88 85 87
6 Grace Anderson 1 Female 20 84 82 86
9 Jack Martin 1 Male 19 87 85 88
12 Mia Hall 1 Female 18 82 80 84
13 Nathan Young 2 Male 22 88 86 87
15 Paul Wright 1 Male 20 85 83 84
18 Sam Scott 1 Male 22 83 80 82
19 Tina Green 2 Female 19 86 88 85
23 Xander Carter 3 Male 22 89 88 90
24 Yara Mitchell 1 Female 19 84 82 83
25 Zack Perez 2 Male 20 88 90 87
26 Amy Roberts 3 Female 21 81 80 82
29 Dylan Campbell 3 Male 19 85 83 86
32 Gina Edwards 3 Female 18 82 80 81
33 Harry Collins 1 Male 22 87 85 88
36 Kelly Morris 1 Female 21 83 82 84
37 Liam Rogers 2 Male 18 88 87 89
39 Noah Cook 1 Male 19 86 85 87
42 Queen Foster 1 Female 18 84 82 83
45 Tom Cox 1 Male 20 85 83 84
46 Uma Diaz 2 Female 21 89 88 90
47 Victor Richardson 3 Male 18 81 80 82
50 Yusuf Brooks 3 Male 20 86 85 87
53 Bella Bennett 3 Female 22 84 82 83
54 Chris Barnes 1 Male 19 88 87 89
57 Fiona Coleman 1 Female 18 82 80 81
58 George Jenkins 2 Male 22 87 86 88
60 Ian Powell 1 Male 20 85 83 84
63 Laura Hughes 1 Female 22 89 88 90
64 Mike Flores 2 Male 19 83 82 84
67 Paula Simmons 2 Female 18 88 87 89
68 Ray Bryant 3 Male 22 84 82 83
70 Tony Russell 2 Male 20 81 80 82
73 Whitney Myers 2 Female 22 86 85 87
74 Xavier Ford 3 Male 19 82 80 81
75 Yvonne Hamilton 1 Female 20 88 87 89
76 Zane Graham 2 Male 21 85 83 84
79 Clara West 2 Female 19 84 82 83
80 Daniel Jordan 3 Male 20 89 88 90
83 Gloria Hawkins 3 Female 22 83 82 84
84 Howard Dunn 1 Male 19 86 85 87
87 Kylie Newman 1 Female 18 85 83 84
90 Nick Keller 1 Male 20 87 85 88
  • 集合筛选:使用 isin() 方法筛选某列中值属于指定集合的记录,例如筛选一班和二班的学生记录:
df1[df1["class"].isin([1, 2])]
name class gender age chinese math english
0 Alice Johnson 1 Female 18 85 88 90
1 Bob Smith 2 Male 19 78 75 80
3 David Miller 1 Male 20 88 85 87
4 Emma Wilson 2 Female 19 76 78 74
6 Grace Anderson 1 Female 20 84 82 86
... ... ... ... ... ... ... ...
84 Howard Dunn 1 Male 19 86 85 87
85 Irene Olson 2 Female 20 90 92 91
87 Kylie Newman 1 Female 18 85 83 84
88 Lucas Porter 2 Male 22 91 92 93
90 Nick Keller 1 Male 20 87 85 88

61 rows × 7 columns

  • 字符串筛选:

    • str.contains() 筛选包含特定子字符串的记录
    • str.startswith()str.endswith() 分别筛选以特定子字符串开头或结尾的记录
df1[df1["name"].str.contains("A")]
name class gender age chinese math english
0 Alice Johnson 1 Female 18 85 88 90
6 Grace Anderson 1 Female 20 84 82 86
20 Ulysses Adams 3 Male 20 79 77 78
26 Amy Roberts 3 Female 21 81 80 82
52 Aaron Price 2 Male 18 79 77 78
69 Sara Alexander 1 Female 19 90 92 91
77 Abby Sullivan 3 Female 18 91 92 93
89 Megan Arnold 3 Female 19 79 77 78
df1[df1["name"].str.startswith("A")]
name class gender age chinese math english
0 Alice Johnson 1 Female 18 85 88 90
26 Amy Roberts 3 Female 21 81 80 82
52 Aaron Price 2 Male 18 79 77 78
77 Abby Sullivan 3 Female 18 91 92 93
df1[df1["name"].str.endswith("Taylor")]
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89

2. 基于索引的数据筛选

事实上,我们之前讲过的 .loc 方法同样可以用于基于条件的筛选:

  • 语法格式是:df.loc[条件, 列名列表],这里的 条件 和前面介绍的布尔索引是一样的,需要写成 df["列名"] 后面跟上比较运算符和具体的值
  • 这样做的好处是,可以同时指定要筛选的列,而不是将所有列都保留下来

我们来看以下例子:

  • 筛选出数学成绩大于 90 的学生的姓名和数学成绩:
df1.loc[df1["math"] > 90, ["name", "math"]]
name math
5 Frank Taylor 92
8 Ivy Moore 93
11 Leo Walker 94
17 Rachel Hill 92
21 Vera Baker 94
27 Brian Turner 95
30 Ella Parker 92
35 Jason Sanchez 93
40 Ophelia Morgan 94
43 Ryan Howard 92
48 Wendy Wood 95
51 Zoe Kelly 92
55 Daisy Ross 92
61 Julia Long 94
65 Nina Washington 92
69 Sara Alexander 92
71 Ursula Griffin 95
77 Abby Sullivan 92
82 Felix Stone 94
85 Irene Olson 92
88 Lucas Porter 92
  • 筛选出英数学成绩大于 85 且语文成绩大于 80 的学生的姓名、语文、数学、英语成绩:
df1.loc[(df1["english"] > 85) & (df1["chinese"] > 80), ["name", "chinese", "math", "english"]]
name chinese math english
0 Alice Johnson 85 88 90
2 Charlie Brown 92 90 94
3 David Miller 88 85 87
5 Frank Taylor 90 92 89
6 Grace Anderson 84 82 86
8 Ivy Moore 91 93 90
9 Jack Martin 87 85 88
11 Leo Walker 95 94 96
13 Nathan Young 88 86 87
17 Rachel Hill 90 92 91
21 Vera Baker 92 94 93
23 Xander Carter 89 88 90
25 Zack Perez 88 90 87
27 Brian Turner 93 95 94
29 Dylan Campbell 85 83 86
30 Ella Parker 90 92 91
33 Harry Collins 87 85 88
35 Jason Sanchez 91 93 92
37 Liam Rogers 88 87 89
39 Noah Cook 86 85 87
40 Ophelia Morgan 92 94 93
43 Ryan Howard 90 92 91
46 Uma Diaz 89 88 90
48 Wendy Wood 93 95 94
50 Yusuf Brooks 86 85 87
51 Zoe Kelly 91 92 93
54 Chris Barnes 88 87 89
55 Daisy Ross 90 92 91
58 George Jenkins 87 86 88
61 Julia Long 92 94 93
63 Laura Hughes 89 88 90
65 Nina Washington 91 92 93
67 Paula Simmons 88 87 89
69 Sara Alexander 90 92 91
71 Ursula Griffin 93 95 94
73 Whitney Myers 86 85 87
75 Yvonne Hamilton 88 87 89
77 Abby Sullivan 91 92 93
80 Daniel Jordan 89 88 90
82 Felix Stone 92 94 93
84 Howard Dunn 86 85 87
85 Irene Olson 90 92 91
88 Lucas Porter 91 92 93
90 Nick Keller 87 85 88
  • 筛选出数学成绩大于 90 且语文或英语至少一科成绩大于 85 的学生的姓名、语文、数学、英语成绩:
df1.loc[(df1["math"] > 90) & ( (df1["english"] > 85) | (df1["chinese"] > 85) ), ["name", "chinese", "math", "english"]]
name chinese math english
5 Frank Taylor 90 92 89
8 Ivy Moore 91 93 90
11 Leo Walker 95 94 96
17 Rachel Hill 90 92 91
21 Vera Baker 92 94 93
27 Brian Turner 93 95 94
30 Ella Parker 90 92 91
35 Jason Sanchez 91 93 92
40 Ophelia Morgan 92 94 93
43 Ryan Howard 90 92 91
48 Wendy Wood 93 95 94
51 Zoe Kelly 91 92 93
55 Daisy Ross 90 92 91
61 Julia Long 92 94 93
65 Nina Washington 91 92 93
69 Sara Alexander 90 92 91
71 Ursula Griffin 93 95 94
77 Abby Sullivan 91 92 93
82 Felix Stone 92 94 93
85 Irene Olson 90 92 91
88 Lucas Porter 91 92 93

3. 基于查询的数据筛选

Pandas 还提供了一种更简洁的基于查询的方法 query() 来实现数据筛选:

  • 语法格式是:df.query("条件表达式"),其中 条件表达式 是一个字符串,直接写列名和比较运算符
  • 大家如果学过 SQL,会发现这种语法和 SQL 的 WHERE 子句非常类似

在这个条件表达式中:

  • 列名可以直接使用,不需要加引号或方括号
  • 逻辑运算符和前面介绍的布尔索引中使用的是一样的:& 表示逻辑与,| 表示逻辑或,~ 表示逻辑非
  • 条件和条件之间不强制使用括号,括号的使用主要是表明优先级的

我们来看以下例子,和上面的例子其实是对应的:

  • 筛选出数学成绩大于 90 的记录:
df1.query("math > 90")
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93
  • 筛选出数学成绩大于 90 且英语成绩大于 85 的记录:
df1.query("math > 90 & english > 85")
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93
  • 筛选出数学成绩大于 90 且语文和英语至少一科成绩大于 85 的记录:
df1.query("math > 90 & (english > 85 | chinese > 85)")
name class gender age chinese math english
5 Frank Taylor 3 Male 21 90 92 89
8 Ivy Moore 3 Female 22 91 93 90
11 Leo Walker 3 Male 21 95 94 96
17 Rachel Hill 3 Female 18 90 92 91
21 Vera Baker 1 Female 21 92 94 93
27 Brian Turner 1 Male 18 93 95 94
30 Ella Parker 1 Female 20 90 92 91
35 Jason Sanchez 3 Male 20 91 93 92
40 Ophelia Morgan 2 Female 20 92 94 93
43 Ryan Howard 2 Male 22 90 92 91
48 Wendy Wood 1 Female 22 93 95 94
51 Zoe Kelly 1 Female 21 91 92 93
55 Daisy Ross 2 Female 20 90 92 91
61 Julia Long 2 Female 21 92 94 93
65 Nina Washington 3 Female 20 91 92 93
69 Sara Alexander 1 Female 19 90 92 91
71 Ursula Griffin 3 Female 21 93 95 94
77 Abby Sullivan 3 Female 18 91 92 93
82 Felix Stone 2 Male 18 92 94 93
85 Irene Olson 2 Female 20 90 92 91
88 Lucas Porter 2 Male 22 91 92 93

注意:

  • query() 方法只能用于筛选行,不能像 .loc 那样同时指定要筛选的列
  • 但是要想筛选列,直接在 query() 之后再使用列索引来筛选即可
  • 例如,筛选出数学成绩大于 90 的学生的姓名和数学成绩:
df1.query("math > 90")[["name", "math"]]
name math
5 Frank Taylor 92
8 Ivy Moore 93
11 Leo Walker 94
17 Rachel Hill 92
21 Vera Baker 94
27 Brian Turner 95
30 Ella Parker 92
35 Jason Sanchez 93
40 Ophelia Morgan 94
43 Ryan Howard 92
48 Wendy Wood 95
51 Zoe Kelly 92
55 Daisy Ross 92
61 Julia Long 94
65 Nina Washington 92
69 Sara Alexander 92
71 Ursula Griffin 95
77 Abby Sullivan 92
82 Felix Stone 94
85 Irene Olson 92
88 Lucas Porter 92