Pandas的学习记录
数据结构
Pandas有两种数据结构,DataFrame和Series,其中Series是标量数据的集合,DataFrame是一系列Series的集合。
Numpy只有一种数据结构,就是Array。DataFrame可以通过DataFrame.to_numpy()
转换成Array,转换过程中会损失index属性。
Series具有index属性
In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
DataFrame对象具有index和Columns属性
In [5]: dates = pd.date_range('20130101', periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [8]: df
Out[8]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
主要用法
Index | Function | Description |
---|---|---|
1 | .head() | 查看头部数据 |
2 | .tail(n) | 查看尾部数据 |
3 | .index | 显示索引 |
4 | .columns | 显示列名 |
5 | .describe() | 统计摘要 |
6 | .T | 转置 |
7 | .sort_index(axis=1,ascending=False) | 按轴排序 |
8 | .sort_values(by=’B’) | 按值排序 |
9 | df[‘A’] | 获取数据 |
10 | df[0:3]或者df[‘20130102’:’20130104’] | 切片行 |
11 | df.loc[dates[0]] | 用标签提取一行数据 |
12 | df.loc[:,[‘A’,’B’]] | 用标签提取多列数据 |
13 | df.loc[‘20130102’:’20130104’,[‘A’,’B’]] | 用标签切片 |
14 | df.iloc(3) | 用整数位置选择 |
15 | df.iloc[3:5,0:2] or df.iloc[[1,2,4],[0,2]] | 用整数位置切片 |
16 | df.iloc[1:3, :] | 整行显示切片 |
17 | df[df.A >0] | 单列值选择 |
18 | df[df > 0] | 选择所有满足条件的值 |
19 | df2[df2[‘E’].isin([‘two’, ‘four’])] | isin()筛选 |
赋值
df.at[dates[0], 'A'] = 0
df.iat[0, 1] = 0
df.loc[:, 'D'] = np.array([5] * len(df))
df2[df2 > 0] = -df2
重建索引
df1= df.reindex(index=dates[0:4],columns=list(df.columns)+['E])
df1.loc[dates[0]:dates[1],'E']=1
删除所有含有缺失值的行
df1.dropna(how='any') # delete all rows contain NaN
df1.fillna(value=5) # fill Nan with 5
pd.isna(df1) # get nan mask value
连接
# 分解为多组
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
Database style merge
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')
Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
追加
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
s = df.iloc[3]
df.append(s, ignore_index=True)
分组
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A').sum()
Out[93]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
df.groupby(['A', 'B']).sum()
Out[94]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
堆叠
stack()与unstack()。压缩后的DataFrame和Series具有多层索引。
数据透视表
In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
.....: 'B': ['A', 'B', 'C'] * 4,
.....: 'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
.....: 'D': np.random.randn(12),
.....: 'E': np.random.randn(12)})
.....:
In [106]: df
Out[106]:
A B C D E
0 one A foo 1.418757 -0.179666
1 one B foo -1.879024 1.291836
2 two C foo 0.536826 -0.009614
3 three A bar 1.006160 0.392149
4 one B bar -0.029716 0.264599
5 one C bar -1.146178 -0.057409
6 two A foo 0.100900 -1.425638
7 three B foo -1.035018 1.024098
8 one C foo 0.314665 -0.106062
9 one A bar -0.773723 1.824375
10 two B bar -1.170653 0.595974
11 three C bar 0.648740 1.167115
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]:
C bar foo
A B
one A -0.773723 1.418757
B -0.029716 -1.879024
C -1.146178 0.314665
three A 1.006160 NaN
B NaN -1.035018
C 0.648740 NaN
two A NaN 0.100900
B -1.170653 NaN
C NaN 0.536826
时间序列
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
Pandas 函数可以很方便地转换时间段与时间戳。下例把以 11 月为结束年份的季度频率转换为下一季度月末上午 9 点:
In [123]: prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
In [124]: ts = pd.Series(np.random.randn(len(prng)), prng)
In [125]: ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
In [126]: ts.head()
Out[126]:
1990-03-01 09:00 -0.902937
1990-06-01 09:00 0.068159
1990-09-01 09:00 -0.057873
1990-12-01 09:00 -0.368204
1991-03-01 09:00 -1.144073
Freq: H, dtype: float64
可视化
DataFrame 的 plot() 方法可以快速绘制所有带标签的列:
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
.....: columns=['A', 'B', 'C', 'D'])
.....:
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f2b53a2d7f0>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f2b539728d0>