728x90
Group Data¶
그룹 연산(Group by)¶
group by 후에는 다양한 연산을 할 수 있는데, 예시는 다음과 같습니다.
size, count: 각 그룹의 수
mean, median, min, max: 각 그룹의 평균, 중앙값, 최소, 최대
sum, prod, std, var, quantile : 각 그룹의 합계, 곱, 표준편차, 분산, 사분위수
first, last: 각 그룹의 가장 첫번째 값과 가장 마지막 값
In [1]:
import pandas as pd
import seaborn as sns
In [2]:
df = sns.load_dataset("diamonds")
df.head()
Out[2]:
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
1) 한 개 열을 기준으로 집계¶
In [3]:
df.groupby(by="cut").size()
Out[3]:
cut Ideal 21551 Premium 13791 Very Good 12082 Good 4906 Fair 1610 dtype: int64
- 다음과 같이 by에 집계하고 싶은 기준값을 입력합니다.
- size()는 각 그룹의 사이즈를 계산합니다.
In [4]:
df.groupby(by="cut").mean()
Out[4]:
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
cut | |||||||
Ideal | 0.702837 | 61.709401 | 55.951668 | 3457.541970 | 5.507451 | 5.520080 | 3.401448 |
Premium | 0.891955 | 61.264673 | 58.746095 | 4584.257704 | 5.973887 | 5.944879 | 3.647124 |
Very Good | 0.806381 | 61.818275 | 57.956150 | 3981.759891 | 5.740696 | 5.770026 | 3.559801 |
Good | 0.849185 | 62.365879 | 58.694639 | 3928.864452 | 5.838785 | 5.850744 | 3.639507 |
Fair | 1.046137 | 64.041677 | 59.053789 | 4358.757764 | 6.246894 | 6.182652 | 3.982770 |
- mean()은 각 그룹의 평균값을 계산합니다.
- max, min 등 다양한 통계값도 가능합니다.
In [5]:
df.groupby(by="cut").first()
Out[5]:
carat | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|
cut | |||||||||
Ideal | 0.23 | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
Premium | 0.21 | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
Very Good | 0.24 | J | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 |
Good | 0.23 | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
Fair | 0.22 | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 |
- 각 그룹별로 가장 첫번째 값을 출력합니다.
In [6]:
df.groupby(by="cut").last()
Out[6]:
carat | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|
cut | |||||||||
Ideal | 0.75 | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
Premium | 0.86 | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
Very Good | 0.70 | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
Good | 0.72 | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
Fair | 0.71 | D | VS1 | 65.4 | 59.0 | 2747 | 5.62 | 5.58 | 3.66 |
- 각 그룹별로 가장 마지막 값을 출력합니다.
2) 여러 열을 기준으로 집계¶
In [7]:
df.groupby(by=['cut', 'color'], as_index=False).sum()
Out[7]:
cut | color | carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|
0 | Ideal | D | 1603.38 | 174796.5 | 158606.6 | 7450854 | 14699.12 | 14726.10 | 9073.90 |
1 | Ideal | E | 2257.50 | 240763.1 | 218441.0 | 10138238 | 20357.12 | 20421.33 | 12569.01 |
2 | Ideal | F | 2509.20 | 235970.7 | 213966.0 | 12912518 | 20693.88 | 20745.53 | 12773.99 |
3 | Ideal | G | 3422.29 | 301343.6 | 273027.2 | 18171930 | 26916.77 | 26979.25 | 16620.84 |
4 | Ideal | H | 2490.52 | 192298.9 | 174333.6 | 12115278 | 17853.24 | 17881.49 | 11026.84 |
5 | Ideal | I | 1910.97 | 129334.0 | 117252.7 | 9317974 | 12508.18 | 12535.98 | 7734.20 |
6 | Ideal | J | 952.98 | 55392.5 | 50187.3 | 4406695 | 5662.76 | 5673.56 | 3505.83 |
7 | Premium | D | 1156.64 | 98054.3 | 94126.5 | 5820962 | 8976.59 | 8931.07 | 5475.73 |
8 | Premium | E | 1677.37 | 142968.1 | 137367.6 | 8270443 | 13057.09 | 12989.87 | 7966.54 |
9 | Premium | F | 1927.82 | 142797.0 | 136781.4 | 10081319 | 13698.57 | 13623.89 | 8364.54 |
10 | Premium | G | 2460.51 | 179181.2 | 171645.7 | 13160170 | 17147.08 | 17056.70 | 10466.87 |
11 | Premium | H | 2398.82 | 144720.0 | 138749.2 | 12311428 | 14737.04 | 14698.34 | 8995.61 |
12 | Premium | I | 1634.97 | 87577.3 | 83926.2 | 8491146 | 9267.05 | 9215.27 | 5660.44 |
13 | Premium | J | 1044.82 | 49603.2 | 47570.8 | 5086030 | 5502.46 | 5470.68 | 3367.76 |
14 | Very Good | D | 1053.69 | 93428.0 | 87816.5 | 5250817 | 8323.33 | 8369.33 | 5151.90 |
15 | Very Good | E | 1623.16 | 148152.6 | 139293.3 | 7715165 | 13037.92 | 13111.71 | 8097.59 |
16 | Very Good | F | 1603.44 | 133565.6 | 125184.0 | 8177367 | 12141.13 | 12210.84 | 7515.36 |
17 | Very Good | G | 1762.87 | 142173.1 | 132846.4 | 8903461 | 13012.81 | 13078.44 | 8069.22 |
18 | Very Good | H | 1670.69 | 113030.3 | 105615.1 | 8272552 | 10911.69 | 10955.44 | 6774.06 |
19 | Very Good | I | 1260.53 | 74569.2 | 69958.6 | 6328079 | 7551.80 | 7584.03 | 4684.73 |
20 | Very Good | J | 768.32 | 41969.6 | 39512.3 | 3460182 | 4380.41 | 4403.66 | 2716.66 |
21 | Good | D | 492.87 | 41286.1 | 38754.5 | 2254363 | 3720.49 | 3729.64 | 2320.22 |
22 | Good | E | 695.21 | 58036.1 | 54841.7 | 3194260 | 5241.49 | 5255.08 | 3261.83 |
23 | Good | F | 705.32 | 56541.9 | 53550.0 | 3177637 | 5175.34 | 5190.08 | 3222.05 |
24 | Good | G | 741.13 | 54460.7 | 50929.1 | 3591553 | 5095.58 | 5106.52 | 3177.34 |
25 | Good | H | 642.14 | 43874.7 | 41145.0 | 3001931 | 4190.09 | 4193.92 | 2616.31 |
26 | Good | I | 551.87 | 32612.0 | 30680.0 | 2650994 | 3264.35 | 3268.00 | 2036.59 |
27 | Good | J | 337.56 | 19155.5 | 18055.6 | 1404271 | 1957.74 | 1960.51 | 1221.08 |
28 | Fair | D | 149.98 | 10439.9 | 9612.0 | 699443 | 980.99 | 972.00 | 625.90 |
29 | Fair | E | 191.88 | 14183.6 | 13297.7 | 824838 | 1323.63 | 1312.24 | 833.76 |
30 | Fair | F | 282.27 | 19814.5 | 18549.4 | 1194025 | 1869.04 | 1850.51 | 1181.80 |
31 | Fair | G | 321.48 | 20202.7 | 18454.8 | 1331126 | 1938.58 | 1919.82 | 1244.43 |
32 | Fair | H | 369.41 | 19569.3 | 17785.0 | 1556112 | 1993.55 | 1968.71 | 1278.47 |
33 | Fair | I | 209.66 | 11238.6 | 10366.5 | 819953 | 1148.78 | 1136.36 | 733.86 |
34 | Fair | J | 159.60 | 7658.5 | 7011.2 | 592103 | 802.93 | 794.43 | 514.04 |
- 여러개의 열을 기준으로 계산하고 싶을때는 다음과 같이 원하는 열을 두개 입력합니다.
In [8]:
df.groupby(by=['cut', 'color'], as_index=False).first()
Out[8]:
cut | color | carat | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Ideal | D | 0.30 | SI1 | 62.5 | 57.0 | 552 | 4.29 | 4.32 | 2.69 |
1 | Ideal | E | 0.23 | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
2 | Ideal | F | 0.81 | SI2 | 58.8 | 57.0 | 2761 | 6.14 | 6.11 | 3.60 |
3 | Ideal | G | 0.23 | VS1 | 61.9 | 54.0 | 404 | 3.93 | 3.95 | 2.44 |
4 | Ideal | H | 0.77 | VS2 | 62.0 | 56.0 | 2763 | 5.89 | 5.86 | 3.64 |
5 | Ideal | I | 0.30 | SI2 | 62.0 | 54.0 | 348 | 4.31 | 4.34 | 2.68 |
6 | Ideal | J | 0.23 | VS1 | 62.8 | 56.0 | 340 | 3.93 | 3.90 | 2.46 |
7 | Premium | D | 0.22 | VS2 | 59.3 | 62.0 | 404 | 3.91 | 3.88 | 2.31 |
8 | Premium | E | 0.21 | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
9 | Premium | F | 0.22 | SI1 | 60.4 | 61.0 | 342 | 3.88 | 3.84 | 2.33 |
10 | Premium | G | 0.31 | SI1 | 61.8 | 58.0 | 553 | 4.35 | 4.32 | 2.68 |
11 | Premium | H | 0.30 | SI1 | 62.9 | 59.0 | 554 | 4.28 | 4.24 | 2.68 |
12 | Premium | I | 0.29 | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
13 | Premium | J | 0.30 | SI2 | 59.3 | 61.0 | 405 | 4.43 | 4.38 | 2.61 |
14 | Very Good | D | 0.23 | VS2 | 60.5 | 61.0 | 357 | 3.96 | 3.97 | 2.40 |
15 | Very Good | E | 0.23 | VS2 | 63.8 | 55.0 | 352 | 3.85 | 3.92 | 2.48 |
16 | Very Good | F | 0.23 | VS1 | 60.9 | 57.0 | 357 | 3.96 | 3.99 | 2.42 |
17 | Very Good | G | 0.23 | VVS2 | 60.4 | 58.0 | 354 | 3.97 | 4.01 | 2.41 |
18 | Very Good | H | 0.26 | SI1 | 61.9 | 55.0 | 337 | 4.07 | 4.11 | 2.53 |
19 | Very Good | I | 0.24 | VVS1 | 62.3 | 57.0 | 336 | 3.95 | 3.98 | 2.47 |
20 | Very Good | J | 0.24 | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 |
21 | Good | D | 0.26 | VS2 | 65.2 | 56.0 | 403 | 3.99 | 4.02 | 2.61 |
22 | Good | E | 0.23 | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
23 | Good | F | 0.23 | VS1 | 58.2 | 59.0 | 402 | 4.06 | 4.08 | 2.37 |
24 | Good | G | 0.72 | VS2 | 59.7 | 60.5 | 2776 | 5.80 | 5.84 | 3.47 |
25 | Good | H | 0.31 | SI1 | 64.0 | 54.0 | 402 | 4.29 | 4.31 | 2.75 |
26 | Good | I | 0.30 | SI2 | 63.3 | 56.0 | 351 | 4.26 | 4.30 | 2.71 |
27 | Good | J | 0.31 | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
28 | Fair | D | 0.75 | SI2 | 64.6 | 57.0 | 2848 | 5.74 | 5.72 | 3.70 |
29 | Fair | E | 0.22 | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 |
30 | Fair | F | 0.96 | SI2 | 66.3 | 62.0 | 2759 | 6.27 | 5.95 | 4.07 |
31 | Fair | G | 0.84 | SI1 | 55.1 | 67.0 | 2782 | 6.39 | 6.20 | 3.47 |
32 | Fair | H | 0.91 | SI2 | 64.4 | 57.0 | 2763 | 6.11 | 6.09 | 3.93 |
33 | Fair | I | 0.90 | SI1 | 67.3 | 59.0 | 2804 | 5.93 | 5.84 | 3.96 |
34 | Fair | J | 1.05 | SI2 | 65.8 | 59.0 | 2789 | 6.41 | 6.27 | 4.18 |
- 각 그룹별로 가장 첫번째 값을 출력합니다.
In [9]:
df.groupby(by=['cut', 'color'], as_index=False).last()
Out[9]:
cut | color | carat | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Ideal | D | 0.75 | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
1 | Ideal | E | 0.71 | SI1 | 61.9 | 56.0 | 2756 | 5.71 | 5.73 | 3.54 |
2 | Ideal | F | 0.70 | VS1 | 62.0 | 55.0 | 2751 | 5.74 | 5.71 | 3.55 |
3 | Ideal | G | 0.71 | VS1 | 61.4 | 56.0 | 2756 | 5.76 | 5.73 | 3.53 |
4 | Ideal | H | 0.73 | VS2 | 62.5 | 58.0 | 2752 | 5.71 | 5.75 | 3.58 |
5 | Ideal | I | 0.79 | SI1 | 61.6 | 56.0 | 2756 | 5.95 | 5.97 | 3.67 |
6 | Ideal | J | 0.83 | VS2 | 62.3 | 55.0 | 2742 | 6.01 | 6.03 | 3.75 |
7 | Premium | D | 0.72 | SI1 | 62.7 | 59.0 | 2757 | 5.69 | 5.73 | 3.58 |
8 | Premium | E | 0.71 | SI1 | 60.5 | 55.0 | 2756 | 5.79 | 5.74 | 3.49 |
9 | Premium | F | 0.71 | SI1 | 59.8 | 62.0 | 2756 | 5.74 | 5.73 | 3.43 |
10 | Premium | G | 0.83 | SI2 | 61.6 | 57.0 | 2749 | 6.06 | 6.02 | 3.72 |
11 | Premium | H | 0.86 | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
12 | Premium | I | 0.76 | VS1 | 59.3 | 62.0 | 2753 | 5.93 | 5.85 | 3.49 |
13 | Premium | J | 0.90 | SI2 | 63.0 | 59.0 | 2717 | 6.14 | 6.11 | 3.86 |
14 | Very Good | D | 0.70 | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
15 | Very Good | E | 0.70 | VS2 | 61.2 | 59.0 | 2757 | 5.69 | 5.72 | 3.49 |
16 | Very Good | F | 0.80 | SI1 | 61.6 | 59.0 | 2745 | 5.90 | 5.96 | 3.65 |
17 | Very Good | G | 0.70 | VS1 | 61.4 | 55.0 | 2748 | 5.70 | 5.74 | 3.51 |
18 | Very Good | H | 0.80 | SI1 | 63.8 | 56.0 | 2739 | 5.85 | 5.91 | 3.75 |
19 | Very Good | I | 0.75 | VVS2 | 62.0 | 59.0 | 2749 | 5.81 | 5.83 | 3.61 |
20 | Very Good | J | 0.90 | SI1 | 63.2 | 60.0 | 2753 | 6.12 | 6.09 | 3.86 |
21 | Good | D | 0.72 | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
22 | Good | E | 0.73 | SI1 | 57.9 | 55.0 | 2749 | 6.00 | 5.96 | 3.46 |
23 | Good | F | 0.79 | SI1 | 58.1 | 59.0 | 2756 | 6.06 | 6.13 | 3.54 |
24 | Good | G | 0.80 | VS2 | 64.2 | 58.0 | 2753 | 5.84 | 5.81 | 3.74 |
25 | Good | H | 1.02 | I1 | 64.3 | 63.0 | 2751 | 6.28 | 6.23 | 4.02 |
26 | Good | I | 0.84 | VS1 | 63.7 | 59.0 | 2753 | 5.94 | 5.90 | 3.77 |
27 | Good | J | 0.90 | SI1 | 63.2 | 60.0 | 2729 | 6.09 | 6.12 | 3.86 |
28 | Fair | D | 0.71 | VS1 | 65.4 | 59.0 | 2747 | 5.62 | 5.58 | 3.66 |
29 | Fair | E | 0.78 | SI2 | 66.9 | 57.0 | 2721 | 5.70 | 5.66 | 3.60 |
30 | Fair | F | 0.72 | VS2 | 55.4 | 64.0 | 2724 | 6.06 | 5.97 | 3.34 |
31 | Fair | G | 1.04 | SI2 | 65.2 | 57.0 | 2745 | 6.25 | 6.23 | 4.07 |
32 | Fair | H | 1.05 | SI2 | 64.8 | 55.0 | 2717 | 6.39 | 6.32 | 4.12 |
33 | Fair | I | 1.00 | SI2 | 66.8 | 56.0 | 2743 | 6.22 | 6.12 | 4.13 |
34 | Fair | J | 1.01 | SI2 | 66.9 | 58.0 | 2683 | 6.13 | 6.07 | 4.08 |
- 각 그룹별로 가장 마지막 값을 출력합니다.
In [10]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
'Python' 카테고리의 다른 글
dataprep EDA 패키지 (0) | 2021.07.03 |
---|---|
[Matplotlib] Matplotlib 기본 사용법 (0) | 2021.06.29 |
[Pandas] Pandas Cheat Sheet(Combine Data Sets) (0) | 2021.06.22 |
[Pandas] Pandas Cheat Sheet(Make New Columns) (0) | 2021.06.21 |
[Pandas] Pandas Cheat Sheet(Handling Missing Data) (0) | 2021.06.21 |