해들리위컴 그래프레이어

import pandas as pd 
mpg=pd.read_csv("mpg.csv")

기본산점도 (2차원)

from plotnine import * 
ggplot(mpg) + geom_point(aes(x = "displ", y = "hwy"))
<ggplot: (117623195251)>

$\to$ 엔진크기와 연료효율은 반비례 $\to$ 엔진크기가 클수록 연비 안 좋다

fig=ggplot(mpg) 
a1=aes(x='displ',y='hwy') 
point1=geom_point(a1) 
fig+point1
<ggplot: (117623102492)>

산점도응용 (3차원) + (점크기변경)

ggplot(mpg)+ geom_point(aes(x='displ',y='hwy',size= 'class'))
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (117623142138)>

산점도 + 투명도변경

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',alpha= 'class'))
# 여기서 data와 mapping은 생략 가능
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_alpha.py:68: PlotnineWarning: Using alpha for a discrete variable is not advised.
<ggplot: (117623137118)>

4차원 (+점 크기, + 투명도)

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',size= 'class',alpha='class'))
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_alpha.py:68: PlotnineWarning: Using alpha for a discrete variable is not advised.
<ggplot: (117623141988)>

3차원 산점도 (+ 형태)

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',shape='class'))
<ggplot: (117624346444)>

3차원 산점도 (+색깔 )

ggplot(data=mpg)+ geom_point(mapping=aes(x='displ',y='hwy',color='class'))
<ggplot: (117624492963)>

- 객체지향적으로?

a2=aes(x='displ',y='hwy',color='class')
a1
{'x': 'displ', 'y': 'hwy'}
a2
{'x': 'displ', 'y': 'hwy', 'color': 'class'}
point2=geom_point(a2)
fig+point2
<ggplot: (117624600333)>

지옴을 더 추가 (적합선,추세선)

fig+point1
<ggplot: (117624629044)>
sline1=geom_smooth(a1)
fig+point1+sline1
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117624849797)>
fig+point2+sline1
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117624864687)>

- 명령어로 한번에 그리기

ggplot(mpg)\
+geom_point(aes(x='displ',y='hwy',color='class'))\
+geom_smooth(aes(x='displ',y='hwy')) # aes(x='displ',y='hwy') 이건 생략해도 됨, 밑에서 설명할 것
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117625399411)>

- 공통적인 맵핑규칙은 ggplot()쪽으로 빼기도 한다. (figure를 선언하는 곳에서 공통으로 선언함)

ggplot(mpg,aes(x='displ',y='hwy'))+geom_point(aes(color='class'))+geom_smooth()
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117625531733)>

산점도응용2 (4차원)

ggplot(mpg,aes(x='displ',y='hwy'))+geom_point(aes(size='class',color='drv'),alpha=0.2)
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (117624905376)>
  • 모든 $x$에 대하여 붉은색 점들이 대부분 초록선과 보라색 점들에 비하여 아래쪽에 위치하여 있음 $\to$ 4륜구동방식이 연비가 좋지 않음

- 객체지향적

a1,a2
({'x': 'displ', 'y': 'hwy'}, {'x': 'displ', 'y': 'hwy', 'color': 'class'})
a3=a2.copy()
a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'})
a3['color']='drv'
a3['size']='class' # 이건 새로 생성
a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'})
  • 아래와 같이 선언해도 괜찮음
    a3=aes(x='displ',y='hwy',color='drv',size='class')
    
point3=geom_point(a3,alpha=0.2)
fig+point3
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
<ggplot: (117624865663)>
fig+point3+sline1
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117623265160)>

각 그룹별로 선을 따로 그릴수도 있을까?

a1,a2,a3
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'})
a4=a2.copy()
a4['color']='drv'
a4
{'x': 'displ', 'y': 'hwy', 'color': 'drv'}
sline2=geom_smooth(a4)
fig+sline2+point3
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117623085858)>

- 선의 색깔을 동일하게 하고 선의 타입을 변경하여 그룹을 표시할수도 있지 않을까?

a1,a2,a3,a4
({'x': 'displ', 'y': 'hwy'},
 {'x': 'displ', 'y': 'hwy', 'color': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv', 'size': 'class'},
 {'x': 'displ', 'y': 'hwy', 'color': 'drv'})
a5=a1.copy()
a5['linetype']='drv'
a5
{'x': 'displ', 'y': 'hwy', 'linetype': 'drv'}
sline3=geom_smooth(a5,size=1,color='gray')
# size는 선의 굵기
fig+point3+sline3
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117626080434)>
fig+point3+sline3+sline1
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117626269175)>
sline2=geom_smooth(a4,size=1,linetype='dashed')
fig+point3+sline2+sline1
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\scales\scale_size.py:48: PlotnineWarning: Using size for a discrete variable is not advised.
C:\Users\ehfus\Anaconda3\envs\dv2021\lib\site-packages\plotnine\stats\smoothers.py:310: PlotnineWarning: Confidence intervals are not yet implementedfor lowess smoothings.
<ggplot: (117626079883)>

- 고차원의 변수를 표현할 수 있는 무기는 다양하다.

  • 산점도(포인트지옴): 점의크기, 점의형태, 점의색깔, 점의투명도
  • 라인플랏(스무스지옴, 라인지옴): 선의형태, 선의색깔, 선의굵기

결론

: 그래프는 데이터 + 지옴 + 맵핑(변수와 에스테틱간의 맵핑) + 스탯(통계) + 포지션 + 축 + 패싯그리드 7개의 조합으로 그릴수 있다. 

판다스에서 column을 선택하는 방법

import numpy as np
dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df
X1 X2 X3
0 -1.215979 -1.395564 0.193139
1 -0.311555 0.430770 0.044660
2 -0.464150 -0.188806 -0.456508
3 -0.476752 0.543144 1.066535
4 -1.246602 -0.812871 1.230598
df.X1
0   -1.215979
1   -0.311555
2   -0.464150
3   -0.476752
4   -1.246602
Name: X1, dtype: float64
df['X1']
0   -1.215979
1   -0.311555
2   -0.464150
3   -0.476752
4   -1.246602
Name: X1, dtype: float64
df[['X1']]
X1
0 -1.215979
1 -0.311555
2 -0.464150
3 -0.476752
4 -1.246602
  • df['X1']는 series를 리턴하고 df[['X1']]는 dataframe을 리턴한다.
df.loc[:,'X1']
0   -1.215979
1   -0.311555
2   -0.464150
3   -0.476752
4   -1.246602
Name: X1, dtype: float64

- 방법5

df.loc[:,['X1']]
X1
0 -1.215979
1 -0.311555
2 -0.464150
3 -0.476752
4 -1.246602

- 방법6

df.loc[:,[True,False,False]] 
X1
0 -1.215979
1 -0.311555
2 -0.464150
3 -0.476752
4 -1.246602

- 방법7

df.iloc[:,0]
0   -1.215979
1   -0.311555
2   -0.464150
3   -0.476752
4   -1.246602
Name: X1, dtype: float64

- 방법8

df.iloc[:,[0]]
X1
0 -1.215979
1 -0.311555
2 -0.464150
3 -0.476752
4 -1.246602

- 방법9

df.iloc[:,[True,False,False]]
X1
0 -1.215979
1 -0.311555
2 -0.464150
3 -0.476752
4 -1.246602

참고사항: 열이름이 interger일 경우

_df = pd.DataFrame(np.array([[1,2,3],[3,4,5],[5,6,7]])) 
_df
0 1 2
0 1 2 3
1 3 4 5
2 5 6 7

- 아래가 모두 가능하다.

_df[0]
0    1
1    3
2    5
Name: 0, dtype: int32
_df[[0]]
0
0 1
1 3
2 5
_df.loc[:,0]
0    1
1    3
2    5
Name: 0, dtype: int32
_df.loc[:,[0]]
0
0 1
1 3
2 5
_df.iloc[:,0]
0    1
1    3
2    5
Name: 0, dtype: int32
_df.iloc[:,[0]]
0
0 1
1 3
2 5
dic={'X.1':np.random.normal(0,1,5),
     'X.2':np.random.normal(0,1,5),
     'X.3':np.random.normal(0,1,5)}
_df=pd.DataFrame(dic)
_df
X.1 X.2 X.3
0 -0.379475 0.328678 -1.183929
1 1.803752 -2.116759 0.436055
2 0.280973 -0.187272 -0.685104
3 -0.065296 0.383366 -0.498928
4 0.687288 -1.173613 -0.944841
_df['X.1']
0   -0.379475
1    1.803752
2    0.280973
3   -0.065296
4    0.687288
Name: X.1, dtype: float64
# 이건 사용 불가
# df.~ 이 형태로 indexing할 땐 column명에 dot이 있거나 공백이 있으면 사용할 수 없다

예제2: 여러개의 열을 선택

dic={'X1':np.random.normal(0,1,5),
     'X2':np.random.normal(0,1,5),
     'X3':np.random.normal(0,1,5),
     'X4':np.random.normal(0,1,5)}
df=pd.DataFrame(dic)
df
X1 X2 X3 X4
0 1.132239 0.408603 1.545259 0.988863
1 1.141026 0.062062 0.347106 0.915165
2 0.833884 0.070472 -1.086375 -1.122671
3 -0.269858 1.385849 -0.786887 -1.641488
4 0.485986 -0.644105 -2.143558 0.987382

- 방법1

df[['X1','X2','X3']]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법2

df.loc[:,['X1','X2','X3']]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법3

df.loc[:,'X1':'X3'] 
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법4

df.loc[:,[True,True,True,False]]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법5

df.iloc[:,[0,1,2]]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법6

df.iloc[:,:3]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558
df.iloc[:,0:3]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558
df.iloc[:,range(3)]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 방법7

df.iloc[:,[True,True,True,False]]
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

loc에서 슬라이싱은 마지막변수를 포함, iloc에서는 포함하지 않음

df.iloc[:,0:3] ## 0,1,2,3중 3은 포함되지 않는다.
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558
df.loc[:,'X1':'X3'] ## 'X3'도 포함된다. 
X1 X2 X3
0 1.132239 0.408603 1.545259
1 1.141026 0.062062 0.347106
2 0.833884 0.070472 -1.086375
3 -0.269858 1.385849 -0.786887
4 0.485986 -0.644105 -2.143558

- 그래서 column의 이름이 integer일 경우는 종종 매우 헷갈리는 일이 일어남

_df = pd.DataFrame(np.array([[1,2,3,4],[3,4,5,6],[5,6,7,8]]))
_df
0 1 2 3
0 1 2 3 4
1 3 4 5 6
2 5 6 7 8
_df.loc[:,0:2]
0 1 2
0 1 2 3
1 3 4 5
2 5 6 7
_df.iloc[:,0:2]
0 1
0 1 2
1 3 4
2 5 6

예제3: movie data - 특정조건에 맞는 열을 선택

df=pd.read_csv('https://raw.githubusercontent.com/PacktPublishing/Pandas-Cookbook/master/data/movie.csv')

- 열의 이름을 출력하여 보자.

df.columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')
pd.Series(df.columns)
0                         color
1                 director_name
2        num_critic_for_reviews
3                      duration
4       director_facebook_likes
5        actor_3_facebook_likes
6                  actor_2_name
7        actor_1_facebook_likes
8                         gross
9                        genres
10                 actor_1_name
11                  movie_title
12              num_voted_users
13    cast_total_facebook_likes
14                 actor_3_name
15         facenumber_in_poster
16                plot_keywords
17              movie_imdb_link
18         num_user_for_reviews
19                     language
20                      country
21               content_rating
22                       budget
23                   title_year
24       actor_2_facebook_likes
25                   imdb_score
26                 aspect_ratio
27         movie_facebook_likes
dtype: object
list(range(13))+[26]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 26]
df.iloc[:,list(range(13))+[26]] 
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users aspect_ratio
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 1.78
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 2.35
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 2.35
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 2.35
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary Doug Walker Star Wars: Episode VII - The Force Awakens 8 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4911 Color Scott Smith 1.0 87.0 2.0 318.0 Daphne Zuniga 637.0 NaN Comedy|Drama Eric Mabius Signed Sealed Delivered 629 NaN
4912 Color NaN 43.0 43.0 NaN 319.0 Valorie Curry 841.0 NaN Crime|Drama|Mystery|Thriller Natalie Zea The Following 73839 16.00
4913 Color Benjamin Roberds 13.0 76.0 0.0 0.0 Maxwell Moody 0.0 NaN Drama|Horror|Thriller Eva Boehnke A Plague So Pleasant 38 NaN
4914 Color Daniel Hsia 14.0 100.0 0.0 489.0 Daniel Henney 946.0 10443.0 Comedy|Drama|Romance Alan Ruck Shanghai Calling 1255 2.35
4915 Color Jon Gunn 43.0 90.0 16.0 16.0 Brian Herzlinger 86.0 85222.0 Documentary John August My Date with Drew 4285 1.85

4916 rows × 14 columns

- 다시열의 이름들을 확인

df.columns
Index(['color', 'director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'language', 'country',
       'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes',
       'imdb_score', 'aspect_ratio', 'movie_facebook_likes'],
      dtype='object')

actor라는 단어가 포함된 변수들만 뽑고싶다.

list(map(lambda x : 'actor' in x, df.columns))
[False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False]
df.iloc[:,list(map(lambda x : 'actor' in x, df.columns))]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

df.loc[:,list(map(lambda x : 'actor' in x, df.columns))]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

loc으로 해도 되고 iloc으로 해도 된다.

- 방법3

list로 안 만들어도 됨

df.iloc[:,map(lambda x : 'actor' in x, df.columns)]
actor_3_facebook_likes actor_2_name actor_1_facebook_likes actor_1_name actor_3_name actor_2_facebook_likes
0 855.0 Joel David Moore 1000.0 CCH Pounder Wes Studi 936.0
1 1000.0 Orlando Bloom 40000.0 Johnny Depp Jack Davenport 5000.0
2 161.0 Rory Kinnear 11000.0 Christoph Waltz Stephanie Sigman 393.0
3 23000.0 Christian Bale 27000.0 Tom Hardy Joseph Gordon-Levitt 23000.0
4 NaN Rob Walker 131.0 Doug Walker NaN 12.0
... ... ... ... ... ... ...
4911 318.0 Daphne Zuniga 637.0 Eric Mabius Crystal Lowe 470.0
4912 319.0 Valorie Curry 841.0 Natalie Zea Sam Underwood 593.0
4913 0.0 Maxwell Moody 0.0 Eva Boehnke David Chandler 0.0
4914 489.0 Daniel Henney 946.0 Alan Ruck Eliza Coupe 719.0
4915 16.0 Brian Herzlinger 86.0 John August Jon Gunn 23.0

4916 rows × 6 columns

- 방법4

df.loc[:,map(lambda x : 'face' in x, df.columns)]
director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes cast_total_facebook_likes facenumber_in_poster actor_2_facebook_likes movie_facebook_likes
0 0.0 855.0 1000.0 4834 0.0 936.0 33000
1 563.0 1000.0 40000.0 48350 0.0 5000.0 0
2 0.0 161.0 11000.0 11700 1.0 393.0 85000
3 22000.0 23000.0 27000.0 106759 0.0 23000.0 164000
4 131.0 NaN 131.0 143 0.0 12.0 0
... ... ... ... ... ... ... ...
4911 2.0 318.0 637.0 2283 2.0 470.0 84
4912 NaN 319.0 841.0 1753 1.0 593.0 32000
4913 0.0 0.0 0.0 0 0.0 0.0 16
4914 0.0 489.0 946.0 2386 5.0 719.0 660
4915 16.0 16.0 86.0 163 0.0 23.0 456

4916 rows × 7 columns

변수이름이 s로 끝나는 변수들만 뽑고싶다.

df.iloc[:,map(lambda x: 's' == x[-1],df.columns )]
num_critic_for_reviews director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross genres num_voted_users cast_total_facebook_likes plot_keywords num_user_for_reviews actor_2_facebook_likes movie_facebook_likes
0 723.0 0.0 855.0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi 886204 4834 avatar|future|marine|native|paraplegic 3054.0 936.0 33000
1 302.0 563.0 1000.0 40000.0 309404152.0 Action|Adventure|Fantasy 471220 48350 goddess|marriage ceremony|marriage proposal|pi... 1238.0 5000.0 0
2 602.0 0.0 161.0 11000.0 200074175.0 Action|Adventure|Thriller 275868 11700 bomb|espionage|sequel|spy|terrorist 994.0 393.0 85000
3 813.0 22000.0 23000.0 27000.0 448130642.0 Action|Thriller 1144337 106759 deception|imprisonment|lawlessness|police offi... 2701.0 23000.0 164000
4 NaN 131.0 NaN 131.0 NaN Documentary 8 143 NaN NaN 12.0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
4911 1.0 2.0 318.0 637.0 NaN Comedy|Drama 629 2283 fraud|postal worker|prison|theft|trial 6.0 470.0 84
4912 43.0 NaN 319.0 841.0 NaN Crime|Drama|Mystery|Thriller 73839 1753 cult|fbi|hideout|prison escape|serial killer 359.0 593.0 32000
4913 13.0 0.0 0.0 0.0 NaN Drama|Horror|Thriller 38 0 NaN 3.0 0.0 16
4914 14.0 0.0 489.0 946.0 10443.0 Comedy|Drama|Romance 1255 2386 NaN 9.0 719.0 660
4915 43.0 16.0 16.0 86.0 85222.0 Documentary 4285 163 actress name in title|crush|date|four word tit... 84.0 23.0 456

4916 rows × 12 columns

df.loc[:,map(lambda x: 's' == x[-1],df.columns )]
num_critic_for_reviews director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross genres num_voted_users cast_total_facebook_likes plot_keywords num_user_for_reviews actor_2_facebook_likes movie_facebook_likes
0 723.0 0.0 855.0 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi 886204 4834 avatar|future|marine|native|paraplegic 3054.0 936.0 33000
1 302.0 563.0 1000.0 40000.0 309404152.0 Action|Adventure|Fantasy 471220 48350 goddess|marriage ceremony|marriage proposal|pi... 1238.0 5000.0 0
2 602.0 0.0 161.0 11000.0 200074175.0 Action|Adventure|Thriller 275868 11700 bomb|espionage|sequel|spy|terrorist 994.0 393.0 85000
3 813.0 22000.0 23000.0 27000.0 448130642.0 Action|Thriller 1144337 106759 deception|imprisonment|lawlessness|police offi... 2701.0 23000.0 164000
4 NaN 131.0 NaN 131.0 NaN Documentary 8 143 NaN NaN 12.0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
4911 1.0 2.0 318.0 637.0 NaN Comedy|Drama 629 2283 fraud|postal worker|prison|theft|trial 6.0 470.0 84
4912 43.0 NaN 319.0 841.0 NaN Crime|Drama|Mystery|Thriller 73839 1753 cult|fbi|hideout|prison escape|serial killer 359.0 593.0 32000
4913 13.0 0.0 0.0 0.0 NaN Drama|Horror|Thriller 38 0 NaN 3.0 0.0 16
4914 14.0 0.0 489.0 946.0 10443.0 Comedy|Drama|Romance 1255 2386 NaN 9.0 719.0 660
4915 43.0 16.0 16.0 86.0 85222.0 Documentary 4285 163 actress name in title|crush|date|four word tit... 84.0 23.0 456

4916 rows × 12 columns

변수이름이 c 혹은 d로 시작하는 변수들만 뽑고싶다.

df.iloc[:,map(lambda x: 'c' == x[0] or 'd' == x[0] ,df.columns )]
color director_name duration director_facebook_likes cast_total_facebook_likes country content_rating
0 Color James Cameron 178.0 0.0 4834 USA PG-13
1 Color Gore Verbinski 169.0 563.0 48350 USA PG-13
2 Color Sam Mendes 148.0 0.0 11700 UK PG-13
3 Color Christopher Nolan 164.0 22000.0 106759 USA PG-13
4 NaN Doug Walker NaN 131.0 143 NaN NaN
... ... ... ... ... ... ... ...
4911 Color Scott Smith 87.0 2.0 2283 Canada NaN
4912 Color NaN 43.0 NaN 1753 USA TV-14
4913 Color Benjamin Roberds 76.0 0.0 0 USA NaN
4914 Color Daniel Hsia 100.0 0.0 2386 USA PG-13
4915 Color Jon Gunn 90.0 16.0 163 USA PG

4916 rows × 7 columns