흑백 $\to$ 차원: 세로픽셀수 $\times$ 가로픽셀수, 값: 0~255 (값이 클수록 흰색)

칼라 $\to$ 차원: 세로픽셀수 $\times$ 가로픽셀수 $\times$ 3, 값: 0~255 (값이 클수록 진한빨강, 진한파랑, 진한녹색)

import cv2 as cv

bd=cv.imread('KakaoTalk_20210927_192557462.jpg')

import matplotlib.pyplot as plt 
plt.imshow(bd)

<matplotlib.image.AxesImage at 0x1df8e565d90>

bd.shape

(1084, 1084, 3)

(1084, 1084, 3) -> 빨강으로만 표현된 사진, 파랑, 초록 각각으로 표현된 사진, 세사진이 겹쳐진 것을 의미

zeros_like -> 초기화

[ : , : , 0 ] => 첫부분하고 두번째 부분은 세로,가로 픽셀수를 의미 마지막 숫자가 0,1,2순서대로 RGB임

import numpy as np
bd_red=np.zeros_like(bd) #아예 초기화
bd_green=np.zeros_like(bd) #아예 초기화
bd_blue=np.zeros_like(bd) #아예 초기화
bd_red[:,:,0]=bd[:,:,0]
bd_green[:,:,1]=bd[:,:,1]
bd_blue[:,:,2]=bd[:,:,2]

plt.imshow(bd_red)

<matplotlib.image.AxesImage at 0x1df8fab19a0>

plt.imshow(bd_blue+bd_red)

<matplotlib.image.AxesImage at 0x1df901f81c0>

산점도 (scatter plot)

import matplotlib.pyplot as plt

x=[1,2,3,4]
y=[2,3,4,5] 
plt.plot(x,y,'o')

[<matplotlib.lines.Line2D at 0x1df8feb2ee0>]

deafult : line
산점도는 보통 X와 Y의 관계를 알고 싶을 경우 그린다.
박스플랏, 히스토그램은 그림을 그리기 위해서 하나의 변수만 필요함
따라서 산점도를 위해서는 두개의 변수가 필요함.

x=[44,48,49,58,62,68,69,70,76,79]
y=[159,160,162,165,167,162,165,175,165,172]

plt.plot(x,y,'ok')

[<matplotlib.lines.Line2D at 0x1df8ff27940>]

키가 큰 사람일수록 몸무게도 많이 나간다. (반대도 성립)
키와 몸무게는 관계가 있어보인다. (정비례)

얼만큼 정비례인지?
이 질문에 대답하기 위해서는 상관계수의 개념을 알아야 한다.
상관계수에 대한 개념은 산점도를 이해함에 있어서 핵심개념이다.

$$ (표본)상관계수 $$

$$r=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar{y})^2 }} $$

$$r=\sum_{i=1}^{n}\frac{1}{c}(x_i-\bar{x})(y_i-\bar{y}) $$

$$r=\sum_{i=1}^{n}\left( \frac{(x_i-\bar{x})}{\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}}\frac{(y_i-\bar{y})}{\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}} \right)$$

$$\tilde{x}_i=\frac{(x_i-\bar{x})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}, \tilde{y}_i=\frac{(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}$$

$$r=\sum_{i=1}^{n}\tilde{x}_i \tilde{y}_i $$

import numpy as np
x=np.array(x)
y=np.array(y)

plt.plot(x-np.mean(x), y-np.mean(y),'o')

[<matplotlib.lines.Line2D at 0x1df8ff989a0>]

- $a=\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}, b=\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}$

a=np.sqrt(np.sum((x-np.mean(x))**2))
b=np.sqrt(np.sum((y-np.mean(y))**2))
a,b

(36.58004920718396, 15.218409903797438)

$a>b$ 이므로 $\{x_i\}$들이 $\{y_i\}$들 보다 좀 더 퍼져있다.

$a=\sqrt{n}\times{\tt np.std(x)}$

$b=\sqrt{n}\times{\tt np.std(y)}$

std = 표준편차

n=len(x)
np.sqrt(n)*np.std(x), np.sqrt(n)*np.std(y)

(36.58004920718397, 15.21840990379744)

${\tt np.std(x)}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2}$
${\tt np.std(y)}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\bar{y})^2}$

- 이제 $(\tilde{x}_i,\tilde{y}_i)$를 그려보자.

xx= (x-np.mean(x))/a
yy= (y-np.mean(y))/b
plt.plot(xx,yy,'o')

[<matplotlib.lines.Line2D at 0x1df90009e80>]

평균도 비슷하고 퍼진정도도 비슷하다.

- 질문1: $r$의 값이 양수인가? 음수인가?

plotly 사용하여 그려보자.

# px.scatter(x=xx, y=yy)

$\tilde{x}_i$, $\tilde{y}_i$ 를 곱한값이 양수인것과 음수인것을 체크해보자.
양수인쪽이 많은지 음수인쪽이 많은지 생각해보자.
$r=\sum_{i=1}^{n}\tilde{x}_i \tilde{y}_i$ 의 부호는? +가 -보다 많은 것으로 보아 상관계수의 부호는 +임을 알 수 있다

- 질문2 : 아래와 같은 두개의 데이터 set이 있다고 하자.

x1=np.arange(0,10,0.1)
y1=x1+np.random.normal(loc=0,scale=1.0,size=len(x1))

plt.plot(x1,y1,'o')

[<matplotlib.lines.Line2D at 0x1df92eb2a90>]

x2=np.arange(0,10,0.1)
y2=x2+np.random.normal(loc=0,scale=7.0,size=len(x2)) # 표준편차 업그레이드
plt.plot(x2,y2,'x')

[<matplotlib.lines.Line2D at 0x1df92f209d0>]

아래는 겹쳐 그린 것

plt.plot(x1,y1,'o')
plt.plot(x2,y2,'x')

[<matplotlib.lines.Line2D at 0x1df92f901c0>]

n=len(x1)
xx1= (x1-np.mean(x1)) / (np.std(x1) * np.sqrt(n))
yy1= (y1-np.mean(y1)) / (np.std(y1) * np.sqrt(n))
xx2= (x2-np.mean(x2)) / (np.std(x2) * np.sqrt(n))
yy2= (y2-np.mean(y2)) / (np.std(y2) * np.sqrt(n))

plt.plot(xx1,yy1,'o')
plt.plot(xx2,yy2,'x')

[<matplotlib.lines.Line2D at 0x1df92feef70>]

(1) $r_1$, $r_2$의 부호는 양수인가? 음수인가?

$r_1$ 의 부호는 양수 $r_2$ 의 부호도 대충 양수임을 알 수 있음 $\to$ 1,2,3,4분면으로 나눠서 1,3사분면에 많은 분포가 있을수록 상관계수의 부호는 양수일 확률이 높다

(2) $r_1,r_2$의 값중 어떠한 값이 더 절대값이 큰가?

r1의 절댓값이 더 클 것. 왜냐하면 r2는 2사분면 4사분면이 값이 음수라 양수값들을 상쇄시키기 때문에 절댓값의 크기 또한 작아질 것이다.

sum(xx1*yy1),sum(xx2*yy2)

(0.9381086706782814, 0.36042715437479517)

파란색의 계수값이 더 크므로 더 강한 직선성을 띈다고 할 수 있다

2022/01/02/SUN

`산점도 (scatter plot)`