๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

๐“ก๐“ธ๐“ธ๐“ถ5: ๐’ฆ๐‘œ๐“‡๐‘’๐’ถ ๐’ฐ๐“ƒ๐’พ๐“‹/๊ฒฝ์˜์ •๋ณด์‹œ์Šคํ…œ(BUSS215)

[๊ฒฝ์˜์ •๋ณด์‹œ์Šคํ…œ] 2. Regression

Regression

Regression
X์™€ Y ์‚ฌ์ด์˜ relationship์„ ๋ถ„์„ํ•˜๋Š” statistical model

X = ํ•˜๋‚˜ ์ด์ƒ์˜ independent/explanatory variables
Y = dependent, target, explanatory variables, ์ฆ‰ ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ธกํ•˜๊ณ  ์‹ถ์€ ๊ฒƒ

x๊ฐ’์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ Y๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด, ๋” ๋‚˜์•„๊ฐ€ X์™€ Y ๊ฐ„์˜ relationship์„ explanationํ•˜๊ธฐ ์œ„ํ•ด!
๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•˜์—ฌ ์•„์ง๋ชจ๋ฅด๋Š” parameter a, b๊ฐ’์„ ๊ตฌํ•ด๋‚ธ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด Height = a + b(Age) ๋ผ๊ณ  ํ•ด๋ณด์ž.

Age๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก Height๋„ ์–ด๋Š์ •๋„ ๋น„๋ก€ํ•ด์„œ ์ปค์งˆ ๊ฒƒ์ด๋‹ค. ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ๋ถ„๋ฉด ์œ„์— ํ‘œ์‹œํ•ด๋ณด๋ฉด ๊ทธ๋ž˜ํ”„์˜ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์–‘์ˆ˜์ธ ํ˜•ํƒœ์˜ ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง„๋‹ค.

์ฆ‰, X์™€ Y ์‚ฌ์ด์˜ relationship๋ฅผ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” pattern์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด regression์ด๋‹ค.

๊ทธ๋ฆฌ๊ณ  ํ•™์Šต์„ ํ†ตํ•ด ์Šค์Šค๋กœ relationship function์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Machine Learning์ด๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ๊ฒƒ์€ Linear regression ์ด๋ฏ€๋กœ input variable๊ณผ output variable ์‚ฌ์ด์˜ ๊ด€๊ณ„๊ฐ€ linear relationship์ด๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค.

Process

  1. Hypothesis

์šฐ๋ฆฌ๊ฐ€ Weight์™€ Height ์‚ฌ์ด์˜ relationship์„ ์ฐพ๊ณ ์ž ํ•  ๋•Œ, ๊ทธ ๋‘˜์€ Linear relationship์„ ๊ฐ€์ง€๋ฏ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ basic hypothesis function์„ ์„ธ์šธ ์ˆ˜ ์žˆ๋‹ค.
H(x) = Wx + b

  1. Learning

์ตœ์ ์˜ relation function์„ ์ฐพ๊ธฐ ์œ„ํ•ด W, b๊ฐ’์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค.
W,b๊ฐ’์„ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ BEST STRAIGHT LINE(regression function)์„ ์ฐพ๋Š”๋‹ค!

ML ๊ด€์ ์—์„œ ๋ณด๋ฉด

  • ๊ฐ variable X๊ฐ€ outcome Y์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ณด๊ณ 
  • (Prediction) variable๊ณผ outcome ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ณด๋‹ค๋Š” ๋ฏธ๋ž˜์˜ events๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์ง‘์ค‘ํ•œ๋‹ค!

ํ•ด์„ํ•˜๊ธฐ ์‰ฌ์›Œ์„œ ๋Œ€์ค‘์ ์œผ๋กœ, ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•์ž„!
์ฃผ๋กœ SPSS, SAS, R Program, Python, Matlab, STATA, Excel ๋“ฑ์˜ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•จ

Types

  • Linear Regression - input variables์˜ ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„์ž
    1. simple linear regression - input variable ํ•˜๋‚˜ (X)
    2. multiple linear regression - input variable ์—ฌ๋Ÿฌ๊ฐœ (X1, X2...)
  • Research์˜ ๋ชฉ์ ์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ ๋ณด์ž
    1. Regression Problem
      • ํ•œ ์ œํ’ˆ์ด ๋ช‡ ๋Œ€๋‚˜ ํŒ”๋ฆด๊นŒ
      • ์–ผ๋งˆ์— ํŒ”๋ฆด๊นŒ
    2. Classification Problem (=Logistic Regression)
      • ๊ณ ๊ฐ์ด ์ œํ’ˆ์„ ๊ตฌ๋งคํ• ๊นŒ ํ•˜์ง€ ์•Š์„๊นŒ
      • ๋น„๊ฐ€ ์˜ฌ๊นŒ, ์˜ค์ง€ ์•Š์„๊นŒ

Big Data Analytics

-> Variables๊ณผ Records๊ฐ€ ๋งค์šฐ๋งค์šฐ ๋งŽ์„ ๋•Œ!!!! ๋” ๋งŽ์€ Computations์„ ํ•ด์•ผํ•ด์„œ ๋”์šฑ powerfulํ•œ ์ปดํ“จํ„ฐ(CPU, GPU, Memory)๊ฐ€ ํ•„์š”ํ•จ

์–ผ๋งˆ๋‚˜ Big ํ•ด์•ผ Big data์ธ๊ฐ€
=> Petabyte

 

ํ•˜์ง€๋งŒ ๋‹จ์ˆœํžˆ ํฌ๊ธฐ๊ฐ€ ํฐ ๊ฒƒ์— ๋Œ€ํ•œ ๊ฐœ๋…์ด ์•„๋‹ˆ๋ผ Phenomenon Descriptive์— ๋Œ€ํ•œ ๊ฐœ๋…!!

3Vs : Variety, Velocity, Volume

 

General Procedures

 

cleaning&integration -> [Cleaned data] -> selection&transformation -> [Prepared data] ->data mining -> [Pattern] evaluation  -> [Knowledge]