scikit-learn

scikit-learn

Data Preparation

๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ฐ€์ ธ์˜ค๊ธฐ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

๋ฐ์ดํ„ฐ์„ธํŠธ ๋กœ๋“œ

data = pd.read_csv('/kaggle/input/ecommerce/advertising.csv')
data.head(2)

Data Exploratory Analysis

๋ฐ์ดํ„ฐ ํƒ์ƒ‰ ๋ฐ ์ „์ฒ˜๋ฆฌ

data.info()

Data Type, Data Count, Missing Value ๋ฅผ ํ™•์ธํ•œ๋‹ค. Data Type์„ ์ ๊ฒ€ํ•˜์—ฌ Type ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•œ์ง€, Data Count๋ฅผ ์ ๊ฒ€ํ•˜์—ฌ Train๊ณผ Test ๋Š” ์–ด๋–ค ๋น„์œจ๋กœ ๋‚˜๋ˆŒ์ง€, Missing Value๋ฅผ ์ ๊ฒ€ํ•˜์—ฌ ์ œ๊ฑฐํ• ์ง€ ๋Œ€์ฒดํ• ์ง€๋ฅผ ์ •ํ•œ๋‹ค.

๊ธฐ์ดˆํ†ต๊ณ„๋ถ„์„(Descriptive Statistics Analysis)

์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ

data.describe()

mean, median, min, max ๋“ฑ ๊ธฐ์ดˆํ†ต๊ณ„๋กœ ๋ถ„ํฌ์™€ Outlier๋ฅผ ์˜ˆ์ƒํ•ด ๋ณธ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ํ‰๊ท ๊ทผ์ฒ˜์— ์žˆ๊ณ  ์ •๊ทœ๋ถ„ํฌ๋กœ ์˜ˆ์ƒ๋˜๋ฉด ํฐ ์‹ ๊ฒฝ์“ฐ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. : 25% - min, max - 75% ๋ฅผ ์ฐจ์ด๋ฅผ ํ™•์ธํ•˜๋ฉด ๋ถ„ํฌ์˜ ์น˜์šฐ์นจ ์ •๋„๋ฅผ ์˜ˆ์ƒํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. : 25% - min ์ฐจ์ด๊ฐ€ ํฌ๋ฉด ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์น˜์šฐ์นœ ๋ถ„ํฌ์ด๊ณ  max - 75๊ฐ€ ํฌ๋ฉด ์™ผ์ชฝ์œผ๋กœ ์น˜์šฐ์นœ ๋ถ„ํฌ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค. min์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋„ˆ๋ฌด ์ž‘๊ฑฐ๋‚˜ max๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด Outlier๊ฐ€ ์กด์žฌํ•  ๊ฒƒ์œผ๋กœ ์ถ”์ธกํ•ด๋ณด๊ณ  ์ฐจํŠธ๋ฅผ ๊ทธ๋ ค๋ณธ๋‹ค.

์ฐจํŠธ๋กœ ๋ถ„ํฌ๋ฅผ ์ ๊ฒ€

sns.distplot(data[['Age']])

๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ

data['Country'].nunique()
data['Country'].unique()

Outlier ๋ฅผ ์ ๊ฒ€ํ•œ๋‹ค.

plt.figure(figsize=(20,10))
sns.boxplot(x='productline', y='startprice', data=data)

Q3 + IRQ1.5 ๊ณผ Q1 - IRQ1.5 ๋ฐ–์€ outlier๋กœ ๋ณธ๋‹ค.

๋ฒ”์ฃผํ˜•์„ ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

pd.get_dummies(data, columns=['zip_code', 'channel'], drop_first=True)

Missing Value

data.isna().mean()

์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜์—์„œ Missing Value๊ฐ€ ์ฐจ์ง€ํ•˜๋Š” ๋น„์œจ์„ ์ ๊ฒ€ํ•ด๋ณธ๋‹ค.

df = data.dropna()
df.isna().mean()

Missing Value๋ฅผ ์‚ญ์ œํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ์ œ๊ฑฐํ•˜๋ฉด ์ค‘์š”ํ•œ ์ •๋ณด๋„ ํ•จ๊ป˜ ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋˜๋„๋ก์ด๋ฉด mean, median ๋“ฑ์œผ๋กœ ๋Œ€์ฒดํ•œ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด, Missing Value๊ฐ€ 20% ์ •๋„๋ฉด impute ํ•˜๊ณ , 40%~50% ์ •๋„๋ฉด feature๊ฐ€ ๋งŽ์ง€ ์•Š์„ ๊ฒฝ์šฐ์—๋Š” ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค impute ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ๋ณดํ†ต, Missing Value๋Š” Impute๋Š” mean, median์„ ๋งŽ์ด ์‚ฌ์šฉํ•œ๋‹ค.

data.fillna(round(data['Age'].mean()), inplace=True)
data.isna().sum()

Data Representation

from sklearn.model_selection import train_test_split
X = data[['X1', 'X2', 'X3']]
y = data['Y'] #Series
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 100)

๋ณดํ†ต, ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ 1000 ๊ฐœ ์ดํ•˜๋กœ ์ ์œผ๋ฉด ํ›ˆ๋ จ์šฉ๊ณผ ๊ฒ€์ฆ์šฉ ๋ฐ์ดํ„ฐ๋ฅผ 8:2 ๋‚˜๋ˆ„๊ณ , ๊ทธ ์ด์ƒ์ด๋ฉด 7:3๋กœ ๋‚˜๋ˆˆ๋‹ค.

Data Modeling

Logistic Regression
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

model.coef_

์ ˆ๋Œ€๊ฐ’์ด ํด์ˆ˜๋ก ์˜ํ–ฅ๋„๊ฐ€ ๋†’์€ ๋ณ€์ˆ˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ์˜ ์Šค์ผ€์ผ ๋‹ค๋ฅด๋ฉด ๊ผญ ๊ทธ๋ ‡์ง€๋„ ์•Š์œผ๋ฏ€๋กœ๋‹ˆ ์ฃผ์˜ํ•ด์•ผ ํ•œ๋‹ค.

Decision Tree

gini ์ง€์ˆ˜๊ฐ€ 0.5์ด๋ฉด, ๋ถˆ์ˆœ๋ฌผ์ด ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์ด๊ณ  0์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ๋ถˆ์ˆœ๋ฌผ ์—†์ด ์ž˜ ๋ถ„๋ฅ˜ํ•œ ๊ฒƒ์ด๋‹ค.

from sklearn.tree import plot_tree

DecisionTreeClassifier

acc_list = []
for i in range(2, 31):
    model = DecisionTreeClassifier(max_depth = i)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    print(i, round(accuracy_score(y_test, pred), 4))
    acc_list.append(accuracy_score(y_test, pred))  
from sklearn.ensemble import RandomForestClassifier

RandomForestRegressor

rf = RandomForestRegressor(n_estimators=150, max_depth = 10, random_state = 100)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)
rf.feature_importances_

KMeans

from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(data)

Model Evaluation & Optimization

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

pred = model.predict(X_test)
accuracy_score(y_test, pred)
confusion_matrix(y_test, pred)

Accuracy๋Š” ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜•์ด ์‹ฌํ•˜๋ฉด ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ •ํ™•ํžˆ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, Confusion Matrix๋ฅผ ํ™•์ธํ•ด๋ณด๊ณ  ์ž˜๋ชป ์˜ˆ์ธกํ•œ ๋น„์œจ(type1/type2 error)์ด ๋งŽ๋‹ค๋ฉด Precision, Recall, f1-score๋ฅผ ํ™•์ธํ•ด์„œ ์ตœ์ข… ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ ๊ฒ€ํ•ด๋ด์•ผ ํ•œ๋‹ค.

๋น„์ง€๋‹ˆ์Šค ๋ฌธ์ œ์— ๋”ฐ๋ผ type 1/type 2 error์˜ ์ค‘์š”๋„ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‹ค์ œ ์•”์ด ์•„๋‹ˆ์ง€๋งŒ, ์•”์œผ๋กœ ์˜ˆ์ธกํ•  ๊ฒฝ์šฐ(type 1 error)๋Š” ํฐ ๋ฌธ์ œ๊ฐ€ ์—†์ง€๋งŒ, ์•”์ธ๋ฐ ์ •์ƒ์œผ๋กœ ์˜ˆ์ธก(type 2 error)ํ•˜๋ฉด ์‹ฌ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. type 1 Error๊ฐ€ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๋Š” ๋งˆ์ผ€ํŒ… ๋น„์šฉ ์ง€์ถœ์— ๋”ฐ๋ฅธ ๊ตฌ๋งค ๋ฐ˜์‘์„ ๋ถ„์„ํ•  ๋•Œ type 1 error๋Š” ๋งˆ์ผ€ํŒ… ๋น„์šฉ์„ ์ง€์ถœํ–ˆ์ง€๋งŒ ์‹ค์ œ๋กœ ๊ตฌ๋งค๊ฐ€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜์œผ๋ฏ€๋กœ ๋งˆ์ผ€ํŒ… ๋น„์šฉ ์ง€์ถœ์ด ๋ฐœ์ƒํ•œ๋‹ค.

Confusion Matrix

Clustering ํ‰๊ฐ€

ํด๋Ÿฌ์Šคํ„ฐ๋งํ•œ label์„ ๋ฐ์ดํ„ฐ์„ธํŠธ์™€ ํ•ฉ์ณ์„œ ํด๋Ÿฌ์Šคํ„ฐ๋ณ„ ๋ณ€์ˆ˜๋“ค์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๊ณ  ๋ถ„์„์— ํ™œ์šฉํ•œ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด, ๋‚˜์ด, ์„ฑ๋ณ„, ์—ฐ๋ด‰, ์†Œ๋น„์ ์ˆ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ณ ๊ฐ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ์œ„ํ•ด ํด๋Ÿฌ์Šคํ„ฐํ•˜์—ฌ ๋‚˜์ด๊ฐ€ ์–ด๋ฆฌ๊ณ  ์ˆ˜์ž…์ด ๋†’์œผ๋ฉด ์ง€์ถœ์ด ๋†’๋‹ค. ๋‚˜์ด๊ฐ€ ๋งŽ๊ณ  ์ˆ˜์ž…์ด ๋†’์€ ๊ทธ๋ฃน์€ ์ ์€ ๊ทธ๋ฃน๋ณด๋‹ค ์ง€์ถœ์ด ์ ๋‹ค. ๋“ฑ์˜ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Clustering ์ตœ์ ํ™”

distance = []
for i in range(2,11):
    model = KMeans(n_clusters=i)
    model.fit(df)
    distance.append(model.inertia_)

Elbow๋Š” ์ตœ์ ์˜ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์„ ์ •ํ•˜๊ธฐ ์• ๋งคํ•  ๋•Œ๊ฐ€ ์žˆ์œผ๋ฉด ์ด๋Ÿด ๊ฒฝ์šฐ, ์‹ค๋ฃจ์—ฃ ์ ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค. inertia_ ๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์‹ฌ๊ณผ ๊ด€์ธก์น˜์™€์˜ ๊ฑฐ๋ฆฌ ํ•ฉ์œผ๋กœ ์ž‘์„ ์ˆ˜๋ก ์ข‹์ง€๋งŒ, ์‹ค๋ฃจ์—ฃ ์Šค์ฝ”์–ด ๊ฐ’์ด ํด์ˆ˜๋ก ์ข‹๋‹ค.

sil = []
for i in range(2,11):
    model = KMeans(n_clusters=i)
    model.fit(data)
    sil.append(silhouette_score(data, model.labels_))

์‹œ๊ฐํ™”
plt.figure(figsize=(20, 10))

Clustering ์ตœ์ ํ™” ์‹œ๊ฐํ™”

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(data)
pca_df = pca.transform(data)
pca_df = pd.DataFrame(pca_df, columns=['PC1', 'PC2'])
pca_df.head(2)

์ฐจ์›์ถ•์†Œํ•˜๋ฉด ๋ฐ์ดํ„ฐ๊ฐ€ ์™œ๊ณก๋˜์–ด ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด ์ž˜ ์•ˆ๋œ ๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐ์•ˆํ•˜๊ณ  ์‹œ๊ฐํ™” ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

Logistic Regression, Decision Tree, Random Forest ์˜ ์ฐจ์ด

Logistic Regression์€ Parameter์ด๊ณ , Decision Tree๋Š” Non Parameter์ด๋‹ค. Logistic Regression๋Š” Feature Power๋ฅผ ์ข€๋” ์ž˜ ๋“œ๋Ÿฌ๋‚ธ๋‹ค. Decision Tree๋Š” Categorical Value๋ฅผ ์ˆ˜์น˜ํ˜•์œผ๋กœ ๋ฐ”๊พธ์ง€ ์•Š์•„๋„ ๋œ๋‹ค. Tree ๊ณ„์—ด์€ LR์ฒ˜๋Ÿผ ๋ณ€์ˆ˜์˜ ์˜ํ–ฅ๋„๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜๋Š” ์—†์ง€๋งŒ ์ƒ๋Œ€์ ์œผ๋กœ ์–ด๋–ค ๋ณ€์ˆ˜๊ฐ€ ์ค‘์š”ํ•œ์ง€ ํŒŒ์•…ํ•  ์ˆ˜๋Š” ์žˆ๋‹ค.

Bagging ๊ณผ Random Forest๋Š” Decision Tree์˜ overfitting ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. Bagging์€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์„œ๋ธŒ์„ธํŠธ๋กœ ๋ถ„๋ฆฌํ•˜๊ณ  ๊ฐ๊ฐ Tree๋ฅผ ํ•™์Šตํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ๋‚ด์–ด ์˜ˆ์ธก์„ ํ•˜๋Š”๋ฐ, ํŠน์ • feature์— ์˜ํ–ฅ๋„ ๋†’์„ ๊ฒฝ์šฐ ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค. Random Forest๋Š” Bagging์—์„œ ์ข€ ๋” ๋ฐœ์ „ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ Bagging ๊ด€์ธก์น˜๋ฅผ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์„œ๋ธŒ์„ธํŠธ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๋ฉด, Random Forest๋Š” ๋…๋ฆฝ๋ณ€์ˆ˜๋ฅผ ์ƒ˜ํ”Œํ•ด์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ๋งŒ๋“ค๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ท ๋‚ด์–ด ์˜ˆ์ธกํ•œ๋‹ค. ์˜ํ–ฅ๋„๊ฐ€ ํฐ ๋ณ€์ˆ˜์˜ ์˜ํ–ฅ๋„๋ฅผ ๋‚ฎ์ถ”์–ด ๋‹ค๋ฅธ ๋ณ€์ˆ˜์˜ ํŠน์„ฑ๋„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

ํ™œ์šฉ์‚ฌ๋ก€

๊ด‘๊ณ  ๋ฐ˜์‘๋ฅ  ์˜ˆ์ธกํ•˜๊ธฐ - ๋‚˜์ด, ์„ฑ๋ณ„, ์ˆ˜์ž…, ์ผํ‰๊ท  ์ธํ„ฐ๋„ท ์‚ฌ์šฉ์‹œ๊ฐ„์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ด‘๊ณ  ํด๋ฆญ ๋ฐ˜์‘ ๋ถ„์„ํ•˜๊ธฐ ๊ตฌ๋งค ์š”์ธ ๋ถ„์„ํ•˜๊ธฐ - ํ†ต์‹ ์‚ฌ, ์ œํ’ˆ์ƒ‰์ƒ, ๊ฐ€๊ฒฉ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ œํ’ˆ ๊ตฌ๋งค ์š”์ธ ํŒŒ์•…ํ•ด ๋ณด๊ธฐ ํ”„๋กœ๋ชจ์…˜์— ๋ฐ˜์‘ํ•  ๊ณ ๊ฐ ์˜ˆ์ธก - ์ตœ๊ทผ๋ฐฉ๋ฌธ์ผ, ์ฑ„๋„, ๋ฐ˜ํ’ˆ ๋…๋ฆฝ๋ณ€์ˆ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ตฌ๋งค์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ๊ณ ๊ฐ๋ถ„๋ฅ˜ - ๋‚˜์ด, ์„ฑ๋ณ„, ์—ฐ๋ด‰, ์†Œ๋น„์ ์ˆ˜๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ณ ๊ฐ ๋ถ„๋ฅ˜

Last updated