Carl's Tech Blog

AdaBoost in python

data-science-summary/summary 2020. 9. 20. 17:07

<엔지니어는 구현을 못하면 모르는것이다>

AdaBoost

Boosting 방식의 일종

Decision Tree 를 만드는데 2개의 Leaf Node만 가지는 Tree 를 여러개(결국 feature 수만큼 stump 생성) 만든다. => 나무가 작은 즉, 그루터기(Stump) 라고 부름

만들어진 Stump 들을 순차적으로 사용하여 예측

=> 그리고 데이터를 샘플링 할 때 각 데이터 마다 Weight를 줌(초기 weight 값은 1/전체 데이터수)

Stump 를 거치면서 잘못 예측된 데이터는 weight 는 점점 커짐

그럼 Stump 의 순서도 중요한데 각 Stump 중 가장 불순도(Inpurity)가 작은 Stump 를 먼저 사용

그런뒤 Weight 를 갱신하기 전에

정확도(amount of say)를 구하고 틀린 데이터는 기존 Weight * e^정확도 만큼 변경하여 크게, 맞은 데이터는 기존 Weight * e^-정확도 만큼 변경시 켜 작게 만들어준다.

그리고 이렇게 갱신된 Weight 를 softmax 방식처럼 총 합이 1이 되도록 비율을 다시 정해준다.

그런뒤 새롭게 데이터 셋을 뽑는데 weight가 큰 것 위주로 데이터를 선택한다. => 틀린것 위주로 자연스럽게 데이터

그리고 새롭게 뽑힌 데이터 셋의 weight는 초기화 하고 학습을 반복한다. (초기 weight 값은 1/전체 데이터수)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
                           n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=123)

clf = AdaBoostClassifier(n_estimators=100, random_state=0)
clf.fit(train_x, train_y)

clf.predict(test_x)

# 정확도를 보면 꽤 높다.
clf.score(test_x, test_y)

저작자표시 (새창열림)

'data-science-summary > summary' 카테고리의 다른 글

LightGBM in python (0)	2020.09.20
GradientBoost in python (0)	2020.09.20
Random Forest in python (0)	2020.09.20
Ensemble & Bootstrap & Bagging & Boosting 간단정리 (0)	2020.09.20
Decision Tree in python (0)	2020.09.20

ABOUT ME

Carl's Tech Blog Carl's Tech Blog

'data-science-summary > summary' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'data-science-summary > summary' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바