Association Rule : 장바구니분석, 연관성분석

연관규칙분석이란 어떤 아이템의 집합이 번번히 발생하는가를 알려주는 일련의 규칙들을 생성하는 알고리즘

A priori Algorithm

모든 규칙들에 대한 지표를 계산한다는건 비효율적, 따라서 최소한의 지지도(support) 이상 되는 경우만을 탐색하여 효율적으로 계산함 -> {1,3} 번 아이템이 같이 나올 확률이 낮다면 {1,3, 다음 하나} 역시 확률이 낮음으로 계산하지않는다.

Support : 지지도

x 와 y 가 함께 발생할 확률 -> P(A∩B) : frq(x,y)/N

Confidence : 신뢰도

x가 나왔을 때 y가 나올 확률 -> P(A∩B) / P(A) : frq(x,y)/frq(X)

Lift : 향상도(품목간의미)

x, y 의 독립을 가정

x와 y가 함께일어난 사건을 x와 y가 서로 독립된 사건일때 일어난 사건으로 나눈것

-> P(A∩B) / (P(A)*P(B)) = P (B|A) / P (B)

주로 x 라는 상품에서 신뢰도가 동일한 상품 y, z 가 존재할때 어떤상품을 택할지를 선택할때사용 예를 들면 (x,y) 와 (x,z) 의 신뢰도가 함께 0.5 일때 y 가 100번 거래중 10번 발생했고, z가 100번 거래중 5번 발생했을 때 (x,z) 가 더욱 가치가 있다 라는 식으로 유추하여 x를 구입할때 z를 추천해주는식으로 사용

-> 0.5 / p(y) = 0.5/(10/100) = 5

-> 0.5 / p(z) = 0.5/(5/100) = 10

lift가 큰것을 추천

만약 lift가 1이면 두품목은 독립,

1보다 크면 양의 상관관계(우연적 기회보다 높은 확률),

1보다 작으면 음의 상관관계(우연적 기회보다 낮은 확률)

# Association 을 위한 전처리
import pandas as pd
from mlxtend.frequent_patterns import apriori

# 데이터 프레임 물건 리스트 변환
df = pd.DataFrame([[1, 'banana'],[2, 'banana'],[2, 'apple'],[3, 'banana']], columns=['a','b'])
def toList(x):
  return list(set(x))
df1 = df.groupby('a').b.apply(lambda x: toList(x)).reset_index()
# 2개 이상만 추출
df1['leng'] = df1.b.apply(lambda x: len(x) >= 2)

# 원하는 연관분석 형태로 변환
dataset = list(df1.b)

# 조금더 확실하게 볼수 있는 더 많은 데이터셋으로 전환
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
           ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
           ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
           ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]

# 원하는 변수들을 인덱스/컬럼 으로 재정렬
# 각 제품의 포함 여부를 one-hot encoding하여 array 로 변환 
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)

# 변환된 array를 dataframe으로 변환 후 확인
df = pd.DataFrame(te_ary, columns=te.columns_)
print(df)

# 컬럼 이름
print(te.columns_)

# 혹시 1 또는 0으로 변경하고 싶다면
print(pd.DataFrame(te_ary.astype('int'), columns=te.columns_))

# 원래 이중 리스트로 변환
te.inverse_transform(te_ary)

# 연관규칙 분석을 위한 apriori 알고리즘 사용
from mlxtend.frequent_patterns import apriori

# 지지도 도출 -> 수가 많을 수 있으므로 min_support 로 일정 이상의 지지도만 도출 (default=0.5)
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)

# 특정 개수 이상의 itemset만 추출
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets[frequent_itemsets['length'] >=2]

# 특정 아이템(Eggs) 이 포함된 것만 추출
frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: 'Eggs' in list(x))]

# 연관 규칙 도출
from mlxtend.frequent_patterns import association_rules
# 최소 신뢰도 0.7이상인것만 추출
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)
# 최소 지지도 0.7이상인것만 추출
association_rules(frequent_itemsets, metric="lift", min_threshold=0.7)

# antecedents 1개인거만 추출
df = rules[rules.apply(lambda x: True if len(x.antecedents) else false, axis=1)]

# lift 제일 큰것찾기 = 상호 정보량이 가장 큰것 찾기(lift max)
df[df.antecedents == {'Apple'}].sort_values(by='lift', ascending=False)

# antecedents가 Apple 이고, consequents 가 하나일때 lift 젤 작은것 찾기 = 가장 멀리있어도 되는 물품
rules[(rules.antecedents == frozenset({'Apple'})) & (rules.consequents.apply(lambda x: len(x) ==1))].sort_values(by='lift')

# antecedents에 특정 단어 'Apple' 있는거 찾기
rules[rules.antecedents.apply(lambda x: 'Apple' in x)]

# 특정 antecedents 만 찾기
rules[rules.antecedents == frozenset({'Apple'})]

# fronzenset 에서 값 추출하기
[i for i in frozenset({'Apple', 'Banana'})]

저작자표시

'data-science-summary > summary' 카테고리의 다른 글

RANSAC(RANdom Sample Consensus) in python (0)	2020.09.20
likelihood 와 maximum likelihood method 란? 간단정리 (0)	2020.09.20
후진 제거법 (Backward Elimination) in python (0)	2020.09.20
XGBoost in python (0)	2020.09.20
LightGBM in python (0)	2020.09.20

ABOUT ME

Carl's Tech Blog Carl's Tech Blog

Association Rule : 장바구니분석, 연관성분석

A priori Algorithm

Support : 지지도

Confidence : 신뢰도

Lift : 향상도(품목간의미)

'data-science-summary > summary' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Association Rule : 장바구니분석, 연관성분석

A priori Algorithm

Support : 지지도

Confidence : 신뢰도

Lift : 향상도(품목간의미)

'data-science-summary > summary' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바