LightGBM: A Highly Efficient Gradient Boosting Decision Tree (2017)

Notice

Recent Posts

Recent Comments

Link

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

언어 전공자의 NLP 로그

LightGBM: A Highly Efficient Gradient Boosting Decision Tree (2017) 본문

논문 읽기

LightGBM: A Highly Efficient Gradient Boosting Decision Tree (2017)

JohnnyNLP 2023. 8. 12. 16:56

논문 출처 : https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Requests for name changes in the electronic proceedings will be accepted with no questions asked. However name changes may cause bibliographic tracking issues. Authors are asked to consider this carefully and discuss it with their co-authors prior to reque

proceedings.neurips.cc

information gain : DT 모델에서 특정 feature에 기반하여 dataset을 분리했을 떄 얻는 정보의 양을 정량화한 지표. 즉, 가장 효율적으로 tree split을 할 수 있는 지점을 결정하기 위한 값. The information gain is computed based on the sum of gradients within each bin or bucket for a given feature.
entropy : information gain을 계산하는 데 사용되는 지표. 이는 분류 모델에 해당하는 내용이며, 각 instance가 특정 class에 속할 확률과 그 값을 \( log2 \)에 취한 값을 곱해 모두 더한 값이다. 엔트로피가 낮다면 데이터들이 서로 같은 class에 속할 확률이 높다는 의미고, 엔트로피가 높으면 class 분포가 다양하다는 의미다.

Let's assume you have a dataset with 100 instances, and the class labels are either "A," "B," or "C." The distribution of class labels is as follows:

Class A: 40 instances

Class B: 30 instances

Class C: 30 instances

To calculate the entropy, you would perform the following calculations:

p(A) = 40 / 100 = 0.4 p(B) = 30 / 100 = 0.3 p(C) = 30 / 100 = 0.3

Entropy(S) = - (0.4 * log2(0.4) + 0.3 * log2(0.3) + 0.3 * log2(0.3))

Evaluate the expression and multiply the result by -1 to obtain the entropy value.

gradients : partial derivatives of the loss function with respected to the predicted values
data instance = single row or record of a dataset
feature space가 sparse하다 : 0이나 0에 가까운 값들이 대부분이다. 즉, feature에 0이 아닌 값보다 0인 값을 가진 원소가 더 많다. NLP, 이미지 인식, 추천 시스템, 고차원 데이터 등등에서 많이 발견된다. (PCA, One-hot-encoding)

0. Abstract

Gradient-based One-Side Sampling : 경사도가 낮은 (에러율이 적은) 데이터는 배제하고, 경사도가 높은 (에러율이 높은) 데이터에 집중하는 샘플링 방식. Random (uniform) sampling보다 더 효율적인 추정이 가능해진다.
Exclusive Feature Bundling (EFB): 서로 배타적인 feature (즉, 동시에 0이 아닌 값을 가지지 않는 feature)를 묶어 feature 수 간소화하자. => greedy algorithm

![Vertex definition - Math Insight](https://mathinsight.org/media/image/image/small_undirected_network_labeled.png)

1. Introduction

GBDT : 데이터의 크기가 증가하면서 정확도와 효율성 간의 tradeoff 발생
전통적인 GBDT의 경우, 분리 지점을 찾기 위해 모든 데이터 instance를 평가해야 한다.
그러면 instance 수와 feature 수를 줄이면 되지 않느냐? => 간단하지 않은 문제.
가중치에 기반한 데이터 샘플링 기법 => GBDT는 샘플링에 쓸 가중치 값이 없다.
GOSS : GBDT 자체 가중치는 없지만, 각 instance가 지니는 경사도 값에 집중.

2. Preliminaries

2-1. GBDT and Its Complexity Analysis

GBDT는 트리 기반 앙상블 모델로, 매 iteration마다 음의 경사도(= residual errors)를 fitting하여 학습한다.
이때 가장 많은 연산 비용이 발생하는 부분이 최적의 split point를 찾는 데서 온다.
pre-sorted algorithm : 쉽고 간단하지만, 오래 걸린다.
histogram-based algorithm : 연속값을 이산치 범위 (bin)에 담아 feature histogram을 생성한다. (쉽게 말해 그룹핑) (#data * #features => #bins * #features)

2-2. Related Work

XGBoost는 위 두 알고리즘 모두 차용하고 성능이 뛰어나 이를 baseline으로 삼는다.
학습 데이터의 사이즈를 줄이기 위해 가중치의 임계점을 정해 일부 instance를 필터링하는 방법 : GBDT에는 자체적으로 데이터에게 부여하는 가중치가 없다. (SGB/AdaBoost 등등 언급)
PCA나 투영을 이용해 weak feature를 필터링하는 방법 : 중복되는 요소가 많은 feature들의 경우는 유효하나, 각기 고유한 feature를 가지고 있는 경우 정보 소실 발생 => 정확도 영향
0값 feature를 무시하는 방법 : histogram-based인 GBDT는 sparse한 데이터의 최적화 솔루션이 충분하지 않다. 값이 0건 아니건 bin을 만들어내야 하기 때문.
이러한 한계점에 대한 대안으로 GOSS와 EFB 등장

3. GOSS

3-1. Algorithm Description

각 instance에 대하여 gradient가 작으면 해당 데이터는 학습이 충분히 잘 됐다.
단순히 이러한 instance를 제거해버리면 데이터의 분포가 영향을 받으므로, 샘플링 기법을 도입.
Gradient가 큰 instance를 다 가져가면서 일부 small gradient instance를 임의 추출.
a : 상위 a% gradient를 지닌 instance의 비율
b : 그 외 데이터의 임의로 추출할 instance의 비율
\( 1−ab\) : 데이터 분산을 유지하기 위해 곱해지는 상수
Variance gain : 특정 feature j의 특정 분리 지점 d를 채택했을 때 얻을 수 있는 분산의 감소량.
분산이 왜 작아야 하는가? 균질한 데이터 분류 지점을 찾기 위해서.

3.2 Theoretical Analysis

a=0.2, b=0.1이라 하자.
기본적으로 GBDT는 feature j에 대하여 variance gain을 최대화하는 d값을 찾기 위해 **모든 i에 대하여/혹은 랜덤 sampling하여** 모든 d값을 다 탐색했다면, 이 모델에서는 **일부만 특정 샘플링하여**, 즉, A에 속한 20%의 데이터 및 B에 속한 \(1−0.20.1\), 8%의 데이터에 대하여 d값을 탐색한다. 즉, 이 경우에 72%만큼의 데이터를 보지 않고도 유사한 정확도를 낼 수 있는 것.
기존 수식

GOSS 적용 수식

GPT 예시

Let's assume we have a dataset with 100 instances and we are considering splitting on feature j. Here are the variable assignments for the example:

n = 100: Total number of instances in the dataset.

a = 0.2: The fraction of instances selected for full gradient calculation (the remaining 80% are sampled for gradient estimation).

b = 0.4: The fraction of instances selected for gradient estimation from the remaining 80% of instances.

njl(d) = 60: Number of instances in the left child node (subset Al and Bl) after the split on feature j at point d.

njr(d) = 40: Number of instances in the right child node (subset Ar and Br) after the split on feature j at point d.

Σ(xi∈Al gi) = -15: Sum of target values in subset Al.

Σ(xi∈Bl gi) = 10: Sum of target values in subset Bl.

Σ(xi∈Ar gi) = 25: Sum of target values in subset Ar.

Σ(xi∈Br gi) = 5: Sum of target values in subset Br.

Now, let's calculate the variance gain using the formula:

V˜j(d) = (1/n) * [(Σ(xi∈Al gi) + (1-a)*b*njl(d)*Σ(xi∈Bl gi))^2 + (Σ(xi∈Ar gi) + (1-a)*b*njr(d)*Σ(xi∈Br gi))^2]

Variance Gain Calculation:

V˜j(d) = (1/100) * [(-15 + (1-0.2)*0.4*60*10)^2 + (25 + (1-0.2)\*0.4\*40*5)^2]

V˜j(d) = (1/100) * [(-15 + 0.8*0.4*60*10)^2 + (25 + 0.8*0.4*40*5)^2]

V˜j(d) = (1/100) * [(-15 + 19.2)^2 + (25 + 12.8)^2]

V˜j(d) = (1/100) * [(4.2)^2 + (37.8)^2]

V˜j(d) = (1/100) * [17.64 + 1428.84]

V˜j(d) = (1/100) * 1446.48

V˜j(d) ≈ 14.4648

4. EFB

Feature 수를 효율적으로 줄일 수 있는 방법.
Sparce한 data => 거의 손실 없이 feature 수를 줄일 수 있다. ex) one-hot-encoding
유사한 feature끼리 bundling

어떠한 feature들을 bundling할 것인가?

graph coloring problem

![Graph coloring - Wikipedia](https://upload.wikimedia.org/wikipedia/commons/thumb/9/90/Petersen_graph_3-coloring.svg/1200px-Petersen_graph_3-coloring.svg.png)

What is the minimal # of colors? graph coloring problem.
이 경우 각 feature가 vertice가 되고, 서로 수직이 아닌 feature 사이에 그어지는 선이 edge가 된다.
따라서 선이 그어지지 않은 feature들 끼리는 같은 색을 칠할 수 있는 것.
실제 데이터 상에 동시에 서로 0이 아닌 값을 가지는 경우가 생각보다 잘 없다.
동시에 0이 아닌 값을 지닌다 => 같은 벡터 공간에 존재한다 => 서로 배타적이지 않다.
알고리즘 구현 : 각 feature들 사이에 conflict가 작다면 함께 묶고, 크다면 새로운 bundle을 생성한다. 연산 시간은 O(#\(features2\))로, 학습 전 1번만 수행한다. (Feature가 너무 많아지면 문제 발생)

어떻게 bundling할 것인가?

feature를 0이 아닌 값의 count로 정렬한다. (더 많은 non-zero values => 더 많은 conflict)
서로 배타적인 feature를 함께 묶어(bundling) 서로 다른 bin에 넣는다 (offset을 더해줌) => one-hot-encoding의 반대
데이터 상의 0의 수를 줄이겠다!

5. Experiment

생략

6. Conclusion

GOSS와 EFB를 적용한 LightGBM은 연산 속도와 메모리 소모에서 XGBoost와 SGB를 능가하는 성능을 보인다.

참고 자료

https://www.youtube.com/watch?v=9uxWzeLglr0&ab_channel=UnfoldDataScience

Boosting -> \( M0 \)에 대해 학습한 결과를 \( M1...Mn \)에서 반복하고, 이를 최종 모델에 반영 ... 에러 계산
smart featuring + smart sampling => better result

500 records(=features) data ... decision tree => root column selection and find split points(=bins) => histogram-based algorithm
Exclusive Feature Bundling (EFB) ... bundling categorical features (one-hot-encoded data) => combine into one feature and reduce the # of features
Gradient-based One-Side Sampling (GOSS) ... gradient = error. Sampling based on errors. \(M0\) decides gradients of #features => sort them out => give different weights on features based on gradients => sample b% from large gradient groups and sample a% from small gradient groups => create sub-sample. By sampling groups with large errors, we can train the data that are under-trained

normally 10 times faster with similar accuracy
With good computational resources, we can still use XGBoost, as they are well-managed and supported.

https://www.youtube.com/watch?v=n_ZMQj09S6w&ab_channel=DigitalSreeni

Fast, distributed, high-performance gradient boosting framework based on decision tree algorithm.
Leaf-wise tree growth => can lead to overfitting (define max-depth)
Requires many hyperparameters
data : (569, 32) breast cancer dataset
https://github.com/bnsreenu/python_for_microscopists/blob/master/196_lightGBM_feature_selection_breast_cancer.py

lgbm_params = {'learning_rate':0.05, 'boosting_type':'gbdt',    #Try dart for better accuracy
              'objective':'binary',
              'metric':['auc', 'binary_logloss'],
              'num_leaves':100,
              'max_depth':10}

clf = lgbm.train(lgbm_params, d_train, 50) #50 iterations. Choose # based on the learning rate

objective -> binary, regression, multiple-classification etc.
Tweak parameters as you like

parameters={'max_depth':10, 
            'objective':'binary:logistic',
            'eval_metric':'auc',
            'learning_rate':.05}

Comparison to XGBoost

'논문 읽기' 카테고리의 다른 글

BLEU: a Method for Automatic Evaluation of Machine Translation (1)	2023.10.30
Chatbot의 장기 기억력을 높이는 방법 (0)	2023.08.28
GloVe: Global Vectors for Word Representation (2014) (0)	2023.08.12
XGBoost: A Scalable Tree Boosting System (2016) (0)	2023.08.12
Sequence to Sequence Learning with Neural Networks (2014) (0)	2023.08.12

'논문 읽기' Related Articles

언어 전공자의 NLP 로그

LightGBM: A Highly Efficient Gradient Boosting Decision Tree (2017) 본문

LightGBM: A Highly Efficient Gradient Boosting Decision Tree (2017)

1. Introduction

2. Preliminaries

2-1. GBDT and Its Complexity Analysis

2-2. Related Work

3. GOSS

3-1. Algorithm Description

3.2 Theoretical Analysis

4. EFB

어떠한 feature들을 bundling할 것인가?

어떻게 bundling할 것인가?

5. Experiment

6. Conclusion

참고 자료

'논문 읽기' 카테고리의 다른 글

티스토리툴바