QUICK REVIEW

[논문 리뷰] Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

Lucas Mentch, Siyu Zhou|arXiv (Cornell University)|2019. 10. 31.

Gaussian Processes and Bayesian Inference인용 수 42

한 줄 요약

논문은 랜덤 포레스트의 추가적인 무작위성이 암시적 정규화처럼 작용하여 자유도(dof)를 줄이고 낮은 SNR 설정에서 성능을 향상시키며, 이를 시뮬레이션과 선형 모델 유사체를 통해 보인다.

ABSTRACT

Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.

연구 동기 및 목표

랜덤 포레스트가 보간(interpolation)이나 분산 감소만으로 설명되기보다 왜 잘 작동하는지 설명한다.
랜덤 포레스트에서 mtry 매개변수가 모델 복잡도(자유도)에 어떤 영향을 주는지 정량화한다.
무작위성이 낮은 신호대 잡음비(SNR) 맥락에서 더 큰 이점을 제공한다는 것을 보여준다.
선형 모델에서의 무작위 순방향 선택이 숲에서 관찰되는 정규화 효과를 반영한다.

제안 방법

데이터 재샘플링 및 특징 부분샘플링(mtry)과 같은 명시적 무작위성 구성요소를 가진 랜덤 포레스트를 형식화한다.
df(f̂)= (1/σ^2) sum Cov(ŷ_i, y_i)로 추정된 추정기의 자유도를 정의한다.
몬테카를로 실험을 이용해 varying maxnodes와 mtry에서 포레스트의 자유도를 추정한다.
배깅과 선형 모델의 무작위 순방향 선택 유사체와 비교한다.
합성 데이터(선형 및 MARS 유사) 및 실제 데이터에서 SNR에 따른 성능을 평가한다.
맥락으로써 보간 및 정규화에 대한 기존 연구를 참조/보간한다.

실험 결과

연구 질문

RQ1mtry 매개변수가 랜덤 포레스트의 자유도에 어떤 영향을 미치는가?
RQ2어떤 SNR 영역에서 랜덤 포레스트가 배깅과 같은 비무작위 접근법에 비해 예측 이득이 가장 큰가?
RQ3선형 모델의 무작위 순방향 선택 절차가 랜덤 포레스트와 유사한 정규화 효과를 보이는가?
RQ4낮은-SNR 설정에서 무작위성의 개선은 주로 분산 감소, 바이어스 감소, 아니면 둘의 조합에 기인하는가?

주요 결과

maxnodes를 증가시키면 포레스트의 자유도가 증가하는데, 자유도는 오목하게 증가하는 패턴을 보인다.
고정된 maxnodes에서 더 높은 mtry는 더 낮은 mtry 값에 비해 자유도가 더 높다.
저SNR 맥락에서 랜덤 포레스트가 배깅보다 더욱 두드러진 성능 우위를 보이며, 고SNR에서 이 이점은 감소한다.
최적의 mtry는 SNR과 양의 상관관계를 가지며, 이는 무작위성의 정규화 효과를 시사한다.
노이즈가 크고 차원이 낮은 설정에서 무작위 선형 모델 전방 선택 유사체도 비슷한 정규화 이점을 보인다.
무작위성은 암묵적 정규화기로 작용하며, 명시적 정규화 방법의 축소 페널티와 유사하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.