QUICK REVIEW

[論文レビュー] Bayesian Symbolic Regression

Ying Jin, Weilin Fu|arXiv (Cornell University)|Oct 20, 2019

Evolutionary Algorithms and Applications参考文献 27被引用数 47

ひとこと要約

Bayesian Symbolic Regression (BSR) は、加法混成の簡潔なシンボリックツリーを用い、 posterior tree structures をサンプルする MCMC を使って、遺伝的プログラミングと比較して解釈性と頑健性を向上させます。

ABSTRACT

Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression (SR) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making it infeasible for enumeration. Genetic programming (GP) has been traditionally and commonly used in SR to search for the optimal solution, but it suffers from several limitations, e.g. the difficulty in incorporating prior knowledge; overly-complicated output expression and reduced interpretability etc. To address these issues, we propose a new method to fit SR under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting SR. Secondly, to improve interpretability of expressions in SR, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed SR model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo (MCMC) algorithm. Finally, compared with GP, the proposed BSR(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'. Numerical experiments show that, compared with GP, the solutions of BSR are closer to the ground truth and the expressions are more concise. Meanwhile we find the solution of BSR is robust to hyper-parameter specifications such as the number of trees.

研究の動機と目的

象徴的回帰に事前知識を組み込み、適合効率と解釈性を改善する。
事前分布によって複雑さを制御できる象徴的木として式を表現する。
木構造とパラメータのポスターレをMCMCベースで計算する。
BSR がGPよりも地真値に近く、より簡潔な式を生むことを実証する。
加法成分の数などハイパーパラメータの選択に対するBSRの頑健性を示す。

提案手法

各数式を特徴量を端点ノード、演算子を非端点ノードとする象徴的木として表現する。
複雑さを制御するために木構造、端点特徴量、および線形変換パラメータに対する事前分布を指定する。
応答をいくつかの簡潔な式の線形結合としてモデル化し、係数をOLSで推定する。
木構造と関連パラメータをサンプルするために Metropolis-Hastings と可逆跳跃 MCMC を用いる。
木が lt() ノードを増減させるときの跨次元変化を扱うために RJMCMC を適用する。
模擬ベンチマークと金融データセット上でBSRをGPと比較し、精度と解釈性を評価する。

実験結果

リサーチクエスチョン

RQ1ベイズ事前分布と加法的木構造表現により、予測精度を損なうことなく、より簡潔で解釈可能な象徴式を得られるか。
RQ2多様なタスクにおける適合品質、一般化、および式の複雑さの観点で、BSRは遺伝的プログラミングと比較してどのように性能を示すか。
RQ3モデルで用いる加法成分の数（K）の選択にBSRの解が頑健であるか。

主な発見

BSRはベンチマーク課題で、GPより地真値に近く、より短く、解釈しやすい式を得る傾向がある。
訓練データおよび検証データで競争力のある RMSE を示し、いくつかの課題で顕著な改善が見られる。
加法的木構造は、課題全般でGPと比較して式のサイズ（ノード数）が著しく小さい。
加法成分数 K を増やすと RMSE はある点まで改善するが、それ以降は利得が減少し、余剰成分は小さな係数で剪定される。
実データの金融データで有用な信号を示し、たとえば始値と安値を次日リターンの符号と関連づける式など。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。