QUICK REVIEW

[论文解读] Bayesian Symbolic Regression

Ying Jin, Weilin Fu|arXiv (Cornell University)|Oct 20, 2019

Evolutionary Algorithms and Applications参考文献 27被引用 47

一句话总结

贝叶斯符号回归（BSR）在贝叶斯框架下拟合符号回归，使用简明符号树的加性混合与 MCMC 来抽样后验树结构，相较于遗传编程提高可解释性和鲁棒性。

ABSTRACT

Interpretability is crucial for machine learning in many scenarios such as quantitative finance, banking, healthcare, etc. Symbolic regression (SR) is a classic interpretable machine learning method by bridging X and Y using mathematical expressions composed of some basic functions. However, the search space of all possible expressions grows exponentially with the length of the expression, making it infeasible for enumeration. Genetic programming (GP) has been traditionally and commonly used in SR to search for the optimal solution, but it suffers from several limitations, e.g. the difficulty in incorporating prior knowledge; overly-complicated output expression and reduced interpretability etc. To address these issues, we propose a new method to fit SR under a Bayesian framework. Firstly, Bayesian model can naturally incorporate prior knowledge (e.g., preference of basis functions, operators and raw features) to improve the efficiency of fitting SR. Secondly, to improve interpretability of expressions in SR, we aim to capture concise but informative signals. To this end, we assume the expected signal has an additive structure, i.e., a linear combination of several concise expressions, whose complexity is controlled by a well-designed prior distribution. In our setup, each expression is characterized by a symbolic tree, and the proposed SR model could be solved by sampling symbolic trees from the posterior distribution using an efficient Markov chain Monte Carlo (MCMC) algorithm. Finally, compared with GP, the proposed BSR(Bayesian Symbolic Regression) method saves computer memory with no need to keep an updated 'genome pool'. Numerical experiments show that, compared with GP, the solutions of BSR are closer to the ground truth and the expressions are more concise. Meanwhile we find the solution of BSR is robust to hyper-parameter specifications such as the number of trees.

研究动机与目标

将先验知识融入符号回归以提高拟合效率和可解释性。
将表达式表示为符号树，通过先验实现对复杂度的可控。
开发基于 MCMC 的树结构与参数后验计算。
证明 BSR 产生比 GP 更接近真实表达、且更简洁的表达式。
展示 BSR 对超参数选择的鲁棒性，如加性成分的数量。

提出的方法

将每个数学表达式表示为符号树，终端节点为特征，非终端节点为运算符。
对树结构、终端特征和线性变换参数设定先验以控制复杂度。
将响应建模为若干简明表达式的线性组合，并通过OLS估计系数。
使用 Metropolis-Hastings 和可逆跳跃 MCMC 在树结构及相关参数上进行采样。
应用 RJMCMC 来处理当树获得或失去 lt() 节点时的跨维变更。
在仿真基准和金融数据集上将 BSR 与 GP 进行比较，以评估精度和可解释性。

实验结果

研究问题

RQ1贝叶斯先验和加性树结构表示是否能在不牺牲预测准确性的前提下，得到更简洁、可解释的符号表达？
RQ2在拟合质量、泛化能力和表达式复杂度方面，BSR 相对于 Genetic Programming 在多任务上的表现如何？
RQ3BSR 的解对模型中加性组件数量（K）的选择是否鲁棒？

主要发现

BSR 通常在基准任务上获得更接近真实表达且更短、更具可解释性的表达式，相较于 GP。
BSR 在训练数据和测试数据上显示出有竞争力的 RMSE，在若干任务中有显著提升。
加性树结构在表达式规模方面显著小于 GP，在各任务上节点数更少。
增加加性成分数量 K 可以在一定程度上提升 RMSE，但到达某点后增益减弱，冗余组件会被系数接近于零的情况剔除。
BSR 在真实世界金融数据中显示出有用的信号，例如将开盘价和最低价与次日回报符号相关联的表达式。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。