QUICK REVIEW

[论文解读] Machine Learning in Astronomy: a practical overview

Dalya Baron|arXiv (Cornell University)|Apr 15, 2019

Gamma-ray bursts and supernovae参考文献 18被引用 136

一句话总结

对应用于天文学数据的有监督和无监督机器学习技术的实用概述，强调数据挑战、评估以及实现常用算法的要点，包括概率扩展和深度学习的考虑。

ABSTRACT

Astronomy is experiencing a rapid growth in data size and complexity. This change fosters the development of data-driven science as a useful companion to the common model-driven data analysis paradigm, where astronomers develop automatic tools to mine datasets and extract novel information from them. In recent years, machine learning algorithms have become increasingly popular among astronomers, and are now used for a wide variety of tasks. In light of these developments, and the promise and challenges associated with them, the IAC Winter School 2018 focused on big data in Astronomy, with a particular emphasis on machine learning and deep learning techniques. This document summarizes the topics of supervised and unsupervised learning algorithms presented during the school, and provides practical information on the application of such tools to astronomical datasets. In this document I cover basic topics in supervised machine learning, including selection and preprocessing of the input dataset, evaluation methods, and three popular supervised learning algorithms, Support Vector Machines, Random Forests, and shallow Artificial Neural Networks. My main focus is on unsupervised machine learning algorithms, that are used to perform cluster analysis, dimensionality reduction, visualization, and outlier detection. Unsupervised learning algorithms are of particular importance to scientific research, since they can be used to extract new knowledge from existing datasets, and can facilitate new discoveries.

研究动机与目标

鉴于大规模、复杂数据集的增长，推动将机器学习作为天文学中相对于传统基于模型分析的数据驱动替代方法。
就将有监督和无监督的 ML 应用于天文数据集提供实用指导，包括预处理、评估和算法选择。
突出流行算法（SVM、随机森林、浅层神经网络）以及用于聚类、降维和异常检测的无监督方法。
讨论即将到来的观测调查中的数据挑战，以及机器学习如何帮助检测、表征和分类天体对象。

提出的方法

描述有监督学习的评估指标和模型验证方案，包括训练/验证/测试划分和交叉验证。
讨论输入数据处理：特征选择、归一化、缩放，以及处理不平衡数据集。
展示并解释核心的有监督算法：支持向量机、决策树、随机森林和浅层人工神经网络。
用概率随机森林解释对特征和标签不确定性的概率处理。
概述无监督学习主题（距离度量、聚类、降维、异常检测）及其科学相关性。
讨论浅层与深层模型的实际使用考虑，以及卷积结构在特征提取方面的能力。

实验结果

研究问题

RQ1如何在天文数据上有效地训练、验证和测试有监督的机器学习？
RQ2天文数据在预处理、特征选择和处理不平衡数据方面的实际注意事项是什么？
RQ3常见的 ML 算法（SVM、随机森林、浅层神经网络）在典型天文任务上的表现如何，以及它们的局限性？
RQ4无监督方法在从大型天文数据集中发现新知识方面提供了哪些优势？
RQ5如何将测量和标签的不确定性纳入天文学的 ML 模型？

主要发现

与传统随机森林相比，概率随机森林在特征带噪声时分类准确性提升最多10%，在标签带噪声时提升最多30%。
概率随机森林自然处理缺失值以及训练集与测试集之间不同的噪声特性。
随机森林由于跨树的聚合通常比单一决策树泛化能力更好，但标准的 RF 并不能本地处理特征/标签的不确定性。
SVM 简单且鲁棒，但对特征缩放敏感，且可能受无关特征影响，因此推荐进行特征选择。
在某些情境下，集成方法和深度学习方法可以利用原始数据，减少对广泛特征工程的需求。
该文档强调无监督学习在从大型数据集中提取新知识和推动发现方面尤其重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。