[Paper Review] Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond
This survey provides a comprehensive overview of random features for kernel approximation, covering algorithms, theory, and connections to deep learning. It evaluates methods like RFF, ORF, and SSF across large-scale datasets, showing that structured random features achieve superior approximation quality and competitive inference speed while maintaining strong generalization performance.
Random features is one of the most popular techniques to speed up kernel methods in large-scale problems. Related works have been recognized by the NeurIPS Test-of-Time award in 2017 and the ICML Best Paper Finalist in 2019. The body of work on random features has grown rapidly, and hence it is desirable to have a comprehensive overview on this topic explaining the connections among various algorithms and theoretical results. In this survey, we systematically review the work on random features from the past ten years. First, the motivations, characteristics and contributions of representative random features based algorithms are summarized according to their sampling schemes, learning procedures, variance reduction properties and how they exploit training data. Second, we review theoretical results that center around the following key question: how many random features are needed to ensure a high approximation quality or no loss in the empirical/expected risks of the learned estimator. Third, we provide a comprehensive evaluation of popular random features based algorithms on several large-scale benchmark datasets and discuss their approximation quality and prediction performance for classification. Last, we discuss the relationship between random features and modern over-parameterized deep neural networks (DNNs), including the use of high dimensional random features in the analysis of DNNs as well as the gaps between current theoretical and empirical results. This survey may serve as a gentle introduction to this topic, and as a users' guide for practitioners interested in applying the representative algorithms and understanding theoretical results under various technical assumptions. We hope that this survey will facilitate discussion on the open problems in this topic, and more importantly, shed light on future research directions.
Motivation & Objective
- To provide a systematic review of random feature methods for kernel approximation over the past decade.
- To clarify the connections among various algorithms, their sampling schemes, variance reduction, and data exploitation strategies.
- To analyze theoretical bounds on the number of random features needed to maintain high approximation and generalization quality.
- To evaluate the empirical performance of representative algorithms on large-scale benchmark datasets for classification tasks.
- To explore the relationship between random features and over-parameterized deep neural networks, including theoretical and empirical gaps.
Proposed method
- Categorizes random feature algorithms based on sampling schemes (e.g., i.i.d., structured, quasi-Monte Carlo), learning procedures, and variance reduction techniques.
- Reviews theoretical results on the required number of random features to ensure low empirical and expected risk, focusing on generalization bounds.
- Employs a unified evaluation framework on multiple large-scale datasets (e.g., MNIST-8M, covtype, letter) using kernel ridge regression and logistic regression.
- Introduces and evaluates structured random features (e.g., ORF, SORF, SSF) that improve approximation accuracy by leveraging structured sampling patterns.
- Applies the doubly stochastic framework for data streaming to handle ultra-large datasets like MNIST-8M under memory constraints.
- Compares time and accuracy trade-offs across methods including RFF, Fastfood, QMC, GQ, and LS-RFF, using metrics like approximation error, training/test error, and total time cost.
Experimental results
Research questions
- RQ1How do different random feature sampling schemes (e.g., i.i.d., structured, quasi-Monte Carlo) compare in terms of approximation quality and computational efficiency?
- RQ2What theoretical bounds exist on the number of random features required to achieve low generalization error in kernel approximation?
- RQ3How do random feature methods perform empirically on large-scale classification tasks across diverse kernel types (Gaussian, arc-cosine, polynomial) and datasets?
- RQ4What is the relationship between random features and over-parameterized deep neural networks, and how can random feature theory inform DNN analysis?
- RQ5What are the key gaps between theoretical predictions and empirical results in random feature and deep learning settings?
Key findings
- On the MNIST-8M dataset, ORF and SORF achieve the lowest approximation error (0.0041) for the Gaussian kernel, outperforming RFF (0.0126) and Fastfood (0.0159).
- For the zero-order arc-cosine kernel, ORF and SORF achieve the best approximation error (0.0224 and 0.0231), while RM performs poorly (0.0448) due to suboptimal sketching for polynomial-like kernels.
- SSF achieves the best approximation error (0.0078) for the Gaussian kernel, though ORF and SORF are competitive with slightly higher time costs.
- On the arc-cosine kernels, ORF and SORF show consistent performance across datasets, with test errors around 2.7% for arccos0 and 1.5% for arccos1, outperforming RM and Fastfood.
- Time cost varies significantly: LS-RFF is the slowest (15,725 sec.) for Gaussian kernel, while SORF is fastest (8,861.6 sec.) for arccos1, indicating trade-offs between accuracy and speed.
- Despite high approximation error in some cases (e.g., RM at 0.0448 for arccos0), RM is computationally efficient due to its Maclaurin expansion-based sketching, making it suitable for low-latency applications.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.