QUICK REVIEW

[Paper Review] Decision functions from supervised machine learning algorithms as collective variables for accelerating molecular simulations.

Mohammad M. Sultan, Vijay S. Pande|arXiv (Cornell University)|Feb 28, 2018

Protein Structure and Dynamics5 references2 citations

TL;DR

This paper proposes using decision functions from supervised machine learning algorithms—such as Support Vector Machines and Logistic Regression—as collective variables (CVs) to accelerate molecular simulations. By leveraging the decision hyperplane distance or probability outputs as CVs, the method enables efficient sampling of slow structural transitions in solvated alanine dipeptide and Chignolin, demonstrating reversible and enhanced conformational sampling in complex energy landscapes.

ABSTRACT

Selection of appropriate collective variables for enhancing molecular simulations remains an unsolved problem in computational biophysics. In particular, picking initial collective variables (CVs) is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we attempt to solve the initial CV problem using a data-driven approach inspired by supervised machine learning literature. In particular, we show how the decision functions in supervised machine learning (SML) algorithms can be used as initial CVs for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the Support Vector Machines decision hyperplane, the output probability estimates from Logistic Regression, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.

Motivation & Objective

To address the persistent challenge of selecting initial collective variables (CVs) in high-dimensional molecular simulation spaces.
To explore whether decision functions from supervised machine learning (SML) models can serve as effective, data-driven CVs for enhanced sampling.
To evaluate the performance of SML-based CVs in accelerating the sampling of slow conformational transitions in biomolecular systems.
To identify which SML algorithms are most suitable for generating informative and reversible CVs in molecular simulations.

Proposed method

Utilizing the decision function of a trained Support Vector Machine (SVM) as a collective variable, specifically the signed distance to the SVM hyperplane.
Employing the output probability estimates from Logistic Regression as a continuous, reversible collective variable for enhanced sampling.
Applying other supervised learning classifiers to generate alternative decision functions that can act as CVs in enhanced sampling simulations.
Using the resulting SML-derived CVs in metadynamics or similar enhanced sampling methods to accelerate transitions between slow states.
Validating the reversibility and efficiency of sampling by analyzing the free energy landscapes reconstructed from simulations using SML-based CVs.
Testing the approach on two benchmark systems: solvated alanine dipeptide and the Chignolin mini-protein, both known for complex, slow conformational dynamics.

Experimental results

Research questions

RQ1Can decision functions from supervised machine learning models serve as effective collective variables for accelerating molecular simulations?
RQ2How do the performance and reversibility of SML-based CVs compare to traditional, manually selected CVs in sampling slow conformational transitions?
RQ3Which supervised learning algorithms produce the most informative and stable decision functions when used as CVs in biomolecular simulations?
RQ4To what extent can SML-derived CVs capture the essential reaction coordinates in two-state and multi-state systems like alanine dipeptide and Chignolin?

Key findings

The signed distance to the SVM decision hyperplane successfully captures the essential reaction coordinate in solvated alanine dipeptide, enabling efficient sampling of the cis-trans isomerization pathway.
Probability estimates from Logistic Regression provide a smooth, continuous, and reversible collective variable that effectively accelerates conformational sampling in Chignolin.
SML-based CVs enable the reconstruction of free energy landscapes with improved convergence and reduced sampling time compared to random or heuristic CV choices.
The method demonstrates robustness across different protein systems, including both two-state and multi-state conformational transitions.
Other SML algorithms such as Random Forests and Neural Networks show potential for generating alternative CVs, though their decision functions require further analysis for optimal use in sampling.
The approach provides a systematic, data-driven alternative to manual CV selection, particularly valuable in high-dimensional systems where intuition fails.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.