QUICK REVIEW

[Paper Review] Next-Step Conditioned Deep Convolutional Neural Networks Improve Protein Secondary Structure Prediction

Akosua Busia, Navdeep Jaitly|arXiv (Cornell University)|Feb 13, 2017

Machine Learning in Bioinformatics21 references18 citations

TL;DR

This paper introduces a next-step conditioned deep convolutional neural network for protein secondary structure prediction, improving performance by conditioning predictions on both local sequence features and previously predicted structure labels using scheduled sampling. The method achieves 71.4% Q8 accuracy on the CB513 benchmark via ensembling, setting a new state of the art for single-model and ensemble approaches in eight-class secondary structure prediction.

ABSTRACT

Recently developed deep learning techniques have significantly improved the accuracy of various speech and image recognition systems. In this paper we show how to adapt some of these techniques to create a novel chained convolutional architecture with next-step conditioning for improving performance on protein sequence prediction problems. We explore its value by demonstrating its ability to improve performance on eight-class secondary structure prediction. We first establish a state-of-the-art baseline by adapting recent advances in convolutional neural networks which were developed for vision tasks. This model achieves 70.0% per amino acid accuracy on the CB513 benchmark dataset without use of standard performance-boosting techniques such as ensembling or multitask learning. We then improve upon this state-of-the-art result using a novel chained prediction approach which frames the secondary structure prediction as a next-step prediction problem. This sequential model achieves 70.3% Q8 accuracy on CB513 with a single model; an ensemble of these models produces 71.4% Q8 accuracy on the same test set, improving upon the previous overall state of the art for the eight-class secondary structure problem. Our models are implemented using TensorFlow, an open-source machine learning software library available at TensorFlow.org; we aim to release the code for these experiments as part of the TensorFlow repository.

Motivation & Objective

To improve protein secondary structure prediction accuracy using deep learning techniques adapted from computer vision.
To address the limitations of standard convolutional networks in capturing sequential dependencies in protein structures.
To explore next-step conditioning—where predictions depend on prior predicted labels—to enhance sequential modeling in secondary structure prediction.
To mitigate overfitting in next-step conditioned models through scheduled sampling during training.
To establish a new state of the art for eight-class secondary structure prediction using a single model and ensembled models.

Proposed method

A multi-scale, residual convolutional neural network is designed using techniques like batch normalization, dropout, and weight normalization to improve feature learning from amino acid sequences.
The model uses 1D convolutions with 3-filter kernels to extract local patterns from sequence embeddings, including one-hot and PSSM-encoded residues.
Next-step conditioning is introduced by feeding past predicted secondary structure labels as input to subsequent predictions, enabling autoregressive modeling.
Scheduled sampling is applied during training to reduce overfitting by randomly replacing ground-truth labels with model-predicted labels during training.
The architecture is trained end-to-end using cross-entropy loss with label smoothing and early stopping to prevent overfitting.
Ensemble models are created by training multiple instances of the next-step conditioned network and averaging predictions to improve robustness and accuracy.

Experimental results

Research questions

RQ1Can next-step conditioning improve protein secondary structure prediction beyond standard convolutional networks?
RQ2How does scheduled sampling affect the generalization of next-step conditioned models in secondary structure prediction?
RQ3To what extent does conditioning on predicted labels reduce overfitting compared to using ground-truth labels during training?
RQ4Can a single deep convolutional model with residual connections and multi-scale filters outperform previous state-of-the-art models without ensembling?
RQ5Does the integration of language modeling techniques into protein sequence modeling lead to measurable gains in secondary structure prediction accuracy?

Key findings

The baseline model using advanced convolutional techniques achieves 70.0% Q8 accuracy on CB513 without ensembling or multitask learning, setting a new single-model state of the art.
The next-step conditioned model achieves 70.3% Q8 accuracy on CB513 with a single model, demonstrating a 0.3% improvement over the baseline.
An ensemble of next-step conditioned models reaches 71.4% Q8 accuracy on CB513, representing a 1.7% improvement over the previous overall state of the art.
Without scheduled sampling, the next-step conditioned model overfits severely, dropping from 82% validation accuracy to 67.1% on test set inference, highlighting the necessity of scheduled sampling.
The model shows a slight recall deficit for rare or short secondary structure classes, suggesting persistent overfitting to label repetition.
The proposed architecture is generalizable and could be applied to other protein sequence prediction tasks such as solvent accessibility or backbone angle prediction.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.