QUICK REVIEW

[論文レビュー] Deep Learning for Financial Time Series: A Large-Scale Benchmark of Risk-Adjusted Performance

Adir Saly-Kaufmann, Kieran Wood|arXiv (Cornell University)|Mar 2, 2026

Stock Market Forecasting Methods被引用数 0

ひとこと要約

A large-scale benchmark evaluates diverse deep learning architectures for financial time-series forecasting optimized for Sharpe ratio, highlighting that architectures with explicit temporal representations and adaptive memory (e.g., VLSTM, LPatchTST, TFT) outperform linear baselines and generic DL models under risk-adjusted criteria.

ABSTRACT

We present a large scale benchmark of modern deep learning architectures for a financial time series prediction and position sizing task, with a primary focus on Sharpe ratio optimization. Evaluating linear models, recurrent networks, transformer based architectures, state space models, and recent sequence representation approaches, we assess out of sample performance on a daily futures dataset spanning commodities, equity indices, bonds, and FX spanning 2010 to 2025. Our evaluation goes beyond average returns and includes statistical significance, downside and tail risk measures, breakeven transaction cost analysis, robustness to random seed selection, and computational efficiency. We find that models explicitly designed to learn rich temporal representations consistently outperform linear benchmarks and generic deep learning models, which often lead the ranking in standard time series benchmarks. Hybrid models such as VSN with LSTM, a combination of Variable Selection Networks (VSN) and LSTMs, achieves the highest overall Sharpe ratio, while VSN with xLSTM and LSTM with PatchTST exhibit superior downside adjusted characteristics. xLSTM demonstrates the largest breakeven transaction cost buffer, indicating improved robustness to trading frictions.

研究の動機と目的

Assess out-of-sample risk-adjusted performance of modern deep learning architectures on a cross-asset futures dataset (2010–2025).
Investigate how architectural inductive biases (temporal representations, memory, feature selection) affect Sharpe-ratio optimization versus linear baselines.
Evaluate stability, downside risk, and robustness to seeds and market regimes.
Provide a transparent, practical benchmark to guide future research in deep learning for finance.

提案手法

Framework maps historical multivariate inputs to a trading signal via a two-stage pipeline: a sequence model g_phi processes a lookback window to produce a latent state h_t, followed by a linear projection with tanh activation to yield y_hat_t.
Portfolio weights are derived by scaling signals with a volatility-targeting factor based on EWMA-based per-asset volatility to achieve target risk (sigma_tgt = 10%).
End-to-end training optimizes the negative annualized Sharpe Ratio by evaluating portfolio returns R_port over training sequences and applying the Sharpe loss with a stability term epsilon.
Net returns incorporate transaction costs; primary evaluation uses gross returns (c_k = 0) to assess predictive efficacy, with a subsequent breakeven transaction-cost analysis per asset.
Model families include linear baselines (AR1x, AR_n x, DLinear, NLinear), transformer-based architectures (iTransformer, PatchTST), state-space models (Mamba, Mamba2), recurrent models (LSTM, xLSTM, PsLSTM, PatchPsLSTM), and hybrids (VLSTM, VSN+Mamba2, LPatchTST, TFT).
Note: The pipeline also includes ticker embeddings and a cross-asset normalization to ensure asset-specific learning.

実験結果

リサーチクエスチョン

RQ1Which deep learning architectures yield the best risk-adjusted (Sharpe) performance in a long-horizon, cross-asset financial time-series setting?
RQ2How do architectural inductive biases (temporal representations, memory mechanisms, feature selection) influence stability across regimes and robustness to trading frictions?
RQ3Do hybrid and structured models (e.g., VLSTM, LPatchTST, TFT) outperform linear baselines and generic DL models in terms of downside risk and tail behavior?
RQ4What is the trade-off between predictive performance and trading intensity (turnover) under volatility targeting?
RQ5How do results generalize across market regimes from 2010–2025 and under different subperiods?

主な発見

Model	CAGR	Ann. Ret.	SR	t (HAC)	Hit	Turnover	xGMV	Info. Ratio	t (HAC) v Passive	Corr. v Passive
Passive	0.0435	0.0476	0.48	1.65	0.531	-	-	-	-	-
AR1x	0.0813	0.0831	0.83	3.12	0.539	353.64	90.421	-0.0086	-0.0305	0.3533
AR nx x	0.0646	0.0677	0.68	2.52	0.538	280.66	69.525	-0.0829	-0.3011	0.4325
DLinear	0.0750	0.0773	0.77	2.87	0.539	278.41	75.282	0.0141	0.0501	0.2612
LSTM	0.1351	0.1318	1.32	4.56	0.554	948.08	225.769	-0.0637	-0.2303	0.2816
VLSTM	0.2632	0.2388	2.39	8.81	0.588	966.86	218.369	0.8539	3.3071	0.4042
Mamba2	0.0587	0.0620	0.62	2.31	0.546	233.00	58.164	-0.0901	-0.3246	0.2220
VSN+Mamba2	0.0967	0.0973	0.97	3.65	0.555	329.11	78.842	0.1091	0.3936	0.2821
PatchTST	0.0847	0.0864	0.86	3.29	0.541	623.88	198.021	-0.2149	-0.7848	0.5530
LPatchTST	0.2550	0.2323	2.32	8.81	0.577	959.89	211.514	0.7070	2.7470	0.3471
PsLSTM	0.1868	0.1763	1.76	6.83	0.563	823.07	185.496	0.3981	1.5410	0.4862
TFT	0.2398	0.2201	2.20	8.13	0.584	912.81	223.231	0.6665	2.5487	0.3888
VxLSTM	0.1937	0.1821	1.82	6.89	0.574	775.88	159.438	0.4666	1.6727	0.5069
xLSTM	0.1937	0.1796	1.80	6.85	0.568	482.62	91.924	0.7984	2.9042	0.6274
iTransformer	0.0308	0.0353	0.35	1.26	0.529	36.32	9.203	-0.1539	-0.5563	0.4855

Nonlinear sequence models substantially outperform linear benchmarks on most horizons, with stronger and more stable risk-adjusted returns.
VLSTM delivers the strongest aggregate performance (Sharpe 2.39, CAGR 23.9%) with robust passive-relative diagnostics (Info. Ratio 0.854; HAC t-stat 3.31).
LPatchTST and TFT also show strong performance, illustrating the benefit of combining robust temporal state encoding with patch-based or attention mechanisms.
xLSTM-based architectures provide a favorable balance of performance and trading efficiency (Sharpe 1.80; turnover relatively moderate).
iTransformer shows low turnover but weak economic performance, highlighting that extreme turnover reduction without signal responsiveness may hurt value.
Across methods, hybrid models that combine feature selection with adaptive memory (e.g., VLSTM, LPatchTST) achieve higher Sharpe ratios and better risk controls.
State-space models (Mamba, Mamba2) exhibit heterogeneous results and generally lower aggregate performance despite strong regime-specific episodes.
Under a volatility-targeted framework, VLSTM achieves highest CAGR and Sharpe, with strong statistical significance relative to passive benchmarks.
Downside risk metrics favor VLSTM and LPatchTST, which exhibit milder drawdowns and favorable Calmar ratios compared to many baselines.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。