Forecasting the equity premium: Do deep neural network models work?
1. Introduction
Equity premium forecasting is one of the core issues in financial research. It is closely related to many important financial issues, such as portfolio management, capital cost and market effectiveness (Rapach & Zhou, 2013; Rapach et al., 2010). However, the out-of-sample predictability is still controversial. For example, Welch and Goyal (2008) find that 14 popular predictive variables do not outperform the simple historical average (HA) of returns. However, Campbell and Thompson (2008) point out that equity premium is predictable out-of-sample by adding parameter constraints based on financial theory. Neely et al. (2014) also show that combining information from both macroeconomic variables and technical indicators using principal components analysis (PCA) performs significantly better than the historical average forecast.
Among methods for stock return prediction, traditional linear regression methods have been widely adopted, e.g., OLS (Ordinary Least Squares), LASSO (Least Absolute Shrinkage and Selection Operator, see Tibshirani, 2011), Ridge regression (Tikhonov,1998). However, literature applying nonlinear methods, especially deep learning, to extract information from the stock return time series is still limited (Bekiros et al.,2016; Gupta et al.,2018). The ability to extract and transform features from data, and to identify hidden nonlinear relations without relying on econometric assumptions and human expertise, makes deep learning much more attractive than other machine learning methods. On the other hand, the number of conditioning variables that are believed to have forecasting power for returns is large and continue to increase over the last five decades. The traditional methods are reaching their limits on handling a large number of conditioning variables, so more advanced statistical tools, such as deep learning can be a solution (Gu et al., 2018). As one of the most popular deep learning methods, Deep Neural Network (DNN) DNN does not require manual indicator selection and enables us to apply much more variables as inputs. In this paper, we apply DNN method to directly forecast the U.S. equity premium and compare the result with that of OLS regression method.
Specifically, following Neely et al.(2014), we compare the forecasting performance (measured by MSFE_{OS}, R^{2}_{OS}, and MSFE-adjusted statistic) of the Ordinary Least Squares models using 28 input variables (OLS+28) with Deep Neural Network models using the same 28 input variables (DNN+28) and Deep Neural Network models using the same 28 factors and additional 14 variables (DNN+42). Next, following Kandel and Stambaugh (1996) and Welch and Goyal (2008), we use the out-of-sample forecasts to compute the Certainty Equivalent Return (CER) gain and Sharpe ratio for mean-variance investors who optimally allocate their wealth between equities and risk-free bills. Our results show that the OLS+28 model has a surprisingly poor performance over the out-of-sample period 2011:01-2016:12, which Neely et al.(2014) didn’t test due to data availability. In contrast, the two DNN models both have good performances. The R^{2}_{OS} of DNN models are near 3%, and the DNN models generate large and robust economic gains for investors with an annualized CER gain at around 3%. The monthly Sharpe ratio of DNN models substantially outperforms HA and OLS+28 model.
Our study contributes to the existing literature in three ways. First, to the best of our knowledge, we are the first to apply deep learning——one of the hottest IT technologies——to forecast equity premium in a finance academic paper. Unlike most of studies focusing on traditional econometric model, we introduce a nonlinear machine learning model to forecast equity premium. Our results show that DNN models can outperform HA models and OLS models. Especially, we find the poor predictive ability of OLS models during the period 2011:01-2016:12, which is beyond the period studied by Neely et al.(2014). However, the DNN models still work well in this period. Second, we test whether DNN models can incorporate more predictive information from additional 14 variables selected from existing finance literature. The results show that the forecasting performance of DNN can be improved by inputting more variables. These, in turn, verify the existing finance literature. Last but not least, our asset allocation results indicate that DNN models can be applied to practical investment management and produce a large number of economic values.
The rest of the paper is organized as follows. Section 2 presents the methodology and data. Section 3 discusses the empirical results. Section 4 concludes the paper.
2. Methodology and Data
2.1. HA model
Welch and Goyal (2008) argue that a that simple historical average(HA) forecasts equity premium better than regressions equity premium on predictors including 14 popular macroeconomic variables. So our first benchmark model is HA model, which can be expressed as follows:
where R_{t} is the equity premium at month t.
2.2. OLS model
Based on PCA and OLS predictive regression framework, Neely et al. (2014) find that, compared with HA model, combining information from both 14 macroeconomic variables and 14 technical variables significantly improves equity premium forecasts. We repeat their study and define OLS models as follows:
where R_{t}_{+1} is the equity premium at month t+1, x_{i,t }is the predictor i at month t. Based on data through t, we can get α̂_{i,t},β̂_{i,t} from the OLS estimate of α_{i,t},β_{i,t}. Then the out-of-sample forecast R̂_{t+1} is
Especially, we denote the OLS regression on principal components extracted from these 28 variables studied by Neely et al.(2014) as “OLS+28”model.
2.3. DNN model
Our DNN models have the following general equations:
where N^{(l)}denotes the number of neurons in each layer l∈{1,...,N^{(l)}}. We define the output of neuron n in layer l as x_n^{(l)} and the vector of outputs for this layer (augmented to include a constant, x_n^{(l)} as x_{(l)}=(1,x_1^{(l)},...,x_N^{(l)})'. The number of units in the input layer is equal to the dimension of the variables, and let x^{(0)}=(1,x_1,...,x_m)', where x_{m} is the m-th input variable. Let θ_n^{(l-1)} denote weight and bias parameters in each layer lR̂_{t+1} is the forecast of log equity premium at month t+1. Rectified linear unit (ReLU) is the most popular activation function (Nair and Hinton, 2010) and we use this at all nodes. Batch normalization (BN) is a simple regularization technique for controlling the variability of variables across different regions of the network and across different datasets (Nair and Hinton, 2010). Equation states the relationship between the input variables in input layer and the output vectors in the first hidden layer. Equation shows the recursively output formula for the neural network at each neuron in layer l. And equation gives the final output of forecasting results. For comparing with HA and OLS+28 models, we first apply the same 28 variables as input to the OLS+28 model and DNN+28 model. Then, in order to examine whether DNN models can extract information from the 14 additional predictors to improve the forecast performance, we add 14 additional variables selected from existing finance literature and obtain the DNN+42 models.
At present, there is no uniform approach to determine the best parameters such as the number of layers and neurons for DNN on a given problem. Since Gu et al. (2018) suggest that shallow learning outperforms the relatively deeper learning, we choose three or four hidden layers to start search in our study. To solve this nonlinearity and nonconvexity problem, we use the adaptive moments method (Adam, Kingma, et al. 2014) to train our DNN models and grid search method to select the best one. Finally, DNN+28 models take 200, 200, 200, and 128 neurons in four hidden layers and 0, 0.5, 20 as the values of the weight decay of Adam, dropout probability and epochs respectively. For DNN+42 models, these values are 600, 300, 300 in three hidden layers and 0, 0.5, 10, respectively. For robustness check, we will discuss the effect of those key parameters on forecasting performance.
DNN models tend to suffer from overfitting when tuning parameters to achieve satisfactory results. Four methods are applied to prevent overfitting: First, we shrink the weight parameters of DNN model via L2 penalized estimation method, because the method can control the weight of regularization term in the loss function. Second, we apply dropout technique to prevent overfitting and co-adaptations of neurons, and set the output of any neuron to zero with probability p. Models with dropout can be interpreted as an ensemble of models with different numbers of neurons in each layer, but also with weight sharing, and thus can enhance generalization ability (Srivastava et al. 2014). Third, early stopping method is adopted to determine the best training epoch. And we stop training once the model performance stops improving on test datasets. Finally, we use the batch normalization algorithm, which normalizes the input of each layer to ensure that the input data of each layer is stable, thus achieving the purpose of speeding up training and improving generalization ability.
2.4. Forecast Evaluation Measures
Following Neely et al.(2014) and Welch and Goyal (2008), we employ two kinds of forecast evaluation measures. First R^{2}_{OS} and MSFE-adjusted Statistics. R^{2}_{OS} measures the forecasting accuracy versus benchmark HA model and a monthly R^{2}_{OS} of 0.5% is economically significant (Campbell and Thompson ,2008). MSFE-adjusted statistic measures the statistical significance (Clark and West, 2007). Second, Asset Allocation Performance measured by following six measures: (1) certainty equivalent return gain [CER gain, △(ann%)], (2) CER gain in expansions [△(ann%), EXP], (3) CER gain in recessions [△(ann%), REC], (4) Sharpe ratio, (5) Relative average turnover, (6) CER gain with 50bps per transaction [△(ann%), cost = 50bps].
2.5. Data
The dataset used covers the monthly period from 1950:12 to 2016:12, based on data availability. The equity premium R_{t} is computed as the difference between the log return on the S&P 500 (including dividends) and the log return on a risk-free bill. As mentioned before, in order to compare the forecasting performance of our considered models, we select 48 predictors. These consist of three groups: 14 macroeconomic variables from Welch and Goyal (2008), 14 technical variables from Neely et al. (2014), and 14 additional variables from existing finance literatures including investors sentiment changes (Wurgler and Baker, 2006), financial stress index (Cardarelli et al.,2011), ratio of 52-week high (George & Hwang, 2004), etc.
Table 1 reports the summary statistics for the log equity premium (1950:12-2016:12), macroeconomic variables (1950:12-2016:12), technical variables (1950:12-2016:12), and additional variables (1965:08-2016:12). The average monthly equity premium (0.004) divided by its standard deviation (0.043) produces a monthly Sharpe ratio value of 0.088. Most of the macroeconomic variables and additional variables are strongly auto-correlated.
3. Empirical results
Similar to Neely et al.(2014), these models are estimated in-sample using recursively expanding windows with an initial length of 15 years. We divide out-of-sample period into three panels: panel A (1966:01-2011:12), panel B (1980:09-2010:12), and panel C (2011:01-2016:12). We report results in each panel for the whole period along with NBER-date business-cycle expansions and recessions period.
Mean | Median | Std | Min | Max | Auto-cor | Skewness | Kurtosis | |
Panel A: Log equity premium, December 1950 to December 2016 | ||||||||
R | 0.004 | 0.008 | 0.043 | -0.248 | 0.149 | 0.049 | -0.669 | 2.535 |
Panel B: Macroeconomic variables, December 1950 to December 2016 | ||||||||
DP | -3.602 | -3.531 | 0.412 | -4.524 | -2.753 | 0.994 | -0.134 | -0.872 |
DY | -3.597 | -3.525 | 0.412 | -4.531 | -2.751 | 0.994 | -0.139 | -0.848 |
EP | -2.831 | -2.860 | 0.449 | -4.836 | -1.899 | 0.989 | -0.723 | 2.648 |
DE | -0.771 | -0.815 | 0.320 | -1.244 | 1.379 | 0.986 | 2.961 | 15.854 |
RVOL | 0.145 | 0.135 | 0.051 | 0.055 | 0.316 | 0.963 | 0.799 | 0.549 |
BM | 0.498 | 0.414 | 0.270 | 0.121 | 1.207 | 0.994 | 0.761 | -0.465 |
NTIS | -0.010 | -0.013 | 0.020 | -0.051 | 0.058 | 0.979 | 0.650 | 0.265 |
TBL | -4.866 | -4.970 | 3.275 | -16.300 | -0.010 | 0.990 | -0.527 | 0.596 |
LTY | -6.772 | -6.460 | 2.683 | -14.820 | -1.750 | 0.993 | -0.589 | 0.132 |
LTR | 0.639 | 0.510 | 3.054 | -11.240 | 15.230 | 0.037 | 0.380 | 2.275 |
TMS | 1.905 | 2.060 | 1.507 | -3.650 | 4.550 | 0.955 | -0.464 | -0.172 |
DFY | 1.062 | 0.940 | 0.448 | 0.320 | 3.380 | 0.964 | 1.754 | 4.229 |
DFR | 0.011 | 0.060 | 1.498 | -9.750 | 7.370 | -0.064 | -0.348 | 6.146 |
INFL | -0.330 | -0.305 | 0.359 | -1.792 | 1.915 | 0.619 | 0.160 | 3.458 |
Panel C: Technical variables, December 1950 to December 2016 | ||||||||
MA(1,9) | 0.677 | 1 | 0.468 | 0 | 1 | 0.703 | -0.761 | -1.425 |
MA(1,12) | 0.708 | 1 | 0.455 | 0 | 1 | 0.780 | -0.919 | -1.160 |
MA(2,9) | 0.684 | 1 | 0.465 | 0 | 1 | 0.748 | -0.793 | -1.375 |
MA(2,12) | 0.705 | 1 | 0.456 | 0 | 1 | 0.821 | -0.901 | -1.191 |
MA(3,9) | 0.686 | 1 | 0.465 | 0 | 1 | 0.785 | -0.801 | -1.362 |
MA(3,12) | 0.703 | 1 | 0.457 | 0 | 1 | 0.817 | -0.893 | -1.207 |
MOM(9) | 0.703 | 1 | 0.457 | 0 | 1 | 0.767 | -0.893 | -1.207 |
MOM(12) | 0.728 | 1 | 0.445 | 0 | 1 | 0.804 | -1.026 | -0.951 |
VOL(1,9) | 0.666 | 1 | 0.472 | 0 | 1 | 0.609 | -0.706 | -1.506 |
VOL(1,12) | 0.687 | 1 | 0.464 | 0 | 1 | 0.709 | -0.809 | -1.349 |
VOL(2,9) | 0.660 | 1 | 0.474 | 0 | 1 | 0.761 | -0.675 | -1.549 |
VOL(2,12) | 0.690 | 1 | 0.463 | 0 | 1 | 0.825 | -0.826 | -1.322 |
VOL(3,9) | 0.676 | 1 | 0.468 | 0 | 1 | 0.770 | -0.753 | -1.437 |
VOL(3,12) | 0.682 | 1 | 0.466 | 0 | 1 | 0.835 | -0.785 | -1.388 |
Panel D: Additional variables, August 1965 to December 2016 | ||||||||
PDND | -4.658 | -6.194 | 13.58 | -50.23 | 31.632 | 0.970 | 0.147 | 0.147 |
RIPO | 16.808 | 12.700 | 19.44 | -28.80 | 119.10 | 0.648 | 2.112 | 6.403 |
NIPO | 25.916 | 19.000 | 23.23 | - | 122.00 | 0.862 | 1.203 | 1.079 |
CEFD | 8.674 | 9.220 | 7.343 | -10.91 | 25.28 | 0.962 | -0.124 | -0.327 |
S | 0.172 | 0.151 | 0.086 | 0.045 | 0.430 | 0.994 | 0.946 | 0.348 |
ΔSENT | 0.001 | 0.032 | 0.942 | -3.616 | 5.416 | 0.086 | 0.289 | 2.882 |
FS | 100.77 | 100.74 | 0.894 | 98.359 | 105.89 | 0.857 | 0.621 | 2.229 |
WH52_Ratio | 0.936 | 0.965 | 0.083 | 0.51 | 1.04 | 0.891 | -1.858 | 3.915 |
WH52_Abs | 0.154 | 0.000 | 0.361 | 0.00 | 1.00 | 0.079 | 1.922 | 1.700 |
DV | 0.010 | 0.009 | 0.003 | 0.01 | 0.02 | 0.997 | 0.649 | -0.287 |
WV | 0.009 | 0.009 | 0.002 | 0.00 | 0.01 | 0.998 | 0.128 | -0.778 |
AV | 0.009 | 0.009 | 0.003 | 0.01 | 0.02 | 0.992 | 0.592 | -0.595 |
VAR005 | 0.060 | 0.058 | 0.015 | 0.03 | 0.08 | 0.980 | 0.024 | -1.063 |
VAR001 | 0.078 | 0.080 | 0.020 | 0.04 | 0.11 | 0.981 | -0.191 | -1.054 |
3.1. In-sample test results
Table 2 reports the in-sample results of HA, OLS+28, DNN+28, DNN+42 models for the three panels. The results in Panel A, show that OLS+28 models can beat the HA models in terms of MSFE and R^{2}, which is consistent with Neely et al.(2014). Overall, the in-sample results of DNN models outperform HA and OLS+28 models on all the three panels, which is not affected by business-cycle expansions or recessions.
Model | (%) | .EXP (%) | REC (%) | |
Panel A: January 1966 to December 2011 | ||||
HA | 20.23 | |||
OLS+28 | 15.15 | 0.05 | 0.42 | 0.52 |
DNN+28 | 15.47 | 3.03 | 1.08 | 5.48 |
Panel B: September 1980 to December 2010 | ||||
HA | 20.54 | |||
OLS+28 | 16.24 | 0.04 | 0.29 | 0.41 |
DNN+28 | 15.47 | 3.03 | 1.08 | 5.48 |
DNN+42 | 18.56 | 3.72 | 0.50 | 6.96 |
Panel C: January 2011 to December 2016 | ||||
HA | 10.67 | |||
OLS+28 | 17.49 | 1.81 | 1.81 | - |
DNN+28 | 17.37 | 2.62 | 2.62 | - |
DNN+42 | 18.59 | 3.81 | 3.81 | - |
3.2. Out-of-Sample forecasting results
Table 3 provides the out-of-sample forecasting results of models. From Panel A of Table 3, in terms of R^{2}_{OS} and MSFE_{OS}, the OLS+28 model outperforms the HA model from 1966:01 to 2011:12, which have almost the same results as those of Neely et al. (2014). However, Panel B shows that the performance of OLS+28 model in each panel is worse than the HA model since 1980:09. This means that the OLS+28 model only performs better than the HA model in the former 15 years. Besides, the OLS+28 model obtains significantly large positive R^{2}_{OS} (11.37%, 10.64% in panel A and panel B, respectively) during recessions, but disappointingly negative R^{2}_{OS} during expansions (-2.63%, -4.14% in panel A and panel B, respectively). This suggests that the OLS+28 model’s strong performance on the whole sample is largely due to high R^{2}_{OS} values during recessions. From Panel C, it further shows that, surprisingly, the OLS+28 model displays no out-of-sample predictive ability in terms of R^{2}_{OS} (-5.02%) from 2011:01 to 2016:12, a period that has not been examined by Neely et al.(2014). Overall, the OLS+28 model does not have good predictive robustness.
Turning to our proposed DNN models, the results in Table 3 show that both DNN+28 and DNN+42 model strongly beat the simple HA benchmark and the OLS+28 model in terms of MSFE and R^{2}_{OS}. The out-of-sample MSFEs for DNN models are significantly less than that of HA and OLS+28 at the conventional confidence level. Impressively, it is worth pointing out that the R^{2}_{OS }statistics of DNN models overwhelmingly beat the OLS+28 model and are positive in each panel. These indicate that DNN models can outperform the HA model both in expansions and recessions, and have good robustness.
Moreover, it shows that, overall, the performance of DNN+42 model are relatively better than the DNN+28 models. Especially, the DNN+42 model has an R^{2}_{OS} of 3.37% in Panel B of Table 2, which significantly exceeds the R^{2}_{OS} of 1.49% of DNN+28 model. The out-of-sample MSFEs of DNN+42 model are much less than that of HA models at the 1% confidence level. Thus, the results suggest that the forecasting performance of DNN modes is enhanced by incorporating 14 additional variables.
Model | MSFE_{OS} | R^{2}_{OS} (%) | MSFE-adjusted | R^{2}_{OS }EXP (%) | R^{2}_{OS }REC (%) |
Panel A: January 1966 to December 2011 | |||||
HA | 20.23 | ||||
OLS+28 | 19.83 | 1.95 | 3.38*** | -2.63 | 11.37 |
DNN+28 | 19.67 | 2.75 | 3.60*** | 1.02 | 6.31 |
Panel B: September 1980 to December 2010 | |||||
HA | 20.54 | ||||
OLS+28 | 20.55 | -0.07 | 2.03** | -4.14 | 10.64 |
DNN+28 | 20.23 | 1.49 | 2.23** | 0.48 | 4.15 |
DNN+42 | 19.85 | 3.37 | 2.58*** | 1.06 | 9.41 |
Panel C: January 2011 to December 2016 | |||||
HA | 10.67 | ||||
OLS+28 | 11.20 | -5.02 | -0.53* | -5.02 | - |
DNN+28 | 10.31 | 3.35 | 2.06** | 3.35 | - |
DNN+42 | 10.30 | 3.42 | 1.85** | 3.42 | - |
3.3. Asset allocation results
Table 4 reports the portfolio performance for asset allocation over 1966:01-2016:12. In accord with the R^{2}_{OS} in Table 1, the OLS+28 model does not uniformly get robustness performance in terms of △(ann%), △(ann%), EXP, and △(ann%), REC in Table 4.
Turning to the performance of DNN models, Table 4 shows that CER gains in both recessions and expansions are positive. Besides, though the turnover is relatively high compared with HA and OLS+28 models, the CER gains with a proportional transactions cost of 50 basis points per transaction are still positive. From the perspective of asset allocation, the DNN+28 models also obtain good performance. Table 4 consistently confirms that the DNN+42 model outperforms the DNN+28 model in terms of CER gain and Sharpe ratio. DNN+42 models generate monthly out-of-sample R^{2} of 3.42% and annual utility gain of 2.99% for a mean-variance investor from 2011:1 to 2016:12. The asset allocation analysis demonstrates a substantial economic value of employing DNN models for equity premium forecasting.
Model | △(ann%) | △(ann%),EXP | △(ann%), REC | Sharpe ratio | Relative average turnover | △(ann%), cost =50 bps |
Panel A: January 1966 to December 2011 | ||||||
HA(CER) | 4.87 | 9.33 | -17.52 | 0.06 | 2.66% | 4.70 |
OLS+28 | 5.07 | 0.05 | 30.33 | 0.16 | 6.43 | 4.20 |
DNN+28 | 4.40 | 1.46 | 18.99 | 0.14 | 13.64 | 2.37 |
Panel B: September 1980 to December 2010 | ||||||
HA(CER) | 7.12 | 11.54 | -17.61 | 0.10 | 2.63% | 6.95 |
OLS+28 | 2.77 | -1.57 | 26.96 | 0.16 | 5.18 | 2.09 |
DNN+28 | 2.49 | 1.13 | 9.90 | 0.15 | 14.36 | 0.37 |
DNN+42 | 4.48 | 0.32 | 27.65 | 0.20 | 19.85 | 1.48 |
Panel C: January 2011 to December 2016 | ||||||
HA(CER) | 8.35 | 8.35 | - | 0.26 | 2.31% | 8.21 |
OLS+28 | -4.56 | -4.56 | - | 0.16 | 12.50 | -6.19 |
DNN+28 | 2.88 | 2.88 | - | 0.31 | 7.78 | 1.95 |
DNN+42 | 2.99 | 2.99 | - | 0.33 | 16.52 | 0.84 |
3.4. Robustness checks
To further validate our results, we conduct the following robustness checks. First, the effects of the number of DNN models’ epochs, dropout probability and weight decay on forecasting performance are displayed in Figure 1. It shows that these key parameters have good performance near the optimal value. Second, we report the out-of-sample forecasting results year by year for our models (Table A11 in the Online Appendix). Finally, we check the results of asset allocation exercise with risk aversion coefficients equal to 4,5,6 (Table A4 – Table A10 in the Online Appendix). Overall, these robustness checks confirm that DNN models indeed work better than HA models and OLS models for forecasting equity premium.
Figure 1. Effects of the Number of DNN Models’ Epochs, Dropout Probability and Weight Decay on and CER Gains in Panel C
4. Conclusion
This study compares the predictive ability of deep neural network models with that of ordinary least squares models and historical average models. We find that DNN models robustly work the best and significantly outperform both OLS and HA models in both in- and out-of-sample tests and asset allocation exercises. Moreover, the forecasting performance of DNN is enhanced by adding 14 additional variables selected from finance literature, which indicates that the DNN comprehensively incorporates the predictive information contained in these variables. One possible explanation for their excellent performance is that the nonlinear DNN successfully extract high dimension features from data automatically and discover different forecasting pattern in data. Our study is of great significance to portfolio construction and risk management for investors.
Supplementary Materials: Online Appendix is available from the authors.
Author Contributions: Conceptualization, Xianzheng Zhou, and Huaigang Long.; methodology, Xianzheng Zhou.; software, Xianzheng Zhou; validation, Xianzheng Zhou., Huaigang Long, and Hui Zhou.; formal analysis, Xianzheng Zhou; investigation, Xianzheng Zhou; resources, Hui Zhou; data curation, Hui Zhou ; writing—original draft preparation, Xianzheng Zhou; writing—review and editing, Huaigang Long; visualization, Hui Zhou; supervision, Huaigang Long; project administration, Huaigang Long; funding acquisition, Huaigang Long. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The processed data from this study are available upon request.
Conflicts of Interest: The authors declare no conflict of interest.