Part 3 (1/2)
Testing a Range of Stochastic Values.
At this point, it is impossible to avoid some amount of optimization. We are hopeful that the underlying premise of the method is good, and we would like to vary the calculation period as well as the entry and exit parameters to see if we can increase the number of trades, as well as the returns. The pattern of these relations.h.i.+ps is predictable, so we might be on safe ground. For example, as we make the entry thresholds farther apart, we will get fewer trades, the profits should be bigger (if we keep the exit at the same level), and the risk should get larger because we're holding the trade longer.
If we hold the entry levels constant and make the exit points closer, then we should increase the number of trades, reduce the size of the profit, and decrease the overall risk because the trades will be held for a shorter time.
We will also vary the calculation period for the stochastic indicator. As that value gets larger, the indicator will reach its extremes less often, generating fewer trading signals. A shorter calculation period will produce more signals. We would like to have as many signals as possible but recognize that a faster frequency of signals will also have smaller price moves and consequently smaller potential profits. There are always trade-offs that must be balanced. In addition, we know from Chapter 2 that when we look at shorter time intervals, we see more price noise. We expect that for this mean-reverting method, shorter calculation periods should favor our trading strategy.
Because we are testing different combinations and must be concerned about overfitting, we will watch carefully to see that the pattern of results has the shape that we expect and that the numbers do not jump around.
Using In-Sample and Out-of-Sample Data.
If we were working in completely uncharted waters, that is, developing a strategy from a new concept, we would want to divide the data into in-sample and out-of-sample part.i.tions. We would then test all of the new concepts on the in-sample data until we were satisfied with the rules and the test results. Finally, we would run our best rules and parameters through the unseen, out-of-sample data. We expect the results to be worse than the in-sample performance because there are many more patterns that we could not have antic.i.p.ated. Still, if the ratio of return to risk of the in-sample test was 2.0 and the out-of-sample test was 1.2, we would consider that a success. It would be ideal if the out-of-sample data performed the same as the in-sample data, but that rarely happens because there is always some degree of overfitting, even if it is unconscious and unintended.
However, if the out-of-sample test is a complete failure, yielding a ratio near zero, then the method is also a failure. You cannot review the new results, find the problem area, and fix it, because that is feedback. You no longer have true out-of-sample data, and there is good reason to believe that your improvements are simply overfitting the data more and will result in trading losses.
Specifying the Tests.
Our primary measurement statistic is the information ratio (annualized return divided by annualized volatility), and the results are shown in Table 3.5. The overbought entry levels are varied from stochastic values of 50 to 30 and the corresponding exit levels from 10 points under the entry to zero. Rather than test only the pair AMR-CAL, we will show the average of all tests for the combinations of four airlines: LCC-CAL.
LCC-AMR.
LCC-LUV.
AMR-CAL.
AMR-LUV.
CAL-LUV.
We take this approach because the choice of parameters should work for all the pairs, not just for AMR-CAL. When we look at average results of all tests, we won't be able to see how the individual pairs performed, but we will know if one set of thresholds is better than others when applied to all markets. This will prevent us from looking too closely at the detail. We also expect to get smoother results by averaging the ratios for each pair.
Which Parameter First?
There are three parameters to test: the stochastic calculation period, the entry threshold, and the exit threshold. The general rule is that we test the parameter that has the most effect on performance. That seems to be the stochastic calculation period. We should expect that longer periods (larger values) will generate fewer trades. The entry threshold will also be a major factor in determining the number of trades: The greater the threshold, the fewer the trades. The exit threshold will have only a small effect on the frequency of trading. If we exit sooner, then there is a chance that prices will reverse and allow us to enter again, but that should happen much less often. Then the order of testing will be Stochastic calculation period.
Entry threshold (with exit set to zero).
Exit threshold.
To begin, we need to pick some reasonable values, which will be a calculation period of 20, an entry of 60, and an exit of zero. We expect that the best calculation period will be less than 20 and the entry may be less than 60. Exiting at zero is normal, but exiting a short at 10 might be safer. Table 3.5 shows the results of these tests beginning in January 2000; however, some of these stocks start later due to mergers.
TABLE 3.5 Initial tests of four airline stocks, six pairs.
The ratio will be the key statistic for determining success or failure. In Table 3.5, the ratio increases in a reasonably orderly way as the momentum calculation period declines from 20 to 6, as shown in Figure 3.6. The calculation periods are shown along the bottom and the ratios along the left scale. Periods of 10 and lower are clearly better, with 7 the best. The period 7 also had the highest profits per share, which is critical to success. We could have tested all calculation periods, but the difference between 20 and 19 days (a change of 5%) would not be as significant as the difference between 7 and 8 (a change of 14%), so we've skipped some values at the high end and included all of them at the lower end.
FIGURE 3.6 Information ratios for tests of momentum calculation periods.
We expect that everyone would choose the period 7, not just because it has the highest ratio but because it falls in the middle of the profitable set of tests. It is best to avoid the value 6 because faster trades are likely to have smaller profits per trade. The 10-day test may be better because of the profits per trade, but readers will need to perform these tests themselves to verify the results, and they can make other choices at that time. You can use these results to convince yourself that this is a viable approach to trading, but you can never simply accept someone else's work without verifying it yourself.
TABLE 3.6 Airline pairs with entry threshold of 50.
TABLE 3.7 Airline pairs with entry threshold of 40.
The greatest concern is the average number of trades. Because U.S. Airways (LCC) started trading in October 2005, all combinations using LCC will be more than 5 years shorter than the other pairs. If we consider all pairs trading for the full 10 years, an average of 36 trades is only 3.6 trades per year. That may not be enough to hold our attention. One way to increase the number of trades is to lower the entry threshold below 60; however, by lowering the threshold, we will also expose ourselves to greater risk because we will enter more trades before they reach their extremes. Tables 3.6 and 3.7 show the results of lowering the threshold to 50 and 40. The averages show that the number of trades increases along with all of the other statistics-the annualized rate of return, per share return, and information ratio-but the entry threshold of 50 is noticeably better than the threshold of 40. For the threshold of 50, more of the individual pairs were profitable (see the far right column) than with the entry threshold of 40. Calculation periods of 10 and lower are still best, and 7 is again the peak performer.
The number of trades has increased by making the threshold lower, but the profits per share are, on average, at only $0.082, which is below what we believe is a safe margin of error, given execution costs. We would consider testing the exit level of 10, compared to zero, to a.s.sure us that we exit the trade more often. But because the profits per share are marginally small, exiting sooner would reduce those profits, and it would be unlikely that these stocks would generate a net profit. One answer is to look at the volatility of the market for each individual stock and trade only when the volatility is relatively high. That will reduce the number of trades but should increase the profits per trade. It will probably add risk because there are fewer trades and less diversification, and risk is always a.s.sociated with higher volatility. But low volatility isn't an option if it doesn't produce sufficient profits.
Before looking at volatility, let's inspect the returns of the individual pairs. Up to now, we have looked at the average of the tests, which is a good way to avoid overfitting. But we need to understand the profits per trade. Table 3.8 shows that results are significantly skewed, with the first two pairs, U.S. Airways (LCC)Continental (CAL) and U.S. AirwaysAmerican (AMR), posting very large per share returns, and all other pairs posting returns that are below what we would consider sufficient for netting a profit. Still, all pairs are profitable, which can be seen as a good start.
TABLE 3.8 Results of individual pairs for airlines, momentum period 7, entry threshold 50, exit threshold 0.
If we go back to the original test that used a 60 entry threshold, we expect the profits per trade to increase, although there would be fewer trades. Table 3.9 shows that results are as expected. The per share results go up on average, and the LCC-LUV pair increases from 4.1 cents to 12.5 cents, enough to produce a real profit. There are some differences in the results of the first two pairs, LCC-CAL and LCC-AMR, due to better entries or few trades, but the gains of those two pairs hold up nicely. The number of trades drops predictably, as do the net profits. Three of the six pairs are tradable.
TABLE 3.9 Results of individual pairs for airlines, momentum period 7, entry threshold 60, exit threshold 0.
One last approach is to visually inspect the individual net a.s.set value (NAV) streams. In Figure 3.7a, these results are messy. If we look closer at the more recent U.S. Airways pairs in Figure 3.7b, the returns are much more orderly.
FIGURE 3.7 NAVs for (a) all airline pairs and (b) airline pairs using U.S. Airways (LCC) as one leg.
Are These Results Robust?
Now we come to the difficult part, deciding whether these results are robust. If they are, then we can comfortably trade these pairs. The answer, if any, comes partly from a more philosophic view of this process.
On the positive side, the idea of trading distortions between two fundamentally related stocks is a basic and believable concept. We used a stochastic indicator to measure the relative momentum of the two stocks and then found those points where they diverged. This was simply the difference between the two stochastic indicator values. The larger the difference that determined the entry threshold, the fewer the trades and the larger the profits per trade. That is all according to expectations. We exit when the stochastic values come back together.
When we run a set of tests, varying the calculation period of the stochastic, we get more trades for shorter holding periods. Again, this is very normal and conceptually correct if we are trying to emphasize the price noise. The results are continuous in terms of the number of trades, profits per trade, and information ratio.
On the negative side, we have clearly tested combinations of parameters. If we test enough, then some are very likely to be profitable, but statistics tell us that a small number of profitable results within a larger set of tests do not have predictive qualities. There are also not as many trades over the test period as we would like, but that may be the normal outcome of highly correlated stocks that don't diverge often, rather than just spurious price moves. And some of the results show very small net profits and even some losses.
One way to determine robustness is to consider the percentage of profitable results over all tests. In other words, if we used a reasonable range of calculation periods for the momentum indicator and reasonable entry and exit thresholds, and we found the percentage of profitable tests, then a large percentage would tell us that this method is sound, even though some returns were small and others large. It would remove the possibility that this method worked for only a narrow set of conditions. We find this a strong measurement of robustness.
Another confirmation of robustness is to apply this exact method on other sectors with similar fundamental relations.h.i.+ps. If the results were similar, then we would be more confident and, at the same time, have additional pairs to trade that would provide valuable diversification.
For now, we can say that there is nothing wrong with the current results, but they are not sufficient to draw a conclusion. We would also prefer pairs that had more trades.
TARGET VOLATILITY.
Before moving on, notice that the standard deviation of returns in Table 3.7 is 12% for all pairs. That is called the target volatility. To compare the returns of different pairs, we need to make the risk equal for each of the pairs' NAV streams. We use 12% annualized volatility as the industry standard.
There are a number of steps needed to equalize the risk of all the pairs that will be traded. This has the consequence of maximizing diversification by avoiding the arbitrary allocation of more or less of the investment risk to any one pair. The first step was to volatility-adjust the two legs so that each stock in the pair had the same risk exposure. The exact way of doing that was given in the section ”Different Position Sizes.” The next step is to equalize the risk of each pair relative to each other. To do that, scale the number of shares traded in each pair to a level that represents a target volatility, in this case, 12%.
A 12% target volatility is where the annualized standard deviation of the daily returns is equal to 0.12. To get to that number after the fact, based on all data, follow these steps: Record the daily net profits and losses of both legs of the pairs trade.
Find the standard deviation of the entire series of profits and losses.
Multiply that standard deviation by the square root of 252 in order to annualize.