Option FanaticOptions, stock, futures, and system trading, backtesting, money management, and much more!

Trading System Development 101 (Part 7)

Today I’m going to start discussing a data-mining approach to trading system development.

With the walk-forward approach, I have to find strategies and program them. Strategies are available in many places: books on technical analysis and trading strategies, articles, blog posts, vendors, webinars, etc.

Coming up with the strategies can take some work, though. In my experience to date, I started with a general familiarity of basic indicators and some e-books. I tested many of those on 2-3 markets. I now need to do some digging in order to continue along this path.

Another approach to trading system development involves data mining. According to microstrategy.com:

     > Data mining is the exploration and analysis of large data to
     > discover meaningful patterns and rules. It’s considered a
     > discipline under the data science field of study… [that]
     > describes historical data… data mining techniques are used
     > to build machine learning models that power modern AI apps
     > such as search engine algorithms…

I started by purchasing point-and-click software that creates trading strategies without any required programming by me.

The software is a genetic algorithm that will search many possible entry signal combinations, exit signals, and other exit criteria to form the best strategies based on selected test criteria and fitness functions (e.g. Sharpe Ratio, net profit, profit factor, etc.).

The software will then create tens to hundreds of strategies that meet my criteria. I can view fitness functions, equity curves, different kinds of Monte Carlo analyses, etc.

The software compares trading signals/strategies against random signals/strategies. This allows me to assess the probability a strategy has edge with predictive value that could not have occurred randomly. While a genetic algorithm curve fits, I don’t want an overfit strategy. A randomly-mined baseline (along with buy-and-hold) can serve as a minimum threshold to beat.

Aside from comparing against random, the software comes pre-packaged with a number of other stress tests that also help to assess whether strategies are honing in on bona fide signal or overfitting to noise. The array of stress tests is impressive. The question is how well they do to forecast profitable strategies. I won’t know that until I find some.

Depending on what particular application is purchased, these packages can do even more. The one I have can build strategy portfolios, track correlations among strategies, and generate full strategy code for different brokerage platforms.

I will continue next time.

What’s the Problem with Walk-Forward Optimization?

I discussed Walk-Forward Optimization (WFO) with regard to trading system development in the fifth paragraph here. My testing thus far has left me somewhat skeptical about the whole WF concept.

I wrote a mini-series about WFO many years ago and explained how it fits into the whole system development paradigm (see here). WFO has many supporters and has been called “the gold standard of trading system validation.”

I have found WFO to be a very high hurdle to clear. I was especially frustrated because multiple times, an expanded feasibility test (i.e. second example here as opposed to seventh paragraph here) passed whereas WFO generated poor results. WFO is basically taking trades at different times from different standard optimizations, which as a whole did pretty well (thereby passing expanded feasibility). How could the entire sequence end up losing money, then?

The easy explanation is different pass criteria for feasibility and WFO. In the feasibility phase, I merely require profitability. The TradeStation criteria for passing WFO phase are:


Although the particular numbers may be changed, this should give a good idea of what a viable strategy might look like: consistently profitable, no huge drawdowns, and relatively short periods of time in between new equity highs.

These criteria are much more stringent than feasibility’s “X% iterations profitable.” This explanation should have satisfied me.

Due to my mounting frustration, however, I couldn’t help but start to rationalize why WFO might be unnecessary for a viable trading system. Here are my thoughts from a few months ago:

     > …aside from generating OS data, which I agree is essential, I think WF
     > screens for an additional characteristic that may not be necessary for
     > real-time profitability. People talk about how managers and asset classes
     > that are the best (worst) during one period end up worse (better) in
     > subsequent periods. WF would reject such mean-reverting strategies due
     > to poor OS performance. Each manager or asset class may be okay to trade,
     > though, as one component of a diversified, noncorrelated portfolio despite
     > the phenomenon of mean reversion… this trainability, for which WFO
     > screens, being altogether unnecessary.

I think it’s an interesting argument: one that can only be settled by sufficient testing.

What’s the alternative without WFO? Probably an expanded feasibility test followed by Monte Carlo simulation.

At this point, I have no practical reason to reject the notion of WFO especially keeping in mind that I may have been conducting the WFO altogether wrong with the coarse grid (see last paragraph here).

Trading System Development 101 (Part 6)

Today I want to tie up some remaining loose ends.

Performance report details need to be carefully considered because subtle interactions may not give us what we want.

I’d kill (figuratively speaking) for a profit factor (PF) of 2.0, for example, but before confirmation bias sweeps me away I need to look closer. Both of these will get me PF = 2.0: $100K profit + $50K loss and $200K profit + $100K loss. Assuming this is trading one contract with a $100K account, I now know the former, unlike the latter, will not be interesting to me. The latter has a good chance to meet my criteria and be viable.

As another example, I need to look closer before getting overly excited about a strategy that generates an average trade of +$1,000. This is much more attractive for an average trade duration of five days than it is 50-100. The latter will have far fewer trades and less overall profitability. This is worthy of note even though most backtesting platforms I have seen do not display average trade per day (as mentioned in third-to-last paragraph here).

Finally, the interaction between trade duration and sample size was discussed in the third-to-last paragraph here. In Part 4, I mentioned some people would be happy with a longer duration strategy. Of important statistical note is the fact that trade duration and sample size are inversely related.

One advantage to longer duration is lower transaction fees (slippage and commission). Transaction fees (TF) are an enormous enemy of net profits. For every trade, TF is constant while longer trades allow for more market movement and potentially larger profits. The adverse impact of TF is therefore inversely related to trade duration. I have to laugh when I think about all the intraday systems I have seen discussed online. I already know the difficulty of finding viable strategies on the daily time frame; viable intraday strategies are probably much harder to find! Combining this rationale with the frequent footnote that so many studies don’t include TF helps this all to make sense.

Until your testing proves otherwise, let this be the one takeaway with regard to TF: many strategies that fail on a short time frame have a much better chance to work if trades are held much longer because average trade may then be large enough to more than offset multiple commissions.

Is anyone still enamored with day trading? I hope not.

Next time, I will begin discussion of a different approach to system development.

Trading System Development 101 (Part 5)

In the first four parts of this mini-series (e.g. Part 4), I talked about my walk-forward (WF) approach to trading system development. Before moving forward with a secondary approach, I want to tie up some loose ends.

Finding a viable trading strategy with the WF approach is really difficult (as discussed in the fourth paragraph here). This was a shocking realization. The internet contains numerous blog posts, trade gurus, and education programs all claiming to teach trading. Numerous books on technical analysis and webinars are available, chat rooms… yet none of the basic strategies that I tested work! I’m skeptical by nature (see second-to-last paragraph here) and now that skepticism has been legitimized.

Nobody should approach any of the above without being prepared to uncover what fiction/deception/omission is (are) being presented. The time to get excited about these things is when, despite concentrated efforts, I am unable to find any flaws.

Also from this fourth paragraph, I want to clarify what I meant about “[deceptive]” claims. I included the word in brackets because I consider it a possibility rather than a certainty. The e-books mention different strategies that have “been successful in metals,” or currencies/softs/equities/metals, etc. When testing a few of these myself—even on the indicated markets—I failed to find viable strategies.

While frustrating, this does not necessarily mean the e-books are deceptive because success has not been objectively defined. One winning trade could be considered successful. Any short period of profitability could be considered successful. A strategy could even test profitably over a long period with a large sample size of OS trades. My discovery of said strategy as non-viable could simply represent the difference between when the claim was published and when I tested it.

The footnote included in Part 3 is one potential critique I have about WF optimization (WFO). I would feel more confident about a set of parameters if they were to score well given a surrounding parameter space that also scores well. I would sacrifice some absolute performance in order to get a better surrounding parameter space. WFO simply looks for the best and uses it for the following OS period.

I can’t help but wonder whether I need to stop using the coarse grid if I am to continue using the WF approach. I explained this in the third paragraph of Part 3. I aim for 70 or fewer iterations per WFO to minimize processing time. This leaves theoretical opportunity for signal to drop through the cracks. Only by testing the same strategies with a fine grid could I ever know if I were being victimized by the false negative. Based on reports from others, I should be willing to increase number of iterations at least 10-fold to do this testing. Such comparison would be an interesting study to do.

Trading System Development 101 (Part 4)

Back in Part 3 of this mini-series, I drilled down into some details about feasibility testing. I left off with ways to increase number of trades in order to avoid small sample sizes.

Requiring a decent sample size in feasibility has the controversial consequence of eliminating strategies with long average hold times. Some people would be happy with a strategy that generates, for example, less than one trade every two months if their subjective function criteria are met. This is a matter of personal preference. I think if I am going to test trades that have few trades in feasibility periods, then I need to go straight to the full dataset and test. In doing this, I need to be careful to avoid curve fitting: seeing the results, tweaking the strategy, retesting, and repeating.

Another situation where few/zero trades may be generated in feasibility is when testing hedge strategies. Recall the VIX filter previously discussed (second-to-last paragraph here). How many instances do we have of VIX over 20 – 50 by increment of 3 in the last 12 years? Not many, and when we do they are largely clustered in time. Most 2-year feasibility periods have zero trades, which makes preliminary assessment difficult. In this case, I would probably scrap feasibility testing altogether and look for more than a small sample size when testing over the whole dataset to avoid curve fitting.

Going back to the Eurostat excerpt, empirical evidence based on OS forecast performance is generally considered more trustworthy than evidence based on IS performance. The latter is more sensitive to extremes and data mining. Because the strategy has not yet tested on OS data, OS forecasts better reflect the information available to the forecaster in live trading (i.e. strategy has not been tested on future data either).

For this reason, the next phase after I find a strategy with 70% iterations profitable is walk-forward analysis (WFA). I described WFA here and included a pictorial representation here. WFO (optimization) is the same thing as WFA except it places emphasis on the fact that specific parameters used for OS are determined by an exhaustive optimization over IS.

If a strategy passes WFO, then the next phase of development is Monte Carlo (MC) simulation, which I discussed here and here. For each simulation, I will compute a ratio of average annualized return to maximum drawdown. I want to see a ratio above a pre-determined threshold to advance the strategy to the next phase.

The final phase of development is incubation. Here, I will paper trade the strategy (i.e. trade on “sim”). If performance looks to be “within normal limits” based on WFA and MC, then I can start trading it live.

Next time, I will make some comments about this walk-forward approach to trading system development.

Timing Luck

I need to interrupt my overview of trading system development in order to discuss a concept called timing luck.

The more traditional concept of timing luck is subject of a post by Corey Hoffstein called “The Luck of the Rebalance Timing.” He addresses the difference in equity curves resulting from rebalancing on different days of the month. As discussed here, I think awareness of all possible curves is important just like awareness of all possible trade results based on the surrounding parameter space. In other contexts, timing luck can apply to taking trades on different days of the week, days of the month, or option trades on particular days to expiration.

Most backtesting software I have seen allows for no more than one open position at a time. For any given trading strategy, though, number of trade triggers is greater than or equal to total number of trades. If a trade is open and a trigger occurs, then nothing happens.

The particular sequence of trades may depend on the backtest starting date. Imagine two triggers occurring one week apart with trades lasting 20 days. If the first trade is taken then the next trigger will be skipped. If I start the backtest a few days later then the second trade will be taken and the first trade skipped (backtest had not yet begun).

The essence of timing luck is that the exact sequence of trades determines the equity curve. With more trade triggers than total number of trades, multiple potential equity curves exist. Why should one equity curve blindly constitute the backtest when it may not be typical of the distribution of all equity curves? Better than average performance may be fortuitous and due to nothing repeatable going forward (i.e. not signal but noise, to which I do not want to fit).

With options, I developed a backtesting approach to solve the timing luck conundrum. Rather than backtesting one open trade at a time, I opened positions on every trading day and tracked them all in a spreadsheet. Unfortunately it took months for me to run one of these backtests, which is why I wrote about serial vs. multiple/overlapping trades here and here.

I think one could make a case for a multiple/overlapping-trade backtesting being just as important as the more common, serial approach. The former factors in all potential trade triggers similar to a Monte Carlo simulation taking into account many more potential equity curves than that generated by one particular backtest.

Monte Carlo simulation is part of my trading system development process, which I will be writing more about in future posts.

Trading System Development 101 (Part 3)

Choice of subjective function is just the tip of the iceberg. As the last two posts have made clear, trading system development can be a very individual process.

Granularity of the variable grid is another important detail that has no right answer. A coarse (less granular) grid has fewer potential variable values whereas a fine (more granular) grid will have more. Testing from 10-30 by 10 gives me three values to test (10, 20, 30) whereas testing from 10-30 by 2 gives me 11 values to test. A more granular grid will result in more iterations (third-to-last paragraph here) and more processing time.

If I want to differentiate between spike and high plateau regions, then I really need increased granularity.* Recall the diagram in the last blog post linked. If one [variable] value tests well, then a high plateau region means an adjacent one should too. This makes more sense for 10 or 14 relative to 12 than it does for 10 or 30 relative to 20. Ten or 14 are only 17% away from 12 whereas 10 or 30 are 50% away from 20. Qualitatively, 17% away may be “similar” while 50% away is apples and oranges rendering exploration of the surrounding space impossible. Signal can fall right through the cracks with a coarse grid.

Feasibility testing is where I can tweak the strategy to get a good result. This is curve-fitting on a small percentage of the total data. If this results in an overfit model, then the strategy will be subsequently rejected when I test on a larger portion of data.

I found number of trades to be a recurring challenge with feasibility testing. Basic statistics (see third-to-last paragraph here) dictates preference of larger sample sizes when evaluating test results. I don’t feel confident in the outcome when one or two routine trade results can drastically skew the result. I’m not looking for an enormous number of trades since this is just a 2-year feasibility test, but if I get too few then I’ll be concerned. What minimum number to accept—all together now!—has no right answer.

I have a few ideas on how to increase number of trades. A strategy that closes or reverses direction when the opposite trigger occurs is fine as long as the triggers occur with some degree of regularity. If only 1-2 triggers occur per year then over two years I may get a tiny number of trades. I’ve tried implementing profit targets, stop losses, and m-day exits to increase number of trades. Feasibility testing is the time to do this; if I get a tiny sample size of trades over the full dataset then I must junk the strategy and move on to avoid curve fitting.

* — Of course, if subsequent steps in the development process don’t do this then why be concerned at all?

Trading System Development 101 (Part 2)

Today I am discussing unknowns in the feasibility testing phase.

Variable range selection can significantly affect results and may have no correct answer. If I have a short-term signal and I test a strategy over a short-term range (e.g. 10-30), then I am more likely to hit the critical value of 70% profitable than if I test over a mixed range (e.g. 10-90) or a longer-term range (e.g. 30-90).

Number of iterations can also significantly affect results and may have no correct answer. In terms of granularity percentage, success or failure of one iteration contributes 2% of the total for 50 iterations vs. 0.2% of the total for 500 iterations (target = 70%).

What segment of the data to use for feasibility is another important detail that may have no right answer. Doing feasibility testing on one 2-year period that represents a particular market environment may generate significantly different results than feasibility testing on another 2-year period. In trying to find a strategy for any given futures market, feasibility testing over multiple environments would be ideal.

Testing over multiple market environments first requires a listing of said environments. This is a subjective task (vulnerable to hindsight bias) that also has no right answer. Were I to pursue this, then I should also determine how often the different environments occur. This might feed back to help determine whether this is worth doing at all (e.g. is it worth testing on exceedingly rare market conditions?).

In my testing thus far, I have eschewed all this and simply chosen to rotate the 2-year feasibility period. How I do the rotation probably doesn’t matter as long as I do it to give strategies suited to different market environments a chance. If I end up testing 100 strategies on a futures market, then maybe I test 10 on each 2-year period within the full 10 years. I will have false negatives, as I discussed in Part 1, but such remains an inevitable reality of this approach.

I have to be a bit lucky to get a strategy to pass feasibility testing, which brings to mind two possibilities. First, it’s okay to rely on some luck and incur some false negatives since I have an infinite number of potential strategies to test. Because feasibility failure decreases the possibility that I’m dealing with a viable strategy (see last long paragraph in Part 1), I should feel good about moving on to the next candidate and minimizing wasted time.

Alternatively, perhaps some method exists that eliminates false negatives (and any reliance on luck) by testing everything. I think this would require an enormous amount of processing power (and programming), though. I already encounter processing delays with my mediocre hardware: a 10-year backtest with 70 iterations takes up to 25 minutes. Many people backtest with hundreds (or even thousands) of iterations, which would take my computer all night to run.

To feasibility test or not to feasibility test differ in how time will be spent. With feasibility testing, extra time will be spent testing additional strategies since viable strategies are inadvertently dismissed. Without feasibility testing, extra time will be spent testing strategies that would have otherwise been dismissed in feasibility over a much longer period.

I will continue next time.

Trading System Development 101 (Part 1)

Today is the day I start talking more specifically about the trading system development process.

Numerous articles on trading system development methodology can be found on the internet. For this reason, I won’t go into extreme detail. I do hope to delve into a few deeper theoretical issues.

As discussed here, I am trying to come up with a model that has a high SNR. The Eurostat website says:

     > Statistical tests of a model’s forecast performance are commonly conducted
     > by splitting a given data set into an in-sample [IS] period, used for the initial
     > parameter estimation and model selection, and an out-of-sample [OS] period,
     > used to evaluate forecasting performance.

I previously showed how I can use IS data to make a backtested strategy look really good. This is what I want to avoid. Ultimately, the only thing I care about is that the strategy perform well on OS data, which has yet to be seen.

I expect a model that trains on some data (IS) to test well on that data (IS). I need to find out whether this correlates with how it will test on future data (OS). To do this, I train (develop) the model on IS data and test the model on OS data.

I start with feasibility study, which will be limited to a small percentage of the total data available. Maybe I designate two out of 10 total years of data to be used for feasibility testing. For each variable in the strategy, I then want to define ranges and test strategies across those ranges. I discussed this in the second-to-last paragraph here. I will then designate a strategy worthy of further consideration if at least 70% of the iterations are profitable (using net income as my subjective function).

Understand that feasibility study includes arbitrary features that should be determined ahead of time by the individual developer. How many years should be used for feasibility testing? I said “small percentage;” there is no right answer. What percentage of iterations must test well in order to pass feasibility? I said 70%; there is no right answer. What fitness function to use? No right answer. What small percentage just mentioned will be selected to be used for feasibility testing? I will discuss this one in further detail later, but here’s a sneak peak: no right answer.

The whole notion of feasibility study is not without critique due to the possibility of false negatives. No strategy performs well at all times, and I will miss out on a viable strategy if the feasibility study happens to be a portion of time where the strategy lags. I’m playing the probabilities* here. A viable strategy is more likely to perform well during more periods. Given this, a viable strategy is more likely to pass feasibility. None of this is ever a guarantee.

I will continue discussion of other unknowns in the next post.

* — The best we can ever hope for when it comes to trading and investing.

Fire and Fortitude in Algorithmic Trading (Part 2)

In order to be an algorithmic trader, I think one must have a burning desire (fire) to beat the market as well as the fortitude to plod through significant failure with regard to strategy development.

I left off with mention of KD’s approach to algorithmic trading. He also says discovery of the first viable strategy may take an extra long time with subsequent viable strategies becoming easier.

I think one of the biggest challenges to trading system development is sticking with the process and maintaining the drive to continue despite serial failure. I’m so here to tell you: getting knocked over the head so many times is absolutely brutal. As a pharmacist, most everything I did either accomplished something or got me one step closer. With trading system development, I need to reprogram myself to get positive reinforcement from failure. I need to regard every rejected strategy as being one step closer to viability.

In addition to serial failure, I think the specter of scam combines to create a harsh double whammy that few people can overcome. Until I find a viable strategy on my own, I have only others to trust. As failure mounts, a belief that the process is somehow flawed becomes harder to resist. Maybe the talk of viable strategies is all just a good marketing story used to pad the pockets of people like KD. How do we know he has actually found success himself? This is consistent with the comments from [11]. I have reached out to almost 15 algorithmic traders to share experiences and have only gotten two responses. I would not be surprised if most spin their wheels for a time and subsequently exit the arena.

I am not saying KD is a con artist, but like the broader financial industry, he does have a strong underlying motive to get us to believe. If nobody believes, then his business selling trading education will struggle. For the broader financial industry, when people stop believing and sell everything, investment advisors experience a severe decline in assets under management and generate less revenue. They therefore have an inherent conflict of interest when recommending clients “stay the course” and ride out market turbulence.