Option FanaticOptions, stock, futures, and system trading, backtesting, money management, and much more!

Short Premium Research Dissection (Part 2)

I continue today with my critique of some recently-purchased research. This tells a story that should be interesting and compelling to option traders. For this reason, I don’t mind if this mini-series is lengthy.

Skipping ahead a couple sections, this table breaks down straddle data into four categories of VIX level with an equal number of occurrences in each:

Short Premium Table 3 (saved 11-16-18)

Inclusive dates and number of occurrences need to be given every time. The section mentioned in Part 1 said “since 2005” whereas text in this section says “since 2007.” Which is it?

Text in this section says “analyzing these four ranges across four different entry time frames would leave us with an overwhelming amount of data, so we’ll be using the 60-day straddles for this analysis.” I think we can therefore assume 60 DTE.* As a critique, though, I never want to see weak excuses like this when I’ve paid good money to get something. If you told me “data are condensed for simplicity” ahead of time, then I would never have trusted enough to purchase it. This screams “I’m too lazy.” The only reason to condense is for having valid reason to do so.

Row 4 is confusing because it contradicts the first table presented in Part 1 where the worst loss was stated to be -390%. These numbers aren’t anywhere close to that. Row 5, which gives the 10th %’ile to help define the distribution, is more consistent with -390%.

I miss the standard of practice for reliable methodology reporting seen in peer-reviewed scientific literature. The “methods” section of a peer-reviewed article lays out precisely-defined steps that could be followed to repeat the study. I want to see a recap of the methodology used to generate every table and graph shown in a backtesting report (see fifth paragraph here). If presented in the same section, then restatement may not be warranted. At the very least, the first table or graph in every section should be accompanied by well-defined methodology.

If I don’t even have mention of methodology to suggest the author’s head is in the game, then I have added reason to suspect sloppy and inconsistent work. Especially given the fact that people I meet in the trading community are complete strangers, I do not want to assume a thing. It’s my hard-earned money on the line and with all the stiff disclaimers up front, the author is not accountable. This is another reason why I would prefer to do my own research.

* I still want to see number of occurrences for confirmation. I want to see redundant statistics all over
   the place to cross-confirm things, really.

Short Premium Research Dissection (Part 1)

I recently purchased research from someone on short premium strategies. In this mini-series I will go through the research with a fine-tooth comb and critique it.

I am going to conduct this analysis in much the same way I have previously dissected investment presentations. Many of these offer mouthwatering conclusions. I think it’s important to smack ourselves whenever we get whiff of something sounding like the Holy Grail. We’re all looking for the Grail and we hope to find it even though, in a separate breath, experienced traders would admit no such Grail exists. Being overtaken by confirmation bias (mentioned twice in the final four paragraphs here and in the fourth paragraph here) can amount to some expensive trading tuition, indeed.

The first thing she does is present basic data on short straddles:

Short Premium Table 1 (saved 11-16-18)

This is an interesting table. I always like to see some reference to “worst trade.” She includes 10th %’ile PnL, which is a great way of giving clues about the PnL distribution. Alone, the average (mean or median) tells very little if the distribution is not Normal. I would have liked to see more percentile data—maybe 5th %’ile (~ 2 SD) and 20-25th %’ile (~ 1st quartile)—to better define the lower tail.

Unfortunately, the methodology used to generate this table is not adequately explained. She writes “2005 to present.” When in 2005? The ending date is also unknown (the document has no publication date, either). The chart implies trades were taken every month at the specified DTE but I need to see total number of trades for verification. Anything more than 30 DTE would result in overlapping trades and much of the document discusses nonoverlapping trades. This could result in a significant sample size difference (larger sample sizes are more meaningful).

Speaking of significant differences, no statistical analysis was applied in this document. This is very unfortunate for reasons discussed here and here. Without number of occurrences and p-values, we cannot put context around whether descriptive statistics are different no matter how they appear.

This table includes more interesting exploratory data:

Short Premium Table 2 (saved 11-16-18)

Row 1 indicates suggests a profitable trade, but whether the average trade is profitable depends on the magnitude and distribution (think histogram) of losses. Rows 3 and 5 give insight into the distribution of maximum favorable excursion (see second paragraph here), which can be used to determine profit target. Why set a 50% profit target if only 10% of trades ever reached that level? Rows 4 and 6 give temporal context, which speaks to annualized returns.

I will continue next time.

Mining for Trading Strategies (Part 3)

I continue today with two more random simulations run on short CL.

Mining 5 is a repeat of Mining 4 (results of which I covered last time) with the same minor change I made to exit criteria. Also as mentioned in Part 2, I am now running the Randomized OOS x3 for confirmation (perhaps two would be sufficient?).

Out of the top 32 (IS) strategies, four demonstrated a lower Monte Carlo (MC) analysis average drawdown (DD) than backtested DD (2007-2015). Three of the four passed Randomized OOS and none passed incubation (2015-2019).

Focusing primarily on Randomized OOS, 14 of the top 33 passed but none passed incubation. Three of the 14 are strategies that passed the MC DD criterion just mentioned.

Because this was not encouraging, I re-randomized the entry criteria, made one change to the exit criteria (see Mining 6), and ran another simulation.

Focusing primarily on the MC DD criterion, 13 out of the top 32 (IS) strategies had a lower average MC DD than backtested DD. Only four of those 13 passed Randomized OOS, and one of those four also passed incubation.

Focusing primarily on Randomized OOS, 12 of the top 32 strategies passed and four of those 12 also passed the MC DD. The only strategy in this simulation to pass incubation was one of those four.

To recap the last few posts, I have run six simulations thus far with my latest methodology. The first simulation was long. This generated six strategies that passed incubation and two that were close (PNLDD 1.70/PF 1.42 and PNLDD 1.98/PF 1.37). The last five simulations were short. Mining 3 produced two strategies that passed incubation. Mining 4 and Mining 6 produced one strategy each that passed incubation.

In other words, Mining 1 was prolific while Mining 2 through Mining 6 were relatively dry. Why might this be?

I am concerned that passing incubation is not so much a matter of whether the strategy is robust as it is a matter of whether the incubation period is favorable for the strategy. If this is true, then should incubation really be the final arbiter? I can imagine a situation where a strategy passes incubation but does not pass either of the other test periods (four years each of IS and OOS); am I to think this strategy is any better than those from my simulations that don’t pass incubation?*

Maybe number of periods passed (e.g. with PNLDD > 2.0 and PF > 1.3) is most important. With each period being slightly different in terms of market environment [whether quantifiable or not], strategies that pass more periods would seem to be most likely to do well in the future when market conditions are likely to repeat (in sufficiently general terms).

This relates back to walk-forward optimization (WFO), but remains slightly different. In WFO, I get multiple test runs by sliding the window forward the length of the OOS period. The big difference there is that the rules can change with each run. What I really want to study are rolling returns (overlapping or not is a consideration) of the same strategy and then select strategies that pass the most rolling periods. Is it possible this could happen by fluke? If so then this approach would be invalidated.

Another possibility is to seek out a fitness function that reflects equity curve consistency. I need to research whether I have this at my disposal and consider what similarities exist compared to a tally of rolling returns.

* — Realize that I don’t incubate unless the strategy does well OOS, passes Randomized
       OOS (and/or MC DD), and is reasonably good IS (else it would not have appeared in
       the first place and/or would not pass Randomized OOS per third paragraph here).

Mining for Trading Strategies (Part 2)

Today I am going to mine for more trading strategies using the same procedure presented in Part 1.

Today’s strategies will be short CL. See Mining 2 for specific simulation settings.

Seven out of the top 20 (IS) strategies passed the Randomized OOS test (see fourth paragraph here). None of the seven had a PNLDD > 1 in the 4-year incubation period, or a PF > 1.24.

Six of the top 30 had an average Monte Carlo drawdown (DD) less than backtested DD. Only two of the six passed Randomized OOS. I did not run the others through incubation.

Not happy with this, I ran another simulation (see Mining 3) for short CL strategies the very next day.

Nine out of the top 28 (IS) strategies passed Randomized OOS. With PNLDD > 2 and PF > 1.30, two of these nine passed incubation. Five of these nine had an average Monte Carlo DD < backtested DD, but none of those passed incubation. Thirteen of the top 28 strategies had an average Monte Carlo DD < backtested DD, but none of these passed incubation.

Still not impressed, I changed the exit criteria slightly for my next short CL simulation (see Mining 4).

Twenty-two of the top 31 (IS) strategies passed Randomized OOS. One of these 22 passed incubation (PNLDD 2.44, PF 1.57). Five of these 22 had an average Monte Carlo DD < backtested DD, but none of the five passed incubation.

Running this simulation generated 22 strategies that passed Randomized OOS but only one that passed incubation (and none that passed those + Monte Carlo DD). I questioned the utility of looking at so many Randomized OOS graphs since such a small percentage pass but on second thought, viewing more is necessary for exactly that reason.

I have seen strategies pass Randomized OOS once but fail to repeat. With Monte Carlo DD, I run the simulation three times for confirmation. I will do the same for Randomized OOS going forward. I also thought about requiring both OOS and IS equity curves to be all above zero for Randomized OOS. The hurdle is high enough already, though, so I will hold off on the latter and just focus on requiring multiple passing Randomized OOS results to confirm.

Mining for Trading Strategies (Part 1)

On the heels of my validation work with the Noise Test and Randomized OOS, I am going to proceed with a new methodology to develop trading systems.

I built today’s strategies in the following manner:

The incubation criteria are nothing magical. I found a couple handfuls of decent-looking strategies and settled upon these numbers after seeing the first few (the numbers were actually lowered somewhat by the end). Also, more than anything else at this point I am trying to gauge whether Randomized OOS is at all helpful to screen for new strategies; a specific critical value will hopefully be determined in the future.

Of the top 28 strategies (all had PNLDD > 3.3 OOS and PF > 1.48 OOS), 21 passed Randomized OOS. Many of these satisfied DV #2 for the IS portion as well, but I did not require that in order to pass.

Nine of 21 strategies met the lowered incubation criteria with PNLDD > 1.68 and PF > 1.28.

On the Monte Carlo Analysis, I look for average drawdown (DD) to be less than the backtested strategy DD. This is mentioned by the software developers as a metric to provide confidence that performance statistics are not artificially inflated due to luck. I have not yet tested this, but I am monitoring it.

In the current simulation, zero of 10 strategies that met the lowered incubation criteria had Monte Carlo DDs less than backtested. None of the other 11 strategies that passed Randomized OOS did either.

I have no major takeaways right now since I am early in the data collection stage. What percentage of strategies pass Randomized OOS? What percentage of strategies have MC DDs less than backtested? What percentage of strategies go on to pass incubation? What kind of performance deterioration can I expect going from IS to OOS to incubation?? How often will I find a strategy that does not follow this pattern?

The software is advertised to come standard with an arsenal of tools capable of stress testing strategies. If passing those stress tests is not correlated with profitable strategies, then we will have an ugly disconnect.

For now, though, all I need are more simulations, more samples, and more data.

Testing the Randomized OOS (Part 4)

I ran into a snafu last time in trying to think through validation of Randomized OOS. Today, let’s try to get back to basics.

The argument for Randomized OOS seems strong as a test of OOS robustness for different market environments. By analyzing where the original backtest fits within the simulated distribution (DV #1), I should be able to get a sense of how fair the OOS period is and whether it contributes positively or negatively to OOS results. Also, if all simulations are above zero (DV #2) then I feel more confident this strategy is likely to be profitable during the time period studied.

In the same breath, Randomized OOS is a reflection of IS results. The better IS performance, the greater the chance for better scores on DV #1 and DV #2. I could look at the total equity curve and separately evaluate IS vs. OOS, but I think the stress test may portray this more clearly.

To make for a viable strategy, I want the actual backtested OOS equity to be in the lower 2/3 of the simulated distribution and a Yes on DV #2. I also want to see decent IS performance, but the latter is probably redundant if I am looking at Randomized OOS. My study, then, is to determine whether strategies that pass Randomized OOS are more likely to go on to produce profitable results in the future (similar to this third paragraph).

Perhaps the highest-level study I can do with the software is to build the best strategies and see what percentage proceed to do well* afterward. Since the software builds strategies based on IS results, I could save time by testing on IS and looking to see what percentage of best strategies do well OOS. This could serve as a benchmark for what percentage of best strategies that also meet stress testing criteria go on to do well. The big challenge is to find strategies that pass the stress tests. This is also the most time-consuming activity.

The latter process, though, is probably already shortened now that I have rejected the Noise Test. Related future studies include exploration of the merits of MC simulation and MC drawdown.

* — Operational definition required. “Do well” could mean positive PNL or
       some minimal score on other fitness functions.

Testing the Randomized OOS (Part 3)

Today I continue discussion of my attempt to validate the Randomized OOS stress test.

As I started scoring the OOS graphs, I quickly noticed the best (IS) strategies were associated with all simulated OOS equity curves above zero (DV #2 from Part 1). This seemed much different than my experience validating the Noise Test. I realize comparing the two is not apples-to-apples (i.e. different methodology of stress test, 100 vs. 1,000 simulations, etc.). Nevertheless, this caught my attention since only ~50% (85/167) of the Noise Tests analyzed showed the same thing.

I then realized IS performance directly affects the OOS graph in this test! The simulated OOS equity curves are a random mashup of IS equity and OOS equity. If the IS equity (or any fitness function) is really good then the simulated OOS is going to have a positive bias. If the IS equity is marginal, then it’s going to have a much weaker [but still] positive bias.

I figured my mistake was that I needed to be scoring the IS, not OOS, graphs. I would then be seeing if the best versus not-so-great strategies are associated with any significant difference in DVs #1 and #2. I realized, too, that not all IS strategies had associated OOS data that met my minimum trade number criterion (60). Were this the case, then attempting to run the Randomized OOS test produced an error message forcing me to find another strategy instead. This took more time, but I was able to get through it.

For the same reasoning described two paragraphs above, I now believe this approach to be flawed as well. My best and worst strategies are associated with an unknown variance in OOS performance. This OOS variability prevents me from establishing a direct link between any observed differences and quality of (IS) strategy. All observed differences are due to some unknown combination of IS and OOS variability.

Doing the study this way would require collection of additional data on OOS performance to compare consistency between the groups. A brief review shows 61 out of 167 (36.5%) strategies with profitable OOS periods (and I should probably go through and estimate the exact PnL to get more than nominal data). The higher the OOS PnL, the more upward bias I would expect on the IS distribution. If I have three variables—good/bad IS, not/profitable OOS, and positioning within OOS simulation—then maybe I could run a 3-way ANOVA. Three-way Chi Square? I know correlation cannot be calculated with nominal data.

Honestly, I don’t have the statistical expertise to proceed with an analysis this complex.

At this point, I’m not sure it makes sense to do the study the original way, either. If I scored the OOS graph and tried to look for some relationship with future performance, then I would need to look at IS in order to determine whether something particular about IS leaked into the OOS metrics (DV #1 and DV #2) or whether OOS metrics are the way they are due to actual strategy effectiveness. Some interaction effect seems in need of being identified and/or eliminated.

Well isn’t this just a clusterf—

I will conclude next time.