Option FanaticOptions, stock, futures, and system trading, backtesting, money management, and much more!

Testing the Randomized OOS (Part 2)

I described the Randomized OOS in intricate detail last time. Today I want to proceed with a method to validate Randomized OOS as a stress test.

I had some confusion in determining how to do this study. To validate the Noise Test, I preselected winning versus losing strategies as my independent variable. My dependent variables (DVs) were features the software developers suggested as [predictive] test metrics (DV #1 and DV #2 from Part 1). I ran statistical analyses on nominal data (winning or losing OOS performance, all above zero or not, Top or Mid) to identify significant relationships.

I thought the clearest way to do a similar validation of Randomized OOS would be to study a large sample size of strategies that score in various categories on DV #1 and DV #2. Statistical analysis could then be done to determine potential correlation with future performance (perhaps as defined by nominal profitability: yes or no).

This would be a more complicated study than my Noise Test validation. I would need to do subsequent testing one at a time, which would be very time consuming for 150+ strategies. I would also need to shorten the IS + OOS backtesting period (e.g. from 12 years to 8-9?) to preserve ample data for getting a reliable read on subsequent performance. I don’t believe 5-10 trades are sufficient for the latter.*

Because Randomized OOS provides similar data for IS/OOS periods, I thought an available shortcut might be to study IS and look for correlation to OOS. My first attempt involved selection of best and worst strategies and scoring the OOS graphs.

In contrast to the Noise Test validation study, two things must be understood here about “best” and “worst.” First, the software is obviously designed to build profitable strategies and it does so based on IS performance. Second, a corollary to this is that even those strategies at the bottom of the results list are still going to be winners (see fifth paragraph here to see that the worst Noise Test validation strategies were OOS losers). I still thought the absolute performance difference from top to bottom would be large enough to see significant difference in the metrics.

I will continue next time.

* — I could also vary the time periods to get a larger sample size. For example, I can backtest from
      2007-2016 and analyze 2017-2019 for performance. I can also backtest from 2010-2019 and
      analyze 2007-2009 for performance. The only stipulation is that the backtesting period be
      continuous because I cannot enter a split time interval into the software. If I shorten the
      backtesting period even further, then I would have more permutations available within the 12
      years of total data as rolling periods become available.

Testing the Randomized OOS (Part 1)

I previously blogged about validating the Noise Test on my current trading system development platform. Another such stress test is called Randomized OOS (out of sample) and today I begin discussion of a study to validate that.

While many logical ideas in Finance are marketable, I have found most to be unactionable. The process of determining whether a test has predictive value or whether a trading strategy is viable is what I call validation. If I cannot validate Randomized OOS then I don’t want to waste my time using it as part of my screening process.

The software developers have taught us a bit about the Randomized OOS in a training video. Here’s what they have to say:

I want to further explain the second bullet point. Suppose I want 40% of the data to be reserved for OOS testing. The 40% can come at the beginning or at the end. The 40% can be in the middle. I can have 20% at the beginning and 20% at the end. I can space out 10% four times intermittently throughout. I can theoretically permute the data an infinite number of times to come up with different sequences, which is how I get distributions of simulated IS and OOS equity curves.

Here are a couple examples from the software developers with the left and right graphs being IS and OOS, respectively, and the bold, blue line as the original backtested equity curve:

Randomized OOS Ex 1 (6-13-20)

This suggests the backtested OOS performance is about as good as it could possibly be. Were the data ordered any other way, performance of the strategy would likely be worse: an ominous implication.

Contrast to this example:

Randomized OOS Ex 2 (6-13-20)

Here, backtested OOS performance falls in the middle (rather than top) of the simulated distribution. This suggests backtested OOS performance is “fair” because ~50% of the data permutations have it better while ~50% have it worse. This is considered more repeatable or robust. In the previous example, ~100% of the data permutations gave rise to worse performance.

The third possibility locates the original equity curve near the bottom of the simulated distribution.* This would occur in a case where the original OOS period is extremely unfavorable for the strategy—perhaps due to improbable bad luck. Discarding the strategy for performance reasons alone may not be the best choice in this instance.

I will continue next time.

* — I am less likely to see this because the first thing I typically do is look at
       IS + OOS equity curves and readily discard those that don’t look good.

Automated Backtester Research Plan (Part 9)

With digressions on position sizing for spreads and deceptive butterfly trading plans complete, I will now resume with the automated backtester research plan.

We can study [iron, perhaps, for better execution] butterfly trades entered daily from 10-90 days to expiration (DTE). We can center the trade 0% (ATM) to 5% OTM (bullish or bearish) by increments of 1% [perhaps using caution to stick to the most liquid (10- or 25-point) strikes especially when open interest is low*]. We can vary wing width from 1-5% of the underlying price by increments of 1%. We can vary contract size to keep notional risk as consistent as possible (given granularity constraints of the most liquid strikes).

An alternative approach to wing selection would be to buy longs at particular delta values (e.g. 2-4 potential delta values for each such as 16-delta put and 25-delta call). This could be especially useful to backtest asymmetrical structures, which are a combination of symmetrical butterflies and vertical spreads (as mentioned in the second-to-last paragraph here).

With trades held to expiration, I’d like to track and plot maximum adverse (favorable) excursion for the winners (losers) along with final PnL and total number of trades to determine whether a logical stop-loss (profit target) may exist. We can also analyze differences between holding to expiration, managing winners at 5-25% profit by increments of 5%, or exiting at 1-3x profit target by increments of 0.25x. We can also study exiting at 7-28 (proportionally less on the upper end for short-term trades) DTE by increments of seven.

As an alternative not previously mentioned, we can use DIT as an exit criterion. This could be 20-40 days by increments of five. Longer-dated trades have greater profit (and loss) potential than shorter-dated trades given a fixed DIT, though. To keep things proportional, we could instead backtest exiting at 20-80% of the original DTE by increments of 15%.

Trade statistics to track include winning percentage, average (avg.) win, avg. loss, largest loss, largest win, profit factor, avg. trade (avg. PnL), PnL per day, standard deviation of winning trades, standard deviation of losing trades, avg. days in trade (DIT), avg. DIT for winning trades, and avg. DIT for losing trades. Reg T margin should be calculated and will remain constant throughout the trade. Initial PMR should be calculated along with the maximum value of the subsequent/initial PMR ratio.

We can later consider relatively simple adjustment criteria. I may spend some time later brainstorming some ideas on this, but I am most interested at this point in seeing raw statistics for the butterfly trades.

I will continue next time.

* This would be a liquidity filter coded into the backtester. A separate study to see how open interest for
   different strikes varies across a range of DTE might also be useful.

Butterfly Skepticism (Part 4)

Today I want to complete discussion of the protective put (PP) butterfly adjustment.

I might be able to come up with some workaround (as done in this second paragraph) for PP backtesting. I could look at EOD [OHLC] data and determine when the low was more than 1.6 SD below the previous day’s close. In this case, I could purchase the put at the close. This would bias the backtest against (not a bad thing) the adjustment in cases where the close was more than 1.6 SD below because the put would be more expensive.

Unfortunately, I am not sure this particular workaround would work. If the close is less than 1.6 SD below then the backtested PP would be less expensive than actual. Furthermore, if I waited until EOD then the NPD and corresponding PP(s) to purchase would be different. This would distort the study in an unknown direction. I could track error (difference) between -1.6 SD and closing market price. Positive and negative error might cancel out over time. If I had a large sample size then this might or might not be meaningful.

At best, this workaround seems like a questionable approximation of an adjustment strategy that is precisely defined.

Before dismissing the PP out of frustration, let’s step back for a moment and piece together some assumptions.

First, I believe the butterfly can be a trade with somewhat consistent profits and occasionally larger losses. Overall, I’m uncertain whether this has a positive or negative expectancy (hopefully to be determined as I begin to describe here).

Second, as butterflies are held longer, I believe profitability will be decreased. I have seen some anecdotal (methodology incompletely defined) research to suggest butterflies are more profitable when avoiding periods of greatest negative gamma.

Third, I have seen anecdotal (methodology incompletely defined) research to suggest PPs as unprofitable whether:


Fourth, this adjustment will require any butterfly to be held longer on average. The additional time will be needed to recoup the PP loss. The result will be, as described per second assumption above, decreased average profitability.

In my mind, combining the first and fourth assumptions does not bode well.

The big unknown involves the magnitude of the largest losses and in what percentage of trades those largest losses occur.

Interestingly, the trader who explained this to me said PP will lose money in most cases. What it can prevent is a massive windfall loss. Being forthright [about the obvious?] may give the teacher more credibility. Without backtesting, though, I think it leaves us with more than reasonable doubt over whether this approach tends toward profit or loss.

Butterfly Skepticism (Part 3)

On my mind this morning is skepticism regarding the protective put (PP).

I have seen the PP lauded by many traders as a lifesaving arrow to have in the quiver.

One trader described this to me with regard to a butterfly trading plan. Part of the plan provides for the following adjustment:

  1. If NPD is at least 10 with market at least -1.6 SD intraday then record NPD.
  2. Buy PP(s) to cut NPD by 75%.
  3. On a subsequent big move, if NPD again reaches value recorded in Step 1 then repeat Step 2.
  4. If market reverses to the high of purchase day, then close PP(s) from Step 2.


Upon further questioning, I got some additional information. He learned it from a guy who claims to have “mentored” many traders. The mentor (teacher) claims to have seen many lose significant money in big moves and therefore recommends this to avoid windfall losses. The teacher has shown numerous historical examples where this adjustment would keep people in the trade (not stopped out at maximum loss) and often wind up profitable. Further prodding revealed uncertainty over whether these “numerous” examples amount to more than a handful of instances. Of the several times the adjustment has been presented, he acknowledged the possibility that many could have been repetition of the same [handful of] instance[s]. He is uncertain whether anyone has presented big losing trades more than once.

Much of this casts doubt over the sample size behind this adjustment. We certainly wouldn’t want to fall prey to that described in this this second-to-last paragraph.

As described in this third paragraph, the PP is simply an overlay added later in the trade. This strategy also has its own catchy name and is marketed. In order to backtest, we could study the profitability of long puts purchased on days the market is down at least 1.6 SD (also explore the surrounding parameter space as discussed in the fourth complete paragraph here).

From a backtesting perspective, intraday is a huge wrinkle. Technically, I’d need intraday data to identify exactly when the market was down 1.6 SD in order to purchase the PP at the correct time. As mentioned in the second-to-last paragraph here, this is arguably another reason why certain trading plans cannot be backtested: data not available.

I will continue next time.

Butterfly Skepticism (Part 2)

The presentations I see on butterfly strategies often get my skeptical juices flowing.

I will not accept edge that occurs at one particular DTE and not another because I think this is one of the great fallacies in all of option trading (see here and here). I find many butterfly strategies to be guilty of this.

From across the mountaintop, the plethora of butterfly trading plans look like an attempt to place Band Aids in all combinations over any potential risk including high/low underlying price, volatility, or days to expiration. I think any advanced trader would recognize this as the Holy Grail or a flat position. The former is a mirage that does not exist. The latter poses no risk with no opportunity for reward. Any statistically-minded trader might recognize this as curve-fitting: trying different things until you find one that works perfectly. Despite that perfect match to past data, curve-fit systems are unlikely to work in the future. Also by [statistical] chance alone, every 20 attempts is likely to produce one success [alpha = 0.05].

Studying butterflies was one of my early reasons for wanting an automated backtester. Of all the butterfly trading plans I have seen pitched, many were accompanied with no backtesting at all (optionScam.com). Some of them have limited backtesting, which I have often found to be vulnerable to criticism (e.g. transaction fees not included, insufficient sample sizes, limited subset of market environments studied, survivorship bias, no OOS validation). I want to be able to take a defined trading plan and backtest it comprehensively.

For different reasons, many butterfly trading plans cannot be backtested. Some are complex enough to fall prey to the curse of dimensionality (discussed and here and here). Some trading plans emphasize multiple positions in the whole portfolio placed over long periods of time to properly hedge each other. Chewing up such a large time interval for a single instance will only allow for a small number of total instances that cannot be divided up in sufficient sample size for the different categories of market environments. Some trading plans are specific enough to be curve-fit (worthless). Other trading plans incorporate discretion. For all practical purposes, discretion can never be backtested. Any discretionary plan thereby becomes a story, which falls into the domain of sales, rapport building, advertising, and marketing (optionScam.com?). I alluded to this in the second-to-last paragraph here.

I have yet to trade butterflies with any consistency because I do not even know if they are better than naked puts, which I consider the most basic of option trades. At the very least, the automated backtester research plan should be able to address this.

Butterfly Skepticism (Part 1)

Before I continue the research plan, I want to express some skepticism I have toward butterfly trades based on what I have seen in recent years.

Butterflies have been a “secret sauce” of the options trading community for some time. They come in all different shapes and sizes with diverse trading plans incorporating varying degrees of discretion. I have seen many traders and vendors describe more/less elaborate adjustment plans that look, at cursory glance, very impressive in their presentation.

Most of these adjustment plans are simply position overlays added later in the trade. Adding [a] management criteria [criterion], giving the trade a fancy name, and selling it as a new idea for $$$ seems alarmingly deceptive to me especially in the absence of valid backtesting to support it.

Most of the complexity amounts to adding subsequent positions within specified time constraints. For example, adding a put credit spread (PCS) later in the trade to raise the upper expiration line may seem appealing. Consider these:


If they happen at all then either is going to be triggered much later than butterfly initiation since decay occurs over time. I strongly believe nothing to be magical about any particular DTE for trade inception or adjustment. Failure to explore any other time a trade could be placed or adjustment made is short-sighted as I have written about here, here, and here.

If I would consider doing the adjustment later, then I should backtest the adjustment earlier as well. With regard to the above example, I already plan to study shorter- and longer-dated PCS as part of the automated backtester research plan. Whether the trading plan works as a whole is simply a question of whether the sum of its parts (e.g. symmetrical butterfly plus shorter-dated PCS) is profitable.

I will continue next time.

Constant Position Sizing of Spreads Revisited (Part 4)

Today I will conclude this blog digression by deciding how to define constant position size, which I believe is important for a homogenous backtest.

The leading candidates—all mentioned in Part 3—are notional risk, leverage ratio, and contract size.

Possible means to achieve—both mentioned in Part 2—are fixed credit and fixed delta.

I thought it might be the case that fixed delta results in a fixed leverage ratio. I suggested this in the last paragraph of Part 1 where I asked whether fixed delta would lead to a constant SWUP percentage. For naked puts under Reg T margining, gross requirement is notional risk. For spreads under Reg T margining, notional risk is spread width x # contracts and while notional risk may be fixed, the SWUP percentage varies.

Speaking of, we also have Reg T versus portfolio margining (PM) to complicate things. Both focus on a fixed percentage down (e.g. -100% for Reg T vs. -12% for PM) on the underlying. However, PnL at -12% can vary significantly with underlying price movement. PnL for spreads at -100% will not change as the underlying moves around because the long strike—at which point the expiration risk curve goes horizontal to the downside—is so far above.

Implied volatility (IV) also needs to be teased out since it will affect some of these parameters but not others. Given fixed strike price, IV is directly (inversely) proportional to delta (relative moneyness). For naked puts assuming constant contracts and fixed delta, IV is inversely proportional to notional risk and to leverage ratio. IV does not relate to leverage ratio for spreads, which is net liquidation value (NLV) divided by notional risk as defined two paragraphs above in the last sentence.

After spending extensive time immersed in all this wildly theoretical stuff, I seem to keep coming back to notional risk, leverage ratio, and fixed delta. The first two vary with NLV* and with # contracts due to proportional slope of the risk graph. Number of contracts can vary to keep notional risk relatively constant as strike price changes but this applies more to naked puts and less to spreads where spread width is of equal importance.

I want to say that for naked puts, the answer is fixed notional risk (strike price x # contracts), but we also need to keep delta fixed to maintain moneyness. With fixed credit, changing the latter would affect slope and leverage ratio. This is how I described the research plan originally and we will see whether an optimal delta exists or whether results are similar across the range. In the midst of all this mental wheel spinning, I seem to have gotten this right for naked puts without realizing it.

I guess I have also lost sight of the fact that this post is not even supposed to be about nakeds (see title)!

Getting back to constant position sizing of spreads, I think we can focus on notional risk and moneyness but we should also factor in SWUP. As the underlying price increases (decreases), spread width can increase (decrease) and we will normalize notional risk by varying contract size. Short strikes at fixed delta will be implemented and compared across a delta range.

Which is what I had settled on before (for spreads)…

[To reaffirm] Which is what I had settled on before (for naked puts)…

As I unleash a gigantic SIGH, I question whether any of this extensive deliberation was ever necessary in the first place?

I think at some level, this mental wheel spinning is what I missed as a pharmacist. The complexity fires my intellectual juices and is great enough to require peer review/collaboration to sort through. Once that is done, selling the strategy is an entirely separate domain suited to different talents, perhaps.

I left a job of the people (co-workers/customers) for a job that begs for people, which I have really yet to find. Oh the irony!

* By association, this is why I stressed magnitude of drawdown as a % of initial account value (NLV) in previous posts.

Constant Position Sizing of Spreads Revisited (Part 3)

Happy New Year, everyone!

The current blog mini-series has been a tangent from the automated backtester research plan. Today I will discuss whether fixed notional risk—with regard to naked puts and spreads—is even important.

This issue is significant because it seems like fixed notional risk is the “last man standing” since I initially mentioned it in Part 1. I have reassessed the importance of so many concepts and parameters in this research plan. The fact that they get misunderstood and reinterpreted is testament to how theoretical and highly complex they are. Especially from the perspective of avoiding confirmation bias, I believe this is all debate that must be had, and a main reason why system development is best done in groups as a means to check each other.

The reason fixed notional risk may not matter is because leverage ratio can vary. I also mentioned this in the third-to-last paragraph here. Leverage ratio is notional risk divided by portfolio margin requirement (PMR). Keeping PMR under net liquidation value and meeting the concentration criterion are essential to satisfy the brokerage. Leverage ratio can be lowered by selling the same total premium NTM. This will affect the expiration curve by decreasing margin of safety as it lifts T+x lines. Analyzing this, somehow, might be worth doing if backtesting over a delta range does not provide sufficient comparison.

Whether “homogeneous backtest” should mean constant leverage ratio throughout is another highly theoretical question that is subject to debate. Keeping allocation constant, which I aim to do in the serial, non-overlapping backtests, is one thing, but leverage can vary in the face of fixed allocation. I discussed this here in the final four paragraphs. In that example, buying the long option for cheap halves Reg T risk but dramatically increases the chance of blowing up (complete loss) since the market only needs to drop to 500 rather than zero. While the chance of a drop even to 500 is infinitesimal, theoretically it could happen and on a percent of percentage basis, the chance of that happening is much greater than a drop to zero.

Portfolio margin (PM) provides leverage because the requirement is capped at T+0 loss seen 12% down on the underlying. In the previous example, 500 represents a 50% drop. Even under PM, though, leverage ratio can vary because of what I said in second-to-last sentence of paragraph #4 (above).

When talking just about naked puts, much of this question about leverage seems to relate to how far down the expiration curve extends at a market drop of 12%, 25%, or 100%. This brings contract size back into the picture because contract size is proportional to downside slope of that curve.

With verticals, though, number of contracts is less meaningful because width of the spread is also important. The downside slope will be proportional to number of contracts. The max potential loss of the vertical depends not only on the downside slope, but for how long that slope persists because the graph only slopes down between the short and long strikes.

Either way, you can see how number of contracts gets brought back into the discussion and could, itself, be mistaken as being sufficient for “constant position size.”

I certainly was not wrong with my prediction from the second paragraph of Part 1.