Option FanaticOptions, stock, futures, and system trading, backtesting, money management, and much more!

Automated Backtester Research Plan (Part 9)

With digressions on position sizing for spreads and deceptive butterfly trading plans complete, I will now resume with the automated backtester research plan.

We can study [iron, perhaps, for better execution] butterfly trades entered daily from 10-90 days to expiration (DTE). We can center the trade 0% (ATM) to 5% OTM (bullish or bearish) by increments of 1% [perhaps using caution to stick to the most liquid (10- or 25-point) strikes especially when open interest is low*]. We can vary wing width from 1-5% of the underlying price by increments of 1%. We can vary contract size to keep notional risk as consistent as possible (given granularity constraints of the most liquid strikes).

An alternative approach to wing selection would be to buy longs at particular delta values (e.g. 2-4 potential delta values for each such as 16-delta put and 25-delta call). This could be especially useful to backtest asymmetrical structures, which are a combination of symmetrical butterflies and vertical spreads (as mentioned in the second-to-last paragraph here).

With trades held to expiration, I’d like to track and plot maximum adverse (favorable) excursion for the winners (losers) along with final PnL and total number of trades to determine whether a logical stop-loss (profit target) may exist. We can also analyze differences between holding to expiration, managing winners at 5-25% profit by increments of 5%, or exiting at 1-3x profit target by increments of 0.25x. We can also study exiting at 7-28 (proportionally less on the upper end for short-term trades) DTE by increments of seven.

As an alternative not previously mentioned, we can use DIT as an exit criterion. This could be 20-40 days by increments of five. Longer-dated trades have greater profit (and loss) potential than shorter-dated trades given a fixed DIT, though. To keep things proportional, we could instead backtest exiting at 20-80% of the original DTE by increments of 15%.

Trade statistics to track include winning percentage, average (avg.) win, avg. loss, largest loss, largest win, profit factor, avg. trade (avg. PnL), PnL per day, standard deviation of winning trades, standard deviation of losing trades, avg. days in trade (DIT), avg. DIT for winning trades, and avg. DIT for losing trades. Reg T margin should be calculated and will remain constant throughout the trade. Initial PMR should be calculated along with the maximum value of the subsequent/initial PMR ratio.

We can later consider relatively simple adjustment criteria. I may spend some time later brainstorming some ideas on this, but I am most interested at this point in seeing raw statistics for the butterfly trades.

I will continue next time.

* This would be a liquidity filter coded into the backtester. A separate study to see how open interest for
   different strikes varies across a range of DTE might also be useful.

Butterfly Skepticism (Part 4)

Today I want to complete discussion of the protective put (PP) butterfly adjustment.

I might be able to come up with some workaround (as done in this second paragraph) for PP backtesting. I could look at EOD [OHLC] data and determine when the low was more than 1.6 SD below the previous day’s close. In this case, I could purchase the put at the close. This would bias the backtest against (not a bad thing) the adjustment in cases where the close was more than 1.6 SD below because the put would be more expensive.

Unfortunately, I am not sure this particular workaround would work. If the close is less than 1.6 SD below then the backtested PP would be less expensive than actual. Furthermore, if I waited until EOD then the NPD and corresponding PP(s) to purchase would be different. This would distort the study in an unknown direction. I could track error (difference) between -1.6 SD and closing market price. Positive and negative error might cancel out over time. If I had a large sample size then this might or might not be meaningful.

At best, this workaround seems like a questionable approximation of an adjustment strategy that is precisely defined.

Before dismissing the PP out of frustration, let’s step back for a moment and piece together some assumptions.

First, I believe the butterfly can be a trade with somewhat consistent profits and occasionally larger losses. Overall, I’m uncertain whether this has a positive or negative expectancy (hopefully to be determined as I begin to describe here).

Second, as butterflies are held longer, I believe profitability will be decreased. I have seen some anecdotal (methodology incompletely defined) research to suggest butterflies are more profitable when avoiding periods of greatest negative gamma.

Third, I have seen anecdotal (methodology incompletely defined) research to suggest PPs as unprofitable whether:


Fourth, this adjustment will require any butterfly to be held longer on average. The additional time will be needed to recoup the PP loss. The result will be, as described per second assumption above, decreased average profitability.

In my mind, combining the first and fourth assumptions does not bode well.

The big unknown involves the magnitude of the largest losses and in what percentage of trades those largest losses occur.

Interestingly, the trader who explained this to me said PP will lose money in most cases. What it can prevent is a massive windfall loss. Being forthright [about the obvious?] may give the teacher more credibility. Without backtesting, though, I think it leaves us with more than reasonable doubt over whether this approach tends toward profit or loss.

Butterfly Skepticism (Part 3)

On my mind this morning is skepticism regarding the protective put (PP).

I have seen the PP lauded by many traders as a lifesaving arrow to have in the quiver.

One trader described this to me with regard to a butterfly trading plan. Part of the plan provides for the following adjustment:

  1. If NPD is at least 10 with market at least -1.6 SD intraday then record NPD.
  2. Buy PP(s) to cut NPD by 75%.
  3. On a subsequent big move, if NPD again reaches value recorded in Step 1 then repeat Step 2.
  4. If market reverses to the high of purchase day, then close PP(s) from Step 2.


Upon further questioning, I got some additional information. He learned it from a guy who claims to have “mentored” many traders. The mentor (teacher) claims to have seen many lose significant money in big moves and therefore recommends this to avoid windfall losses. The teacher has shown numerous historical examples where this adjustment would keep people in the trade (not stopped out at maximum loss) and often wind up profitable. Further prodding revealed uncertainty over whether these “numerous” examples amount to more than a handful of instances. Of the several times the adjustment has been presented, he acknowledged the possibility that many could have been repetition of the same [handful of] instance[s]. He is uncertain whether anyone has presented big losing trades more than once.

Much of this casts doubt over the sample size behind this adjustment. We certainly wouldn’t want to fall prey to that described in this this second-to-last paragraph.

As described in this third paragraph, the PP is simply an overlay added later in the trade. This strategy also has its own catchy name and is marketed. In order to backtest, we could study the profitability of long puts purchased on days the market is down at least 1.6 SD (also explore the surrounding parameter space as discussed in the fourth complete paragraph here).

From a backtesting perspective, intraday is a huge wrinkle. Technically, I’d need intraday data to identify exactly when the market was down 1.6 SD in order to purchase the PP at the correct time. As mentioned in the second-to-last paragraph here, this is arguably another reason why certain trading plans cannot be backtested: data not available.

I will continue next time.

Butterfly Skepticism (Part 2)

The presentations I see on butterfly strategies often get my skeptical juices flowing.

I will not accept edge that occurs at one particular DTE and not another because I think this is one of the great fallacies in all of option trading (see here and here). I find many butterfly strategies to be guilty of this.

From across the mountaintop, the plethora of butterfly trading plans look like an attempt to place Band Aids in all combinations over any potential risk including high/low underlying price, volatility, or days to expiration. I think any advanced trader would recognize this as the Holy Grail or a flat position. The former is a mirage that does not exist. The latter poses no risk with no opportunity for reward. Any statistically-minded trader might recognize this as curve-fitting: trying different things until you find one that works perfectly. Despite that perfect match to past data, curve-fit systems are unlikely to work in the future. Also by [statistical] chance alone, every 20 attempts is likely to produce one success [alpha = 0.05].

Studying butterflies was one of my early reasons for wanting an automated backtester. Of all the butterfly trading plans I have seen pitched, many were accompanied with no backtesting at all (optionScam.com). Some of them have limited backtesting, which I have often found to be vulnerable to criticism (e.g. transaction fees not included, insufficient sample sizes, limited subset of market environments studied, survivorship bias, no OOS validation). I want to be able to take a defined trading plan and backtest it comprehensively.

For different reasons, many butterfly trading plans cannot be backtested. Some are complex enough to fall prey to the curse of dimensionality (discussed and here and here). Some trading plans emphasize multiple positions in the whole portfolio placed over long periods of time to properly hedge each other. Chewing up such a large time interval for a single instance will only allow for a small number of total instances that cannot be divided up in sufficient sample size for the different categories of market environments. Some trading plans are specific enough to be curve-fit (worthless). Other trading plans incorporate discretion. For all practical purposes, discretion can never be backtested. Any discretionary plan thereby becomes a story, which falls into the domain of sales, rapport building, advertising, and marketing (optionScam.com?). I alluded to this in the second-to-last paragraph here.

I have yet to trade butterflies with any consistency because I do not even know if they are better than naked puts, which I consider the most basic of option trades. At the very least, the automated backtester research plan should be able to address this.

Butterfly Skepticism (Part 1)

Before I continue the research plan, I want to express some skepticism I have toward butterfly trades based on what I have seen in recent years.

Butterflies have been a “secret sauce” of the options trading community for some time. They come in all different shapes and sizes with diverse trading plans incorporating varying degrees of discretion. I have seen many traders and vendors describe more/less elaborate adjustment plans that look, at cursory glance, very impressive in their presentation.

Most of these adjustment plans are simply position overlays added later in the trade. Adding [a] management criteria [criterion], giving the trade a fancy name, and selling it as a new idea for $$$ seems alarmingly deceptive to me especially in the absence of valid backtesting to support it.

Most of the complexity amounts to adding subsequent positions within specified time constraints. For example, adding a put credit spread (PCS) later in the trade to raise the upper expiration line may seem appealing. Consider these:


If they happen at all then either is going to be triggered much later than butterfly initiation since decay occurs over time. I strongly believe nothing to be magical about any particular DTE for trade inception or adjustment. Failure to explore any other time a trade could be placed or adjustment made is short-sighted as I have written about here, here, and here.

If I would consider doing the adjustment later, then I should backtest the adjustment earlier as well. With regard to the above example, I already plan to study shorter- and longer-dated PCS as part of the automated backtester research plan. Whether the trading plan works as a whole is simply a question of whether the sum of its parts (e.g. symmetrical butterfly plus shorter-dated PCS) is profitable.

I will continue next time.

Constant Position Sizing of Spreads Revisited (Part 4)

Today I will conclude this blog digression by deciding how to define constant position size, which I believe is important for a homogenous backtest.

The leading candidates—all mentioned in Part 3—are notional risk, leverage ratio, and contract size.

Possible means to achieve—both mentioned in Part 2—are fixed credit and fixed delta.

I thought it might be the case that fixed delta results in a fixed leverage ratio. I suggested this in the last paragraph of Part 1 where I asked whether fixed delta would lead to a constant SWUP percentage. For naked puts under Reg T margining, gross requirement is notional risk. For spreads under Reg T margining, notional risk is spread width x # contracts and while notional risk may be fixed, the SWUP percentage varies.

Speaking of, we also have Reg T versus portfolio margining (PM) to complicate things. Both focus on a fixed percentage down (e.g. -100% for Reg T vs. -12% for PM) on the underlying. However, PnL at -12% can vary significantly with underlying price movement. PnL for spreads at -100% will not change as the underlying moves around because the long strike—at which point the expiration risk curve goes horizontal to the downside—is so far above.

Implied volatility (IV) also needs to be teased out since it will affect some of these parameters but not others. Given fixed strike price, IV is directly (inversely) proportional to delta (relative moneyness). For naked puts assuming constant contracts and fixed delta, IV is inversely proportional to notional risk and to leverage ratio. IV does not relate to leverage ratio for spreads, which is net liquidation value (NLV) divided by notional risk as defined two paragraphs above in the last sentence.

After spending extensive time immersed in all this wildly theoretical stuff, I seem to keep coming back to notional risk, leverage ratio, and fixed delta. The first two vary with NLV* and with # contracts due to proportional slope of the risk graph. Number of contracts can vary to keep notional risk relatively constant as strike price changes but this applies more to naked puts and less to spreads where spread width is of equal importance.

I want to say that for naked puts, the answer is fixed notional risk (strike price x # contracts), but we also need to keep delta fixed to maintain moneyness. With fixed credit, changing the latter would affect slope and leverage ratio. This is how I described the research plan originally and we will see whether an optimal delta exists or whether results are similar across the range. In the midst of all this mental wheel spinning, I seem to have gotten this right for naked puts without realizing it.

I guess I have also lost sight of the fact that this post is not even supposed to be about nakeds (see title)!

Getting back to constant position sizing of spreads, I think we can focus on notional risk and moneyness but we should also factor in SWUP. As the underlying price increases (decreases), spread width can increase (decrease) and we will normalize notional risk by varying contract size. Short strikes at fixed delta will be implemented and compared across a delta range.

Which is what I had settled on before (for spreads)…

[To reaffirm] Which is what I had settled on before (for naked puts)…

As I unleash a gigantic SIGH, I question whether any of this extensive deliberation was ever necessary in the first place?

I think at some level, this mental wheel spinning is what I missed as a pharmacist. The complexity fires my intellectual juices and is great enough to require peer review/collaboration to sort through. Once that is done, selling the strategy is an entirely separate domain suited to different talents, perhaps.

I left a job of the people (co-workers/customers) for a job that begs for people, which I have really yet to find. Oh the irony!

* By association, this is why I stressed magnitude of drawdown as a % of initial account value (NLV) in previous posts.

Constant Position Sizing of Spreads Revisited (Part 3)

Happy New Year, everyone!

The current blog mini-series has been a tangent from the automated backtester research plan. Today I will discuss whether fixed notional risk—with regard to naked puts and spreads—is even important.

This issue is significant because it seems like fixed notional risk is the “last man standing” since I initially mentioned it in Part 1. I have reassessed the importance of so many concepts and parameters in this research plan. The fact that they get misunderstood and reinterpreted is testament to how theoretical and highly complex they are. Especially from the perspective of avoiding confirmation bias, I believe this is all debate that must be had, and a main reason why system development is best done in groups as a means to check each other.

The reason fixed notional risk may not matter is because leverage ratio can vary. I also mentioned this in the third-to-last paragraph here. Leverage ratio is notional risk divided by portfolio margin requirement (PMR). Keeping PMR under net liquidation value and meeting the concentration criterion are essential to satisfy the brokerage. Leverage ratio can be lowered by selling the same total premium NTM. This will affect the expiration curve by decreasing margin of safety as it lifts T+x lines. Analyzing this, somehow, might be worth doing if backtesting over a delta range does not provide sufficient comparison.

Whether “homogeneous backtest” should mean constant leverage ratio throughout is another highly theoretical question that is subject to debate. Keeping allocation constant, which I aim to do in the serial, non-overlapping backtests, is one thing, but leverage can vary in the face of fixed allocation. I discussed this here in the final four paragraphs. In that example, buying the long option for cheap halves Reg T risk but dramatically increases the chance of blowing up (complete loss) since the market only needs to drop to 500 rather than zero. While the chance of a drop even to 500 is infinitesimal, theoretically it could happen and on a percent of percentage basis, the chance of that happening is much greater than a drop to zero.

Portfolio margin (PM) provides leverage because the requirement is capped at T+0 loss seen 12% down on the underlying. In the previous example, 500 represents a 50% drop. Even under PM, though, leverage ratio can vary because of what I said in second-to-last sentence of paragraph #4 (above).

When talking just about naked puts, much of this question about leverage seems to relate to how far down the expiration curve extends at a market drop of 12%, 25%, or 100%. This brings contract size back into the picture because contract size is proportional to downside slope of that curve.

With verticals, though, number of contracts is less meaningful because width of the spread is also important. The downside slope will be proportional to number of contracts. The max potential loss of the vertical depends not only on the downside slope, but for how long that slope persists because the graph only slopes down between the short and long strikes.

Either way, you can see how number of contracts gets brought back into the discussion and could, itself, be mistaken as being sufficient for “constant position size.”

I certainly was not wrong with my prediction from the second paragraph of Part 1.

Constant Position Sizing of Spreads Revisited (Part 2)

I’m doing a Part 2 because early this morning, I had another flash of confusion about the meaning of “homogeneous backtest.”

The confusion originated from my current trading approach. Despite my backtesting, I still trade with a fixed credit. If I used a fixed delta then 2x-5x initial credit (stop loss) would be larger at higher underlying prices. Gross drawdown as a percentage of the initial account value would consequently be higher. This means drawdown percentage could not be compared on an apples-to-apples basis across the entire backtesting interval.

Read the “with regard to backtesting” paragraph under the graph shown here. Constant position size (e.g. number of contracts or notional value?), apples-to-apples comparison of PnL changes (e.g. gross or percentage of initial/current account value?) throughout, and evaluating any drawdown (e.g. gross or as a percentage of initial/current account value?) as if it happened from Day 1 are all nebulous and potentially contradictory references (as described).

In this post, I argue:

     > Sticking with the conservative theme, I should also calculate
     > DD as a percentage of initial equity because this will give a
     > larger DD value and a smaller position size. For a backtest
     > from 2001-2015, 2008 was horrific but as a percentage of
     > total equity it might not look so bad if the system had
     > doubled initial equity up to that point.

If I trade fixed credit then I am less likely to incur drawdown altogether at higher underlying price, which makes for a heterogeneous backtest when looking at the entire sample of daily trades. If I trade fixed delta then see the last sentence of (above) paragraph #2.

I focused the discussion on position size in this 2016 post where I stressed constant number of contracts. Recent discussion has neither focused on fixed contracts nor fixed credit.

“Things” seem to “get screwed up” (intentionally nebulous) if I attempt to normalize to allow for an apples-to-apples comparison of any drawdown as if it occurred from Day 1.

If I allow spread width [if backtesting a spread] to vary with underlying price and I sell a fixed delta—as discussed in Part 1—then a better solution may be to calculate gross drawdowns as a percentage of the highwater account value to date. I will leave this to simmer until my next blogging session for review.

I was going to end with one further point but I think this post has been sufficiently thick to leave it here. I will conclude with Part 3 of this blogging detour next year!

Constant Position Sizing of Spreads Revisited (Part 1)

In Part 7, I said constant position size is easy to do with vertical spreads by maintaining a fixed spread width. I now question whether a fixed spread width is sufficient to achieve my goal of a homogeneous backtest throughout.

I enter this deliberation with reason to believe it will be a real mess. I have addressed this point before without a successful resolution. This post provides additional background.

The most recent episode of my thinking on the matter began with the next part of the research plan on butterflies. I want to backtest ATM structures and perhaps one strike OTM/ITM, two strikes OTM/ITM, etc. Rather than number of strikes, which would not be consistent by percentage of underlying price, a better approach may be to specify % OTM/ITM.

I then started thinking about my previous backtesting along with reports of backtests from others suggesting spread width to be inversely proportional to ROI (%). It makes sense to think the wider the spread, the more moderate the losses because it’s more likely for a 30-point (for example) spread to go ITM than it is a 50-point spread with the underlying having to move an additional 20 points in the latter case. This begs the question about whether an optimal spread width exists because while wider spreads will incur fewer max losses, wider spreads also carry proportionally higher margin requirements.

Also realize that a 30-point spread at a low underlying value is relatively wide compared to a 30-point spread at a high underlying price. I mentioned graphing this spread-width-to-underlying-price (SWUP) percentage in Part 7. We could look to maintain a constant SWUP percentage if granularity is sufficient; with the 10- and 25-point strikes most liquid, having to round to the nearest liquid strike could force SWUP percentage to vary significantly (especially at lower underlying prices).

All of this is to suggest that spread width should be left to fluctuate with underlying price, which contradicts what I said about fixed spread width and constant capital. We can attempt to normalize total capital by varying the number of contracts as discussed earlier with regard to naked puts. From the archives, similar considerations about normalizing capital and granularity were discussed here and here.

Aside from notional value, I think the other essential factor to hold constant for a homogeneous backtest is moneyness. As mentioned above, spreads should probably not be sold X strikes OTM/ITM. We should look to sell spreads at fixed delta values (e.g. “short strike nearest to Y delta”) since delta takes into account days to expiration, implied volatility, and underlying price.

An interesting empirical question is how well “long strike nearest to Z delta” does to maintain a constant SWUP percentage.

Automated Backtester Research Plan (Part 8)

After studying put credit spreads (PCS) as daily trades, the next step is to study them as non-overlapping trades.

As discussed earlier, I would like to tabulate several statistics for the serial approach. These include number (and temporal distribution) of trades, winning percentage, compound annualized growth rate (CAGR), maximum drawdown, average days in trade, PnL per day, risk-adjusted return (RAR), and profit factor (PF). Equity curves will represent just one potential sequence of trades and some consideration could be given to Monte Carlo simulation. We can plot equity curves for different account allocations such as 10% to 70% initial account value by increments of 5% or 10% for a $50M account. A 30% allocation (for example) would then be $15M per trade. By holding spread width constant, drawdowns throughout the backtesting interval may be considered normalized.

As an example of the serial approach, I would like to backtest “The Bull” with the following guidelines:


I will not detail a research plan for call credit spreads. If we see encouraging results from looking at naked calls then this can be done as described for PCS.

I also am not interested in backtesting rolling adjustments for spreads due to potential execution difficulty.

Thus far, the automated backtester research plan has two major components: study of daily trades to maximize sample size and study of non-overlapping trades. I alluded to a third possibility when discussing filters and the concentration criterion: multiple open trades not to exceed or match one per day.

This is suggestive of traditional backtesting I have seen over the years where trades are opened at a specified DTE. For trades lasting longer than 28 (or 35 every three months) days, overlapping positions will result. As discussed here, I am not a proponent of this approach. Nevertheless, for completeness I think it would be interesting to do this analysis from 30-64 DTE and compare results between groups, which I hypothesize would be similar. To avoid future leaks, position sizing should be done assuming two overlapping trades at all times. ROI should also be calculated based on double the capital.

Another aspect of traditional backtesting I have eschewed in this trading plan is the use of standard deviation (SD) units. I have discussed backtesting many trades from (for example) 0.10-0.40 delta by units of 0.10. More commonly used are 1 SD (0.16 delta corresponding to 68%), 1.5 SD (0.07 delta corresponding to 86.6%), and 2 SD (0.025 delta corresponding to 95%). Although not necessary, we could run additional backtests based on these unit measures for completeness.