Now Reading
Quantutorial – Garbage In Garbage Out

Quantutorial – Garbage In Garbage Out

by The MoleApril 26, 2017

One of my readers, let’s call him Francis, sent me an interesting email yesterday. Apparently he had been inspired by Scott’s original post on the use of scatter charts for what I call raw edge discovery (RED), for lack of a sexier term. So he proceeded to spend a significant amount of time on slinging spreadsheets in Excel, which can get quite involved and in my opinion is rather error prone as each extra condition requires the addition of at least one more column. His primary focus thus far had been mean reversion and he is now attempting to apply a similar approach to trending or momentum systems.

His main concern apparently is the right comparison periods. Should he compare post condition results with a longer previous timeframe, which then tries to isolate the intermediate term market drift and hope for a somewhat positive correlation? Bear in mind that trending systems have a pretty low win rate (i.e. 50% or less). So should winners in momentum or trending systems have a slope greater than 1 with an intercept toward zero (i.e. being steeper) to offset a low win rate, i.e, an acceleration above market drift?

Now before you read on please make sure that you’re caught up with my recent post on linear regression first. It’s rather short and to the point (especially given the topic) and I promise you that you will be able to fully grasp the concept of linear regression after reading it.

Back To The Basics

Anyway, these are great questions but pondering on all this actually made me circle back to some of the basic assumptions we seem to be making during system development as well as for RED. So let’s take a step back and ask ourselves WHAT are you actually trying to figure out and why. In essence we are attempting to find a best fit curve for a set of features. Each respective x feature consists of an x and and y value, e.g. 25, 40 as shown below.


If you remember my previous post on linear regression you understand that all we are doing here are three things in order to solve for y = mx + b as well as r squared:

  1. We figure out m (the best fit slope) and b (the intercept).
  2. Then we produce ys of the best fit line by plugging m and b into each x value (y = mx + b). We now have a regression line.
  3. We then calculate our r or coefficient of determination. Basically this involves comparing the squared error of the mean of all the original ys to the squared error of all the ys of our regression line. That value then tells us how good our fit is.

But wait a minute. What exactly are we plotting here? Well, Scott’s post talked about mean reversion which involves defining some sort of market condition and then comparing the price delta prior and after said market condition, which effectively gives us xs and ys to be plotted in our scatter chart. But aren’t we making some very crucial assumptions? After all this is supposed to be ‘raw edge discovery’ but how raw can it really get?


Here’s a chart of which I know nothing but that it’s a list of OHLC vectors which we visualize via candle bars. We want to perform RED for mean reversion, so what do we do? Francis’ approach thus far has been to somehow pick an entry condition (we get to that somehow further below) and then count back a certain amount of candles and measure the price delta. That gives him X. Now he counts forward a certain amount of candles and also records the price delta which now yields him Y.

So far so good. But being of discerning minds there are several crucial questions that should immediately occur to us. Let’s say we counted back three candles and counted forward three candles in order to arrive at X and Y:

  1. Why are we counting back three candles and then forward three? Why not back three and forward four or five? Or the opposite – count back four and forward two perhaps?
  2. Are we really measuring mean reversion here? Or is it simply the ability for price to revert back to its origin within x amount of candles?
  3. Why are we using candle intervals in the first place? In mean reversion are we given a timer for it to occur? (actually yes – which we can mathematically define by its half-life but you won’t like it what you get)

I’m sure you can think of several more questions but it all boils down to the fact that we are artificially defining an arbitrary range in hopes of discovering a market condition we can exploit. But if you look at the chart above then it becomes clear that by simply moving the measured window forward or backward will not produce the expected results. For one we would most likely wind up drowning in an ocean of noise which at best would obfuscate the much smaller number of positives that we are looking for.

We Need More Context


And there it is right there – we need more context. Because the vast majority of price series are not strictly mean reverting but follow a geometric random walk. It is the returns, not the prices, which are the ones that are usually randomly distributed around a mean of zero, but we can’t trade returns. Of course as traders we do not require a price series to be purely mean reverting in a mathematical sense, in that the change of the price series is proportional to the difference between the mean price and the current price. That just doesn’t happen in financial markets. But it suffices if price would be ‘somewhat’ mean reverting at times. Which is the very reason why we resort to using indicators, oscillators, or various statistical measures to hopefully increase our odds.

Is More More Or Less?

Interestingly the SMA on the chart above was added after I had picked the trigger and the x,y pair. So it seems my own personal perception of price series is subconsciously looking for price patterns I have observed in the past. Now that trigger candle just so happens to be a) a hammer and b) sits on top of that SMA. So we could conceivably introduce this as an additional condition in order to extract our x,y vectors/pairs/tuples. It would make a lot of sense but that in itself brings about a series of new questions:

  1. Why did I pick an SMA(14) [it was the default] and not an SMA(21) or SMA(50)?
  2. Why use an SMA in the first place and not an EMA or something completely different?
  3. For mean reversion, don’t we want to revert to the mean? So shouldn’t x and y be near that SMA and the trigger away from it?
  4. Do we really need to have a static count of candles for defining x and y or should we parse for a certain condition within a price window?

In particular 3. is a very interesting question as based on what I’m seeing on the chart I would probably be tempted to switch the trigger to the X mark and Y to where the trigger is. But why? Just because I’m looking at an SMA now? This goes to show how very random and subjective our own perception is. Right before I added that moving average the current arrangement seemed like a pretty good example of mean reversion to me. Now that I added more context I’m suddenly starting to see things differently.

Also 4 is something very well considering. In that particular case a supposed long campaign would have worked out fine but we would have been more profitable four candles later. We could for example define a window of let’s say eight candles and then record whether or not our target price was hit during that period. In addition let’s not forget that price continues to move and thus may change our target. What do you really define as mean reversion? The most purest form would be a reversal to the mean. But remember financial price series follow a random walk so that mean will continue moving as well. Which in turn can be approximated by a moving average.

Live With It

And all this and more are exactly the issues we continue to face even during the most ‘purest’ RED process I can envision. At some point we need to draw the line between RED and system development. On one hand we seek a pure evidence of a market inefficiency we may be able to exploit. On the other hand we are dealing with highly noisy data which requires at minimum filtering and additional price derivatives or correlations in order to separate the wheat from the chaff. Every single thing we do needs to be questioned and considered in the context of purity. In essence what we are looking for is a linear trading strategy which is truly ‘parameterless’. But such a thing cannot exist without compromise.

Garbage In Garbage Out (GIGA)

Now given all the above let’s one more time consider Francis’ questions about how to adapt RED to break out or trending systems. Well, it all depends on his xs and ys now, doesn’t it? When he was testing for mean reversion, was he really testing mean reversion or something else? Actually having seen his spreadsheets I know that he was using additional measures but I still felt that focusing on static candle ranges for x and in particular for y was somewhat unconvincing. Because at least in my mind (without yet having proven this however) I suspect that there is probably a high standard deviation within MR time windows. In other words it may take one or two candles to revert (if and when it does) or it may take seven or more. That in part also depends on the instrument traded as many futures contracts for example exhibit clear (realized) volatility patterns. Then there are roll overs and seasonality, etc. It just may be better to give yourself a window instead. Or not – we don’t know until we test for it.

Choose Your Input Carefully

The GIGA problem by the way is not just limited to mean reversion and scatter plots, which primarily deals with x/y values (you can have multiple dimensions but it gets ugly). It’s a significant and much under reported problem I see all across machine learning these days. Some of the most smartest people you ever run into seem to think for some reason that you can simply scratch together a set of arbitrary features (e.g. price, moving average, P/E ratio, volume, etc.) and throw those at a Neural Net, Support Vector Machine, Bayesian Network, etc. Which I can assure you from very personal experience will wind up failing quite spectacularly.

It doesn’t matter really what exactly you put in and what your specific belief system or market lens is, may this be purely technical, fundamental, statistical, or purely mathematical. What does matter is that you take extra care in defining your input and to think very carefully about why exactly you believe it offers value to your analysis. And then go about proving it.


You may recall that Francis asked about the steepness of the curve and the intercept in the context of RED for trending or break out systems. I would expect those outliers to drive the best fit curve which produces a steeper curve. But only if one manages to reduce the noise factor, which is an inherent problem. Remember that linear regression uses squared errors to produce its best fit line, which by design normalizes and suppresses dominance by outliers. Now if you have noisy data then the noise will be dominant and the outliers you are seeking to identify will be dismissed as unwanted outliers. Linear regression can definitely work as RED for trend systems but it’s more complex as you will need to be extra careful about vector selection.

About The Author
The Mole
Mole created Evil Speculator amidst the chaos of the financial crisis in early August of 2008. His vision for Evil Speculator is a refuge of reason, hands-on trading knowledge, and inspiration for traders of all ages and stripes. You can follow him and his nefarious schemes at various social media waterholes below.
  • Mark Shinnick

    Someone correct me but I believe by today most of the margin calls against last weeks sure-thing volatility bear pigfest have been filled.

  • OJuice

    I believe that’s correct for futures. On the options side I believe there could be some rolling over to future weeks/months that will spread the pain, depending on the strategies.

  • Mark Shinnick

    Ok. This would be happening at the former highs resistance of a number of indexes, so a natural spot for some new bear justification.

  • Mary

    Came back quickly to Mole’s inflection point ….

  • Mary

    NQ the weak sister …

  • randomuser6789

    Sell in MAY and go away.
    It is still April. Where is everyone? Mole puts up an excellent post like this, market is near all time highs, and nobody is here.

  • Sir Mole III

    Frankly I wonder…

  • OJuice

    I think it is more active when there is conflicting views or evidence. And the discussion and comments help people reconcile their plans/strategies. Right now equities appear to be a one way trade so there isn’t much debate…

  • TradingGangster

    Great post, along with the original RED post. To whoever is trying to do this in Excel, that’s an act is masochism for this type of research. Learning python along with some of easy to use libraries like pandas, matplotlib, scikitlearn, etc will save you hours/years/months of time and spare a possible trip to rehab.

  • Sir Mole III

    Exactly – frankly it’s impossible to do this consistently and sanely in Excel. It’s about 20 to 20 lines of code in python via pandas, numpy, scikitlearn, etc. If you don’t want to set all that up you can just head over to quantopian where you will find a full fledged python console and access to a myriad of tools.

  • Mark Shinnick

    Yes…seems that way. Yet, for myself, with various indexes smack at chart resistance, and PM’s backed-off from it, the uncertainty appears much higher.

  • Gold_Gerb

    The VIX is on the floor. I’m not surprised of the silence.
    these quiet periods have happened before.
    and sometimes that’s when Mole does his best work.

  • Sir Mole III

    I always do my best work! 😉

    And jeezes that’s low… insane.

  • Sir Mole III

    I’m just a bit confused as to why there was so much excitement and a flurry of activity when Scott posted about system development. However when I do it there’s almost complete silence. So I guess it must be me or my content. A bit confusing to be honest as I’m not sure if I should continue with pertinent posts. FYI – this post took me five hours to write and researching all this is a huge amount of effort. Does it perhaps go above everyone’s head?

  • TradingGangster

    Indeed. I learned python last year specifically for a workshop at Quantcon, It took me about 6 months to appreciate how awesome their research environment is with all of the (free) high quality data they provide.

  • BTrader

    L personally love your post just been too busy managing portfolios….thx for the great post Mole.

  • Sir Mole III

    I looked at quantopian first and quickly decided that I really needed to learn python from the ground up to truly understand what these guys are doing. Old Java/C# coder here so it was quite a change for me – but one for the better. Python is so f…ing awesome for dataframe/list parsing – stuff that would literally take me days to write can be done in a few hours or less in python. And the available statistics/scientific tools are just super juicy. I’m currently squeezing myself through a pretty extended python based machine learning course which will probably take me all the way through summer to complete.

    I haven’t had this much fun learning a new language in a very very long time to be sure. Would love to introduce some of that here but knowing my audience it would most likely be a waste of time. The learning curves are simply to massive for a general audience.

  • Sir Mole III

    Appreciated but I wasn’t fishing for compliments. Rather I am trying analyze why there is such a lack of response so that I can do better in the future.

    (maybe I should just let Scott write them – LOL)

  • Mark Shinnick

    It may have nothing to do with yourself Mole.

  • BTrader

    It is because most people are looking for setups and not for knowledge on how to trade properly.

  • Sir Mole III

    That’s actually what I have been delivering and continue to do. I did however get the impression that people wanted to develop their own systems when Scott was posting his series.

    Well, if you’re right then life is much easier for me as posting setups is what I’ve done well and consistently for the past eight years.

  • TradingGangster

    Which ML course are you going through? I’m a big fan of Jason Brownlee’s books – no affiliation aside from being a big fan of his books. They are aimed at practitioners, not academics and don’t bog you down with needless theory. Within a weekend, I was using fairly proficient with using/applying a lot of the ML libraries in Python. He has a book that breaks the open most the ML algos into raw python without pandas or scikitlearn. It’s great and some of his python code really helped take my python skills up a notch.

  • Sir Mole III

    I’m doing sentdex – Machine Learning With Python. It’s exquisite if you can handle fast talkers :-)

  • Sir Mole III

    Re. Jason’s books – never heard of him but will check him out when I’m done with my current course. I did enjoy Earnest Chan’s books but he’s using matlab (which I don’t own and have very little interest in) and his stuff seems a bit too theoretic IMO. But maybe that’s because I’m not that strong in math to be honest – perhaps comparatively gifted vs. the mean but the advanced stuff escapes me.

    We should stay in touch as I don’t know many people who really enjoy this domain. Email me at admin@ privately and perhaps we can exchange notes here and there.

  • Mark Shinnick

    Wow, nearing vix lows.

  • Mary

    another visit to Mole’s inflection point …

  • Mark Shinnick

    Ok, so a bit short at the really obvious triple-touched place…just probably too obvious but there’s a chance here I don’t believe for serious downside, but instead to re-encourage the bears, at which point I’m a buyer again.

  • Gold_Gerb

    United kills Easter Bunny. news at 11:00.

  • captainboom

    This is great stuff Mole, and exactly the kind of thing I’m interested in to further expand on Scott’s work. That said, I’m ass deep in alligators with work, so this will get web archived locally along with the other system development materials I have archived. I’ll revisit after July 4.

  • sutluc

    Very close to over my head.
    It’s fairly complicated stuff for uncomplicated me. I have no math or stats training beyond 30 year old high school stuff. Not a programmer. Don’t work in related fields.

    I think I have a pretty good grasp of what you wrote, but it took me about as long in study as it took you to write it.

    I find Scott’s writing style a little easier to work with than yours, and he seems to not dig as deep as you. (and I like white charts)

    I even spent some time wondering why GIGA instead of GIGO. Decided it was not significant.

  • kim

    Like button is almost invisible, really hard to find :)

  • captainboom

    I think he meant GIGO. Keep in mind that Mole is not a native English speaker, so he can code switch into German. I’m guessing that the German word for ‘out’ starts with an A.