Backtesting Data Considerations

Now that we have a high level overview of backtesting, I'll discuss backtesting data considerations. We'll likely encounter data quality issues along the way and will need to identify and clean the data before use in the system. Below I discuss some of the issues and how I will solve them. In later posts I'll discuss data sources and data storage technologies. Finally, a potential data pipeline that automates the acquisition, cleaning and storing will be explored.

Backtesting Data Considerations

Data is the most critical aspect of a backtesting model. I spend a lot of time acquiring and manipulating data to feed into all sorts of analysis. I can attest that bad input data will build poor models and lead to bad decision making. Part of the challenge is to automate the data acquisition, checking, and cleaning process. Fortunately in the age of big data there are many tools and techniques to help.

Quality issues

Despite the source of your data, various quality checks need to be performed. If you purchase data from a data vendor, it is likely you'll start with a clean set but it is still important to check. In my experience, financial time series data issues can be summarized in three categories.

Missing values

Market data will be missing for weekends, holidays and other non trading days. These data clearly should be missing but still has to be handled within the system. Lucky for us code exists. Additionally, options expire. Therefore we have to keep the expiration date in mind through our cleaning methods and strategy development.

Missing data without explanation is a bit more tricky. If a contract suddenly disappears on a Tuesday, our results can be incorrect. Fortunately, there are ways to handle these situations. The excellent Python package pandas has a superb interface for cleaning and filling missing data. I'll be using these tools extensively throughout the system.

Small plug: The code above shows only a small amount of what pandas can do. For example, instead of passing mean(), you can use any other object on a pandas DataFrame (even those you write). Instead of using linear interpolation (default) you can use quadratic, cubic or spline (among many others). I highly encourage you to take a look at pandas.

One might consider time series or regression models to predict the missing value. Of course fitting a model incorporates the issues generally found in time series modeling. These include parameter estimation, correct model specification, training period, and others. Unless absolutely necessarily, I will avoid fitting time series models simply to fill missing data. Given I'm building a system with daily data, it seems like overkill (correct me if I'm wrong).

Impossible or outlier values

Financial time series data are notorious for having bad "prints". These are incorrect prices that make it into the published data feed. Consider the following which demonstrates a bad print in the South African Rand spot currency price on 7 June 2011. Note this is from the Bloomberg Professional Service which is a very expensive service.

Backtesting Data Considerations. Outlier data.

Outlier data.

Handing outlier data is an entire discipline in itself. The usual methods for handling these are similar to handling missing data but also include winsorizing.

Let's take a look at a simple example of winsorization in pandas.

Before winsorizing, the bad print distorts the entire time series.

Backtesting Data Considerations. Before winsorizing.

Before winsorizing.

After winsorizing the distortion is largely gone.

Backtesting Data Considerations. After winsorizing.

After winsorizing.

This is another example with more sophisticated data cleaning methodologies described on the excellent R-Bloggers site. Note this is not financial time series data but a time series nonetheless.

Backtesting Data Considerations. Cleaning time-series and other data streams.

Cleaning time-series and other data streams.

I've used winsorizing in most applications of outlier identification and cleaning I've come across. Because I won't be working with high frequency data, I'll start here.

Inconsistent or unlikely values

Is the reported low price of the day really the lowest price of the day? How about the high price? Is the bid always lower than the ask? These are questions that require domain knowledge. With options data there are additional elements we have to consider. Is the expiration date in the future? Does open interest and volume somewhat reflect the time to maturity? I'll watch for these throughout the data pipeline.

These values may be easy to test for but difficult to clean on an automated basis. For example, if the ask is less than the bid, which price is correct? Hopefully these data issues are exceedingly rare. If they are, then it will not be expensive to identify and manually research and clean them. Within the acquisition and cleaning pipeline, I'll run the data through a series of filters to check for some of these cases. If an issue is identified, it will be logged for manual research. I'll check for the following against the common fields I'll likely have in the data set:

  • The fields are the correct type (e.g. dates, strings, floats)
  • The expiration date is greater than today's date
  • The low is lower than the high and less than or equal to the close
  • The high is higher than the low and greater than or equal to the close
  • The close is not outside the low or high
  • The high, low, close, strike price, open interest and volume are positive
  • The strike price meets strike increment specifications

These rules will be fairly easy to code. Alerting on violations will be as simple as printing to a log file.

Matches strategy

It's important to choose the data that matches the resolution of the backtesting system and trading strategies. I am not building an ultra-low latency, high frequency statistical arbitrage system. Therefore I don't need millisecond resolution data from Nanex. I need to pick the appropriate level of granularity for the strategies I'm interested in. I will be backtesting strategies that can live for days and even months so end of day data will suffice.

Market regimes change. It is foolish to think one strategy will outperform forever. It's important then to test a strategy over bull markets, bear markets, trending markets, and mean reverting markets. This is especially true if your strategies attempt to exploit a certain market regime. Volatility is usually the most important consideration when trading options. Therefore I'll need to capture markets of high and low volatility. If the goal of a strategy is to sell volatility when it is relatively high, then I will need to test the strategy when volatility is relatively high.

Leave a Reply