Now that we have a high level overview of backtesting, I'll discuss backtesting data considerations. We'll likely encounter data quality issues along the way and will need to identify and clean the data before use in the system. Below I discuss some of the issues and how I will solve them. In later posts I'll discuss data sources and data storage technologies. Finally, a potential data pipeline that automates the acquisition, cleaning and storing will be explored.
Backtesting Data Considerations
Data is the most critical aspect of a backtesting model. I spend a lot of time acquiring and manipulating data to feed into all sorts of analysis. I can attest that bad input data will build poor models and lead to bad decision making. Part of the challenge is to automate the data acquisition, checking, and cleaning process. Fortunately in the age of big data there are many tools and techniques to help.
Despite the source of your data, various quality checks need to be performed. If you purchase data from a data vendor, it is likely you'll start with a clean set but it is still important to check. In my experience, financial time series data issues can be summarized in three categories.
Market data will be missing for weekends, holidays and other non trading days. These data clearly should be missing but still has to be handled within the system. Lucky for us code exists. Additionally, options expire. Therefore we have to keep the expiration date in mind through our cleaning methods and strategy development.
Missing data without explanation is a bit more tricky. If a contract suddenly disappears on a Tuesday, our results can be incorrect. Fortunately, there are ways to handle these situations. The excellent Python package pandas has a superb interface for cleaning and filling missing data. I'll be using these tools extensively throughout the system.
# import pandas and numpy
import numpy as np
# build a dataframe and add some missing values
df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df['A'].ix[4:6] = np.nan
df['B'].ix = np.nan
df['C'].ix[6:7] = np.nan
df['C'].ix[4:6] = np.nan
# fill in missing data with a scalar
# note the inplace argument fills the data in the original dataframe
# use the value before the missing value to fill the missing value
# also known as filling forward or padding
# use the mean of the column with the missing data
# interpolate to fill the missing values
Small plug: The code above shows only a small amount of what pandas can do. For example, instead of passing
mean(), you can use any other object on a pandas
DataFrame (even those you write). Instead of using linear interpolation (default) you can use quadratic, cubic or spline (among many others). I highly encourage you to take a look at pandas.
One might consider time series or regression models to predict the missing value. Of course fitting a model incorporates the issues generally found in time series modeling. These include parameter estimation, correct model specification, training period, and others. Unless absolutely necessarily, I will avoid fitting time series models simply to fill missing data. Given I'm building a system with daily data, it seems like overkill (correct me if I'm wrong).
Impossible or outlier values
Financial time series data are notorious for having bad "prints". These are incorrect prices that make it into the published data feed. Consider the following which demonstrates a bad print in the South African Rand spot currency price on 7 June 2011. Note this is from the Bloomberg Professional Service which is a very expensive service.
Handing outlier data is an entire discipline in itself. The usual methods for handling these are similar to handling missing data but also include winsorizing.
Let's take a look at a simple example of winsorization in pandas.
# import pandas and the datetime modules
import pandas.io.data as web
# set a date range
start = datetime.datetime(2015,05,01)
end = datetime.datetime(2015,05,14)
# get IBM's equity data from yahoo
ibm = web.DataReader('IBM', 'yahoo', start, end)
# set a dummy value which represents our outlier
ibm['Close'].ix = 1172.6799
# get the standard deviation of data before the bad print
st_dev = ibm['Close'][:5].std()
# get the men of the data before the bad print
mu = ibm['Close'][:5].mean()
# create a z-score two standard deviations from the mean
winz = mu + (2 * st_dev)
# cap any data that exceeds the z-score to that point
ibm.loc[ibm['Close'] > winz, 'Close'] = winz
Before winsorizing, the bad print distorts the entire time series.
After winsorizing the distortion is largely gone.
This is another example with more sophisticated data cleaning methodologies described on the excellent R-Bloggers site. Note this is not financial time series data but a time series nonetheless.
I've used winsorizing in most applications of outlier identification and cleaning I've come across. Because I won't be working with high frequency data, I'll start here.
Inconsistent or unlikely values
Is the reported low price of the day really the lowest price of the day? How about the high price? Is the bid always lower than the ask? These are questions that require domain knowledge. With options data there are additional elements we have to consider. Is the expiration date in the future? Does open interest and volume somewhat reflect the time to maturity? I'll watch for these throughout the data pipeline.
These values may be easy to test for but difficult to clean on an automated basis. For example, if the ask is less than the bid, which price is correct? Hopefully these data issues are exceedingly rare. If they are, then it will not be expensive to identify and manually research and clean them. Within the acquisition and cleaning pipeline, I'll run the data through a series of filters to check for some of these cases. If an issue is identified, it will be logged for manual research. I'll check for the following against the common fields I'll likely have in the data set:
- The fields are the correct type (e.g. dates, strings, floats)
- The expiration date is greater than today's date
- The low is lower than the high and less than or equal to the close
- The high is higher than the low and greater than or equal to the close
- The close is not outside the low or high
- The high, low, close, strike price, open interest and volume are positive
- The strike price meets strike increment specifications
These rules will be fairly easy to code. Alerting on violations will be as simple as printing to a log file.
It's important to choose the data that matches the resolution of the backtesting system and trading strategies. I am not building an ultra-low latency, high frequency statistical arbitrage system. Therefore I don't need millisecond resolution data from Nanex. I need to pick the appropriate level of granularity for the strategies I'm interested in. I will be backtesting strategies that can live for days and even months so end of day data will suffice.
Market regimes change. It is foolish to think one strategy will outperform forever. It's important then to test a strategy over bull markets, bear markets, trending markets, and mean reverting markets. This is especially true if your strategies attempt to exploit a certain market regime. Volatility is usually the most important consideration when trading options. Therefore I'll need to capture markets of high and low volatility. If the goal of a strategy is to sell volatility when it is relatively high, then I will need to test the strategy when volatility is relatively high.