The Task: Size-Weighted Average Price - Easy or Hard?

by Jonathan Strong 2020-04-03T04:06:08.711700525Z

For the hourly size-weighted price of btc/usd grouped by exchange, return the ratio of bmex/gdax

The task involves two parts, easy and hard. Both involve computing the ratio of the average btc/usd price at two exchanges, Bitmex (bmex) and Coinbase Pro (gdax).

Bitmex is the leading cryptocurrency derivitives exchange, while GDAX (the old name that Coinbase Pro once went by) is the leading "spot" exchange (in my mind, at least).

The "average price" should be calculated as the weighted average of the trades for each exchange, with the weights being the amounts of each trade.

Both queries involve time-based group by aggregations.

For time slots where there is missing data (i.e. no rows) for either or both exchanges, NaN, NULL or other "missing" indicator should be returned instead. The final output should not have gaps in the time column.

The Easy Part

The easy query is the ratio of the hourly size-weighted average btc/usd price between bmex and gdax.

Here is the calculation for a single hour in Python/Pandas code:


df: pd.DataFrame = # .. <- all trades across all markets/dates, df has DatetimeIndex

start = "2018-10-03T04:00:00Z" # <- randomly selected start/end 
end   = "2018-10-03T05:00:00Z" #    important point is, 1h span

target_hr = (df.index >= start) & (df.index < end)

of_btcusd = df['ticker'] == 'btc_usd'

of_bmex = df['exch'] == 'bmex'
of_gdax = df['exch'] == 'gdax'

bmex_hr = df.loc[target_hr & of_bmex & of_btcusd, ['price', 'amount']]
gdax_hr = df.loc[target_hr & of_gdax & of_btcusd, ['price', 'amount']]

bmex_size_wt_price = (
    (bmex_hr['price'] * bmex_hr['amount']).sum()
    / bmex_hr['amount'].sum()
)

gdax_size_wt_price = (
    (gdax_hr['price'] * gdax_hr['amount']).sum()
    / gdax_hr['amount'].sum()
)

ratio = bmex_size_wt_price / gdax_size_wt_price # <- final answer (for start,end hr at least)

If Python isn't your cup of tea, don't worry, we've got a Rust reference implementation ahead, as well as SQL (in several flavors).

What I like about this query

there's significant filtering involved - it'd be possible to avoid a lot of work with intelligent indexing or data layout
there's real CPU work to do to compute the weighted average. this is not just 'how fast can you read from a disk'
it's big enough data that I expect IO and memory bandwidth will be significant considerations
the output is only one number per hour, which makes it simpler to validate results against one another
it's realistic: this is something you might actually find yourself computing. I'm certainly interested to see how it pans out over the whole data set

What it doesn't show:

it doesn't require sequential processing. Many types of data (order book updates, for example) are a series of state updates that make no sense when processed out of order. In contrast, each hour in our query is an entirely independent computation, and thus can be parallelized. It doesn't even matter what order the rows are processed, inside of each hour grouping. For sequential data, we'd have to be a lot more clever about query parallelism

The Hard Part

The hard query is, every 10 seconds between the minimum and maximum time in the trades data set, calculate the ratios of the size-weighted average btc/usd price between bmex and gdax over the previous 5 minute, 15 minute and 60 minute lookback horizons (relative to that row's time).

This query, like the event study problem in the preface, requires sliding window aggregations with bounds relative to each row.

Here is some Python/Pandas code to help explain through examples:

# imagine `df` is a DataFrame of our trades csv file, if we had a computer with
# sufficient RAM to actually do that

df.head()

#                   time  amount  exch     price  server_time side   ticker
# 0  1531094401700852527  0.0801  bits 6706.6001            0   na  btc_usd
# 1  1531094401780298519  0.0284  bits 6706.6099            0   na  btc_usd
# 2  1531094402305708472  0.0050  btfx 6707.0000            0   na  btc_usd
# 3  1531094403455657797  0.0050  btfx 6706.7002            0   na  btc_usd
# 4  1531094403592663872  0.0658  btfx 6705.8999            0   na  btc_usd

# add a DatetimeIndex

df.index = pd.to_datetime(df['time'], utc=True)
df.head()

#                                                     time  amount  exch     price  server_time side   ticker
# 2018-07-09 00:00:01.700852527+00:00  1531094401700852527  0.0801  bits 6706.6001            0   na  btc_usd
# 2018-07-09 00:00:01.780298519+00:00  1531094401780298519  0.0284  bits 6706.6099            0   na  btc_usd
# 2018-07-09 00:00:02.305708472+00:00  1531094402305708472  0.0050  btfx 6707.0000            0   na  btc_usd
# 2018-07-09 00:00:03.455657797+00:00  1531094403455657797  0.0050  btfx 6706.7002            0   na  btc_usd
# 2018-07-09 00:00:03.592663872+00:00  1531094403592663872  0.0658  btfx 6705.8999            0   na  btc_usd

# "every 10 seconds between the minimum and maximum time in the trades data set"

points = pd.date_range(df.index.min().floor('1s'), df.index.max().floor('1s'), freq='10sec')

# DatetimeIndex(['2018-07-09 00:00:01+00:00', '2018-07-09 00:00:11+00:00',
#                '2018-07-09 00:00:21+00:00', '2018-07-09 00:00:31+00:00',
#                '2018-07-09 00:00:41+00:00', '2018-07-09 00:00:51+00:00',
#                '2018-07-09 00:01:01+00:00', '2018-07-09 00:01:11+00:00',
#                '2018-07-09 00:01:21+00:00', '2018-07-09 00:01:31+00:00',
#                ...
#                '2020-03-25 04:59:41+00:00', '2020-03-25 04:59:51+00:00',
#                '2020-03-25 05:00:01+00:00', '2020-03-25 05:00:11+00:00',
#                '2020-03-25 05:00:21+00:00', '2020-03-25 05:00:31+00:00',
#                '2020-03-25 05:00:41+00:00', '2020-03-25 05:00:51+00:00',
#                '2020-03-25 05:01:01+00:00', '2020-03-25 05:01:11+00:00'],
#               dtype='datetime64[ns, UTC]', length=5401808, freq='10S')

# calculating the size-weighted average price ratio for the 5min lookback horizon,
# relative to example_time row

example_time = 1531180801700852527 # some arbitrary reference point in time

example_time = pd.to_datetime(example_time)

# create boolean masks to select row subsets we need to aggregate 
last_5min   = (df.index > example_time - pd.Timedelta('5min')) & (df.index <= example_time)
of_btc_usd  = df['ticker'] == 'btc_usd'
of_gdax     = df['exch'] == 'gdax'
of_bmex     = df['exch'] == 'bmex'
g5          = last_5min & of_btc_usd & of_gdax
b5          = last_5min & of_btc_usd & of_bmex

# weighted average price of gdax trades between example_time and 5min prior to example_time
g5_wt_price = (df.loc[g5, 'price'] * df.loc[g5, 'amount']).sum() / df.loc[g5, 'amount'].sum()

# weighted average price of bmex trades between example_time and 5min prior to example_time
b5_wt_price = (df.loc[b5, 'price'] * df.loc[b5, 'amount']).sum() / df.loc[b5, 'amount'].sum()

# ratio between weighted average price at each exchange
ratio_5min = g5_wt_price / b5_wt_price

# cool. now do 15min, 60min too, thanks. oh - and can you do that calculation at
#  every 10s interval across # entire range of the dataset? awesome, appreciate it.

As you can see, the 10 seconds interval results in 5.4 million points to compute the aggregation at.

Extra credit: do it for every 1 second instead of every 10 seconds (sequence length: 54,018,080).

The thing about this query is, it drives databases CRAZY.

Postgresql, for instance, morphed from astonishingly good performance to complete meltdown when trying to graduate from the easy query to this one.

Estimated Execution Time in Python: 30 years

In Python, computing the example code above for all three lookback horizons of 5, 15, and 60 minutes on a sample of 25 rows took 487 seconds. At that rate, computing the hard query in Python would take over three years:

487.8 / 25 = 19.5 seconds per row
19.5 * 5,401,808 output points = 105,410,800.2 seconds
105,410,800.2 / 60 / 60 / 24 / 365 = 3.34255 years (!)

Further, the trades data was from sample CSV with "only" 92 million rows, which is only 10% the size of the actual data set. This makes the Python execution time unrealisticly optimistic, approximately by a factor of ten, suggesting the Python implementation could actually take over 30 years to finish.