The task involves two parts, easy and hard. Both involve computing the ratio of the average btc/usd price at two exchanges, Bitmex (bmex) and Coinbase Pro (gdax).
Bitmex is the leading cryptocurrency derivitives exchange, while GDAX (the old name that Coinbase Pro once went by) is the leading "spot" exchange (in my mind, at least).
The "average price" should be calculated as the weighted average of the trades for each exchange, with the weights being the amounts of each trade.
Both queries involve time-based group by aggregations.
For time slots where there is missing data (i.e. no rows) for either or both exchanges,
NULL or other "missing" indicator should be returned instead. The final output should not have gaps in the time column.
The Easy Part
The easy query is the ratio of the hourly size-weighted average btc/usd price between bmex and gdax.
Here is the calculation for a single hour in Python/Pandas code:
df: pd.DataFrame = # .. <- all trades across all markets/dates, df has DatetimeIndex start = "2018-10-03T04:00:00Z" # <- randomly selected start/end end = "2018-10-03T05:00:00Z" # important point is, 1h span target_hr = (df.index >= start) & (df.index < end) of_btcusd = df['ticker'] == 'btc_usd' of_bmex = df['exch'] == 'bmex' of_gdax = df['exch'] == 'gdax' bmex_hr = df.loc[target_hr & of_bmex & of_btcusd, ['price', 'amount']] gdax_hr = df.loc[target_hr & of_gdax & of_btcusd, ['price', 'amount']] bmex_size_wt_price = ( (bmex_hr['price'] * bmex_hr['amount']).sum() / bmex_hr['amount'].sum() ) gdax_size_wt_price = ( (gdax_hr['price'] * gdax_hr['amount']).sum() / gdax_hr['amount'].sum() ) ratio = bmex_size_wt_price / gdax_size_wt_price # <- final answer (for start,end hr at least)
If Python isn't your cup of tea, don't worry, we've got a Rust reference implementation ahead, as well as SQL (in several flavors).
What I like about this query
- there's significant filtering involved - it'd be possible to avoid a lot of work with intellgent indexing or data layout
- there's real CPU work to do to compute the weighted average. this is not just 'how fast can you read from a disk'
- it's big enough data that I expect IO and memory bandwidth will be significant considerations
- the output is only one number per hour, which makes it simpler to validate results against one another
- it's realistic: this is something you might actually find yourself computing. I'm certainly interested to see how it pans out over the whole data set
What it doesn't show:
- it doesn't require sequential processing. Many types of data (order book updates, for example) are a series of state updates that make no sense when processed out of order. In contrast, each hour in our query is an entirely independent computation, and thus can be parallelized. It doesn't even matter what order the rows are processed, inside of each hour grouping. For sequential data, we'd have to be a lot more clever about query parallelism
The Hard Part
The hard query is, every 10 seconds between the minimum and maximum time in the trades data set, calculate the ratios of the size-weighted average btc/usd price between bmex and gdax over the previous 5 minute, 15 minute and 60 minute lookback horizons (relative to that row's time).
This query, like the event study problem in the preface, requires sliding window aggregations with bounds relative to each row.
Here is some Python/Pandas code to help explain through examples:
# imagine `df` is a DataFrame of our trades csv file, if we had a computer with # sufficient RAM to actually do that df.head() # time amount exch price server_time side ticker # 0 1531094401700852527 0.0801 bits 6706.6001 0 na btc_usd # 1 1531094401780298519 0.0284 bits 6706.6099 0 na btc_usd # 2 1531094402305708472 0.0050 btfx 6707.0000 0 na btc_usd # 3 1531094403455657797 0.0050 btfx 6706.7002 0 na btc_usd # 4 1531094403592663872 0.0658 btfx 6705.8999 0 na btc_usd # add a DatetimeIndex df.index = pd.to_datetime(df['time'], utc=True) df.head() # time amount exch price server_time side ticker # 2018-07-09 00:00:01.700852527+00:00 1531094401700852527 0.0801 bits 6706.6001 0 na btc_usd # 2018-07-09 00:00:01.780298519+00:00 1531094401780298519 0.0284 bits 6706.6099 0 na btc_usd # 2018-07-09 00:00:02.305708472+00:00 1531094402305708472 0.0050 btfx 6707.0000 0 na btc_usd # 2018-07-09 00:00:03.455657797+00:00 1531094403455657797 0.0050 btfx 6706.7002 0 na btc_usd # 2018-07-09 00:00:03.592663872+00:00 1531094403592663872 0.0658 btfx 6705.8999 0 na btc_usd # "every 10 seconds between the minimum and maximum time in the trades data set" points = pd.date_range(df.index.min().floor('1s'), df.index.max().floor('1s'), freq='10sec') # DatetimeIndex(['2018-07-09 00:00:01+00:00', '2018-07-09 00:00:11+00:00', # '2018-07-09 00:00:21+00:00', '2018-07-09 00:00:31+00:00', # '2018-07-09 00:00:41+00:00', '2018-07-09 00:00:51+00:00', # '2018-07-09 00:01:01+00:00', '2018-07-09 00:01:11+00:00', # '2018-07-09 00:01:21+00:00', '2018-07-09 00:01:31+00:00', # ... # '2020-03-25 04:59:41+00:00', '2020-03-25 04:59:51+00:00', # '2020-03-25 05:00:01+00:00', '2020-03-25 05:00:11+00:00', # '2020-03-25 05:00:21+00:00', '2020-03-25 05:00:31+00:00', # '2020-03-25 05:00:41+00:00', '2020-03-25 05:00:51+00:00', # '2020-03-25 05:01:01+00:00', '2020-03-25 05:01:11+00:00'], # dtype='datetime64[ns, UTC]', length=5401808, freq='10S') example_time = points[42 * 42 * 42] # -> Timestamp('2018-07-21 20:42:01+0000', tz='UTC', freq='15S') # calculating the size-weighted average price ratio for the 5min lookback horizon, # relative to example_time row last_5min = (df.index > example_time - pd.Timedelta('5min')) & (df.index <= example_time) of_btc_usd = df['ticker'] == 'btc_usd' of_gdax = df['exch'] == 'gdax' of_bmex = df['exch'] == 'bmex' g5 = last_5min & of_btc_usd & of_gdax b5 = last_5min & of_btc_usd & of_bmex ratio_5min = ((df.loc[b5, 'price'] * df.loc[b5, 'amount']).sum() / df.loc[b5, 'amount'].sum()) / ((df.loc[g5, 'price'] * df.loc[g5, 'amount']).sum() / df.loc[g5, 'amount'].sum()) # cool. now do 15min, 60min too, thanks.
As you can see, the 10 seconds interval results in 5.4 million points to compute the aggregation at.
Extra credit: do it for every 1 second instead of every 10 seconds (sequence length: 54,018,080).
The thing about this query is, it drives databases CRAZY.
Postgresql, for instance, morphed from astonishingly good performance to complete meltdown when trying to graduate from the easy query to this one.
Estimated Execution Time in Python: 30 years
In Python, computing the example code above for all three lookback horizons of 5, 15, and 60 minutes on a sample of 25 rows took 487 seconds. At that rate, computing the hard query in Python would take over three years:
487.8 / 25 = 19.5 seconds per row 19.5 * 5,401,808 output points = 105,410,800.2 seconds 105,410,800.2 / 60 / 60 / 24 / 365 = 3.34255 years (!)
Further, the trades data was from sample CSV with "only" 92 million rows, which is only 10% the size of the actual data set. This makes the Python execution time unrealisticly optimistic, approximately by a factor of ten, suggesting the Python implementation could actually take over 30 years to finish.