The Data: 900 Million Crypto Trades

The data set is just over 900 million crypto trades between July 9, 2018 and March 25, 2020, housed in a 62Gib CSV file.

CSV layout:

head /xfs/trades/trades.csv | xsv table

# time                 amount                 exch  price             server_time  side  ticker
# 1531094401700852527  0.08008278161287308    bits  6706.60009765625  0            na    btc_usd
# 1531094401780298519  0.02837366983294487    bits  6706.60986328125  0            na    btc_usd
# 1531094402305708472  0.004999999888241291   btfx  6707.0            0            na    btc_usd
# 1531094403455657797  0.004999999888241291   btfx  6706.7001953125   0            na    btc_usd
# 1531094403592663872  0.06581910699605943    btfx  6705.89990234375  0            na    btc_usd
# 1531094403735847326  0.10860306024551393    btfx  6705.89990234375  0            na    btc_usd
# 1531094404081074798  0.0074614998884499064  bmex  6701.0            0            na    btc_usd
# 1531094404316100958  0.0071630398742854595  bmex  6701.0            0            na    btc_usd
# 1531094404331681370  0.012983010150492193   bmex  6701.0            0            na    btc_usd

Counting the rows (just over 900 million) takes about 3min with xsv:

time xsv count /xfs/trades.csv 
# 908204336
# real	2m55.124s
# user	2m27.866s
# sys	0m27.052s

Column specifications:

time unix timestamp with nanoseconds precision
amount size of the trade, in base currency
exch exchange where trade executed. one of "gdax", "bmex", "btfx", "bnce", "okex", "bits", "plnx", or "krkn"
price price trade executed at, in quote currency
server_time timestamp of trade according to exchange, if available. 0 is equivalent to missing or NA
side "taker" side in trade. one of "na" (missing), "bid", or "ask"
ticker currency symbols in bid_quote format. one of "btc_usd", "eth_usd", "ltc_usd", "etc_usd", "bch_usd", "xmr_usd"

Notes and Further Reading

  • server_time was represented as 0 when missing to prevent the column from being cast to float64 when read with pd.read_csv. As of Pandas v0.25, columns with integer dtypes could not represent a missing value, so blank rows would result in the column being cast to float, allowing NaN to represent the missing values. However, I had read warnings that the resulting process of going back and forth between int64 and float64 could result in a loss of precision in the underlying data due to rounding issues. So I stored missing as 0.
  • The "gdax" exchange symbol in the file refers to the old name for what is now called "Coinbase Pro". Personally, I think "Coinbase Pro" sounds kind of douchey so I refuse to change any of my code that refers to the previous name.