The data set is just over 900 million crypto trades between July 9, 2018 and March 25, 2020, housed in a 62Gib CSV file.
CSV layout:
head /xfs/trades/trades.csv | xsv table # time amount exch price server_time side ticker # 1531094401700852527 0.08008278161287308 bits 6706.60009765625 0 na btc_usd # 1531094401780298519 0.02837366983294487 bits 6706.60986328125 0 na btc_usd # 1531094402305708472 0.004999999888241291 btfx 6707.0 0 na btc_usd # 1531094403455657797 0.004999999888241291 btfx 6706.7001953125 0 na btc_usd # 1531094403592663872 0.06581910699605943 btfx 6705.89990234375 0 na btc_usd # 1531094403735847326 0.10860306024551393 btfx 6705.89990234375 0 na btc_usd # 1531094404081074798 0.0074614998884499064 bmex 6701.0 0 na btc_usd # 1531094404316100958 0.0071630398742854595 bmex 6701.0 0 na btc_usd # 1531094404331681370 0.012983010150492193 bmex 6701.0 0 na btc_usd
Counting the rows (just over 900 million) takes about 3min with xsv
:
time xsv count /xfs/trades.csv # 908204336 # # real 2m55.124s # user 2m27.866s # sys 0m27.052s
Column specifications:
time | unix timestamp with nanoseconds precision |
amount | size of the trade, in base currency |
exch | exchange where trade executed. one of "gdax", "bmex", "btfx", "bnce", "okex", "bits", "plnx", or "krkn" |
price | price trade executed at, in quote currency |
server_time | timestamp of trade according to exchange, if available. 0 is equivalent to missing or NA |
side | "taker" side in trade. one of "na" (missing), "bid", or "ask" |
ticker | currency symbols in bid_quote format. one of "btc_usd", "eth_usd", "ltc_usd", "etc_usd", "bch_usd", "xmr_usd" |
Notes and Further Reading
server_time
was represented as0
when missing to prevent the column from being cast tofloat64
when read withpd.read_csv
. As of Pandas v0.25, columns with integer dtypes could not represent a missing value, so blank rows would result in the column being cast to float, allowingNaN
to represent the missing values. However, I had read warnings that the resulting process of going back and forth betweenint64
andfloat64
could result in a loss of precision in the underlying data due to rounding issues. So I stored missing as0
.- The "gdax" exchange symbol in the file refers to the old name for what is now called "Coinbase Pro". Personally, I think "Coinbase Pro" sounds kind of douchey so I refuse to change any of my code that refers to the previous name.