Attempting to Open a 60G CSV File With Pandas

First off, I'm just curious whether I can even open this file in Python.

du -sh /xfs/trades.csv
# 62G /xfs/trades.csv

I'm not using the new Pandas v1.0 yet (although it looks cool):

pip freeze | rg pandas
# pandas 0.25.1

This is the Python code:

import pandas as pd

df = pd.read_csv(csv_path)
print(df.info())

It did not end well:

time python pandas-naive.py /xfs/trades.csv
# Traceback (most recent call last):
# ..
#   File "/home/jstrong/src/envs/gnn3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1874, in _stack_arrays
#     stacked = np.empty(shape, dtype=dtype)
# numpy.core._exceptions.MemoryError: Unable to allocate array with shape (2, 892674545) and data type float64
# 
# real	13m18.111s
# user	11m7.292s
# sys	1m37.026s

Peak memory usage hit 95G of RAM.

It's interesting: the CSV file is 62G on disk, and each row is larger in CSV format than it should be in parsed form.

For instance, the first row of the trades.csv file (after the headers row) is:

1531094401700852527,0.08008278161287308,bits,6706.60009765625,0,na,btc_usd

... which is 74 bytes of data (75 with a newline character). File size / number of rows indicates the average row size is 73.3 bytes.

This Trade struct, an idiomatic representation of the CSV data, is 48 bytes in memory:

use markets::crypto::{Exchange, Ticker, Side};

struct Trade {
    pub time: u64,
    pub price: f64,
    pub amount: f64,
    pub exch: Exchange,
    pub ticker: Ticker,
    pub server_time: Option<u64>,
    pub side: Option<Side>,
}

std::mem::size_of::<Trade>() // -> 48

To hold a Vec<Trade> in memory with 908,204,336 items would cost 40.6 GiB of memory (assuming no additional capacity in the Vec).

Exchange, Side are #[repr(u8)], and Ticker is two Currency fields (which is also #[repr(u8)]), but otherwise no optimizing has been done for memory size to acheive the size 48 bytes. For example, an Option<u64> on its own is 16 bytes, but an Option<std::num::NonZeroU64>, which you might use instead for server_time, is only 8 bytes.

Anyways... we'll put the Python baseline on hold for now. I might bring it back on a monster machine just to see how much RAM it takes.