Introduction

by Jonathan Strong 2020-04-02T13:17:45.817163106Z

In exploratory data analysis, a lot of ephemeral, use-once code is produced
But it's not always just fine for it to be slow as molasses. As the data grows in size, so does the pain from using less powerful tools

In this series, we'll take a moderately large data file of executed trades (~900 million rows, around 60GB uncompressed) and explore techniques for using Rust to conduct high performance analysis on its contents.

My motivation for the series is the positive experiences I've had with Rust as a tool for analytical heavy lifting.

The use case I have in mind is taking some big data and getting it to the place where it's small enough to use the more typical tools in the data science toolkit -- Julia, Python, R, Jupyter, etc.

In "Learn Rust the Dangerous Way", Cliff L. Biffle explained that he was aiming for a task that was not so big as to be insurmountable ("fairly short", "self-contained"), yet not too small as to be trivial ("not a complete toy"). This is the same motivation I had in choosing the scope of this series.

First, executed trades data is kind of a classic data set, in my view (I may be biased).

Second, the data is big enough to make bad decisions obvious, but not so big that a mistake means the program is still be running a week from now.

Third, the query or "task" I've chosen involves a healthy balance of CPU and IO work, which makes it a good vehicle for exploring a variety of techniques. In other words, it won't be dominated by one or the other.