Why 'Data Pipelines'? A Munging Manifesto

  • Cleaning data, moving it around, and getting it into the right shape is a huge amount of work!
  • To me it seems like a task that we are spending more and more time on, while still considering it secondary

People who use computers to analyze data for a living generally say they hate cleaning, organizing, munging and otherwise preparing the data for analysis.

A recent headline put it this way: "Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says". The poll results showed 76% of respondents selected data preparation as the least enjoyable part of their work.

Now, I'm not saying I love preparing data, like, in of itself, as an intrinsically enjoyable task. But I do love the excitement from knowing that I'm close to having a brand new set of information available to learn from.

I guess I just look at it as, I'd rather get it done, fast and well.

My "least enjoyable part" of the work is waiting for slow programs to finish running. Everyone understands the idea of how increased productivity with some tool can decrease "time to market," but usually it's a point made to justify "fast to write, slow to execute" code. I've experienced another kind of productivity boost, where the development time with Rust isn't actually longer, the execution time is astronomically faster, and suddenly there are categories of analysis available that simply weren't possible before. And that's something I can get excited about.