When the Data Get Big(ger)

This blog describes how we manage life when the data get too big for Excel-compatible formats such as CSV. Practically, when working with exploratory data analysis, there is no rush to jump to fancy data formats on the client-facing side of things.  Even when people hire a consultant to help with getting deeper insights, they typically want to open their data, look at it, throw together a pivot table or two and basically be in touch with the consulting exploratory stuff that may reach them as screen pictures of graphics from Python scripts.  In most places, this means keeping Excel compatibility for certain if not all aspects of a project.

When data exceed one million rows, Excel cannot be used.  It has a hard limit of 220 or 1048576 rows for opening either XLSX or CSV files.  Aside from Excel, CSV’s can be practically used with more than one million rows.  Pandas and other tools can read and write larger files, and PowerBI can work with larger files. However, CSV’s are much less efficient for file size as shown in the benchmark comparisons below.  Furthermore, CSV’s are relatively slow to read and write, and they do not retain type and formatting information about data –meaning that a very nice serial number like ‘35322000744082’ shows up as ‘3.5322E+13’ when opened in Excel.

On a recent project, we hit Excel’s limit and redesigned raw data storage to utilize the Apache Arrow Feather format.  Does ‘feather’ allude to ‘light as a…’?  Maybe so.  I haven’t researched the origin.  I studied Feather and other CSV alternatives with benchmarking scripts I wrote based on excellent online posts such as this one.  A bibliography of some additional links is below.  Apache Arrow is a successful open-source project that defines a language-independent columnar memory format. Although projects like these are always a team effort, this one was spearheaded by Wes McKinney who is Pandas’ creator, so it comes with a tremendous credibility stamp on its forehead.  The older Pandas can read and write Feather format with recently-added read_feather() and write_feather() commands.  Feather retains, in fact it insists on, a data type by column.

By implementing Feather for our raw data storage, we saw tremendous improvements in file sizes and import/export speeds.  The table shows our benchmarks importing 300 raw data files, combining them into a 2MM row by 7 column DataFrame and writing the DataFrame to disk on a typical MacBook Pro laptop.The block diagram shows the reusable architecture we created. In a separate blog, I will share details of our speed optimization study and a couple of hard-won learnings about working with Feather.  The teaser on that is that setting data types comes with advantages, but it means adhering to those data types or risking getting pummeled with Python Exceptions!


Some helpful links that were background reading and/or helpful in getting the kinks worked out of our feathers:



https://github.com/wesm/feather/issues/349 (troubleshooting data type errors with Feather…)

https://github.com/apache/arrow/blob/master/README.md (link to instructions for installing pyarrow Python library needed for working with Feather)