When The Data Get Big(ger) Part2

It is crucial to move beyond using *.CSV files when the scope of a project goes beyond 1MM rows of data. This is an add-on to my initial post about using Pyarrow Feather in Python for efficient data storage.  The first post makes a good case for this, but it leaves out some nitty gritty details that involve burrowing around in Github and Stackoverflow to master and get things working.

Pyarrow installation and Anaconda in Windows

While I had no problem with Pyarrow installation in Mac OSX (“pip install pyarrow” from the command line), I had to scratch and claw to get it working on Windows for colleagues at a client. Here is what straightened things out in that OS:

  • While only mentioned as a strong reco in Pyarrow install documentation, The latest Pyarrow simply requires 64-bit Python. I run Windows on a Parallels virtual machine (VM) and had mistakenly installed 32-Bit Anaconda even though my VM is 64 bit. Don’t be like me! After uninstalling Anaconda and reinstalling the 64-bit version, I was able to use Anaconda’s update process to get Pyarrow 2.0. Anaconda documentation contains instructions for the uninstall. 
  • To check whether Python is 32 or 64 bit in Windows, open a command line window and enter “python”. This launches the Python console but also prints version info as it does so. Look for something like “[MSC v.1916 64 bit (AMD64)]” which confirms 64-bit per both my experience and the advice in this thread. 
  • A watchout for novices to working with Python and the Windows command line: If running Python via Jupyter notebooks in Anaconda Navigator, the command line window for version-checking needs to be launched from the Windows app called “Anaconda Prompt”. The base Command Line app will not know anything about Python if Python got installed by Anaconda! Anaconda Prompt is designed for messing around with Anaconda specifically. This app in the Windows Start menu (search for “anaconda” there) and is different from Anaconda Navigator. 

  • Anaconda’s update process is often mysterious (aka frustrating) due to package interdependencies. While my Windows fresh install of Anaconda brought me right to Pyarrow latest version (Pyarrow 2.0.0 as of this writing), a colleague was only being offered a less than 1.0 version. A solution is to do a fresh install of Anaconda –essentially trashing whatever interdependencies were keeping it from offering the latest version. An alternative to this is to create a new Environment in Anaconda that focuses on having the latest Python version with minimal other packages installed.

Pyarrow and data types

As opposed to CSV’s, Pyarrow feather format comes with the advantage that it manages (aka enforces) columns’ data types. In Feather, all columns have a type –whether inferred or explicitly assigned –and the individual values must match this type. In case of questions, it should be ‘str’ that allows for data to be re-typed later. Because this is done column-wise and not element-wise (like Pandas), Feather is incompatible with mixing data types within a column. A couple of links at the bottom from the Pyarrow team state this clearly.  See quotes by Uwe Korn/xhochy specifically.  Although these are clear, I learned this organically (e.g. the hard way!) while trying to write data that had concatenated a mixed text/integer data column (Pandas typed it as ‘str’) with one that was exclusively integer (Pandas inferred this to be ‘int’). A better approach for such data is to explicitly type the column as ‘str’ upon import. I will share our CBA and standard code for this in a separate post. Here is an example of a code snippet that will cause an issue. Python .applymap(type) and .groupby() on .applymap() are very helpful for sorting out data types of individual values. This Jupyter notebook is named Feather Mixed Type Example.ipynb and is posted here.

 

Data type discussions from Pyarrow team (see user xhochy and wesm quotes):

https://github.com/pandas-dev/pandas/issues/21228

https://github.com/wesm/feather/issues/349