A “Data Principle” for Digital Transformation

Whether you are a highly technical computer modeler, project manager, leader, or admin, you can be battered by corporate software and by the communication pattern author Cal Newport calls “hyperactive hive mind” in his brilliant book A World Without Email. This best-selling business author and Georgetown Computer Science professor shares four “Principles” for moving beyond an inbox-driven corporate culture. These resonate and create a roadmap for unlocking productivity. As a “data guy” though, this blog explores my contention that he stopped short of adding a critical, fifth “Data Principle” to his four. I am currently working with a consulting partner on an online training in this Data Principle arena. Reach out if you want more info. We have reached the “training is deliverable live” stage of the course production process.

Banishing the inbox-driven approach is about gaining personal and team productivity and about creating delightful work culture. That drives business success. I cannot help wondering whether Cal Newport was surveilling my teams as he describes negative aspects in even highly successful companies. His point is that we have lived in this long enough to see the future. We are at a tipping point to migrate to a better place and that there is unbelievable upside to unlocking this.

What I call the “Data Principle” is democratic across levels and functions. It has at least two parts. Part One is being knowledgeable about how “structure” and “curate” data and calculation models. My case study experience gives me a lot of confidence with speaking to this. Of course, your goal of arming with knowledge is to take action to shape how you and your team actually work. Doing this makes data and models portable and collaborative across software platforms.  It also armors you for whatever comes in the form of new software, databases and open-source tools.

Data Principle Part Two is learning how to be a discerning user of the messy corporate software mayhem we all live in these days. The mayhem is not going away. The call to action is to take inventory of where your and your team’s critical data and models reside and to be intentional about making work practice choices that optimize for productivity. Are your critical data and models stored in a single location accessible to those with need? Can “data” only be gotten by pinging someone else by email or chat? Most importantly, does software honor peoples’ time and productivity, or is usability driven by Legal, HR and Security concerns? If so, are there ways (e.g. simple choices) for honoring those important concerns while making you and your team more productive?

A specific example of Part 2 tension these days is cloud storage of files. In any company, in their innermost, corporate souls, your HR, Legal, and Info Security functions (that need to check off on the licensing contract with Microsoft, Google et al.), do not want thousands of highly sensitive files on everyone’s laptops. They just don’t. There are obvious situations where it can be problematic. However, you want to have your slides, spreadsheet, or document for that meeting in 5 minutes. Also, you sometimes stay in hotels with crappy internet. In an upcoming post, I will explore that tension further with a case study and discuss how to optimize.


Moving from CA to CBA: A Worthwhile Migration

For leaders of and participants in digital transformation, the acronym “CBA,” or Current Best Approach is a perfect term. Ironically, that is because it does not connote perfection.  Do you have personal CBAs for digital topics?  More importantly, does your organization?  Are they written down and discussed, or do you live in the land of CA’s without being conscious of the “B”?  

This is not talking about creating a big book of standard procedures that nobody opens.  Rather, it is the “This is how we do things around here” tenets for the team’s digital practices.  Done well, these are a scaffolding on which to hang grass-roots digital transformation using the team’s “spare cycles” as people work.  It is empowering because it takes people from frustration with tools to taking satisfaction in new and productive capability.

Do you feel the tension of many tech tools?  They are rapidly evolving. Tools often overlap. The pace of change can be confusing. There is great opportunity in becoming intentional about your CBA’s with the software toolset in your business. This is one thing we do when helping a client with digital transformation –understanding where the leverage points are, making training aids to onboard and bring everyone up to CBA and coming up with a step-by-step migration to make continual progress in the flow of people driving the business.  

This grass-roots digital transformation lets an organization meet corporate efforts half-way. It puts the organization in control of making progress and creates knowledgeable partners with IT. The idea of an advancing front of “best but not perfect” digital CBAs is an empowering concept for any business culture.

Here is a quick, non-fiction case study showing the opportunity.  I was recently giving a quick training to a working level R&D person at a client.  Note:  consultant time is not free, but employee time is even more expensive. We were stopped in our tracks because he could not sync a folder in Microsoft Teams using OneDrive. That was far from the point of the training, which involved equipping him to run visualizations of “big data.”  He works on a business with annual sales greater than $1B with a B. His organization is well-resourced with Microsoft’s latest enterprise stuff --intended to be the hot and cold running water utility for digital collaboration.  For him to do visualizations and use his data to understand and solve a product defect, he needed OneDrive to sync an online folder to his laptop so that he could run a Jupyter notebook.  First problem: my client was not familiar with O365/Teams syncing and even where synced folders go when the software works.  In the hot water metaphor, he had the plumbing installed but could only take cold showers.  Second, OneDrive was not working on his computer.  It was stuck and needed restarting as OneDrive sometimes does.  No visualizations.  No insights from those visualizations.  Learning and business progress stopped cold.

How could this be better?  A productive CBA for this organization is to draw a line in a sand that says, “As a foundation for all digital work, we will have O365/Teams/OneDrive working well for our employees. We will ensure everyone is trained on the basics of this and will develop appropriate quick training aids for onboarding and refreshing this knowledge. This might require a one-time desk-to-desk help session to get things going.  It might lead to productive conversations about how to work together collaboratively on documents and presentations (another CBA).  The digital transformation ball is truly rolling downhill at that point. 


When The Data Get Big(ger) Part2

It is crucial to move beyond using *.CSV files when the scope of a project goes beyond 1MM rows of data. This is an add-on to my initial post about using Pyarrow Feather in Python for efficient data storage.  The first post makes a good case for this, but it leaves out some nitty gritty details that involve burrowing around in Github and Stackoverflow to master and get things working.

Pyarrow installation and Anaconda in Windows

While I had no problem with Pyarrow installation in Mac OSX (“pip install pyarrow” from the command line), I had to scratch and claw to get it working on Windows for colleagues at a client. Here is what straightened things out in that OS:

  • While only mentioned as a strong reco in Pyarrow install documentation, The latest Pyarrow simply requires 64-bit Python. I run Windows on a Parallels virtual machine (VM) and had mistakenly installed 32-Bit Anaconda even though my VM is 64 bit. Don’t be like me! After uninstalling Anaconda and reinstalling the 64-bit version, I was able to use Anaconda’s update process to get Pyarrow 2.0. Anaconda documentation contains instructions for the uninstall. 
  • To check whether Python is 32 or 64 bit in Windows, open a command line window and enter “python”. This launches the Python console but also prints version info as it does so. Look for something like “[MSC v.1916 64 bit (AMD64)]” which confirms 64-bit per both my experience and the advice in this thread. 
  • A watchout for novices to working with Python and the Windows command line: If running Python via Jupyter notebooks in Anaconda Navigator, the command line window for version-checking needs to be launched from the Windows app called "Anaconda Prompt". The base Command Line app will not know anything about Python if Python got installed by Anaconda! Anaconda Prompt is designed for messing around with Anaconda specifically. This app in the Windows Start menu (search for “anaconda” there) and is different from Anaconda Navigator. 

  • Anaconda’s update process is often mysterious (aka frustrating) due to package interdependencies. While my Windows fresh install of Anaconda brought me right to Pyarrow latest version (Pyarrow 2.0.0 as of this writing), a colleague was only being offered a less than 1.0 version. A solution is to do a fresh install of Anaconda –essentially trashing whatever interdependencies were keeping it from offering the latest version. An alternative to this is to create a new Environment in Anaconda that focuses on having the latest Python version with minimal other packages installed.

Pyarrow and data types

As opposed to CSV’s, Pyarrow feather format comes with the advantage that it manages (aka enforces) columns' data types. In Feather, all columns have a type –whether inferred or explicitly assigned --and the individual values must match this type. In case of questions, it should be 'str' that allows for data to be re-typed later. Because this is done column-wise and not element-wise (like Pandas), Feather is incompatible with mixing data types within a column. A couple of links at the bottom from the Pyarrow team state this clearly.  See quotes by Uwe Korn/xhochy specifically.  Although these are clear, I learned this organically (e.g. the hard way!) while trying to write data that had concatenated a mixed text/integer data column (Pandas typed it as ‘str’) with one that was exclusively integer (Pandas inferred this to be ‘int’). A better approach for such data is to explicitly type the column as ‘str’ upon import. I will share our CBA and standard code for this in a separate post. Here is an example of a code snippet that will cause an issue. Python .applymap(type) and .groupby() on .applymap() are very helpful for sorting out data types of individual values. This Jupyter notebook is named Feather Mixed Type Example.ipynb and is posted here.

 

Data type discussions from Pyarrow team (see user xhochy and wesm quotes):

https://github.com/pandas-dev/pandas/issues/21228

https://github.com/wesm/feather/issues/349


When the Data Get Big(ger)

This blog describes how we manage life when the data get too big for Excel-compatible formats such as CSV. Practically, when working with exploratory data analysis, there is no rush to jump to fancy data formats on the client-facing side of things.  Even when people hire a consultant to help with getting deeper insights, they typically want to open their data, look at it, throw together a pivot table or two and basically be in touch with the consulting exploratory stuff that may reach them as screen pictures of graphics from Python scripts.  In most places, this means keeping Excel compatibility for certain if not all aspects of a project.

When data exceed one million rows, Excel cannot be used.  It has a hard limit of 220 or 1048576 rows for opening either XLSX or CSV files.  Aside from Excel, CSV’s can be practically used with more than one million rows.  Pandas and other tools can read and write larger files, and PowerBI can work with larger files. However, CSV’s are much less efficient for file size as shown in the benchmark comparisons below.  Furthermore, CSV’s are relatively slow to read and write, and they do not retain type and formatting information about data –meaning that a very nice serial number like ‘35322000744082’ shows up as ‘3.5322E+13’ when opened in Excel.

On a recent project, we hit Excel’s limit and redesigned raw data storage to utilize the Apache Arrow Feather format.  Does 'feather' allude to 'light as a...'?  Maybe so.  I haven't researched the origin.  I studied Feather and other CSV alternatives with benchmarking scripts I wrote based on excellent online posts such as this one.  A bibliography of some additional links is below.  Apache Arrow is a successful open-source project that defines a language-independent columnar memory format. Although projects like these are always a team effort, this one was spearheaded by Wes McKinney who is Pandas' creator, so it comes with a tremendous credibility stamp on its forehead.  The older Pandas can read and write Feather format with recently-added read_feather() and write_feather() commands.  Feather retains, in fact it insists on, a data type by column.

By implementing Feather for our raw data storage, we saw tremendous improvements in file sizes and import/export speeds.  The table shows our benchmarks importing 300 raw data files, combining them into a 2MM row by 7 column DataFrame and writing the DataFrame to disk on a typical MacBook Pro laptop.The block diagram shows the reusable architecture we created. In a separate blog, I will share details of our speed optimization study and a couple of hard-won learnings about working with Feather.  The teaser on that is that setting data types comes with advantages, but it means adhering to those data types or risking getting pummeled with Python Exceptions!

 

Some helpful links that were background reading and/or helpful in getting the kinks worked out of our feathers:

https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d

https://robotwealth.com/feather-r-python-data-frame/

https://github.com/wesm/feather/issues/349 (troubleshooting data type errors with Feather...)

https://github.com/apache/arrow/blob/master/README.md (link to instructions for installing pyarrow Python library needed for working with Feather)

 


Blog Beginnings

Welcome to the voice of Data Delve. This is a venture I started a year ago as I transitioned from a 30+ year career as an R&D technologist at P&G. It was a highly intentional plan built on a passion for helping people and teams work efficiently with their data --helping them extract insights for decision making and business growth.  It stands on the (unusual for a data geek) foundation of decades of R&D experience creating and launching products in the Consumer Goods space. Data Delve's approach is therefore sharpened by the experience of literally thousands of "problem definitions" and models built to get answers to problems standing in the way of launching new products, on filing patents and on explaining complex topics in a way that owners can understand and act on.  It leverages decades of experience working as an entrepreneur with suppliers, consumers and trade customers.  Finally, we have a legacy of teaching and tool-building across businesses in all stages of formation and launch.

After a year of serving clients large and small, this website puts a public face on what we bring as a trainer, advisor and problem-solver. The three “products” listed on the home page are the heart of what we do.  We are here to help you grow your business by gaining insights from your data, transforming digitally, and leading data-grounded troubleshooting of your crises and problems.

This blog will focus on business applications of data and on how data and digital transformation create culture in small and large companies.  The second, Tech Talk blog lets us indulge our passion for creating and using great code and technical solutions for working with data and building models. Tech Talk plants our flag as a citizen in the open-source software world. As such, it indulges our passion for helping and being helped by those globally who write the code, build the models and create the in-the-weeds solutions --whether they are at companies, nonprofits, government agencies or on university research teams.  We recognize open-source data science as revolutionary for businesses of all sizes, and we are committed to this being a key strategy and a key community we play in.

So, how can we help?  Feel free to reach out through the site’s contact buttons or through Twitter to hit us with a message or to “subscribe” to future posts.

J.D. Landgrebe