What Has Data Delve Been Up to in 2022?

With consulting work, it is easy to just keep your head down and work on client projects. This post is a lifting of my head for a 2022 review. At top level, I help people build models and gain data insights to grow their business. This means defining problems in a rich and actionable way and then turning them into coded models and visualizations to uncover hidden gems for decisions and growth. Sometimes these solutions are prototypes that lead to more permanent solutions. Sometimes they are themselves the answer.

Here is a bullet point listing of what we were up to in 2022. A paragraph on each below gives more details:

  • Business and manufacturing simulation for a chemical industry client
  • Python-coded data solutions for an e-Commerce appliance product that is online-connected through AWS
  • Troubleshooting the interplay of a client's raw materials and manufacturing process to resolve quality problems
  • A foray into online training with launch of our Digital Transformation with Excel course on mastering Microsoft Excel in organizations
  • Building a “control panel” front-end user interface for a complex, process and formula simulation model for a CPG product
  • Notes on the the Global SciPy conference I attended this Summer in Austin
  • Overview of our software toolbox

A current project is helping a client’s $US 150+ MM sales B2B chemical business. I work with their business and operations leadership to build a business forecasting/simulation model. They know there are nuggets there that can help but want to translate their intuition into simulations guiding what to emphasize in the business and operation. The business simulator allows them to vary product mix and manufacturing capacity and look at how to grow sales and profit. That links to a capacity and cost model for their two plants. The latter predicts production volume and cost for their batch-making operation.

Work with a long-term client uses Python libraries I wrote and continue developing to connect their AWS, eCommerce and fulfillment vendor data from an internet-connected appliance they sell online. The data foundation is then a basis to explore and answer questions with creative visualizations and models. This helps the team understand the in-market dynamics and consumer experiences with the product. From “wish for” learning questions they provide, we define requirements and then develop code for both exploring those questions and keeping ongoing track of consumer usage of the fleet of in-market devices.

For another current client, I am working with their engineering and manufacturing team to troubleshoot their combined raw material supply and chemically-based manufacturing process. We seek to eliminate intermittent off-quality product that causes expensive customer issues. We use my 5-step, complex troubleshooting methodology as a roadmap to collaboratively define the problem, engage a broad range of expertise, brainstorm a learning plan and execute experiments to create robust solutions.

On the online training front, I’m partnered with Toronto-based Predictum Consulting LLC and its CEO, Wayne Levin. In October 2022, we launched an online course called Digital Transformation with Excel or DTE. It brings together training and mentoring insights from P&G days and recent consulting projects. DTE recognizes the pervasiveness and unabashed usefulness of Microsoft Excel in companies. It teaches mastery in context of how to structure and curate data and models. This includes how to build great user interfaces and how to make data portable and extensible for collaboration.

A recently-completed, two-year project involved building an extensive Excel/VBA “control panel” to feed a simulation model for a CPG brand’s batch-made product. This application lets users import or enter the product formulation and process setup as inputs. It then generates JSON input decks to feed chemical process simulation software. The VBA work involved envisioning a user interface and behind-the-scenes tracking/wayfinding for this work process –making it possible for users to experiment with varying the formula and process in ways that honor plant and ingredient constraints. Because of the complexity of the model, I wrote and utilized an open-source VBA validation suite to continuously run dozens of validation cases. This helps ensure that VBA code changes do not break previous work.

While not a project, I attended the Global SciPy conference this Summer in Austin. I learned a lot and made numerous contacts for the future –despite coming home with a mild case of Covid. This community ranges from astronomers to nuclear physics experts, so it is an extreme intellectual challenge for this chemical engineer. I took tutorials to build my expertise in working with geospatial and map data in Python. I networked with a lot of interesting folks and formed some lasting collaborations. In addition to the scientific and upstream R&D industry worlds, SciPy has significant attendance by the financial analysis community. However, it is almost absent any downstream industrial viewpoints such as mine. This is perhaps understandable because this is not SciPy's focus, but this also speaks to the state of Python tools for experimental design and quality control. In my opinion, this also causes SciPy to have a siloed focus on Python tools wielded by upstream corporate people and by academics --even when Python is not the optimal tool for the job.

Finally, my software toolbox continues to expand and is pragmatic. It uses a “right tool for the job" mentality. To start, I develop coded solutions in Python and its Pandas, Numpy and visualization libraries. This Github repository (see tutorial.md there) is a code sample and discusses the under-publicized synergy between two common Python practices, TDD and OOP (aka Test Driven code Design and Object Oriented Programming or “Tuhdoop” as I call the combo). I use the Python Scikit-learn library for advanced modeling. In addition to Python, I use JMP Statistical Discovery software (by SAS Institute) for exploratory data analysis, modeling, multivariate analysis, quality analysis, stats and visualization. I use Microsoft Excel/VBA to build simulation models that can be used by non-coding clients. As discussed above, Excel is also useful for building user interfaces for models that link to other applications. I use PowerBI to create and publish dashboards for clients.

Here's to an interesting 2023 of helping clients and continuing to grow in bringing solutions!

A “Data Principle” for Digital Transformation

Whether you are a highly technical computer modeler, project manager, leader, or admin, you can be battered by corporate software and by the communication pattern author Cal Newport calls “hyperactive hive mind” in his brilliant book A World Without Email. This best-selling business author and Georgetown Computer Science professor shares four “Principles” for moving beyond an inbox-driven corporate culture. These resonate and create a roadmap for unlocking productivity. As a “data guy” though, this blog explores my contention that he stopped short of adding a critical, fifth “Data Principle” to his four. I am currently working with a consulting partner on an online training in this Data Principle arena. Reach out if you want more info. We have reached the “training is deliverable live” stage of the course production process.

Banishing the inbox-driven approach is about gaining personal and team productivity and about creating delightful work culture. That drives business success. I cannot help wondering whether Cal Newport was surveilling my teams as he describes negative aspects in even highly successful companies. His point is that we have lived in this long enough to see the future. We are at a tipping point to migrate to a better place and that there is unbelievable upside to unlocking this.

What I call the “Data Principle” is democratic across levels and functions. It has at least two parts. Part One is being knowledgeable about how “structure” and “curate” data and calculation models. My case study experience gives me a lot of confidence with speaking to this. Of course, your goal of arming with knowledge is to take action to shape how you and your team actually work. Doing this makes data and models portable and collaborative across software platforms.  It also armors you for whatever comes in the form of new software, databases and open-source tools.

Data Principle Part Two is learning how to be a discerning user of the messy corporate software mayhem we all live in these days. The mayhem is not going away. The call to action is to take inventory of where your and your team’s critical data and models reside and to be intentional about making work practice choices that optimize for productivity. Are your critical data and models stored in a single location accessible to those with need? Can “data” only be gotten by pinging someone else by email or chat? Most importantly, does software honor peoples’ time and productivity, or is usability driven by Legal, HR and Security concerns? If so, are there ways (e.g. simple choices) for honoring those important concerns while making you and your team more productive?

A specific example of Part 2 tension these days is cloud storage of files. In any company, in their innermost, corporate souls, your HR, Legal, and Info Security functions (that need to check off on the licensing contract with Microsoft, Google et al.), do not want thousands of highly sensitive files on everyone’s laptops. They just don’t. There are obvious situations where it can be problematic. However, you want to have your slides, spreadsheet, or document for that meeting in 5 minutes. Also, you sometimes stay in hotels with crappy internet. In an upcoming post, I will explore that tension further with a case study and discuss how to optimize.

Moving from CA to CBA: A Worthwhile Migration

For leaders of and participants in digital transformation, the acronym “CBA,” or Current Best Approach is a perfect term. Ironically, that is because it does not connote perfection.  Do you have personal CBAs for digital topics?  More importantly, does your organization?  Are they written down and discussed, or do you live in the land of CA’s without being conscious of the “B”?  

This is not talking about creating a big book of standard procedures that nobody opens.  Rather, it is the “This is how we do things around here” tenets for the team’s digital practices.  Done well, these are a scaffolding on which to hang grass-roots digital transformation using the team’s “spare cycles” as people work.  It is empowering because it takes people from frustration with tools to taking satisfaction in new and productive capability.

Do you feel the tension of many tech tools?  They are rapidly evolving. Tools often overlap. The pace of change can be confusing. There is great opportunity in becoming intentional about your CBA’s with the software toolset in your business. This is one thing we do when helping a client with digital transformation –understanding where the leverage points are, making training aids to onboard and bring everyone up to CBA and coming up with a step-by-step migration to make continual progress in the flow of people driving the business.  

This grass-roots digital transformation lets an organization meet corporate efforts half-way. It puts the organization in control of making progress and creates knowledgeable partners with IT. The idea of an advancing front of “best but not perfect” digital CBAs is an empowering concept for any business culture.

Here is a quick, non-fiction case study showing the opportunity.  I was recently giving a quick training to a working level R&D person at a client.  Note:  consultant time is not free, but employee time is even more expensive. We were stopped in our tracks because he could not sync a folder in Microsoft Teams using OneDrive. That was far from the point of the training, which involved equipping him to run visualizations of “big data.”  He works on a business with annual sales greater than $1B with a B. His organization is well-resourced with Microsoft’s latest enterprise stuff --intended to be the hot and cold running water utility for digital collaboration.  For him to do visualizations and use his data to understand and solve a product defect, he needed OneDrive to sync an online folder to his laptop so that he could run a Jupyter notebook.  First problem: my client was not familiar with O365/Teams syncing and even where synced folders go when the software works.  In the hot water metaphor, he had the plumbing installed but could only take cold showers.  Second, OneDrive was not working on his computer.  It was stuck and needed restarting as OneDrive sometimes does.  No visualizations.  No insights from those visualizations.  Learning and business progress stopped cold.

How could this be better?  A productive CBA for this organization is to draw a line in a sand that says, “As a foundation for all digital work, we will have O365/Teams/OneDrive working well for our employees. We will ensure everyone is trained on the basics of this and will develop appropriate quick training aids for onboarding and refreshing this knowledge. This might require a one-time desk-to-desk help session to get things going.  It might lead to productive conversations about how to work together collaboratively on documents and presentations (another CBA).  The digital transformation ball is truly rolling downhill at that point. 

When The Data Get Big(ger) Part2

It is crucial to move beyond using *.CSV files when the scope of a project goes beyond 1MM rows of data. This is an add-on to my initial post about using Pyarrow Feather in Python for efficient data storage.  The first post makes a good case for this, but it leaves out some nitty gritty details that involve burrowing around in Github and Stackoverflow to master and get things working.

Pyarrow installation and Anaconda in Windows

While I had no problem with Pyarrow installation in Mac OSX (“pip install pyarrow” from the command line), I had to scratch and claw to get it working on Windows for colleagues at a client. Here is what straightened things out in that OS:

  • While only mentioned as a strong reco in Pyarrow install documentation, The latest Pyarrow simply requires 64-bit Python. I run Windows on a Parallels virtual machine (VM) and had mistakenly installed 32-Bit Anaconda even though my VM is 64 bit. Don’t be like me! After uninstalling Anaconda and reinstalling the 64-bit version, I was able to use Anaconda’s update process to get Pyarrow 2.0. Anaconda documentation contains instructions for the uninstall. 
  • To check whether Python is 32 or 64 bit in Windows, open a command line window and enter “python”. This launches the Python console but also prints version info as it does so. Look for something like “[MSC v.1916 64 bit (AMD64)]” which confirms 64-bit per both my experience and the advice in this thread. 
  • A watchout for novices to working with Python and the Windows command line: If running Python via Jupyter notebooks in Anaconda Navigator, the command line window for version-checking needs to be launched from the Windows app called "Anaconda Prompt". The base Command Line app will not know anything about Python if Python got installed by Anaconda! Anaconda Prompt is designed for messing around with Anaconda specifically. This app in the Windows Start menu (search for “anaconda” there) and is different from Anaconda Navigator. 

  • Anaconda’s update process is often mysterious (aka frustrating) due to package interdependencies. While my Windows fresh install of Anaconda brought me right to Pyarrow latest version (Pyarrow 2.0.0 as of this writing), a colleague was only being offered a less than 1.0 version. A solution is to do a fresh install of Anaconda –essentially trashing whatever interdependencies were keeping it from offering the latest version. An alternative to this is to create a new Environment in Anaconda that focuses on having the latest Python version with minimal other packages installed.

Pyarrow and data types

As opposed to CSV’s, Pyarrow feather format comes with the advantage that it manages (aka enforces) columns' data types. In Feather, all columns have a type –whether inferred or explicitly assigned --and the individual values must match this type. In case of questions, it should be 'str' that allows for data to be re-typed later. Because this is done column-wise and not element-wise (like Pandas), Feather is incompatible with mixing data types within a column. A couple of links at the bottom from the Pyarrow team state this clearly.  See quotes by Uwe Korn/xhochy specifically.  Although these are clear, I learned this organically (e.g. the hard way!) while trying to write data that had concatenated a mixed text/integer data column (Pandas typed it as ‘str’) with one that was exclusively integer (Pandas inferred this to be ‘int’). A better approach for such data is to explicitly type the column as ‘str’ upon import. I will share our CBA and standard code for this in a separate post. Here is an example of a code snippet that will cause an issue. Python .applymap(type) and .groupby() on .applymap() are very helpful for sorting out data types of individual values. This Jupyter notebook is named Feather Mixed Type Example.ipynb and is posted here.


Data type discussions from Pyarrow team (see user xhochy and wesm quotes):



When the Data Get Big(ger)

This blog describes how we manage life when the data get too big for Excel-compatible formats such as CSV. Practically, when working with exploratory data analysis, there is no rush to jump to fancy data formats on the client-facing side of things.  Even when people hire a consultant to help with getting deeper insights, they typically want to open their data, look at it, throw together a pivot table or two and basically be in touch with the consulting exploratory stuff that may reach them as screen pictures of graphics from Python scripts.  In most places, this means keeping Excel compatibility for certain if not all aspects of a project.

When data exceed one million rows, Excel cannot be used.  It has a hard limit of 220 or 1048576 rows for opening either XLSX or CSV files.  Aside from Excel, CSV’s can be practically used with more than one million rows.  Pandas and other tools can read and write larger files, and PowerBI can work with larger files. However, CSV’s are much less efficient for file size as shown in the benchmark comparisons below.  Furthermore, CSV’s are relatively slow to read and write, and they do not retain type and formatting information about data –meaning that a very nice serial number like ‘35322000744082’ shows up as ‘3.5322E+13’ when opened in Excel.

On a recent project, we hit Excel’s limit and redesigned raw data storage to utilize the Apache Arrow Feather format.  Does 'feather' allude to 'light as a...'?  Maybe so.  I haven't researched the origin.  I studied Feather and other CSV alternatives with benchmarking scripts I wrote based on excellent online posts such as this one.  A bibliography of some additional links is below.  Apache Arrow is a successful open-source project that defines a language-independent columnar memory format. Although projects like these are always a team effort, this one was spearheaded by Wes McKinney who is Pandas' creator, so it comes with a tremendous credibility stamp on its forehead.  The older Pandas can read and write Feather format with recently-added read_feather() and write_feather() commands.  Feather retains, in fact it insists on, a data type by column.

By implementing Feather for our raw data storage, we saw tremendous improvements in file sizes and import/export speeds.  The table shows our benchmarks importing 300 raw data files, combining them into a 2MM row by 7 column DataFrame and writing the DataFrame to disk on a typical MacBook Pro laptop.The block diagram shows the reusable architecture we created. In a separate blog, I will share details of our speed optimization study and a couple of hard-won learnings about working with Feather.  The teaser on that is that setting data types comes with advantages, but it means adhering to those data types or risking getting pummeled with Python Exceptions!


Some helpful links that were background reading and/or helpful in getting the kinks worked out of our feathers:



https://github.com/wesm/feather/issues/349 (troubleshooting data type errors with Feather...)

https://github.com/apache/arrow/blob/master/README.md (link to instructions for installing pyarrow Python library needed for working with Feather)


Blog Beginnings

Welcome to the voice of Data Delve. This is a venture I started a year ago as I transitioned from a 30+ year career as an R&D technologist at P&G. It was a highly intentional plan built on a passion for helping people and teams work efficiently with their data --helping them extract insights for decision making and business growth.  It stands on the (unusual for a data geek) foundation of decades of R&D experience creating and launching products in the Consumer Goods space. Data Delve's approach is therefore sharpened by the experience of literally thousands of "problem definitions" and models built to get answers to problems standing in the way of launching new products, on filing patents and on explaining complex topics in a way that owners can understand and act on.  It leverages decades of experience working as an entrepreneur with suppliers, consumers and trade customers.  Finally, we have a legacy of teaching and tool-building across businesses in all stages of formation and launch.

After a year of serving clients large and small, this website puts a public face on what we bring as a trainer, advisor and problem-solver. The three “products” listed on the home page are the heart of what we do.  We are here to help you grow your business by gaining insights from your data, transforming digitally, and leading data-grounded troubleshooting of your crises and problems.

This blog will focus on business applications of data and on how data and digital transformation create culture in small and large companies.  The second, Tech Talk blog lets us indulge our passion for creating and using great code and technical solutions for working with data and building models. Tech Talk plants our flag as a citizen in the open-source software world. As such, it indulges our passion for helping and being helped by those globally who write the code, build the models and create the in-the-weeds solutions --whether they are at companies, nonprofits, government agencies or on university research teams.  We recognize open-source data science as revolutionary for businesses of all sizes, and we are committed to this being a key strategy and a key community we play in.

So, how can we help?  Feel free to reach out through the site’s contact buttons or through Twitter to hit us with a message or to “subscribe” to future posts.

J.D. Landgrebe