The Right Modeling And Analysis Tools For the Job
Software tools in the data and modeling arena often lead individuals and teams into counterproductive patterns. By being informed and intentional, you can choose the best tool for a particular job. It’s good to recognize that software providers, meaning companies and open-source communities, keep adding features and advocate for using their software broadly. Good for them, but modern tools let you pick the best tool for each part of a bigger job and port data among tools for efficiency and robustness.
As an engineer, data scientist and business generalist, I advocate (and here freely share my opinions about) a software ecosystem that works well in many parts of corporate and university research worlds. It has three primary tools for shaping and sharing data, for creating models and for exploring and visualizing data for decision-making.
- Microsoft Excel® for making data democratically viewable regardless of source, for generating “end of the pipeline” reports and for creating spreadsheet models with calculations for use by non-coders
- Python scripts for data reshaping and for developing coded models and data pipelines (possibly mixing in a little SQL)
- JMP® software for exploratory analysis and visualization with accompanying statistical analysis of data
This is solid, but a disclaimer is that it’s possible to be successful with different choices. I recognize that statisticians are going to learn and be fluent with R for example but this requires a lot of expertise and is limited for collaboration inside a company. This also doesn’t preclude using additional tools, which I do in consulting practice. However, considering this core ecosystem can help you think critically about your situation and what’s best for you and your team.
From experience with various organizations, here are two suggestions for fighting the “wrong tool for the job” pressure in corporate and academic cultures, and the discussion below goes into more detail on these:
- Don’t overuse Microsoft Excel where it is not the best tool.
- Don’t overuse Python for exploratory data analysis, visualization, and statistics. Instead use best-in-class JMP® for this
Don't Overuse Microsoft Excel
Not overusing Microsoft Excel takes some discipline. Our online Digital Transformation with Excel (DTE) course teaches responsible use for situations where collaboration and accuracy matter. We teach a surprising combination of long-time, core features. We call that a “hidden layer” that leads to well-curated spreadsheets that are transparent to understand and easy to validate. Excel is wonderful for making data viewable and formatted in intuitive ways. Well-curated spreadsheet models are a way to bring calculations and simulations to a broad range of users. In consulting practice, I use a home-grown, VBA validation suite to ensure that spreadsheet models are correct and stay correct over time. I use DTE Course user interface design principles and this openpyxl Python library to format Python model outputs to put well-formatted spreadsheets into clients’ hands for their review and understanding.
Excel at its Best – Portable-Across-Software Model
with UI Features for Ease of Reviewing and to Highlight Calculations
Although you can buy books on how to do it, Excel is non-ideal for exploratory data analysis or for graphing data except for hard-coded dashboard-like graphs. Additionally, it is wise for Excel experts to be sparing in incorporating advanced, “new Excel” features into models (PowerQuery, PowerPivot with DAX language, Tables, matrix functions etc.). I observe that these are collaboration and robustness killers for most models and 90+% of users --in lieu of simply using “old Excel” to create broadly-understandable spreadsheets.
Controversial with some but avoiding “new Excel” as much as possible also steers experts to use a modular, pipeline architecture where advanced things get done by coded scripts that can be validation-suite verified. Yes, this means that I am lukewarm or even cold on the recent (August 2023) launch of “Excel can do Python” by Microsoft. I gently note that there are pre-existing, robust ways to script Python and even test scripts in Python prior to Microsoft’s “discovery” of Python for Excel. This is a good example of a software company getting marketing mileage out of continually adding features. Be wise to this practice!
Advanced-Feature Hype for Python-In-Excel
Don't Overuse Python
The second, flip-side recommendation is to avoid overuse of Python (or R) for exploratory data analysis, visualization, and statistics. This unfortunately seems to be a hallmark of the Python community. The cost can be seen among data science and analytics experts. Commercial JMP® software is superior in this space for general purposes. It is produced by the venerable and respected SAS Institute. The practical reality is that, within companies, there are wide swaths even within technical organizations where nobody is going to write a script to make a graph let alone fit a three-term regression model to make sense of business or industrial data. Without a tool like JMP, they will either go without analysis or use sub-optimal Excel.
Example JMP Graphical Output With Relevant Statistics For Decision-Making
Graphical Elements and Stats show that Product "b" is Superior and "c" is inferior
Python graphing and stats tools are great in specialized situations, but JMP is useful even for data popping out of Python scripts. JMP is currently on version 17, and its team has long put exceptional quality of thought into how to visualize statistics for decision-making. They have great self-training and webinar resources openly available. JMP is not free. However, it is widely available in companies through negotiated licenses, and it is available under educational discounts in most universities.
Interestingly, since I have long used JMP and started my Python journey more recently, I conducted informal, tabletop “consumer research” at global, scientific Python (SciPy) conferences. I note that many in the Python community are literally unaware of JMP’s existence. When people saw a demo, they were generally amazed by its capabilities and ease of use relative to what they were doing with Python tools –usually visualization tools that lacked direct, statistical support. These JMP demos brought immediate dismay to typical graduate students when they witnessed the ease of use. While not a good solution for everything, JMP analysis can be a workhorse for data-driven decision making in many situations
Making intentional software choices helps avoid getting sucked into doing ineffficient things, and it can break a culture of poor habits. Hopefully this discussion gives you food for thought about your personal or organizational software ecosystem for analysis and modeling.
What Data Delve LLC Does for Clients
My Data Delve LLC business comes down to a simple benefit for people I work with. Namely, I help you and your team be more effective at getting insights from your data to grow your business.
Here are six strategies I use for this. It is a toolbox from which we mutually pick the right tools.
- I help you define what you really want. This is sometimes called “problem definition” but it needs to also include “deliverables definition" to be productive. It is catalyzed by discussing and listening to what you innately know would help your business development, technology development or personal and team efficiency.
- I make your existing data work more efficient. Smart people have ways of getting work done, and there are no successful businesses that don’t, but we have tools and methods for removing the friction and migrating to better places. This can start with advice on and help with “spreadsheet straightening” or involve more complex help with migrating you to efficient “pipelines” to bring together data from multiple sources.
- I build “flight simulators” to let you try out different approaches for growth. These can consist of business simulators, ways of testing technical product formulations and processes or ways of simulating consumer responses. You already have the intelligence, and wisdom to master complex situations. Having low-cost ways to simulate and explore new things can be game changing. The “exploring” may consist of us building tools you can run yourself or you telling us what to try and reporting back to you from coded tools where we are a guide to bridge between your ideas and the potential solutions.
- I help you creatively visualize your data to bring out what it can tell you and how it can guide you. This may involve cross-connections such as reapplying techniques for visualizing complex manufacturing data to shed light on your in-market results with your business or with in-market products.
- I can educate you and your team on best approaches for working with data and the coded solutions for that. We have a suite of online and in-person trainings on software tools from basic to advanced. We are connected to and participants in world class training venues at companies, technical conferences, universities, and online schools. We know how to personalize the training message to you and your company large or small.
- I modularize data and modeling solutions to marry your confidential work with non-confidential open-source software. This avoids needing to start from scratch, and it makes us light on our feet and cost-efficient at getting answers for you. I am active in posting open-source building blocks I create in the Excel/VBA macro and Python language spaces, but we also use free, open-source code packages that are reliable and whose usage is in the 10's of millions. We bring these things to bear against your needs.
Good, Better, Best For Python Projects
I blogged about my observation that Python-coded models and analyses can be grouped into three robustness levels by architecture and macroscopic coding arrangement. Here is a more detailed discussion of the three levels taught also at this training link to a Practical Python for Modelers short course.
The 3-level scale is based on how a project does in three areas: First, is the model or analysis “curated” for the developer and others to understand, inspect and review? Second, are the data transformations and calculations validated with re-runnable, coded tests to build confidence in correctness over the useful life of the model? Finally, are solutions maintainable and extensible by the original developer and colleagues or successors?
Level 1 is what I call the “ad hoc Jupyter notebook approach.” It will garner the highest grade in Data Science coursework and is the “I claim Python expertise on LinkedIn” threshold. Level 1 notebooks will not be questioned for models built for personal usage at a company. It is the go-to for short, linear, one-time analyses and demos. It leaves a lot to be desired beyond that though, and one-time analyses have a funny habit of morphing into permanence.
Level 1 code is rarely well curated and is difficult for anyone else to re-use. From personal experience, its utility breaks down with complexity of data sources and analyses. Level 1 analyses cannot practically be validated with a re-runnable test suite. Usually, Level 1 “validation” consists of printing out and looking visually at a result or performing quick hand calculations: “Yep. That looks plausible. Move on.” That makes things high risk for errors either initially or especially if re-used later.
What I call Level 2 is my go-to approach for client-requested, one-time analyses and for exploratory work. At Level 2, this is still [usually] based in a Jupyter notebook for convenience, but use cases get organized into what I call procedures –groupings of single-action functions that, when executed in order, perform some task or data transformation.
I use the diagram to illustrate a Level 2 and 3 “architecture.” The left block represents a model’s user entering some inputs and executing a use case. The right-most functions are single-action –easy to debug and, at least at Level 3, validated with re-runnable Pytest testing. The “procedures” in the middle are procedural functions that call the elemental ones sequentially to do the work as shown in example code below. It is arbitrary what defines a “procedure,” but they should generally be single-topic and independent (e.g. “transform the ingredients DataFrame into MKS units and add calculated columns.” Or “Create a rows/columns summary of business unit sales and revenue by calendar month”). Delivering the user-facing use case consists of performing 1+ of these more abstract “procedures.” While there are other architectural frameworks out there, I find that this one works well for 90+% of user-facing models and analyses.
The structure makes troubleshooting easy, and it makes it possible to maintain and extend the model by refactoring individual functions, splicing new functions into a procedure or by adding new procedure functions and their right-side collections of elemental ones. At Level 3, all such upgrades can come with re-running the test suite to ensure nothing got broken.
Level 3 is what I call Tuhdoop –referring to a combination of Test-Driven Design (TDD) and Object Oriented Programming (OOP). I coined the term and wrote about Tuhdoop and its synergistic benefits in this Github tutorial. I notice that I can create Level 3 code as efficiently as Level 2 and as efficiently as Level 1 if I factor in rework and debugging time. I am gradually trending to a hybrid work process of developing at Level 2 and then porting the code to Level 3 as appropriate. A distinction between Level 2 and 3 is that Level 3 is object-oriented –often consisting of a single Class per use case or sometimes a Class for the use case that relies on other helper Classes.
To go with the architecture diagram, this screen picture shows typical Level 3 code where the application code (left side) and test code (right side) mirror each other and are organized by procedure. Well-organized code is self-documenting. The testing also forms part of the documentation by making it easy to probe and learn about what individual functions are doing.
As a closing aside, I found that the TDD theme in Level 3 and using ChatGPT to spit out function skeletons has influenced me personally to improve at planning the procedures and elemental functions and in prewriting the tests. I experiment currently with using ChatGPT with prompts like, “For the above rows and columns table listing Python functions in Column A, write the skeleton of the Python function with its arguments (Column B) and write a Pytest test function to verify that it generates the expected output.” I am still undecided whether this will ultimately be advantageous versus writing my own Python helper script to do this systematically. This seems like it might be easier than trying to cajole ChatGPT to do the same thing repeatably. Regardless of how that turns out, credit to GPT-4 for the inspiration.
Robust Models and Analyses – Defining Good (or Meh!), Better, Best
Python is everywhere these days. Universities crank out Data Science majors and minors. Middle school kids are learning the language. There is a groundswell of coding as a generic skill for non-software professionals --engineering, scientific, business, journalism and even liberal arts disciplines. That’s great, but, when it comes to building models and doing important analyses with the language, I observe three distinct levels of robustness. I group projects into these levels based on how they are organized (aka “architected”) and whether they come with re-runnable testing. That is a standard in the software industry for good reason. If you are learning Python, you can use my distinctions as a roadmap and ensure you have the top two levels in your repertoire. If you are a manager of people who create data products and models for decision making (or if you hire consultants!), it should be a priority for these analyses to be curated, validated and maintainable/extensible. The quality of these three items varies wildly and honestly tends towards the low end of the scale.
I recently created a workshop training called Practical Python for Modelers for an audience of R&D engineers and scientists ranging from Python beginners to computer modeling experts. I am reapplying it in collaboration with Dr. Will Hartt and colleagues at University of Delaware Chemical Engineering. We seek to educate highly technical undergrad and graduate students for whom data science is outside their immediate domain. The slides and training data are available at the link.
A followup Tech Notes blog goes into more detail about what distinguishes the three levels of Python styles/architectures. Here are definitions of the three robustness objectives:
- Robust models and analyses are curated. Another knowledgeable human can look at the files and not only repeat the analysis but extend it without the author needing to be present.
- Robust models and analyses are validated. They come with re-runnable test cases consisting of mocked up data proving correctness of calculations, data cleaning steps and output generation.
- Robust models are maintainable and extensible. They are written in a modular way, so that they are easy to build on and easy to reapply.
Look for these things in what you generate. If the work is done externally for you, ask consultants to write them into proposal deliverables along with the direct, decision-making models and analyses! You will not be sorry.
Our New Online Course: Digital Transformation with Excel
This announces a foray into online training with launch of our Digital Transformation with Excel or DTE course. I partnered with Toronto-based Predictum Consulting LLC and CEO Wayne Levin on this. It brings together my data and modeling mentoring insights from decades of P&G R&D work as well as from recent consulting work. DTE recognizes the pervasiveness and unabashed usefulness of Microsoft Excel in organizations. The result is something I’m proud of and something that is uniquely beneficial among the soup of trainings out there.
Please reach out on LinkedIn or Twitter (@data_delve or @Predictum_Inc) if you want to discuss setting up a preview call, live training options or group discounts. DTE's target audience is broad. I hope a wide range of folks will experience it:
- Modeling experts, Finance experts and Analysts can expect to learn a rather surprising “hidden layer” of features for authoring models in a way that is relatable to less technical customers. This audience bullet point is my tribe too. Elegant use of Excel connects models to the business in a relevant way since the downstream world runs on it. Learning how to structure and curate data and models can often be new even to many experts versus a standard practice of constructing non-structured, spreadsheet model “dashboards.”
- Multi-functional leaders can expect to gain personal, working-with-data skills but also to get cultural insights on what to emphasize and encourage as “this is how we do things around here” for digital transformation
- Administrators, project managers and anyone else who works collaboratively on teams can expect to learn skills for combining “business process” data in new ways that calm the constant barrage of chat messages and emails on topics ranging from budget questions to project lists, product designs, project status and so on.
DTE is more than just a collection of skill tidbits such as you can find for free on YouTube. It is built on four, higher-level themes that, when combined together, create data and modeling magic:
- How to be fluent to work well with large and small data sets and models (Example skill: 50+% speedup via practical use of built-in shortcuts for navigation and for issuing commands)
- How to structure data for extensibility and portability (Example skills: How to include “keys” in data for filtering and sorting; building multi-table models with the XLOOKUP function)
- How to curate data to maximize personal and collaborator useful life of data (Example skill: Cell and Range naming to curate formulas)
- Designing excellent spreadsheet user interfaces to encourage use of data and calculation models for collaboration (Example skills: Column Outlines and unique Conditional Formatting techniques for working comfortably with large models)
Besides Excel’s pervasiveness, I will add an additional word on “why Excel?” as a training topic in 2022. We know that company software ecosystems are complex. People work with multiple tools as I do in my consulting practice. However, Excel’s wide-open, grid structure makes it ideal for teaching how to make data and models collaborative and portable across software. It is also true that Excel can be used for both the good of constructing great models or for the evil of making hard-to-decipher, spreadsheet scraps that die on individuals’ hard drives. To bring about true digital transformation, we teach how to bend towards the good and how to steer clear of spreadsheet evil.
DTE practical details:
- It consists of 3 hours of short videos combined with case study examples. When taught live, it is a “one-day” course.
- DTE is cross-platform for Mac and Windows versions. We discuss relative strengths and weaknesses of Excel variants but make the teaching fluid and relevant regardless of preferred operating system
- DTE is hosted on Predictum’s Thinkific “academy." Here is a link to Predictum’s LinkedIn announcement of DTE and intro of me as lead trainer.
What Has Data Delve Been Up to in 2022?
With consulting work, it is easy to just keep your head down and work on client projects. This post is a lifting of my head for a 2022 review. At top level, I help people build models and gain data insights to grow their business. This means defining problems in a rich and actionable way and then turning them into coded models and visualizations to uncover hidden gems for decisions and growth. Sometimes these solutions are prototypes that lead to more permanent solutions. Sometimes they are themselves the answer.
Here is a bullet point listing of what we were up to in 2022. A paragraph on each below gives more details:
- Business and manufacturing simulation for a chemical industry client
- Python-coded data solutions for an e-Commerce appliance product that is online-connected through AWS
- Troubleshooting the interplay of a client's raw materials and manufacturing process to resolve quality problems
- A foray into online training with launch of our Digital Transformation with Excel course on mastering Microsoft Excel in organizations
- Building a “control panel” front-end user interface for a complex, process and formula simulation model for a CPG product
- Notes on the the Global SciPy conference I attended this Summer in Austin
- Overview of our software toolbox
A current project is helping a client’s $US 150+ MM sales B2B chemical business. I work with their business and operations leadership to build a business forecasting/simulation model. They know there are nuggets there that can help but want to translate their intuition into simulations guiding what to emphasize in the business and operation. The business simulator allows them to vary product mix and manufacturing capacity and look at how to grow sales and profit. That links to a capacity and cost model for their two plants. The latter predicts production volume and cost for their batch-making operation.
Work with a long-term client uses Python libraries I wrote and continue developing to connect their AWS, eCommerce and fulfillment vendor data from an internet-connected appliance they sell online. The data foundation is then a basis to explore and answer questions with creative visualizations and models. This helps the team understand the in-market dynamics and consumer experiences with the product. From “wish for” learning questions they provide, we define requirements and then develop code for both exploring those questions and keeping ongoing track of consumer usage of the fleet of in-market devices.
For another current client, I am working with their engineering and manufacturing team to troubleshoot their combined raw material supply and chemically-based manufacturing process. We seek to eliminate intermittent off-quality product that causes expensive customer issues. We use my 5-step, complex troubleshooting methodology as a roadmap to collaboratively define the problem, engage a broad range of expertise, brainstorm a learning plan and execute experiments to create robust solutions.
On the online training front, I’m partnered with Toronto-based Predictum Consulting LLC and its CEO, Wayne Levin. In October 2022, we launched an online course called Digital Transformation with Excel or DTE. It brings together training and mentoring insights from P&G days and recent consulting projects. DTE recognizes the pervasiveness and unabashed usefulness of Microsoft Excel in companies. It teaches mastery in context of how to structure and curate data and models. This includes how to build great user interfaces and how to make data portable and extensible for collaboration.
A recently-completed, two-year project involved building an extensive Excel/VBA “control panel” to feed a simulation model for a CPG brand’s batch-made product. This application lets users import or enter the product formulation and process setup as inputs. It then generates JSON input decks to feed chemical process simulation software. The VBA work involved envisioning a user interface and behind-the-scenes tracking/wayfinding for this work process –making it possible for users to experiment with varying the formula and process in ways that honor plant and ingredient constraints. Because of the complexity of the model, I wrote and utilized an open-source VBA validation suite to continuously run dozens of validation cases. This helps ensure that VBA code changes do not break previous work.
While not a project, I attended the Global SciPy conference this Summer in Austin. I learned a lot and made numerous contacts for the future –despite coming home with a mild case of Covid. This community ranges from astronomers to nuclear physics experts, so it is an extreme intellectual challenge for this chemical engineer. I took tutorials to build my expertise in working with geospatial and map data in Python. I networked with a lot of interesting folks and formed some lasting collaborations. In addition to the scientific and upstream R&D industry worlds, SciPy has significant attendance by the financial analysis community. However, it is almost absent any downstream industrial viewpoints such as mine. This is perhaps understandable because this is not SciPy's focus, but this also speaks to the state of Python tools for experimental design and quality control. In my opinion, this also causes SciPy to have a siloed focus on Python tools wielded by upstream corporate people and by academics --even when Python is not the optimal tool for the job.
Finally, my software toolbox continues to expand and is pragmatic. It uses a “right tool for the job" mentality. To start, I develop coded solutions in Python and its Pandas, Numpy and visualization libraries. This Github repository (see tutorial.md there) is a code sample and discusses the under-publicized synergy between two common Python practices, TDD and OOP (aka Test Driven code Design and Object Oriented Programming or “Tuhdoop” as I call the combo). I use the Python Scikit-learn library for advanced modeling. In addition to Python, I use JMP Statistical Discovery software (by SAS Institute) for exploratory data analysis, modeling, multivariate analysis, quality analysis, stats and visualization. I use Microsoft Excel/VBA to build simulation models that can be used by non-coding clients. As discussed above, Excel is also useful for building user interfaces for models that link to other applications. I use PowerBI to create and publish dashboards for clients.
Here's to an interesting 2023 of helping clients and continuing to grow in bringing solutions!
A “Data Principle” for Digital Transformation
Whether you are a highly technical computer modeler, project manager, leader, or admin, you can be battered by corporate software and by the communication pattern author Cal Newport calls “hyperactive hive mind” in his brilliant book A World Without Email. This best-selling business author and Georgetown Computer Science professor shares four “Principles” for moving beyond an inbox-driven corporate culture. These resonate and create a roadmap for unlocking productivity. As a “data guy” though, this blog explores my contention that he stopped short of adding a critical, fifth “Data Principle” to his four. I am currently working with a consulting partner on an online training in this Data Principle arena. Reach out if you want more info. We have reached the “training is deliverable live” stage of the course production process.
Banishing the inbox-driven approach is about gaining personal and team productivity and about creating delightful work culture. That drives business success. I cannot help wondering whether Cal Newport was surveilling my teams as he describes negative aspects in even highly successful companies. His point is that we have lived in this long enough to see the future. We are at a tipping point to migrate to a better place and that there is unbelievable upside to unlocking this.
What I call the “Data Principle” is democratic across levels and functions. It has at least two parts. Part One is being knowledgeable about how “structure” and “curate” data and calculation models. My case study experience gives me a lot of confidence with speaking to this. Of course, your goal of arming with knowledge is to take action to shape how you and your team actually work. Doing this makes data and models portable and collaborative across software platforms. It also armors you for whatever comes in the form of new software, databases and open-source tools.
Data Principle Part Two is learning how to be a discerning user of the messy corporate software mayhem we all live in these days. The mayhem is not going away. The call to action is to take inventory of where your and your team’s critical data and models reside and to be intentional about making work practice choices that optimize for productivity. Are your critical data and models stored in a single location accessible to those with need? Can “data” only be gotten by pinging someone else by email or chat? Most importantly, does software honor peoples’ time and productivity, or is usability driven by Legal, HR and Security concerns? If so, are there ways (e.g. simple choices) for honoring those important concerns while making you and your team more productive?
A specific example of Part 2 tension these days is cloud storage of files. In any company, in their innermost, corporate souls, your HR, Legal, and Info Security functions (that need to check off on the licensing contract with Microsoft, Google et al.), do not want thousands of highly sensitive files on everyone’s laptops. They just don’t. There are obvious situations where it can be problematic. However, you want to have your slides, spreadsheet, or document for that meeting in 5 minutes. Also, you sometimes stay in hotels with crappy internet. In an upcoming post, I will explore that tension further with a case study and discuss how to optimize.
Moving from CA to CBA: A Worthwhile Migration
For leaders of and participants in digital transformation, the acronym “CBA,” or Current Best Approach is a perfect term. Ironically, that is because it does not connote perfection. Do you have personal CBAs for digital topics? More importantly, does your organization? Are they written down and discussed, or do you live in the land of CA’s without being conscious of the “B”?
This is not talking about creating a big book of standard procedures that nobody opens. Rather, it is the “This is how we do things around here” tenets for the team’s digital practices. Done well, these are a scaffolding on which to hang grass-roots digital transformation using the team’s “spare cycles” as people work. It is empowering because it takes people from frustration with tools to taking satisfaction in new and productive capability.
Do you feel the tension of many tech tools? They are rapidly evolving. Tools often overlap. The pace of change can be confusing. There is great opportunity in becoming intentional about your CBA’s with the software toolset in your business. This is one thing we do when helping a client with digital transformation –understanding where the leverage points are, making training aids to onboard and bring everyone up to CBA and coming up with a step-by-step migration to make continual progress in the flow of people driving the business.
This grass-roots digital transformation lets an organization meet corporate efforts half-way. It puts the organization in control of making progress and creates knowledgeable partners with IT. The idea of an advancing front of “best but not perfect” digital CBAs is an empowering concept for any business culture.
Here is a quick, non-fiction case study showing the opportunity. I was recently giving a quick training to a working level R&D person at a client. Note: consultant time is not free, but employee time is even more expensive. We were stopped in our tracks because he could not sync a folder in Microsoft Teams using OneDrive. That was far from the point of the training, which involved equipping him to run visualizations of “big data.” He works on a business with annual sales greater than $1B with a B. His organization is well-resourced with Microsoft’s latest enterprise stuff --intended to be the hot and cold running water utility for digital collaboration. For him to do visualizations and use his data to understand and solve a product defect, he needed OneDrive to sync an online folder to his laptop so that he could run a Jupyter notebook. First problem: my client was not familiar with O365/Teams syncing and even where synced folders go when the software works. In the hot water metaphor, he had the plumbing installed but could only take cold showers. Second, OneDrive was not working on his computer. It was stuck and needed restarting as OneDrive sometimes does. No visualizations. No insights from those visualizations. Learning and business progress stopped cold.
How could this be better? A productive CBA for this organization is to draw a line in a sand that says, “As a foundation for all digital work, we will have O365/Teams/OneDrive working well for our employees. We will ensure everyone is trained on the basics of this and will develop appropriate quick training aids for onboarding and refreshing this knowledge. This might require a one-time desk-to-desk help session to get things going. It might lead to productive conversations about how to work together collaboratively on documents and presentations (another CBA). The digital transformation ball is truly rolling downhill at that point.
When The Data Get Big(ger) Part2
It is crucial to move beyond using *.CSV files when the scope of a project goes beyond 1MM rows of data. This is an add-on to my initial post about using Pyarrow Feather in Python for efficient data storage. The first post makes a good case for this, but it leaves out some nitty gritty details that involve burrowing around in Github and Stackoverflow to master and get things working.
Pyarrow installation and Anaconda in Windows
While I had no problem with Pyarrow installation in Mac OSX (“pip install pyarrow” from the command line), I had to scratch and claw to get it working on Windows for colleagues at a client. Here is what straightened things out in that OS:
- While only mentioned as a strong reco in Pyarrow install documentation, The latest Pyarrow simply requires 64-bit Python. I run Windows on a Parallels virtual machine (VM) and had mistakenly installed 32-Bit Anaconda even though my VM is 64 bit. Don’t be like me! After uninstalling Anaconda and reinstalling the 64-bit version, I was able to use Anaconda’s update process to get Pyarrow 2.0. Anaconda documentation contains instructions for the uninstall.
- To check whether Python is 32 or 64 bit in Windows, open a command line window and enter “python”. This launches the Python console but also prints version info as it does so. Look for something like “[MSC v.1916 64 bit (AMD64)]” which confirms 64-bit per both my experience and the advice in this thread.
- A watchout for novices to working with Python and the Windows command line: If running Python via Jupyter notebooks in Anaconda Navigator, the command line window for version-checking needs to be launched from the Windows app called "Anaconda Prompt". The base Command Line app will not know anything about Python if Python got installed by Anaconda! Anaconda Prompt is designed for messing around with Anaconda specifically. This app in the Windows Start menu (search for “anaconda” there) and is different from Anaconda Navigator.
- Anaconda’s update process is often mysterious (aka frustrating) due to package interdependencies. While my Windows fresh install of Anaconda brought me right to Pyarrow latest version (Pyarrow 2.0.0 as of this writing), a colleague was only being offered a less than 1.0 version. A solution is to do a fresh install of Anaconda –essentially trashing whatever interdependencies were keeping it from offering the latest version. An alternative to this is to create a new Environment in Anaconda that focuses on having the latest Python version with minimal other packages installed.
Pyarrow and data types
As opposed to CSV’s, Pyarrow feather format comes with the advantage that it manages (aka enforces) columns' data types. In Feather, all columns have a type –whether inferred or explicitly assigned --and the individual values must match this type. In case of questions, it should be 'str' that allows for data to be re-typed later. Because this is done column-wise and not element-wise (like Pandas), Feather is incompatible with mixing data types within a column. A couple of links at the bottom from the Pyarrow team state this clearly. See quotes by Uwe Korn/xhochy specifically. Although these are clear, I learned this organically (e.g. the hard way!) while trying to write data that had concatenated a mixed text/integer data column (Pandas typed it as ‘str’) with one that was exclusively integer (Pandas inferred this to be ‘int’). A better approach for such data is to explicitly type the column as ‘str’ upon import. I will share our CBA and standard code for this in a separate post. Here is an example of a code snippet that will cause an issue. Python .applymap(type) and .groupby() on .applymap() are very helpful for sorting out data types of individual values. This Jupyter notebook is named Feather Mixed Type Example.ipynb and is posted here.
Data type discussions from Pyarrow team (see user xhochy and wesm quotes):
https://github.com/pandas-dev/pandas/issues/21228
https://github.com/wesm/feather/issues/349
When the Data Get Big(ger)
This blog describes how we manage life when the data get too big for Excel-compatible formats such as CSV. Practically, when working with exploratory data analysis, there is no rush to jump to fancy data formats on the client-facing side of things. Even when people hire a consultant to help with getting deeper insights, they typically want to open their data, look at it, throw together a pivot table or two and basically be in touch with the consulting exploratory stuff that may reach them as screen pictures of graphics from Python scripts. In most places, this means keeping Excel compatibility for certain if not all aspects of a project.
When data exceed one million rows, Excel cannot be used. It has a hard limit of 220 or 1048576 rows for opening either XLSX or CSV files. Aside from Excel, CSV’s can be practically used with more than one million rows. Pandas and other tools can read and write larger files, and PowerBI can work with larger files. However, CSV’s are much less efficient for file size as shown in the benchmark comparisons below. Furthermore, CSV’s are relatively slow to read and write, and they do not retain type and formatting information about data –meaning that a very nice serial number like ‘35322000744082’ shows up as ‘3.5322E+13’ when opened in Excel.
On a recent project, we hit Excel’s limit and redesigned raw data storage to utilize the Apache Arrow Feather format. Does 'feather' allude to 'light as a...'? Maybe so. I haven't researched the origin. I studied Feather and other CSV alternatives with benchmarking scripts I wrote based on excellent online posts such as this one. A bibliography of some additional links is below. Apache Arrow is a successful open-source project that defines a language-independent columnar memory format. Although projects like these are always a team effort, this one was spearheaded by Wes McKinney who is Pandas' creator, so it comes with a tremendous credibility stamp on its forehead. The older Pandas can read and write Feather format with recently-added read_feather() and write_feather() commands. Feather retains, in fact it insists on, a data type by column.
By implementing Feather for our raw data storage, we saw tremendous improvements in file sizes and import/export speeds. The table shows our benchmarks importing 300 raw data files, combining them into a 2MM row by 7 column DataFrame and writing the DataFrame to disk on a typical MacBook Pro laptop.The block diagram shows the reusable architecture we created. In a separate blog, I will share details of our speed optimization study and a couple of hard-won learnings about working with Feather. The teaser on that is that setting data types comes with advantages, but it means adhering to those data types or risking getting pummeled with Python Exceptions!
Some helpful links that were background reading and/or helpful in getting the kinks worked out of our feathers:
https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
https://robotwealth.com/feather-r-python-data-frame/
https://github.com/wesm/feather/issues/349 (troubleshooting data type errors with Feather...)
https://github.com/apache/arrow/blob/master/README.md (link to instructions for installing pyarrow Python library needed for working with Feather)