Software tools in the data and modeling arena often lead individuals and teams into counterproductive patterns. By being informed and intentional, you can choose the best tool for a particular job. It’s good to recognize that software providers, meaning companies and open-source communities, keep adding features and advocate for using their software broadly. Good for them, but modern tools let you pick the best tool for each part of a bigger job and port data among tools for efficiency and robustness.
As an engineer, data scientist and business generalist, I advocate (and here freely share my opinions about) a software ecosystem that works well in many parts of corporate and university research worlds. It has three primary tools for shaping and sharing data, for creating models and for exploring and visualizing data for decision-making.
- Microsoft Excel® for making data democratically viewable regardless of source, for generating “end of the pipeline” reports and for creating spreadsheet models with calculations for use by non-coders
- Python scripts for data reshaping and for developing coded models and data pipelines (possibly mixing in a little SQL)
- JMP® software for design of experiments (DOE), exploratory analysis and data visualization with accompanying statistical analysis of data
This is solid, but a disclaimer is that it’s possible to be successful with different choices. I recognize that statisticians are going to learn and be fluent with R for example but this requires a lot of expertise and is limited for collaboration inside a company. This also doesn’t preclude using additional tools, which I do in consulting practice. However, considering this core ecosystem can help you think critically about your situation and what’s best for you and your team.
From experience with various organizations, here are two suggestions for fighting the “wrong tool for the job” pressure in corporate and academic cultures, and the discussion below goes into more detail on these:
- Don’t overuse Microsoft Excel where it is not the best tool.
- Don’t overuse Python for exploratory data analysis, visualization, and statistics. Instead use best-in-class JMP® for this
Don’t Overuse Microsoft Excel
Not overusing Microsoft Excel takes some discipline. Our online Digital Transformation with Excel (DTE) course teaches responsible use for situations where collaboration and accuracy matter. We teach a surprising combination of long-time, core features. We call that a “hidden layer” that leads to well-curated spreadsheets that are transparent to understand and easy to validate. Excel is wonderful for making data viewable and formatted in intuitive ways. Well-curated spreadsheet models are a way to bring calculations and simulations to a broad range of users. In consulting practice, I use a home-grown, VBA validation suite to ensure that spreadsheet models are correct and stay correct over time. I use DTE Course user interface design principles and this openpyxl Python library to format Python model outputs to put well-formatted spreadsheets into clients’ hands for their review and understanding.
Excel at its Best – Portable-Across-Software Model
with UI Features for Ease of Reviewing and to Highlight Calculations
Although you can buy books on how to do it, Excel is non-ideal for exploratory data analysis or for graphing data except for hard-coded dashboard-like graphs. Additionally, it is wise for Excel experts to be sparing in incorporating advanced, “new Excel” features into models (PowerQuery, PowerPivot with DAX language, Tables, matrix functions etc.). I observe that these are collaboration and robustness killers for most models and 90+% of users –in lieu of simply using “old Excel” to create broadly-understandable spreadsheets.
Controversial with some but avoiding “new Excel” as much as possible also steers experts to use a modular, pipeline architecture where advanced things get done by coded scripts that can be validation-suite verified. Yes, this means that I am lukewarm or even cold on the recent (August 2023) launch of “Excel can do Python” by Microsoft. I gently note that there are pre-existing, robust ways to script Python and even test scripts in Python prior to Microsoft’s “discovery” of Python for Excel. This is a good example of a software company getting marketing mileage out of continually adding features. Be wise to this practice!
Advanced-Feature Hype for Python-In-Excel
Don’t Overuse Python
The second, flip-side recommendation is to avoid overuse of Python (or R) for exploratory data analysis, visualization, and statistics. This unfortunately seems to be a hallmark of the Python community. The cost can be seen among data science and analytics experts. Commercial JMP® software is superior in this space for general purposes. It is produced by the venerable and respected SAS Institute. The practical reality is that, within companies, there are wide swaths even within technical organizations where nobody is going to write a script to make a graph let alone fit a three-term regression model to make sense of business or industrial data. Without a tool like JMP, they will either go without analysis or use sub-optimal Excel.
Example JMP Graphical Output With Relevant Statistics For Decision-Making
Graphical Elements and Stats show that Product “b” is Superior and “c” is inferior
Python graphing and stats tools are great in specialized situations, but JMP is useful even for data popping out of Python scripts. JMP is currently on version 17, and its team has long put exceptional quality of thought into how to visualize statistics for decision-making. They have great self-training and webinar resources openly available. JMP is not free. However, it is widely available in companies through negotiated licenses, and it is available under educational discounts in most universities.
Interestingly, since I have long used JMP and started my Python journey more recently, I conducted informal, tabletop “consumer research” at global, scientific Python (SciPy) conferences. I note that many in the Python community are literally unaware of JMP’s existence. When people saw a demo, they were generally amazed by its capabilities and ease of use relative to what they were doing with Python tools –usually visualization tools that lacked direct, statistical support. These JMP demos brought immediate dismay to typical graduate students when they witnessed the ease of use. While not a good solution for everything, JMP analysis can be a workhorse for data-driven decision making in many situations
Making intentional software choices helps avoid getting sucked into doing ineffficient things, and it can break a culture of poor habits. Hopefully this discussion gives you food for thought about your personal or organizational software ecosystem for analysis and modeling.