Considerations for Choosing Stat Packages in CER
by Jordan Harshman
*A list of stat program acronyms that I use throughout this blog is at the end.
Chances are, sometime in your CER-related career you’ll end up with the need (possibly even the desire) to analyze quantitative data. In the analysis and visualization of quantitative data, you have a growing list of statistical programs to choose from. The impetus for writing this blog post is something that I hope I can convince you of: which program(s) you choose to analyze your data directly influences the quality of your analysis, therefore your research.
First, let’s divide the available packages into two main categories: those that can be (and usually are) operated primarily via a graphical user interface (GUI) versus a command line interface (CLI). The two categories essentially boil down into this question: Will you write code (CLI) or will you point-and-click (GUI) in order to make discoveries and communicate them to the world? Some examples for GUIs would be Excel and SPSS; CLIs include R and SAS.
If you think the answer to this question is solely about preference, please reconsider as I walk you through thoughts about time requirements, reproducibility, and quality of analyses in different programs. To establish some credibility, I would humbly deem myself “endurably proficient” in operating SAS, R, SPSS, and Excel, so these programs are what I will talk about here. Secondly, in the interest of bias, R is my program of choice – it has become an inseparable part of the way I conduct research, but I will attempt to be objectionable in order to give you helpful information.
For each topic (time, reproducibility, and capabilities), you’ll read a conversation-style paragraph or two from myself (pro-CLI) and two from my alter-ego (pro-GUI) regarding that particular criteria. It’s my hope that this discussion format might put these considerations in a light that you can relate to as a researcher in CER.
Unless you’ve had previous coding experiences, point-and-click programs such as SPSS and Excel will undoubtedly take less time to pick up the basics and have you generating your first analysis or plot in a very short amount of time. However, considering the bigger picture, it may actually be less time consuming to master the relatively steep learning curve of a CLI program.
Consider a simple scenario: You’ve got a 10-question multiple choice survey with 4 response choices each to be administered 3 times a semester for 3 different courses. If you want to produce a bar plot broken down by question, administration, and class, that’s 90 graphs to make. With a GUI, you’d have to change the title, maybe the axis label, and the data for each graph, preferably making them look the same (color, axis range, etc.).
Generally, this is a very time consuming and monotonous task in GUIs that require many repetitive mouse clicks to get everything the way you want it. Alternatively, CLIs will generally have programming loops that will do this after an initial investment of coding and approximately 0.47 seconds to actually run the command and produce the plots. To be fair, I recognize that there are ways to make this go faster in GUIs as you can usually integrate macros and set default formatting, but on general principle, I’ve found repetitive tasks that need to be done in a consistent manner go by much faster in CLIs than in GUIs.
For many people without any coding experience, the “learning curve” to mastering CLIs bears resemblance to the T1 relaxation state in NMR – you will expend a great deal of time and effort and even after years will fall short of mastery. Coding headaches brought upon by CLI programs take little chunks of time away where GUIs are much easier to pick up and get to analysis. Altering aesthetic characteristics don’t require you to memorize functions and are generally intuitive.
As far as reproducing analyses and graphs many times, programming may be able to help this process go faster, but sometimes you will spend more time writing a code to do something than the time it would take to just do it yourself in another program. All in all, it could be said that the amount of time it takes to do equivalent tasks in GUIs vs. CLIs boils down to the users’ aptitude with the stat package. Even if a program can produce 90 bar plots in less than a second, it’s on the user to check enough of them to make sure the program did it correctly, which can be just as time consuming as making them yourself.
Let’s say you’re on a GUI and you’re making your 65th bar plot. Are you, the tired human behind the controls, more reliable than a computer program in this repetitive task? Assuming that the program is coded correctly to do what you want it to do (and you should of course check this), computers are more consistent than humans in this context.
Now assume you want to run EFA on your survey just like you did in a previous test administration. It’s been two months since you did the last one and now you need to remember every click that will produce the exact same output for a different set of data. If you haven’t been diligently documenting these decisions, you may end up applying different assumptions that affect the results, especially if you do not have a strong expertise in the nuanced options for the analysis you’re doing. In a CLI interface, you would generally only have to change the name of the data set you’re running the analysis on and away the program will go, with every option you had previously.
In order to address reproducibility, we don’t need to talk about “which is better” between CLIs and GUIs because it all boils down to the research skills of the person behind the computer. Stated otherwise, CLIs are only as good as the person writing the code. If the code is constructed to produce output according to a pattern that closely, but doesn’t quite match the data, the code will not run or worse, it will run and produce inaccurate results. These types of mistakes can be very difficult to spot and can be avoided altogether in GUIs because the user is always in control of producing the results.
As a rule, CLIs can perform statistical techniques that GUIs cannot. However, we generally don’t see those techniques being performed in our field and thus GUIs are great for the things we do see commonly in the field (a comparison of what these programs can do may be found here: http://stanfordphd.com/Statistical_Software.html). However, certain techniques might not be widely used precisely because people cannot perform them on the software they use.
I would additionally argue that the quantity of analyses/visualizations are not always possible with GUI programs. This is very interesting because, for me anyway, I will often find valuable findings just by looking at a graph. You might find a completely unplanned yet extremely valuable line of inquiry just by looking at the item response curve of Item #7 and a histogram of Item #2, for example. The point here is that you would have had to generate these two plots, which you might not do on a GUI because it takes too much time and you never had a strong reason to in the first place, making CLIs more apt for data exploration.
While it’s true that GUIs cannot always perform some of the more advanced statistics, they can perform a vast majority of what CER needs to get the job done correctly. Additionally, for those specialized techniques that only other programs can do, it is really not necessary to learn the entire program in order to run an analysis. Many times on the web, it’s easy to find an example similar to what you want to do complete with instructions for entering your data, running the analysis, and interpreting the output with no programming knowledge required. If the data needs to be in a different format, you can use your GUIs to get it there and export it when you’re done.
I could argue as well that data exploration is just as alive and well in GUIs as it is in CLIs. There is no reason that a researcher can’t produce a couple hundred plots in Excel/SPSS and look at them to discover trends. It is, again, simply an argument of time commitment as opposed to being an advantage/disadvantage from one type of stat package to another.
Just to put a note on a few additional considerations...
The cost of a stat program vary. SAS and SPSS are both generally pricey, but they may or may not be subsidized by university-wide licenses. Excel is obviously reasonably priced for an individual license and will no doubt be accessible wherever you are now and wherever you will end up (this is an important consideration). Lastly, R is open-source, so it’s free, which is a pretty good price. Some of these programs also offer student pricing.
Regardless of programs, the internet has opened an incredibly large amount of resources. Simply typing the name of the statistic you want to do followed by the program’s name will most likely solve most troubleshooting woes, but additionally, books are available for many of the programs as well.
Nuances in programs
If you run a stat in one program, you may not get the same result as another. Differences in defined algorithms, rounding defaults, available adjustments, and memory storage can have an impact on the results. Generally, these are small, but can be problematic. Also, you don’t get the same output (same information) on every program, so it’s important to dig behind the scenes to find out what’s happening when you run any analysis.
Caution about using what you know
In my experience, if you learned something on SPSS, you’re likely to continue using SPSS. If you learned on R, you’re likely to continue with R and so on. My suggestion would be to experiment with the different programs while you’re “young” (professionally young). You should decide which program(s) to use for a reason beyond “this is what I know” and actually consider the ramifications of using different stat programs.
The choice, of course, is yours to make. Also, you will probably never exclusively use one program. Therefore, having skills in several programs is a great benefit to conducting research. And I hope this point goes without saying, but it’s less important about whether or not your stat program knows how to do a certain analysis. It’s more important that you know how that stat is calculated and what it means. No program interprets your data for you.
I hope this article has given you some things to think about that you might not have otherwise. I reiterate the importance of making a conscious decision about what programs you want to investigate and use in the future and invite you contact me if you have any questions!
*Some readers may be unfamiliar to the software I referenced throughout this blog. For those, here is a list of the acronyms for stat programs:
SPSS – IBM Statistical Package for the Social Sciences
SAS – Statistical Analysis System
R – Rumored to stand for its original creators; this is a programming language
Excel – Microsoft’s spreadsheet program
Not talked about in this post, but used in CER:
STATA – Name based on word combination of “statistics and data”
MATLAB – MATrix LABoratory
Mathematica – Wolfram Alpha’s stat package
Minitab – I haven’t been able to find out why this is called as it is. Sorry.