I just read two articles that claim that Python is overtaking R for data science and machine learning. From user comments, I learned that R is still strong in certain tasks. I will survey what these tasks are.
The first article by Vincent Granville from DSC uses proxy metrics (as opposed to asking the users). He uses statistics from Google Trends, Indeed job search terms, and Analytic Talent (DSC job database) to conclude that Python has overtaken R.
One is led to ask if one group of users (say Python’s) is a more active googler. Say, it’s a group mainly made up of programmers, already versed in a scripting language like Python, but turning into the data science world. Indeed, the search term analyzed is “Python Data Science.” That’s something you would search to start exploring. In that case, the Google Trends analysis is biased. I won’t get more into critiquing Granville’s finding since I want to survey the areas R still shines (no pun intended).
The other article is from KDnuggets. This one, as opposed to the DSC finding, was from a polling by KDnuggets. From this poll, they found out that “in 2017 Python ecosystem overtook R as the leading platform for Analytics, Data Science, Machine Learning.”
Clearly there is a bias in the conclusion. First of all, the KDnuggets community is not a representative of the (whole) data science community. In fact, the surge of Python users might be a result of a slanted growth in their readership, where the past experience of the pollers is at play; say, if, again, the programmer-turned-data-scientists were frequenting KDnuggets.
To be fair, they did ask if the pollers made a switch; but, better if they had also queried the pollers’ past coding experience and collected the number of pollers in 2016 vs 2017. On the other hand, the reality might indeed be that Python is truly growing in the whole data science community.
Burtch Works published the results of a wider survey in June. For this poll, data scientists and predictive analytics professionals were surveyed. Burtch Works found out that Python’s share among both groups actually grew from 20% in 2016 to 26% this year. In fact, among data scientists, it grew from 53% in 2016 to 69% in 2017, while R usership shrinked by about one-third. So, maybe Python is overtaking R.
Courtesy of Burtch Works
Despite this, I learned reading comments, that R is still preferred for tasks like survival analysis, time series forcasting, glmnet, Bayesian model averaging, and hierarchical modeling thanks to its well developed statistical packages. Now, I am not familiar with most of these tasks (and I can’t verify if R is indeed better), so I will just try to summarize them here for future reference1.
Survival analysis is a set of statistical methods for analyzing events over time: time to death in biological systems, failure time in mechanical systems, etc.
Time Series Forecasting
A time series is a series of measurements with respect to (evenly spaced intervals of) time.
Time series forecasting is the use of a model to predict future values based on previously observed values. Wiki:Time Series
Bayesian Model Averaging
Bayesian is a term I see thrown around. But what is a Bayesian Model?
A Bayesian model is a statistical model where you use probability to represent all uncertainty within the model, both the uncertainty regarding the output but also the uncertainty regarding the input (aka parameters) to the model. from stats.SE
And Bayesian Model Averaging is described in the paper (1999) of the same title by Jennifer A. Hoeting, et al.:
Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident in- ferences and decisions that are more risky than one thinks they are. Bayesian model averaging (BMA) provides a coherent mechanism for ac- counting for this model uncertainty.
This is a generalization of linear and generalized linear regression algorithms where the regression coefficients are modeled as well, using the given data. More in the 2006 paper by Andrew Gelman.
Personally, I don’t see the point of discussions about whether Python or R is better. True, it’s nice to look at the userbase and to visualize it for curiosity’s sake, but in the end, whatever suits your task is the way to go (without boasting that one or the other is superior).
- Maybe there will come a time when I will do these tasks; when I have to make the choice, I will start with R. [return]