R is objectively worse than Python for almost all data science tasks, and R is a huge PITA to productionise compared to Python.
I've yet to see any argument for R that doesn't boil down to 'well, I know it better' or 'well, I prefer the syntax'.
R to data science is as Matlab is to engineering. It's a stopgap 'non programmer' language that thrived at a time when most academics didn't know any programming. Now school children learn programming. There is no use case for these languages anymore.
> R is objectively worse than Python for almost all data science tasks
If you meant to type "machine learning" I'd probably agree, but R is much much better for small scale data exploration, visualization and modeling (i.e. 95% of DS) than Python. Pandas is an absolute horror show of an API compared to dplyr, and the best plotting libraries for Python are just copying features from R. Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using it. R is inferior to Python as a programming language, no doubt about it -- but most data scientists are not programmers. Which is the point of TFA.
This is the crux of the problem with R and why R is increasingly blacklisted at large orgs. It attracts non-programmers which may have been okay 5 years ago but is no longer acceptable.
With the exception of some engineering powerhouses hiring pure research PhDs to write R code, the trend established over the last 2 years is that fewer and fewer employers are hiring data scientists that aren't programmers. There are too many candidates who know data science and can also do data engineering and even generalist SE tasks. Non-programmer data scientists are not competitive in the industry anymore except that small top-end research niche that doesn't exist in most orgs.
Which brings us back to the fact that R was a successful niche language that allowed non-programmers to write models, but that's simply not enough anymore. Businesses want models that can be plugged into production pipelines, models that can scale without needing a dedicated team to re-implement them, and they want staff who do engineering in addition to whatever it is they specialise in.
Virtually all data scientists graduating today are programmers, and pretty good ones. Candidates who only know R can't compete against them.
> Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using
So you'd agree that you fall into the 'I prefer the syntax' bucket then? I don't really see any arguments against Python in your comment. Funnily enough, it's trivial to implement a pipe style operator in Python and there's at least two popular libraries for that.
Eh, I call BS. Names and sources please. I know for a fact that R is used at all of FAANG and about a bazillion other "large orgs" too. I'm sure it's true that R is not used for customer-facing "web scale" products, but then again neither is any other language except for like two.
Being good at programming is useful skill, but so is being good at statistics, and they are not interchangeable. "Productionizing a model" is not the only show in town when it comes to data analysis. Many programmers know shockingly little statistics. An equally large number of really strong statisicians prefer R, for good reasons. Orgs who simply refuse to hire those people do so at their peril.
I actually use R mostly because of its data.table package. It is much faster and more concise than pandas, which is a nightmare to work with. Sure, you can get the job done in pandas, but you often have to wait ~10x longer for your commands to run and sometimes, I simply cannot use pandas at all because I run out of memory.
People are usually pretty surprised when I take the stance that R is faster than Python for the things most people actually care about, which is data manipulation and model building.
Stats and programming aren't mutually exclusive. The current generation of DS graduates are strong in both.
Job applicants who only know R and have no grasp of SE are increasingly less and less competitive. I don't expect there'll be any market for them in another 5 years.
In Statistics they teach programming. For example we have studied one semester C, one semester OOP (C++), one semester SQL and relational database design. All these were must courses. Other than these also R, Matlab, Minitab, SPSS and SAS. All of my classmates knew programming. It is stupid to think that in this age a statistician won't be able to write a program. How you are going to make analyses of a population census in a country as a statistician? Statistical packages do not always provide all you need. Sometimes you need to transform a data. Sometimes you need to check/validate a data. Sometimes you need to query from an X location. Sometimes you need to pipe through some process. Few people from our department wrote R packages that didn't exist (new statistical analyses).
and yet all these businesses run excel in production. I'd rather implement R code (with localizreed variables and dumb algorithms, so there's something funny in it) than an excel-spreadsheet. But somehow there's a difference...
They really don't. I've been at such an org as described by OP. I've owned a system of production R/Excel and it was migrated to cloud + python + ETL over 4 years.
The places where Excel are used are fairly appropriate. Way downstream, for simple tasks.
A lot of hard science is still done on the back of Excel, only begrudgingly adopting a data science mindset as the instruments produce more and more data. Data science is more than just streaming data, data lakes and machine learning.
Visual programming platforms like Knime is the next step for these teams, and then onto something like RStudio as they complete the transition towards employing data science in their pipelines.
>Lack of a magrittr style infix operator, though seemingly minor, actually emerges as a real pain point once you become accustomed to using it.
That's an interesting take, given that to me the magrittr operator seems to have been added to mimic the object oriented 'attribute' operator.
Of course the object oriented variant makes it harder to extend the behaviour of a class after it has been defined (although strictly speaking that isn't impossible with python), you'd need to add your methods up front, or extend the class.
> There is no use case for these languages anymore.
There are entire ecosystems of academic libraries built around Matlab that can’t all just be picked up and moved to Python. This argument probably doesn’t realise just how ingrained Matlab is in STEM non-CS academic departments.
Example: my girlfriends department writes a world-leading MRI analysis library in Matlab. They offer training courses on it (so departments around the world now know it) and it’s frequently used within academic papers (so there are now resources available on it). Why would they move to Python?
> There are entire ecosystems of academic libraries built around Matlab that can’t all just be picked up and moved to Python.
They can and they are. Python is increasingly displacing everything in the data industry and especially proprietary legacy platforms like Matlab. The number of things you can do in Matlab but not Python is converging on 0, while the inverse is not even worth trying to count.
Major universities are abandoning Matlab, Labview, SPSS, Minitab etc for Python, which is basically the end for them all. The next wave of CS/SE/DS/ML graduates had no exposure to Matlab. It'll linger in electrical engineering for a few more years but will suffer the same fate. In the end, proprietary platforms have no chance against FOSS.
> Example: my girlfriends department writes a world-leading MRI analysis library in Matlab
Siemens is leading the MRI industry and the only place where they're still using Matlab is the legacy platforms that aren't yet listed for updates or aren't worth updating.
The actual leading stuff is done with the same ML tools as the rest of the industry, mostly Tensorflow. Siemens and GE both also have programs to engage and eventually acquire 3rd party ML platforms not a single one of which has anything to do with Labview or Matlab outside of occasionally interfacing with legacy components.
> Major universities are abandoning Matlab, Labview, SPSS, Minitab etc for Python
Just to add another point of anecdata.
I helm a large data science effort in the defense industry. We are actively moving away from MATLAB and to Python. It's easier for us to find Python coders, easier to train people to use Python, more maintainable for the restrictions we have on our networks, and cheaper.
Yep, NASA used it alongside Matlab for the Orion's Guidance and Navigation Control systems. I've never had the chance to use it though, it looks pretty interesting.
In python, your MRI analysis library could be trivially hooked up with other cloud data pipelines. Companies would require fewer training courses on average.
python is the most popular programming language in the world and getting better.
Because more more and more people realize that using closed source, proprietary programming languages and libraries is not compatible with open, reproducible science.
Sure but in reality most if not all educational institutes that I'm aware of have Matlab licenses, it's what everyone in that particular field uses, and it's better to publish something with Matlab code than nothing at all which is I guess the alternative (it's a means to an end after all).
I can imagine this will change in the long run, but right now there are many valid reasons why people use these tools.
Than I guess Julia will be an even more powerful language to learn, as it combines the flexibility of Python with all the use cases of all major technical/scientific languages, while being efficient and fast.
The interoperability of Julia and Python is really good (Julia calling Python and vice-versa). There is also the possibility to call R functions (but I do not use it personally). To some degree, one can leverage python's ecosystem in Julia. Our group switched from matlab to Julia are we are quite happy with the move.
For teaching, I think that Julia indexing (for example: vector[2:end-1]) is easier to explain than numpy (vector[1:-1]). On the other hand, I like python's plain English operators and/or versus && / || in Julia.
Also loops tends to be more readable than vertorised code in some circumstances (e.g. computing the Laplacian by finite difference). In Julia, loop and vectorized code are both quite efficient, while in python and R, one has to vectorize the code.
Julia’s ecosystem has been progressing since 1.0. The GLM.jl lib has become much better as has the data frame package. It’s more consistent than Python‘s data-science ecosystem since it’s not tying disparate C code together. But having strong types (actual, not mypy) helps make code more consistent. Still Julia’s ecosystem seems to be building more from R’s more solid academic approach.
The community is great. But small. For a lot of situations, I'd be hesitant to invest in Julia, because I don't know if the community will stay that way or if it fades away.
Out of curiosity, how would you know that the community is large enough, or committed enough? For example, while Julia has been in development for almost 10 years, a lot of the community has now been around for 5 years. There's about 2,500 Julia packages, with the ability to call C, Fortran, R, Python, Java, etc. All the key community stats based on downloads, website views, videos show a healthy growth every year.
While in absolute numbers, we may be at 20% of R or Python communities, I am always curious to understand what people mean when they say the community is too small. What would be a signal that a particular community is big enough?
For me as long as a core group appears to be active I’m fine with a communities survivability. Julia’s data-science and plotting have continued to improve in terms of documentation and feature parity, both are critical in an immature ecosystem as they indicate an active core group of developers. Also many libraries appear to be driven by academics creating cutting edge libraries or developing "workhorse" libraries. One good example is Steven G Johnson’s involvement in Julia [1,2], since he created the FFTW library and NLOpt I’d put him in the category of ‘prolific data science contributor’. Or are take the Julia GaussianProcesses.jl [3] library which has a surprisingly thorough implementation along with academic research (and its citable!) for speeding up GP fitting. Pretty cool! Plus it’s pretty performant to say use Otim.jl to optimize the "hyper parameters" for a set of GP’s. That enables a lot more iterations of data exploration.
Essentially the base ecosystem of a language is driven by a core group of contributors and the derivation an ability of that group matters more than most other factors. When doing scientific and or data science I personally care more about the core quality and what the platform enables. Lately I’ve considered learning R as it has a lot of well done says which simply aren’t available in Python, and aren’t ready yet in Julia. Last time I tried to calculate a confidence interval in Python for an obscure probability function I ended up wanting to pull out my air in frustration. There’s libraries that kind of handle it in Python but they are (we’re?) nigh impossible to modify or re-use for a generalized case. Much less getting a proper covariance matrix with enough documentation to know what to do with it. I used R examples to figure out the correct maths. R’s NSE seems appealing in allowing generalized re-use. I’ve had similar ability to re-use library features in Julia for solving problems outside that libraries initial scope.
Julia uses types as compiler hints. If you dispatch f(x) as f(1.0), in most cases it will lazily compile f to be float-optimized. When you run f(1) it will recompile it to be integer optimized.
This enables you to also select libraries: a standard float type will use blas for matrix ops; a gpu float type will use cuda.
>But it lacks R's ecosystem, so Julia is a tougher sell.
In the areas I work in (scientific computing and scientific machine learning), you can really only find the packages in Julia while R and Python's ecosystems are quite lacking. R has stats and Python has ML, but the rest of the scientific ecosystems there just aren't as complete.
Statisticians, economists, biologists, social scientists all learn and work with R. They publish new packages in R, not in Python. There is no trend at all that this is moving towards Python. Python is far, very far behind when it comes to state of the art research in anything stat-related (besides machine learning i guess, but R is pushing hard to close that gap).
> Statisticians, economists, biologists, social scientists all learn and work with R
They used to work with R. And the old generation of engineers used to work with Matlab. The old generation still does.
The new generation has been using actual programming languages, typically Python, since high school. They were the first wave of graduates in 2019 that specialised in a discipline and were also competent in software engineering.
The old generation is going to be driven out of the job market by the new in the span of 5 years as they saturate the senior tier of their respective fields. How do you compete for a job when all you know is R and your discipline, against someone who's a full fledged software engineer who knows your discipline and can put models directly into production use?
This simply isn't true. There are more of them that know Python these days in addition to R, but as a recent graduate of a respected statistics graduate program, I can assure you that R is still the overwhelmingly preferred choice in the field, and also is in economics.
So assuming this is true, what plan of action do you recommend for an “old gen” data scientist who is strong in math, stats, ML theory, R, dataviz, ETL, research, etc., but who is not by a long shot a “full fledged software engineer”? I will soon be competing against this new crop of statistician/engineer superhybrids you speak of.
I know a fair bit of Python (mostly for ML/DL applications), bash, and just a smidge of HTML/CSS/JS (just enough tweak a front end demo via R Shiny). I’m OCD enough that I make every effort to write clean and reproducible code and unit test it as I go (is this TDD?). I can implement some stats algorithms (e.g. EM algorithm, MCMC) from scratch with a pseudocode reference, but I rarely if ever have the occasion to do that for obvious reasons. I understand the concept of computational complexity, though I don’t have any figures memorized.
But I’ve never taken any CS course beyond Programming 101. I wouldn’t know how to navigate a complex production codebase. Embarrassingly, I know almost nothing about git. I’m 100% sure I’d get slaughtered in a whiteboard interview or similar. For that matter, I could easily get nailed on some holes in my core data science knowledge (cough SQL cough).
So, do I rebuild my foundation around software engineering, or just patch up the obvious holes? Grind towards a management position and let my inferior skills rot away?
Learning git is never a bad thing. But if you encounter a company expecting you to be a software engineer, run away. You're not that, you're a data scientist. You wouldn't expect a software engineer to be able to recreate some statistical proof from scratch, as you're testing the wrong set of skills.
I would say that for data manipulation and data visualization, R is objectively superior to Python. And for most statistical methods. Its pretty even for many machine learning algorithms. Python only really outclasses R in deep learning in my opinion.
I've yet to see any argument for R that doesn't boil down to 'well, I know it better' or 'well, I prefer the syntax'.
R to data science is as Matlab is to engineering. It's a stopgap 'non programmer' language that thrived at a time when most academics didn't know any programming. Now school children learn programming. There is no use case for these languages anymore.