Building on top of Pandas feels like you're only escaping part of the problems. In addition to the API, the datatypes in Pandas are a mess, with multiple confusing (and none of them good) options for e.g. dates/datetimes. Does redframes do anything there?
Ironically, sometimes calculating cost-per-use takes more brainpower than it's worth.
Sometimes I get a lot of enjoyment of buying a thing that I know I will love and considering all of the alternatives. Other times, I just defer to what worked in the past.
The lack of an AD primitive is something I've discussed with the creator of BQN, coming from a JAX world I really miss it and feel that it's such an obvious feature, especially in a language which has a way to turn a tacit function into its AST[1], which has been used for symbolic differentiation[2]. Going from symbolic to reverse-mode AD is not much of a leap and users can define their own primitives with ReBQN[3].
I see what you mean by obfuscation, but I think that it's one of those things that feels really hard and stupid until you start being able to do it really quickly. When you learn a foreign language, you first read letters, then words, then sentences because you become accustomed to larger pieces of the language that you can predict what's coming next without reading it. A similar sort of thing happens with APL/BQN, you read letters (primitives), then you begin to recognise words (small, commonly used groups of primitives), then you see larger patterns which look like magical incantations to an inexperienced user.
These "words" are (typically) tacit phrases, many of them only existing due to specific primitives like swap. Once I used BQN to golf, I started wishing Julia had a swap for operators i.e.
-(3, 5) = -2
swap(-)(3, 5) = 2
I won't defend these languages to the death, but they are fun to puzzle your brain with in codegolf. Maybe Dex[4] will go somewhere too.
> I started wishing Julia had a swap for operators
You can define one pretty easily:
julia> swap2(f::F) where F<:Function = (a, b) -> f(b, a)
swap2 (generic function with 1 method)
julia> swap2(-)(3, 5)
2
(Perhaps you meant you wished it was pre-defined with the language, I understand the slight friction of having to define things for every project; this is just making sure you know it can be done pretty easily.)
I switched permanently from Plots.jl to Makie.jl in order to have backend-agnostic fine-grained control. My publication plots look fantastic and the power given to users is really something. It also has a nicer API than Plots.jl once you get a hang of the figure, axis, plot distinction (plots live inside axes live inside figures) and what goes where.
Unfortunately, as with Plots, the documentation is lacking. The basic tutorial does a good job introducing the aspects of the package at a high level, but the fact that some parts of the documentation uses functions/structs that don't have docstrings in examples makes it very hard to build on the examples in these cases.
I get it, I can do anything with Makie, and most things that I want to do work amazingly. But my code for a single figure can get huge because it's all so low level. See, for example, the Legend documentation[1].
Improving the docs as a key point was one of my takeaways from MakieCon. It's pretty time-intensive to work on them as you can imagine, but I hope we'll be able to make the structure more clear and efficient in the future. There should definitely at least be docstrings for every exported struct and so on. But I also want newcomers to get started with less friction, so the explanations/tutorials/how-tos must improve.
This is an easy way for newcomers to help out, by the way, just give feedback on how starting out with the library went and what the main roadblocks were. The better we understand them, the more we can improve them.
I agree with your conclusion but want to add that switching from Julia may not make sense either.
According to these benchmarks: https://h2oai.github.io/db-benchmark/, DF.jl is the fastest library for some things, data.table for others, polars for others. Which is fastest depends on the query and whether it takes advantage of the features/properties of each.
For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.
Indeed DataFrames.jl isn't and won't be the fastest way to do many things. It makes a lot of trade offs in performance for flexibility. The columns of the dataframe can be any indexable array, so while most examples use 64-bit floating point numbers, strings, and categorical arrays, the nice thing about DataFrames.jl is that using arbitrary precision floats, pointers to binaries, etc. are all fine inside of a DataFrame without any modification. This is compared to things like the Pandas allowed datatypes (https://pbpython.com/pandas_dtypes.html). I'm quite impressed by the DataFrames.jl developers given how they've kept it dynamic yet seem to have achieved pretty good performance. Most of it is smart use of function barriers to avoid the dynamism in the core algorithms. But from that knowledge it's very clear that systems should be able to exist that outperform it even with the same algorithms, in some cases just by tens of nanoseconds but in theory that bump is always there.
In the Julia world the one which optimizes to be fully non-dynamic is TypedTables (https://github.com/JuliaData/TypedTables.jl) where all column types are known at compile time, removing the dynamic dispatch overhead. But in Julia the minor performance gain of using TypedTables vs the major flexibility loss is the reason why you pretty much never hear about it. Probably not even worth mentioning but it's a fun tidbit.
> For what it's worth, data.table is my favourite to use and I believe it has the nicest ergonomics of the three I spoke about.
I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs. Generally when I hear about R people talk about tidyverse. Tidier (https://github.com/TidierOrg/Tidier.jl) is making some big strides in bringing a tidy syntax to Julia and I hear that it has had some rapid adoption and happy users, so there are some ongoing efforts to use the learnings of R API's but I'm not sure if someone is looking directly at the data.table parts.
> Indeed DataFrames.jl isn't and won't be the fastest way to do many things
Agreed, and the DF.jl developers are aware and very open about this fact - the core design trades off flexibility and user friendliness over speed (while of course trying to be as performant as possible within those constraints).
One thing that hasn't been mentioned so far is InMemoryDatasets.jl, which as far as I know is the closest to polars in Julia-land in that it chooses a different point on the flexibility-performance curve more towards the performance end. It's not very widely used as far as I can tell but could be interesting for users who need more performance than DF.jl can deliver - some benchmarks from early versions suggested performance is on par with polars: https://discourse.julialang.org/t/ann-a-new-lightning-fast-p...
I have not tried it. I like that the project makes broadcasting invisible, I dislike that it tries to completely replicate R's semantics and Tidyverse's syntax. Two examples: firstly, the tuples vs scalars thing doesn't seem very Julia to me. Secondly, I love that DF.jl has :column_name and variable_name as separate syntax. Tidier.jl drops this convention (from what I see in the readme).
> I'm not sure if someone is looking directly at the data.table parts
I believe there was some effort to make an i-j-by syntax in Julia but it fell through or stopped getting worked on. By this syntax I mean something like:
# An example of using i, j, and by
@dt flights [
carrier == "AA",
(mean(:arr_delay), mean(:dep_delay)),
by = (:origin, :dest, :month)]
# An example of expressions in by
@dt flights [_, nrows, by = (:dep_delay > 0, :arr_delay > 0)]
The idea of ijby (as I understand it) is that it has a consistent structure: row selection/filtering comes before column selection/filtering, and is optionally followed by "by" and then other keyword arguments which augment the data that the core "ij" operations act upon.
data.table also has some nifty syntax like
data[, x := x + 1] # update in place
data[, x := x/nrows(.SD), by = y] # .SD = references data subset currently being worked on
which make it more concise than dplyr.
The conciseness and structure that comes from data.table and its tendency to be much less code than comparable tidyverse transformations through some well-informed choices and reservations of syntax make it nicer for me to use.
> I would be interested to hear what about the ergonomics of data.table you find useful. if there are some ideas that would be helpful for DataFrames.jl to learn from data.table directly I'd be happy to share it with the devs.
Personally, my main usability gripe is that it's difficult to do row-wise transformations that try to combine multiple columns by name. I know one can do
```
transform(df, AsTable() => foo ∘ Tables.NamedTupleIterator)
```
But this is 1) kind of wordy and 2) can come with enormous compile times (making it unusable) for wide tables
I really hope people don't come from R to Julia. People who use R are not good programmers, and will degrade the core of the language and it's principles.
It would be a shame to see the equivalent of tacking on 6 different object oriented systems to a base language and fragmenting the community completely.
I'm not sure I'd have the same take. Yes, R as a language is kind of wonky and people who use R tend to not be good programmers. However, the APIs of some packages are designed well enough that even with all of those barriers it can still be easy to use for many scientists. I wouldn't copy the language, 6 different object systems and non-standard evaluation is weird. But there is a lot to learn from the APIs of the tidyverse and how it has somehow been able to cover for all of those shortcomings. It would be great to see those aspects with the data science libraries of the Julia language.
It might surprise you to learn that Julia is actively relying on code written in/for R to perform computations. You might be surprised to find out that people who can write R can also write C++ C and other languages of their choosing. You also might be surprised to learn that some of the most vetted statistical code exists in the R ecosystem. If I were someone recruiting for a niche language that had a weak ecosystem, personally I'd take all the help I could get. Can learn Julia with a background in any other programming language in a few weeks... The same can't be said about martingales... But you get to choose your strategy here...
And thus we who transitioned to Julia from R and know a bit about martingales and less about programming have long been trying to degrade the core of the language and its principles by making `mean` a Base function.
R users in the form of statisticians should definitely come around to Julia. More high quality packages never hurt. But I agree with fragmentation and 'object systems', yet I don't think this is a huge danger for Julia.
BQN[1] has higher order functions. Of the array languages I've used, it's by far my favourite. That said, I mostly solve small problems for fun in them.
Yeah BQN sort of has higher order functions but it still distinguishes between functions and data, so I don't think it would be that possible/easy to use combinatory logic style combinators. I haven't used BQN much though, so I could be wrong.
Context: Coming from a statistics background, I learned a bit of R, then a bit of Python for data analysis/science, then found Julia as the language I invested my time in. Over time I keep up with R and Python enough to know what's different since I learned them, but don't use them daily.
What I always tell people is the following:
If you are writing code using existing libraries then use whichever language has those languages. The NN stack(s) in Python are great, the statistical ML stack(s) in R are simple and include SOTA techniques.
If you are writing a package yourself, then I assume you know the core of the idea well enough to be able to write your code from the "top down" i.e. you're not experimenting with how to solve the problem at hand, you're implementing something concretely defined.
In this case, and tailored to your use, I would argue that Julia has more advantages than disadvantages, especially compared to R or Python. Here are a few comments:
1. Environments, dependencies, and distribution can all be handled by Pkg.jl, the built in package manager. There is no 3rd party tool involved, there is no disagreement in the community on which is better. This is my biggest pain point with Python.
2. Julia's type system both exists and is more powerful than that of Python (types or classes) and R (even Hadley's new S7(?) system). By powerful I mean generics/parametric types and overloading/dispatch built in. You can code without them, but certain problems are solved elegantly by them. Since working heavily with types in recent years, I find this to be my biggest pain point in R and I wouldn't want to write a package in R, although I like to use it as an end user.
3. New developments in scientific programming, programming ergonomics, hardware generic code (as in this post), and other cool features happen in Julia. New developments in statistics happen in R (and increasingly Julia), new developments funded by big companies happen in Python.
4. The Python and R interpreter start up faster than Julia. The biggest problem here is when you are redefining types, which is the only thing in Julia that can't currently be "hot reloaded" i.e. you need to restart Julia to redefine types.
5. Working with tabular data is (currently) far more ergonomic and effortless in R than Python and Julia.
6. Plotting is not a solved problem in Julia. Plots.jl is pretty easy and pretty powerful, Makie.jl is powerful but very manual. Time to first plot is longer than R or Python.
7. Julia has almost zero technical debt, R and Python have a lot. Backwards compatibility is guaranteed for Julia code written in >v1.0 and Pkg.jl handles package compatibility. If I send you code I wrote 4 years ago along with a Project.toml containing [compat] information then you could run the code with zero effort. (This is the theory, in practice Julia programmers are typically scientists first and coders second, ymmv.)
8. You can choose how low level you want your code to be. Prototyping can be done in Julia, rewriting to be faster can be done in Julia, production code can be done in Julia. Translating Python to C++ production might mean thinking about types for the first time in the dev process. In Julia, going to production just means making sure your code is type stable.
[1]: https://github.com/maxhumber/redframes