Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But readr gives you a tibble instead of a plain data.frame and that adds a bunch of other headaches.


How so? I can't think of any drawbacks of tibbles vs data.frames?


Using tibbles outside of tidyverse can be dangerous.

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1
Imagine now using some complex function from a repository (i.e. BioConductor) that works on data.frames and passing a tibble to it.


Post edit: Preserving the below post, which I think highlights some of the issues in the parent I'm responding to's example, but it turns out he is correct on another issue -- narrowly, in his example, the `[` subset command dispatches differently on tibbles than data.frames, so you can narrowly produce weird behaviour. So to anyone reading, please consider upvoting parent and reading rest of thread

Original post follows:

Right off the bat, the problem is not "using" tibbles, it's that you've incorrectly constructed one by passing the data through the tibble() constructor rather than using as_tibble(). The tibble constructor -- for pretty good reasons in other circumstances that seem crazy to you here because of your intent -- infers that you want the entire data frame to be a single column inside the tibble, called "iris". It does this because it evaluates the variable name passed to the tibble constructor as both the intended column name and the data to be placed inside the column. This demonstrates nesting, which is one of the great features of tibbles and otherwise used for a bunch of stuff.

If you had done `tb_iris <- as_tibble(iris)`, it would have worked fine. `as_tibble()` is the function to convert an existing data structure to a tibble. R is obviously not "type safe" in any way, but you can engage in defensive programming, and one way you can do that is being hyper-aware of the steps you take during type conversions. If you check the documentation for `tibble()`, it tells you explicitly to "Use as_tibble() to turn an existing object into a tibble." Is there a reason you didn't? Imagine this related example:

  my_string <- "10"
  numeric(my_string)
  as.numeric(my_string)
Would we conclude that "using the numeric type can be dangerous" because the constructor interpreted the argument different than the conversion helper?

Second, I suspect you must be using extremely old versions of things, because on more recent versions, your nunique function would fail, not produce 1. I correctly get "Error: Can't find column `Species` in `.data`." This error message is maybe a little confusing if you don't check the structure `str(tb_iris)` of tb_iris to see what I mentioned above, but is the correct error to output in light of it. You'd also be able to flag this by just checking `colnames(tb_iris)` or `View(tb_iris)` if you're working in RStudio or using the embedded environment pane or really any other way of looking at the data.

But your broader point is also false. Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes. The only thing that makes a tibble different than a data.frame is that it has an additional class label. All dispatches that work on data.frame objects work on tibbles because of how multiple classing works in R. This has been a goal since the beginning of tibble. The one exception I'm aware of is external functions that incorrectly check `if(class(obj) == "data.frame")` instead of using `is.data.frame()` or `if("data.frame" %in% class(obj))`. The former is and always has been incorrect because of how multiple dispatch is designed to work in R and should generate an error with multi-classed objects because the if statement evaluates to a vector of logicals instead of a logical.

Once way you can tell that tibbles and data frames are identical save the above caveat is to run the following code:

  df_iris <- iris
  tb_iris <- as_tibble(iris)
  identical(df_iris, tb_iris)
  class(tb_iris) <- "data.frame"
  identical(df_iris, tb_iris) 
Note that you are not "downconverting" a tibble into a data.frame in this code (but that would work too) -- you are taking the tibble exactly as is and hacking its class label to look like a data frame. It's identical because a tibble was always a data frame.


I think everything you wrote here is false, so I am not sure how to reply. Will try to keep it respectful and short:

First, about the as_tibble - it returns the same thing as tibble:

    tb_iris <- as_tibble(iris)
    length(unique(tb_iris[,"Species"]))
    > 1
Second, about the incorrect version:

    > packageVersion("tibble")
    [1] ‘3.0.1’
Which is also the current version on CRAN.

Third, about the classes:

You say:

> Once a tibble has been formed, it should work EXACTLY the way a data.frame works because R objects can have multiple classes.

This is not the case. You can add any class to any object in R S3 system. So people behind tibble can call their tibble a data.frame but it gives no guarantee that it will behave like one.

More about this problem here (and you can also find replies from tidyverse authors) https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...


Actually your reply was very helpful because it surfaced ways in which you were partially right and I was partially wrong.

I highlighted the nesting issue in constructing versus coercing (which is correct and does have implications for what you're trying to do) but actually in your example the distinction is broken because of a different edge case

Which is to say the following:

  ncol(iris) # 5
  ncol(as_tibble(iris)) # 5
  ncol(tibble(iris) # 1

  iris$Species # Works
  as_tibble(iris)$Species # Works
  tibble(iris)$Species # Errors because of nesting

  iris[, "Species"] # Works
  tibble(iris)[, "Species"] # Doesn't work
  as_tibble(iris)[, "Species"] # Works
 
However, you're correct that because the subset operator for tibble doesn't drop dimensions, length gets you the number of columns rather than the number of observations. This does speak to the fact that length is a pretty shitty function to begin with, but I concede you're partially correct there.

You are also correct that because class labels are not contractual, there is no guarantee that having the data.frame fallback label means stuff behaves identically (for instance, you could add the data.frame label to any data structure and the data.frame dispatch stuff would not work properly). My point was that in the case of a tibble, a tibble is literally a data frame with an additional class label. If you remove that class label, it's exactly identical.

But your example and linked discussion does highlight a way in which I'm wrong; the subset function is overridden for something with a tibble class label. That's true and could produce edge cases I hadn't considered.

Apologies for any hostility in my original reply.


I'm sorry to report that this analysis is completely wrong, and demonstrates a lack of understanding of the R object model. The class that is provided by tibble does not implement all of data.frame, and the OP is correct.


(S3 -- see footnote) Classes don't "implement" anything in R the way they would in other languages. They are labels that tell dispatch functions how to deal with an object. A tibble is internally a data frame. The last example in my post makes this exactly clear.

The other OO systems in R do act closer to traditional classes, but all the tidyverse stuff is S3.

(But the OP was correct in another sense related to the example narrowly!)


So you're ignoring that the [-function by design works differently for tibbles than for data frames. This isn't really a problem with tibble but with sloppiness in programming allowed by dynamic languages.

I personally think it's a good thing that the drop-argument defaults to FALSE for tibbles, since data frame's default drop = TRUE is a source of frequent bugs. The change of the default for this parameter is the source of your observation.


I am not ignoring it, I am _highlighting_ it. The question of the comment above was "why would one prefer data.frame over tibble". I merely answered that question.


Yes, but the problem isn't tibble since what you're highlighting is a design choice and an argument in favor of tibble. The problem only arises when you're not aware of this design choice which is facilitated by sloppiness and dynmically typed languages.

One might ask whether it was a good idea that tibble enlists data.frame as an inherited class. Since a tibble obviously doesn't behave like a data frame, one could also argue that this is a mistake on part of the tibble developers but this is a different discussion.


All I am saying is that there are perfectly good reasons for not using tibbles if you do any kind of work outside of tidyverse. And you seem to agree?

As for whether or not tibbles should be data.frames - I posted a link to this exact discussion on R-dev mailing list within this thread, as an answer to a different poster. Here it is: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...


Ok, now I understand where you're coming from.


Just put a as.data.frame( ) around it. That's what I do with readxl::read_excel :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: