This still generates an issue with sample size - since today was the 29th day, that logo had from days start through time of writing/submitting the article, while the logo from day 1 had a much larger sample size. There is some evidence in the results. Secondly, was it fully randomized, or were things like multi-armed bandit or other a/b testing methods used to provide enough variation. The chart is biased towards earlier numbers, with the outliers of the really bad logos being dispersed towards the end.