This was an extremely interesting paper to me, about a topic that I see as economically and sociologically fundamental.
I was actually impressed by the methods they used. I found myself thinking "this is what I'd really like to see," and then they'd report it. Validating their method on the MusicLab data seemed critical to me, as did examining reddit resubmissions versus YouTube views.
Although I thought methodologically it was almost as well done as it could have been outside of an experiment, I disagreed with the author's conclusions. They acknowledge some of the problems, such as the problem of the huge number of forgotten posts they didn't model at all, but other issues they don't.
For example, it seems the question of most interest is, given an observed post score, what's the actual "quality"? If you look at, say, Figure 3, it's apparent that there's huge variability in quality conditional on score, as observed score increases.
I think the correlational-style relationship they focus on obscures things like this that are critical to interpreting the findings. Yes, there's a strong estimated relationship between quality and score, if you ignore all the missing data that constitutes the bulk of submissions, and the fact that the relationship is being driven very strongly by a large quantity of very low-"quality" posts versus everything else, and the variability everywhere else. It's an odd, heteroscedastic, nonlinear relationship that isn't well-captured by a correlation, even a nonparametric one.
I also would have liked to see examination of variability in links across sites. How much variability is there in rank of an initial link, to the same material, across reddit, HN, Twitter, etc.? Maybe tellingly, the authors report the relationship between YouTube views and number of reddit submissions, but not the relationship (if I'm reading correctly) between YouTube views and rank of initial reddit submissions, which is kind of the key relationship.
So, liked the paper but if anything it just reconfirms the conclusions of earlier studies to me, that social network dynamics has a big influence on apparent popularity.
There's kind of two issues at least. One is the continuous-discrete issue and the other is the moment issue.
As for the moment issue, the short story is that as you get into three or four moments, there isn't a general maximum entropy distribution anymore, except for some special idiosyncratic cases in the case of three I think. So the normal is, in some ways, the most conservative distribution you can have in a general, unspecified scenario sense. You can specify more moments, but then there isn't a single maxent distribution you can specify that would apply across all third and fourth-moment scenarios in the same way that would apply for the first two moments.
As for the continuous versus discrete thing, there's some caution that's warranted, but a lot of the maxent principles apply, and there are similar, closely related principles (minimum description length, which has been shown to be equivalent to maximum entropy inferentially in a sense) that generalize in the continuous case. If you think of everything as discretized (as is the case with machine representation), there's some work showing that the discretized and continuous cases are sort of related up to a constant (doi: 10.1109/TIT.2004.836702).
I realize this is a bit hand-wavy but it is a HN post.
Thank you, I really appreciate the response. This was useful.
I do see the reasoning for choosing the normal due to it being the only distribution with finite non-zero moments, and thus, as you nicely pointed out, constraints on a finite number of higher order moments will not give a unique distribution.
But, due to the issues we've now mentioned, I find myself a bit uneasy wrt. maxent as a derivation of and/or as an explanation of the ubiquity of the normal distribution. Thus I find myself more comfortable with some of the other derivations demonstrated by Jaynes.
And thank you for the paper reference; will have a proper look at it sometime. It might be related to
This is such an unrecognized part of the problem: overcredentialing. It's rampant: it's explicit in areas like healthcare and law, and implicit in the HR practices of many corporations.
We bitch about people going into debt, but turn around an are fine with companies being picky as hell about having a specific degree, as if that's everything about a person's ability or background. We also bitch about healthcare costs, but then act like the sky will fall if we start discussing the possibility of pharmacists, optometrists, or psychologists prescribing or offering more services. Your observation about law is equally astute.
I'm going to beat a dead horse until it rises from the grave, but this is the situation with liberal arts degrees as well: they're from a time when it was assumed that you could major in, say, philosophy, and take comp sci classes, do work in that area, and build up a career in comp sci without anyone questioning it. Now your local HR department uses that comp sci degree to screen you, as if you are your degree.
Everyone knows that these degrees are helpful but imperfect indicators, but we treat them as perfect indicators because it's easier to maintain the myth, and it benefits those who benefit from rent-seeking and overregulation.
There are some recent studies suggesting that globally, on average, the most environmentally friendly diets have some animal product component because of their ability to make use of landmass that we wouldn't be able to make use of directly. E.g., cattle can eat plants that grow in areas we can't grow human-friendly crops on.
I can't find citations to these studies offhand unfortunately. (this is an example but not what I had in mind: www.ncbi.nlm.nih.gov/pmc/articles/PMC5522483/) But what I remember is that globally, the most environmentally friendly diets had some small animal product component.
This also probably varies a lot by location too, so it probably is the case that for some people, the best diet might be vegan; for others it might involve more animal product.
Just pointing this out, because there are nonpolitical-psychological-sociological reasons for a "soft" approach on plant-based-diets.
I think a key component here is just minimizing the meat-component of the diet.
Meat would not at all be a problem if produced in smaller amounts. The current level is just absurd waste (I would guess mostly to support fast-food production?), and we absolutely do not need this amount of meat in our diet.
It is, however, extremely nutritious, so a small meat consumption would both be sufficient and much more sustainable. Even just 1/10th of the current scale would likely reduce any problems to be insignificant.
This is true, but from what I've read it is only true for places that are small scale and local. Anything large scale, like feeding a city, would benefit greatly from a reduction, or elimination, of animal products.
I agree with a tiny caveat, in that I'd change Jeffreys prior to reference prior.
On the other hand, these priors can be difficult to create in some (many?) situations and it's often more tractable to do ML.
Bayesian inference seems more principled to me in general if you allow for and use reference priors, but outside of that I think there are still reasons to prefer ML. There's two areas where I still have problems with priors.
The first is that the sequential testing paradigm (that is, prior -> posterior -> prior) doesn't always work in reality because you often have multiple experimenters operating simultaneously and independently with different priors. In one sense this is a trivial problem but in another sense it is not. E.g., if you are a meta-analyst faced with integrating such results, is prior variation akin to publication bias? What implications does that have?
The second is that there are situations in which using a prior actually might lead to unfair inequities. For example, let's say you're trying to make some inference about an individual, and know that ethnicity provides information in a statistical sense about the parameter you are making an inference about. Is it prejudicial or not to use a prior? I think using a reference prior would address this situation, but depending on the scenario you could make an argument that it is unfair (e.g., if the informative prior would suggest a positive outcome, not using it might be seen as prejudicial, but if the informative prior would suggest a negative outcome, using it might be seen as unfair). In this case, not using a prior at all actually might make sense--you might make a similar argument about non-Bayesian inference as Bayesian reference inference, but using non-prior-based inference does sidestep the issue in a sense, in that there is no longer a prior to decide about. This might be especially important in that, e.g., if you have a series of individuals, the act of choosing a prior might be seen as prejudicial in itself.
I generally consider myself as an "objective Bayesian" in the Jaynesian / reference prior sense, but there are practical and theoretical scenarios where I think people are likely to run into problems.
Jeffreys / reference priors also have some weird behaviour in high dimensions. You may enjoy this attempt to do better, without giving up reparameterization invariance:
With the birth of our daughter, we had a different, and more complicated experience.
Pre-delivery there was no pressure to do a c-section. None at all. My wife definitely did not want one.
At the time of delivery, though, there was very much pressure to do a c-section. Although the admitting resident didn't seem to pressure my wife, care was quickly transferred (because of shift reasons) to other physicians (resident and attending) who did.
The way this manifested, though, was sort of subtle. For example, my wife had a procedure done to speed up the delivery; however, as we found out later, we were definitely not sufficiently informed of the consequences of the procedure, one of which was increased likelihood of a c-section. They tried to talk my wife into a c-section, and then when she declined, they tried to talk her into other procedures that would speed up the delivery, and would omit mention of the fact that they were associated with increased likelihood of c-section. Overall, even if c-sections weren't being explicitly mentioned, they were kind of relied on or assumed, for time and convenience reasons. The discussion was sort of like "Oh you don't want a c-section? Ok, then how about X to speed things up? Oh--I forgot to mention that now we probably have to do a c-section? Oops!" It came across as manipulative to me.
My wife did not have a c-section, but this was probably only because the nurses there (who were phenomenal) were actively arguing with the physicians to not do one, and to wait. We weren't really in the hospital that long either.
Am I missing something or are parts of this article really distorted?
For example, this seems to set up most of the article:
"Economics involves a lot of math and statistics. The most commonly used tools to crunch numbers are the spreadsheet software Microsoft Excel and programming languages Stata and Mathematica."
Is this really true? Mathematica and Stata seem like established but niche products to me at this point. I wouldn't say either of them are "the most commonly used tools to crunch numbers."
If you asked me to predict what a quantitative economist would be using, it would be Python, followed by R, and maybe followed by Java or C, or something like that.
This was an interesting article in the sense I like learning these sorts of things about people, but the premise seemed off to me.
But I'm not an economist so maybe this is something about economics per se.
Most quantitative economists do not use Python. They use Stata, SAS, EViews, etc. For methods not yet implemented, the go to application is Matlab (matrix). Python has been gaining traction in the past five years however.
Nate Silver shared that a vast majority of the latest five thirty eight model for 2018 is constructed in Stata. Just an anecdote, but fairly practical example of something recently written that’s using that technology.
“If you asked me to predict what a quantitative economist would be using, it would be Python, followed by R, and maybe followed by Java or C, or something like that.”
I too was looking for good data. It’s hard to measure this as practical application doesn’t line up with publication.
I was surprised when I started working in health 10 years ago that the predominant tool was Excel, and then SAS. Even 5 years ago in health grad school, they only taught SAS.
This is slowly changing to R and python, but general data analysis skills is less than basic engineering stuff 20 years ago.
No, by the text of that same link you provided: "... origin of this fallacy is probably related to the fact that data must be submitted in the XPT "transport format" (which was originally created by SAS)." While that post goes on to say, "This [XPT] data format is now an open standard." That is somewhat disingenuous. The XPT format requires IBM Mainframe floats and other wierdness. It's not always that easy to write XPT.
Depends on the field. In the social sciences, Mathematica and Stata are probably the most widely used tools. Might be that way for economics too, but I'm not sure.
I believe they prefer window like interfaces instead of scripting or programming your models.
Stata used to be only command line driven and probably still has that functionality ( I used it in 1997-2007) ; then again as a mathematician I prefer R. Or any computer algebra system.
> I'm not an economist so maybe this is something about economics per se.
No, this is just a case of historical contingency, or the founder effect, depending on your preferred metaphor. Stata and Mathematica are suitable tools (and were suitable early on) that happened to be adopted by a few economists, whose choice was then spread and perpetuated via various organizations and institutions, word of mouth and curricula.
> If you asked me to predict what a quantitative economist would be using, it would be Python, followed by R, and maybe followed by Java or C, or something like that.
I've never heard of an economist using Java, and very few using C. Fortran is still quite popular in some areas.
Stata is extremely popular in some areas. GAUSS and Matlab in others (though GAUSS is declining for sure). R is quite popular, particularly since RStudio came along.
I'm not an economist so maybe this is something about economics.
Stata seems to be really popular in economics. Certainly when I was getting my degree all the statistics, modelling and econometric courses at the economics department used either Excel or Stata. A few 'weird' kids used matlab, but the words "Python" or "R" where never mentioned.
When I was finishing up grad school in 2013, I talked to a few econ grad students about what tools they were using for statistics and computing. It sounded like the people just starting out were generally familiar with and in favor of R, but among the older students it wasn't quite as popular. I don't remember anyone mentioning Python at the time.
I've spent most of my career working with economists, and they've got a strong preference for Stata. It's what most of them learned in university. I'm hopeful that it's a preference that will slowly be replaced with Python--we're already making that change at the think tank where I work.
Not sure if I'm left-leaning or not, but the problem is that we need increased competition.
The reason why this is a problem is that sometimes it means decreased regulation, and sometimes it means providing more government services and more regulation. But that doesn't really fit well into the two major political parties.
For example, I'd probably seen as even more radical about healthcare than what you're writing--I'd advocate eliminating licensure laws and radically reorient the mission of the FDA in part by removing its regulatory authority over a lot of things it currently.
But I also advocate sharply reducing patents and copyright terms, and rolling out federal and municipal broadband. I also think public education needs a lot more funding.
It seems like political discussions in the US become oriented around protecting entrenched business interests, or protecting citizens through increased regulation.
As someone who does research in this area, broadly defined, I think you're on to something, but I also think there are some misleading things about this article (which I nevertheless think is interesting) and caveats to what you're saying.
Lots of thoughts:
1. Intelligence is a broad construct. It is by definition, and it is not the only cognitive construct. It does have a lot of utility for certain purposes though, such as in identifying pervasive neurological disease.
As others are noting, this is relevant to the article in that we tend to focus on extremes when making these kinds of comparisons, when the full spectrum is really what's important sometimes. We tend to fixate on whether someone went to some prestigious university or less prestigious university, or whether our incomes are in the upper middle class or upper class, but in the sense of outcomes, compared to all outcomes, these can be relatively minor distinctions and hard to predict.
2. There are other variables that are relevant, like conscientiousness, ruthlessness, and so forth. This is certainly true.
3. There are still other variables that have nothing to do with the individuals involved though. The elephant in the room are societal and other random factors that prevent any individual attribute from mattering as much as they could. The article starts out by dismissing prediction among females out of hand because of societal limitations, which is reasonable. But there are lots of other variables involved, random and nonrandom societal and environmental forces at play. The hidden story is that there are limits to predicting outcomes at all from the individual at hand, meaning that other variables in the environment are working.
4. Measurement of intelligence is fuzzy and imperfect as you're alluding to. It's stochastically imprecise, in the sense that giving the same test twice, or two different tests, will give you somewhat different answers. But it's also imperfect in that the thing it's measuring isn't really what we probably want to measure in an ideal case. Even if the tests were giving the same answer all the time, it wouldn't really be intelligence in the way we want to talk about intelligence.
5. I'm not sure that we really want cognitive functioning measures to be perfectly stable, because I don't think cognitive functioning is actually perfectly stable. It probably varies across the day, for example.
6. Physical measurements are certainly more precise. But the objects systemically are much less complex. It's easier to talk about measuring the mass of a cubic meter of oxygen than it is to talk about measuring climatological variables; something analogous is in play with things like intelligence.
Also, even physical measurements at a certain level become fuzzy and highly interdependent. Measuring mass "precisely" depends on your scale and other variables.
My same concern about the ethics of this. I might have felt a little differently if this AI startup had paid for all of the data collection proactively (although I would have still had concerns about the exclusivity of any such agreements to patient access), but as it is this seems unethical.
The biomedical-industrial complex in the US makes my stomach churn. So many conflicts of interest, rent-seeking, monopolies, and nepotism.
I was actually impressed by the methods they used. I found myself thinking "this is what I'd really like to see," and then they'd report it. Validating their method on the MusicLab data seemed critical to me, as did examining reddit resubmissions versus YouTube views.
Although I thought methodologically it was almost as well done as it could have been outside of an experiment, I disagreed with the author's conclusions. They acknowledge some of the problems, such as the problem of the huge number of forgotten posts they didn't model at all, but other issues they don't.
For example, it seems the question of most interest is, given an observed post score, what's the actual "quality"? If you look at, say, Figure 3, it's apparent that there's huge variability in quality conditional on score, as observed score increases.
I think the correlational-style relationship they focus on obscures things like this that are critical to interpreting the findings. Yes, there's a strong estimated relationship between quality and score, if you ignore all the missing data that constitutes the bulk of submissions, and the fact that the relationship is being driven very strongly by a large quantity of very low-"quality" posts versus everything else, and the variability everywhere else. It's an odd, heteroscedastic, nonlinear relationship that isn't well-captured by a correlation, even a nonparametric one.
I also would have liked to see examination of variability in links across sites. How much variability is there in rank of an initial link, to the same material, across reddit, HN, Twitter, etc.? Maybe tellingly, the authors report the relationship between YouTube views and number of reddit submissions, but not the relationship (if I'm reading correctly) between YouTube views and rank of initial reddit submissions, which is kind of the key relationship.
So, liked the paper but if anything it just reconfirms the conclusions of earlier studies to me, that social network dynamics has a big influence on apparent popularity.