Share |

Popular Blogs

Bayesian model-building by pure thought: Some principles and examples

Gelman's Blog - 10 hours 8 min ago

This is one of my favorite papers:

In applications, statistical models are often restricted to what produces reasonable estimates based on the data at hand. In many cases, however, the principles that allow a model to be restricted can be derived theoretically, in the absence of any data and with minimal applied context. We illustrate this point with three well-known theoretical examples from spatial statistics and time series. First, we show that an autoregressive model for local averages violates a principle of invariance under scaling. Second, we show how the Bayesian estimate of a strictly-increasing time series, using a uniform prior distribution, depends on the scale of estimation. Third, we interpret local smoothing of spatial lattice data as Bayesian estimation and show why uniform local smoothing does not make sense. In various forms, the results presented here have been derived in previous work; our contribution is to draw out some principles that can be derived theoretically, even though in the past they may have been presented in detail in the context of specific examples.

I just love this paper. But it’s only been cited 17 times (and four of those were by me), so I must have done something wrong. In retrospect I think it would’ve made more sense to write it as three separate papers; then each might have had its own impact. In any case, I hope the article provides some enjoyment and insight to those of you who click through.

Categories: Popular Blogs

What is a prior distribution?

Gelman's Blog - February 5, 2012

Some recent blog discussion revealed some confusion that I’ll try to resolve here.

I wrote that I’m not a big fan of subjective priors. Various commenters had difficulty with this point, and I think the issue was most clearly stated by Bill Jeffreys, who wrote:

It seems to me that your prior has to reflect your subjective information before you look at the data. How can it not?

But this does not mean that the (subjective) prior that you choose is irrefutable; Surely a prior that reflects prior information just does not have to be inconsistent with that information. But that still leaves a range of priors that are consistent with it, the sort of priors that one would use in a sensitivity analysis, for example.

I think I see what Bill is getting at. A prior represents your subjective belief, or some approximation to your subjective belief, even if it’s not perfect. That sounds reasonable but I don’t think it works. Or, at least, it often doesn’t work.

Let’s start with a simple example. You hop on a scale that gives unbiased measurements with errors that have a standard deviation of 0.1 kg. To do Bayesian analysis, you assign a N(0,10000^2) prior on your true weight. That doesn’t represent your subjective belief! It’s not even an approximation. No problem—it works fine for most purposes—but it’s not subjective.

More generally, think of all the linear and logistic regressions we use. Instead of thinking of these as subjective beliefs, I prefer to think of the joint probability distribution as a model, reflecting a set of assumptions. In some settings these assumptions represent subjective beliefs, in other settings they don’t.

This article from 2002 might help. If I could go back and alter it, I’d add something on weakly informative priors, but I still agree with the general approach discussed there.

P.S. Just to give an example of what I mean by prior information: The analyses in Red State Blue State all use noninformative prior distributions. But a lot of prior information comes in, in the selection of what questions to study, what models to consider, and what variables to include in the model. For example, as state-level predictors we include region of the country, Republican vote in the previous presidential election, and average state income. Prior information goes into the choice and construction of all these predictors. But the prior distribution is a particular probability distribution that in this case is flat and does not reflect prior knowledge.

One way to think about informative prior distributions is as a form of smoothing: when setting the parameters of a probability distribution based on prior knowledge, we are imposing some time smoothness on the parameters. I think that’s probably a good idea and that the Red State Blue State analyses (among others) would be better for it. I didn’t set up this prior structure because I wasn’t easily equipped to do so and it seemed like too much effort, but perhaps at some future time this sort of structuring will be as commonplace as hierarchical modeling is today.

Categories: Popular Blogs

“Turn a Boring Bar Graph into a 3D Masterpiece”

Gelman's Blog - February 4, 2012

Jimmy sends in this.

Steps include “Make whimsical sparkles by drawing an ellipse using the Ellipse Tool,” “Rotate the sparkles . . . Give some sparkles less Opacity by using the Transparency Palette,” and “Add a haze around each sparkle by drawing a white ellipse using the Ellipse Tool.”

The punchline:

Now, the next time you need to include a boring graph in one of your designs you’ll be able to add some extra emphasis and get people to really pay attention to those numbers!

P.S. to all the commenters: Yeah, yeah, do your contrarian best and tell me why chartjunk is actually a good thing, how I’m just a snob, etc etc.

Categories: Popular Blogs

More on the economic benefits of universities

Gelman's Blog - February 4, 2012

Last year my commenters and I discussed Ed Glaeser’s claim that the way to create a great city is to “create a great university and wait 200 years.”

I passed this on to urbanist Richard Florida and received the following response:

This is a tough one with lots of causality issues. Generally speaking universities make places stronger. But this is mainly the case for smaller, college towws. Boulder, Ann Arbor and so on, which also have very high human capital levels and high levels of creative, knowledge and professional workers.

For big cities the issue is mixed. Take Pittsburgh with CMU and Pitt or Baltimore with Hopkins, or St Louis. The list goes on and on.

Kevin Stolarick and I framed this very crudely as a transmitter reciever issue. The university in a city like this can generate a lot of signal, in terms of innovation or even human capital and the city may not receive it or push it away. A long ago paper by Mike Fogarty showed how innovations in Pittsburgh and Cleveland, by universities in these communities, tended to be picked up in Silicon Valley or even Tokyo.

I responded: Another factor in the interaction is: how good does the university have to be? Glaeser cited UW and Seattle, but that’s kind of a funny example, because I don’t think UW was such a great university 30 years ago. On the other hand, given the existence of Boeing and Microsoft, UW is good enough to do the job of providing a center for the creative class. Perhaps Ohio State (another good but not great university) has played a similar role in Columbus.

Florida replied:

Better is better. I think both are over threshold, but having taught at OSU at the very beginning of my career, it brings both plusses and minuses. It was an open admission school. The faculty was very, very mixed. And a huge football factory. Gates and Allen amongothers have pumped big wads of cash into UW, and it is good in computers and biosciences.

Both strike me as regional talent hubs, which probably trumps university quality.

Portland is another outlier with lots of talent/ human capital attraction and pretty crappy universities.

Florida also sent along this article and this blog.

Also, Hal Varian wrote:

There is a literature that attempts to assess the impact of university research on the local economy. One person I know well who works in this area is Marie Thursby. Click on her vitae to see the kind of work that has been done. This is pretty careful research, though of course it is hard to pin down causality…

P.S. Originally I wrote that UW is not such a great university, which may or may not be true but is sort of beside the point since the real issue is whether UW’s past greatness contributed to Seattle’s current prosperity. So I clarified that I’m really talking about UW thirty or so years ago.

Categories: Popular Blogs

Web equation

Gelman's Blog - February 3, 2012

Aleks sends along this app which, while cute, is not quite “killer” for me. I find it more difficult to write the equation using the trackpad than to simply type it in using Latex! But I suppose it could be useful to beginners who want their papers to look more like science.

Categories: Popular Blogs

Philosophy of Bayesian statistics: my reactions to Senn

Gelman's Blog - February 3, 2012

Continuing with my discussion of the articles in the special issue of the journal Rationality, Markets and Morals on the philosophy of Bayesian statistics:

Stephen Senn, “You May Believe You Are a Bayesian But You Are Probably Wrong”:

I agree with Senn’s comments on the impossibility of the de Finetti subjective Bayesian approach. As I wrote in 2008, if you could really construct a subjective prior you believe in, why not just look at the data and write down your subjective posterior. The immense practical difficulties with any serious system of inference render it absurd to think that it would be possible to just write down a probability distribution to represent uncertainty. I wish, however, that Senn would recognize my Bayesian approach (which is also that of John Carlin, Hal Stern, Don Rubin, and, I believe, others). De Finetti is no longer around, but we are!

I have to admit that my own Bayesian views and practices have changed. In particular, I resonate with Senn’s point that conventional flat priors miss a lot and that Bayesian inference can work better when real prior information is used. Here I’m not talking about a subjective prior that is meant to express a personal belief but rather a distribution that represents a summary of prior scientific knowledge. Such an expression can only be approximate (as, indeed, assumptions such as logistic regressions, additive treatment effects, and all the rest, are only approximations too), and I agree with Senn that it would be rash to let philosophical foundations be a justification for using Bayesian methods. Rather, my work on the philosophy of statistics is intended to demonstrate how Bayesian inference can fit into a falsificationist philosophy that I am comfortable with on general grounds.

Categories: Popular Blogs

The inevitable problems with statistical significance and 95% intervals

Gelman's Blog - February 2, 2012

I’m thinking more and more that we have to get rid of statistical significance, 95% intervals, and all the rest, and just come to a more fundamental acceptance of uncertainty.

In practice, I think we use confidence intervals and hypothesis tests as a way to avoid acknowledging uncertainty. We set up some rules and then act as if we know what is real and what is not. Even in my own applied work, I’ve often enough presented 95% intervals and gone on from there. But maybe that’s just not right.

I was thinking about this after receiving the following email from a psychology student:

I [the student] am trying to conceptualize the lessons in your paper with Stern with comparing treatment effects across studies. When trying to understand if a certain intervention works, we must look at what the literature says. However this can be complicated if the literature has divergent results. There are four situations I am thinking of. FOr each of these situations, assume the studies are randomized control designs with the same treatment and outcome measures, and each situation refers to a different treatment. It is easiest for me to put it into a table. In each of these situations only 1 of 2 published studies is found to be statistically significant.

Effect

se

Sig

Sig in diff

Result

Situation 1      Study A

.5

.05

Y

X

Treatment is effective

     Study B

.4

.2

Situation 2      Study C

.5

.1

Y

Y

Unclear, needs more replications

     Study D

.1

.1

Situation 3      Study E

.41

.2

Y

X

Unclear, needs more replications

     Study F

.14

.2

Situation 4      Study G

.7

.3

 Y

X

Null/needs more replications

     Study H

.19

.1

Here, Situation 1 refers to 2 studies that have similar effects in magnitude, though the larger of the 2 studies (smaller se) is the only sig one. SInce the difference between the two effects is itself, not statistically significant, we should conclude treatment in situation 1 is effective (this seems to be in line with your paper).
In situation 2 there are 2 equally sized experiments that differ in treatment effect and significance. Since the difference between the estimates is statistically significant, one concludes the paradigm needs more replications.
In situation 3 the 2 studies have 2 effects, one is statistically significant while the other is not. However in this situation study F is neither statistically nor substantively significant. Unlike situation 1 it would seem unwise to conclude Treatment in situation 3 is effective and we need more replications.
Situation 4 is just some result I cam across in a research synthesis, where a smaller study (larger se) had a statistically sig effect, but a larger one did not. It would seem in this situation the true effect is null and the stat sig effect is a type 1 error. However the difference between studies is not stat sig, would this matter?

I replied that my quick reaction is that it would be better if there were data from more studies. With only two studies, your inference will necessarily depend on your prior information about effectiveness and variation of the treatments.

The student then wrote:

That is my reaction as well. Unfortunately sometimes the only data we have is from a small number of studies, and not enough to necessarily run a meta-analysis on. In addition, the hypothetical situations I sent you are sometimes all we know about the effectiveness and variation in treatments, because it is all the evidence we have. What I am trying to better understand is if your paper is addressing situation 1 ONLY, or if it is making inferences or statements about the evidence in the other situations I presented.

To which I replied that I don’t know that our paper gives any real recommendations. In a decision problem, I think ultimately it’s necessary to bite the bullet and decide what prior information you have on effectiveness rather than relying on statistical significance.

This is a problem under classical or Bayesian methods. Either way, it’s standard practice to summarize uncertainty in a way that encourages deterministic thinking.

Categories: Popular Blogs

Philosophy of Bayesian statistics: my reactions to Cox and Mayo

Gelman's Blog - February 1, 2012

The journal Rationality, Markets and Morals has finally posted all the articles in their special issue on the philosophy of Bayesian statistics.

My contribution is called Induction and Deduction in Bayesian Data Analysis. I’ll also post my reactions to the other articles. I wrote these notes a few weeks ago and could post them all at once, but I think it will be easier if I post my reactions to each article separately.

To start with my best material, here’s my reaction to David Cox and Deborah Mayo, “A Statistical Scientist Meets a Philosopher of Science.” I recommend you read all the way through my long note below; there’s good stuff throughout:

1. Cox: “[Philosophy] forces us to say what it is that we really want to know when we analyze a situation statistically.”

This reminds me of a standard question that Don Rubin (who, unlike me, has little use for philosophy in his research) asks in virtually any situation: “What would you do if you had all the data?” For me, that “what would you do” question is one of the universal solvents of statistics.

2. Mayo defines scientific objectivity as concerning “the goal of using data to distinguish correct from incorrect claims about the world” and contrasts this with so-called objective Bayesian statistics. All I can say here is that the terms “subjective” and “objective” seem way overloaded at this point. To me, science is objective in that it aims for reproducible findings that exist independent of the observer, and it’s subjective in that the process of science involves many individual choices. And I think the statistics I do (mostly, but not always, using Bayesian methods) is both objective and subjective in that way.

3. Cox discusses Fisher’s rule that it’s ok to use prior information in design of data collection but not in data analysis. Like a lot of hundred-year-old ideas, this rule makes sense in some contexts but not in others. Consider the notorious study in which a random sample of a few thousand people was analyzed, and it was found that the most beautiful parents were 8 percentage points more likely to have girls, compared to less attractive parents. The result was statistically significant (p<.05) and published in a reputable journal. But in this case we have good prior information suggesting that the difference in sex ratios in the population, comparing beautiful to less-beautiful parents, is less than 1 percentage point. A classical design analysis reveals that, with this level of true difference, any statistically-significant oberved difference in the sample is likely to be noise. (Even conditional on statistical significance, the observed difference has an over 40% chance of being in the wrong direction and will overestimate the population difference by an order of magnitude.) At this point, you might well say that the original analysis should never have been done at all---but, given that it has been done, it is essential to use prior information to interpret the data and generalize from sample to population.

Where did Fisher’s principle go wrong here? The answer is simple—and I think Cox would agree with me here. We’re in a setting where the prior information is much stronger than the data. If one’s only goal is to summarize the data, then taking the difference of 8% (along with a confidence interval and even a p-value) is fine. But if you want to generalize to the population—which was indeed the goal of the researcher in this example—then it makes no sense to stop there.

Cox illustrates the difficulty in a later quote: “[Bayesians'] conceptual theories are trying to do two entirely different things. One is trying to extract information from the data, while the other, personalistic theory, is trying to indicate what you should believe, with regard to information from the data and other, prior, information treated equally seriously. These are two very different things.”

Yes, but Cox is missing something important! He defines two goals:
(a) Extracting information from the data.
(b) A “personalistic theory” of “what you should believe.”
I’m talking about something in between, which is inference for the population. I think Laplace would understand what I’m talking about here. The sample is (typically) of no interest in itself, it’s just a means to learning about the population. But my inferences about the population aren’t “personalistic”—at least, no more than the dudes at CERN are personalistic when they’re trying to learn about particle theory from cyclotron experiments, and no more than the Census and the Bureau of Labor Statistics are personalistic when they’re trying to learn about the U.S. economy from sample data.

4. Cox: “There are situations where it is very clear that whatever a scientist or statistician might do privately in looking at data, when they present their information to the public or government department or whatever, they should absolutely not use prior information, because the prior opinions on some of these prickly issues of public policy can often be highly contentious with different people with strong and very conflicting views.”

Maybe. But I don’t think Cox even believes this statement himself if it were taken literally. For example, right now I’m working on the politically controversial problem of reconstructing historical climate from tree rings. We have a lot of prior information on the processes under which tree rings grow and how they are measured. I don’t think anyone would want to just take raw numbers from core samples as a climate estimate! All the tools from Statistical Methods for Research Workers won’t take you from tree rings to temperature estimates. You need some scientific knowledge and prior information on where these measurements came from.

So let me interpret what I think Cox was saying. I take him to be dividing any scientific inference into two parts, inside and outside. Priors are allowed in the inside work of scientific modeling, which uses lots of external information, from the basic assumptions that the data correspond to your scientific goals, through the mathematical form of the transfer function, down to details such as an assumption of normally-distributed measurement errors, which might be supported based on prior experimental evidence. But Cox would prefer to avoid priors in the outside problem. In my example, I assume he’d allow prior information on the tree-ring measurement process—I don’t see how you can get anywhere otherwise—but he’d rather not combine with external estimates of the temperature series. That’s a tenable position. It doesn’t avoid all the controversy—manipulations of the data model can map in predictable ways to changes in the final inferences—but it could make sense.

I’ve followed this approach in much of my own applied work, using noninformative priors and carefully avoiding the use of prior information in the final stages a statistical analysis. But that can’t always be the right choice. Sometimes (as in the sex ratio example above), the data are just too weak—and a classical textbook data analysis can be misleading. Imagine a Venn diagram, where one circle is “Topics that are so controversial that we want to avoid using prior information in the statistical analysis” and the other circle is “Problems where the data are weak compared to prior information.” If you’re in the intersection of these circles, you have to make some tough choices!

More generally, there is a Bayesian solution to the problem of sensitivity to prior assumptions. That solution is sensitivity analysis: perform several analyses using different reasonable priors. Make more explicit the mapping from prior and data to conclusions. Be open about sensitivity, don’t try to sweep the problem under the rug, etc etc. And, if you’re going that route, I’d also like to see some analysis of sensitivity to assumptions that are not conventionally classified as “prior.” You know, those assumptions that get thrown in because they’re what everybody does. For example, Cox regression is great, but additivity is a prior assumption too! (One might argue that assumptions such as additivity, logistic links, etc., are exempt from Fisher’s strictures by virtue of being default assumptions rather than being based on prior information—but I certainly don’t think Mayo would take that position, given her strong feelings on Bayesian default priors.)

My point here is that all statistical methods require choices—assumptions, if you will. Not all your choices can be determined or even validated from the data at hand. If you don’t want your choices to be based on prior information, what other options do you have? You can rely on convention—using methods that appear in major textbooks and have stood the test of time—or maybe on theory. Both these meta-foundational approaches have their virtues but neither is perfect: Conventional methods are not necessarily good (as can be seen by noting that for many problems there are multiple conventional methods that give different results), and theory often doesn’t help (for example classical confidence intervals and hypothesis tests are insufficient in the simple sex-ratio problem noted above).

Categories: Popular Blogs

“the forces of native stupidity reinforced by that blind hostility to criticism, reform, new ideas and superior ability which is human as well as academic nature”

Gelman's Blog - January 31, 2012

Q. D. Leavis wrote:

The answer does seem to be that the academic world, like other worlds, is run by the politicians, and sensitively scrupulous people tend to leave politics to other people, while people with genuine work to do certainly have no time as well as no taste for committee-rigging and the associated techniques. And then of course there are the forces of native stupidity reinforced by that blind hostility to criticism, reform, new ideas and superior ability which is human as well as academic nature.

Not that I’ve ever read anything by Mrs. Leavis (or, as the Brits used to write, Mrs Leavis). The above quote is one of the epigraphs to a book by Richard Kostelanetz. Whom I’ve never heard of, except in a footnote in John Rodden’s classic Orwell study, The Politics of Literary Reputation.

I’ll have more to say about Orwell in another post, but for now let me return to the above Leavis quote, to which I have three reactions:

1. On a personal level, I’m on Leavis’s side. I’d much rather work (or blog, which I feel is related to my work and is also a public service) than spend time on academic politics: forming coalitions, doing the pre-meeting meetings, trading favors, kissing up and kicking down, and all the rest.

To put it another way, I don’t like political games because (a) I’m not good at manipulation and deception, and (b) Much of politics is zero-sum, and I prefer to collaborate in positive-sum activities such as writing Stan.

2. But on a more practical level, somebody needs to do the dirty work. Every once in awhile. I’ve encountered some administrators who are good at “committee-rigging,” etc., and others who show less political ability. I’ve seem people use political processes in a pointless destructive way—power for the sake of power—but others can use their political skills to foster smooth cooperation.

To put it another way, I require the political efforts of others to create the safe space I need to do my work. And it’s a special bonus when these political efforts are not “reinforced by that blind hostility to criticism, reform, new ideas and superior ability.”

3. As a political scientist, I recognize that politics is necessary. There’s no such thing as a non-political process. Politics is how we fight against entropy. Whatever non-politicized zones we have in life are often the result of continued political effort. As the saying goes, the price of liberty is eternal vigilance.

Ultimately I’ll have to go with #3.

Categories: Popular Blogs

Statistical Murder

Gelman's Blog - January 30, 2012

Image via Wikipedia

Robert Zubrin writes in “How Much Is an Astronaut’s Life Worth?” (Reason, Feb 2012):

…policy analyst John D. Graham and his colleagues at the Harvard Center for Risk Analysis found in 1997 that the median cost for lifesaving expenditures and regulations by the U.S. government in the health care, residential, transportation, and occupational areas ranges from about $1 million to $3 million spent per life saved in today’s dollars. The only marked exception to this pattern occurs in the area of environmental health protection (such as the Superfund program) which costs about $200 million per life saved.

Graham and his colleagues call the latter kind of inefficiency “statistical murder,” since thousands of additional lives could be saved each year if the money were used more cost-effectively. To avoid such deadly waste, the Department of Transportation has a policy of rejecting any proposed safety expenditure that costs more than $3 million per life saved. That ceiling therefore may be taken as a high-end estimate for the value of an American’s life as defined by the U.S. government.

This reminds me of my old article on Value of Life – where the hidden cost of the Iraq war for the US comes to 720,000 lives lost (based on the huge cost).

Categories: Popular Blogs

A tax on inequality, or a tax to keep inequality at the current level?

Gelman's Blog - January 30, 2012

My sometime coauthor Aaron Edlin cowrote (with Ian Ayres) an op-ed recommending a clever approach to taxing the rich.

In their article they employ a charming bit of economics jargon, using the word “earn” to mean “how much money you make.” They “propose an automatic extra tax on the income of the top 1 percent of earners.” I assume their tax would apply to unearned income as well, but they (or their editor at the Times) are just so used to describing income as “earnings” that they just threw that in. Funny.

Also, there’s a part of the article that doesn’t make sense to me.

Ayres and Edlin first describe the level of inequality:

In 1980 the average 1-percenter made 12.5 times the median income, but in 2006 (the latest year for which data is available) the average income of our richest 1 percent was a whopping 36 times greater than that of the median household.

Then they lay out their solution:

Enough is enough. . . . we propose an automatic extra tax on the income of the top 1 percent of earners — a tax that would limit the after-tax incomes of this club to 36 times the median household income.

This seems fair enough to me, but one thing that puzzles me is: my impression is that Ayres and Edlin feel that the rich have too much as it is already? So why freeze inequality at the current rate? (Yes, inequality could decline, but if it’s on an inexorable upward trend, my quick guess would be that maxing this ratio at 36 would be nearly equivalent to setting the ratio to 36.) Given the U.S. budget crisis, why 36? Why not 30, or 20, or 15?

P.S. When we last heard from Ayres he was supplying advice for young people who were rich or expecting to be rich. So I think it’s fair to say he’s no class warrior, that he’d like to keep income inequality at the current level but no lower.

And please note that I’m neither endorsing the Ayres/Edlin plan nor criticizing it. (Given my lack of expertise in macroeconomics, I’m certainly not the one you’d go running to, asking for an informed opinion on a proposed tax plan.) I’m just asking a question.

Categories: Popular Blogs

Convenient page of data sources from the Washington Post

Gelman's Blog - January 30, 2012

Wayne Folta points us to this list.

Categories: Popular Blogs

G+ > Skype

Gelman's Blog - January 29, 2012

I spoke at the University of Kansas the other day. Kansas is far away so I gave the talk by video. We did it using a G+ hangout, and it worked really well, much much better than when I gave a talk via Skype. With G+, I could see and hear the audience clearly, and they could hear me just fine while seeing my slides (or my face, I went back and forth). Not as good as a live presentation but pretty good, considering.

P.S. And here’s how to do it!

Conflict of interest disclaimer: I was paid by Google last year to give a short course.

Categories: Popular Blogs

How many parameters are in a multilevel model?

Gelman's Blog - January 29, 2012

Stephen Collins writes:

I’m reading your Multilevel modeling book and am trying to apply it to my work. I’m concerned with how to estimate a random intercept model if there are hundreds/thousands of levels. In the Gibbs sampling, am I sampling a parameter for each level? Or, just the hyper-parameters? In other words, say I had 500 zipcode intercepts modeled as ~ N(m,s). Would my posterior be two dimensional, sampling for “m” and “s,” or would it have 502 dimensions?

My reply: Indeed you will have hundreds or thousands of parameters—or, in classical terms, hundreds or thousands of predictive quantities. But that’s ok. Even if none of those predictions is precise, you’re learning about the model.

See page 526 of the book for more discussion of the number of parameters in a multilevel model.

Categories: Popular Blogs

Using predator-prey models on the Canadian lynx series

Gelman's Blog - January 28, 2012

The “Canadian lynx data” is one of the famous examples used in time series analysis. And the usual models that are fit to these data in the statistics time-series literature, don’t work well. Cavan Reilly and Angelique Zeringue write:

Reilly and Zeringue then present their analysis. Their simple little predator-prey model with a weakly informative prior way outperforms the standard big-ass autoregression models. Check this out:

Or, to put it into numbers, when they fit their model to the first 80 years and predict to the next 34, their root mean square out-of-sample error is 1480 (see scale of data above). In contrast, the standard model fit to these data (the SETAR model of Tong, 1990) has more than twice as many parameters but gets a worse-performing root mean square error of 1600, even when that model is fit to the entire dataset. (If you fit the SETAR or any similar autoregressive model to the first 80 years and use it to predict the next 34, the predictions are a disaster—the predicted values quickly go toward the mean and can’t even attempt to track the curve.)

As Reilly and Zeringue note, the above graph shows potential room for improvement in the model, but even as is, it shows the huge benefits that can be obtained by attempting to model the underlying process rather than simply fitting the data using a conventional family of models.

(It’s funny for me to emphasize this point, given how often I use conventional models such as linear and logistic regression.)

P.S. The title and text above have been modified to reflect comments below with reference to models fit to the lynx data in the ecology literature. There appears to be not enough communication between ecologists and statisticians. The statistical point above still holds—a simple model with some reasonable structure can outperform a generic data-fitting model such as an autoregression—but you should probably check out some of the references given in the comments if you’re interested in the lynx example or ecology models more generally.

Categories: Popular Blogs

Educational monoculture

Gelman's Blog - January 27, 2012

John Cook writes that he’d like to hear more people talk about “educational monoculture.” I don’t actually know John Cook but I enjoy reading his blog, so I feel like the least I can do is to honor his request.

I have to admit that I have a bit of a monocultural temperament myself. I have strong feelings about the right and wrong way to do things, and I don’t have much patience for what seems to me to be the wrong way. As a result, I’ve often disparaged or ignored important statistical developments because some small aspect of the new idea didn’t fit with my thinking. (On the plus side, I think I’ve disparaged or ignored lots more bad ideas thad deserve oblivion.)

I’ve always been suspicious of the hedgehog/fox distinction because my impression is that just about everybody likes to think of him or herself as a fox. Being a hedgehog is like being “ideological”; most of us like to think of ourselves as pragmatic foxes. And in any case I think most statisticians are foxes.

One of the many positive outcomes of my mugging at Berkeley was a commitment to pluralism (for example, see here).

Beyond this, I move away from my natural monocultural instincts by teaching classes that include material I wouldn’t otherwise cover, by listening carefully to people I respect who do things in a different way than I do, and by thinking hard about why certain methods or attitudes which seem silly to me, still remain popular.

Finally, my approach as a political scientist and public opinion researcher is to understand the views of others. I think I have a pretty good grip on why it can make sense for people to vote for Gingrich or Romney or Obama or Santorum or whatever, and I’m interested in understanding political ideologies as they manifest themselves in different areas (even in statistics, where political views range from Dennis Lindley to Jacob Wolfowitz).

“Moving beyond monoculture” doesn’t mean that I abandon my skepticism but it means that I should at least try to understand other approaches to looking at the world.

P.S. I thought the above discussion would be more useful than yet another argument about the extent to which modern education is such a scam etc.

Categories: Popular Blogs

Suggested resolution of the Bem paradox

Gelman's Blog - January 26, 2012

There has been an increasing discussion about the proliferation of flawed research in psychology and medicine, with some landmark events being John Ioannides’s article, “Why most published research findings are false” (according to Google Scholar, cited 973 times since its appearance in 2005), the scandals of Marc Hauser and Diederik Stapel, two leading psychology professors who resigned after disclosures of scientific misconduct, and Daryl Bem’s dubious recent paper on ESP, published to much fanfare in Journal of Personality and Social Psychology, one of the top journals in the field.

Alongside all this are the plagiarism scandals, which are uninteresting from a scientific context but are relevant in that, in many cases, neither the institutions housing the plagiarists nor the editors and publishers of the plagiarized material seem to care. Perhaps these universities and publishers are more worried about bad publicity (and maybe lawsuits, given that many of the plagiarism cases involve law professors) than they are about scholarly misconduct.

Before going on, perhaps it’s worth briefly reviewing who is hurt by the publication of flawed research. It’s not a victimless crime. Here are some of the malign consequences:

- Wasted time and resources spent by researchers trying to replicate non-findings and chasing down dead ends.

- Fake science news bumping real science news off the front page.

- When the errors and scandals come to light, a decline in the prestige of higher-quality scientific work.

- Slower progress of science, delaying deeper understanding of psychology, medicine, and other topics that we deem important enough to deserve large public research efforts.

This is a hard problem!

There’s a general sense that the system is broken with no obvious remedies. I’m most interested in presumably sincere and honest scientific efforts that are misunderstood and misrepresented into more than they really are (the breakthrough-of-the-week mentality criticized by Ioannides and exemplfied by Bem). As noted above, the cases of outright fraud have little scientific interest but I brought them up to indicate that, even in extreme cases, the groups whose reputations seem at risk from the unethical behavior often seem more inclined to bury the evidence than to stop the madness.

If universities, publishers, and editors are inclined to look away when confronted with out-and-out fraud and plagiarism, we can hardly be surprised if they’re not aggressive against merely dubious research claims.

In the last section of this post, I briefly discuss several examples of dubious research that I’ve encountered, just to give a sense of the difficulties that can arise in evaluating such reports.

What to do (statistics)?

My generic solution to the statistics problems involved in estimating small effects is to replace multiple comparisons by multilevel modeling, that is, to estimate configurations rather than single effects or coefficients. This tactic won’t solve every problem but it’s my overarching conceptual framework. There’s lots room for research on how to do better in particular problem settings.

What to do (scientific publishing)?

I have clearer ideas of resolutions (at least in the short term) of the Bem paradox; in short, what to do with dubious but potentially interesting findings.

So far there seem to be two suggestions out there: Either publish such claims in top journals (as for example Bem’s in JPSP, or the contagion-of-obesity paper in NEJM), or the journals should reject them (perhaps from some combination of more careful review of methodology, higher standards than classical 5% significance, and Bayesian skepticism).

The problem with the publish-in-top-journals strategy is that it ensures publicity for some mistakes and it creates incentives for researchers to stretch their statistics to get a prestigious publication.

The problem with the reject-’em-all-and-let-the-Arxiv-sort-’em-out strategy is that it’s perhaps too rigorous. So many papers have potential methodological flaws. Recall that the Bem paper was published, which means in some sense that its reviewers thought the paper’s flaws were no worse than what usually gets published in JPSP. Long-term, sure, we’d like to improve methodological rigor, but in the meantime a key problem with Bem’s paper was not just its methodological flaws, it was also the implausibility of the claimed results.

So here’s my proposed solution. Instead of publishing speculative results in top journals such as JPSP, Science, Nature, etc., publish them in lower-ranked venues. For example, Bem could publish his experiments in some specialized journal of psychological measurement. If the work appears to be solid (as judged by the usual corps of referees), then publish it, get it out there. I’m not saying to send the paper to a trash journal; if it’s good stuff it can go in a good journal, the sort where peer review really means something. (I assume there’s also a journal of parapsychology but that’s probably just for true believers; it’s fair enough that Bem etc would like to publish somewhere that outsiders would respect.)

Under this system, JPSP could feel free to reject the Bem paper on the grounds that it’s too speculative to get the journal’s implicit endorsement. This is not suppression or censorship or anything like it, it’s just a recommendation that the paper be sent to a more specialized journal where there will be a chance for criticism and replication. At some point, if the findings are tested and replicated and seem to hold up, then it could be time for a publication in JPSP, Science, or Nature.

From the other side, this should be acceptable to the Bems and Fowlers who like to work on the edge. You still get your ideas out there in a respectable publication (and you still might even get a bit of publicity), and then you, the skeptics, and the rest of the scientific community can go at it in public.

There have also been proposals for more interactive publications of individual articles, with bloglike opportunities for discussion and replies. That’s fine too, but I think the only way to make real progress here is to accept that no individual article will tell the whole story, especially if the article is a report of new research. If the Bem finding is real, this can be demonstrated in a series of papers in some specialized journal.

Appendix: Individual cases can be tough!

I’ve encountered a lot of these borderline research findings over the past several years, and my own reaction is typically formed by some mix of my personal scientific knowledge, the statistical work involved, and my general impressions. Here are a few examples:

“Beautiful women have more daughters”: I was pretty sure this one was empty just based on my background knowledge (the claim was an difference of 8 percentage points, which is much more than I could possibly expect based on the literature). Careful review of the articles led me to find problems with the statistics.

Dennis the dentist, Laura the lawyer, and the proclivity of Dave Kingman and Vince Koleman to strike out a lot: I was ready to believe the Dennis/Laura effect on occupations and only slightly skeptical of the K effect on strikeouts, but then the work was later strongly criticized on methodological grounds. Still, my back-of-the-envelope calculation let me to believe that they hypothesized effects could be there.

Warming increases the risk of civil war in Africa: This one certainly could be true but something about it rang some bells in my head and I’m skeptical. The statistical evidence here is vague enough that I could well take the opposite tack, believing the claim and being skeptical about skepticism of it. To be honest, if I knew these researchers personally I might very well be more inclined to trust the result. (And that’s not so silly: if I knew them personally I could ask them a bunch of questions and get a sense of where their belief in this finding is coming from.)

“45% hitting, 25% fielding, and 25% pitching”: I was skeptical here because it was presented as a press release with no link to the paper but with enough details to make me suspect that the statistical analysis was pretty bad.

“Minority rules: scientists discover tipping point for the spread of ideas”: I don’t know if this should be called “junk science” or just a silly generalization from a mathematical model. Here I was suspicious because the claim was logically inconsistent and the study as a whole fit the pattern of physicists dabbling in social science. (As I wrote at the time, I’ll mock what’s mockable. If you don’t want to be mocked, don’t make mockable claims.)

“Discovered: the genetic secret of a happy life”: There’s potentially something here but the differences are much smaller than implied by the headlines, the news articles, or even the abstract of the published article.

Whatever medical breakthrough happens to have been reported in the New York Times this week: I believe all of these. Even though I know that these findings don’t always persist, when I see it in the newspaper and I know nothing about the topic, I’m inclined to just believe.

That’s one reason the issue of flawed research is important! I’m as well prepared as anyone to evaluate research claims, but as a consumer I can be pretty credulous when the research is not close to my expertise.

If there is any coherent message from the above examples, it is that my own rules for how to evaluate research claims are not clear, even to me.

Categories: Popular Blogs

Chris Schmid on Evidence Based Medicine

Gelman's Blog - January 25, 2012

Chris Schmid is a statistician at New England Medical Center who is an expert on evidence-based medicine. I invited him to present an introductory overview lecture on the topic at last year’s Joint Statistical Meetings, and here are his slides. All 123 of them. I don’t know how he expected to go though all of these in an hour. You could teach a semester-long course based on this material.

Good stuff, I recommend you all read it.

Categories: Popular Blogs

Difficulties in publishing non-replications of implausible findings

Gelman's Blog - January 24, 2012

Eric Tassone points me to this news article by Christopher Shea on the challenges of debunking ESP. Shea writes:

Earlier this year, a major psychology journal published a paper suggesting that there was some evidence for “pre-cognition,” a form of ESP. Stuart Ritchie, a doctoral student at the University of Edinburgh, is part of a team that tried, but failed, to replicate those results. Here, he tells the Chronicle of Higher Education’s Tom Bartlett about the difficulties he’s had getting the results published.

Several journals told the team they wouldn’t publish a study that did no more than disprove a previous study. . . . An editor at another journal said he’d “only accept our paper if we ran a fourth experiment where we got a believer [in ESP] to run all the participants, to control for . . . experimenter effects.”

My reaction is, this isn’t as easy a question as it might seem. At first, one’s reaction might share Ritchie’s frustration that a shoddy paper by Bem got published while Ritchie’s careful replication got dinged. But, as I wrote when the issue came up on the sister blog:

Setting aside the whole “psychic powers” thing, it makes sense to me not to run the new experiment. After all, it’s hardly news that ESP doesn’t work. If “ESP doesn’t work” were publishable, you could fill up a journal many times over with such findings. And what would be the point of that? Better to start a new journal with some catchy title such as Replications of Well-Known Findings. In the physics division, you could have articles demonstrating that objects fall down, not up. In the chemistry division, you could publish demonstrations that H2 + O2 yields H2O plus energy. The biology section could have a paper demonstrating that cats and dogs can’t produce offspring. And so on.

So I don’t know the answer here. On one hand, we can hardly require or even expect that journals fill their pages with dog-bites-man nonreplications. (And, even in a computerized era where there are no page limits, there are still constraints on the time of editors and reviewers.) On the other hand, this leads to an asymmetry where crap gets on the front page and the refutation doesn’t even get published on page B16.

Categories: Popular Blogs

Fight! (also a bit of reminiscence at the end)

Gelman's Blog - January 23, 2012

Martin Lindquist and Michael Sobel published a fun little article in Neuroimage on models and assumptions for causal inference with intermediate outcomes. As their subtitle indicates (“A response to the comments on our comment”), this is a topic of some controversy. Lindquist and Sobel write:

Our original comment (Lindquist and Sobel, 2011) made explicit the types of assumptions neuroimaging researchers are making when directed graphical models (DGMs), which include certain types of structural equation models (SEMs), are used to estimate causal effects. When these assumptions, which many researchers are not aware of, are not met, parameters of these models should not be interpreted as effects. . . . [Judea] Pearl does not disagree with anything we stated. However, he takes exception to our use of potential outcomes notation, which is the standard notation used in the statistical literature on causal inference, and his comment is devoted to promoting his alternative conventions. [Clark] Glymour’s comment is based on three claims that he inappropriately attributes to us. Glymour is also more optimistic than us about the potential of using directed graphical models (DGMs) to discover causal relations in neuroimaging research . . .

Lindquist and Sobel’s arguments make sense to me, except on one point. They consider a causal setting z -> x -> y, where z is the treatment variable, x is the intermediate outcome, and y is the ultimate outcome, and much of their discussion centers on estimating the causal effect of x on y. I have two difficulties with their perspective:

1. If x is an observed variable that is not directly manipulated, I don’t know if it makes sense to talk about the effect of x on y, unconditional on the intervention that was used to change x. In their example, I’d talk about “the effect of x on y, if x is changed through z.” Different z’s can induce different effects of x on y.

2. Lindquist and Sobel talk about the effect of z on x. If z=0 or 1, they write x(z), so that the causal effect of z on x is x(1) – x(0) (or, more generally, x(1) compared to x(0), but we lose nothing by considering simple differences here). So far, so good.

But I get stuck at the next step, where they define the effect of x on y. If x can equal 0 or 1, they write y(z,x), so that the causal effect of x on y, conditional on z, is y(z,1) – y(z,0). At least, I think that’s what they’re saying.

The trouble is, I don’t see how the two parts of this model fit together. For any given item in the experiment, I think they’re following the rule that x(z) has a particular (although maybe unknown) value. But then I don’t see what it means to look at y(z,1) – y(z,0). For any particular value of z, it seems to me that only one of these two terms is possible. (For example, if x(z)=1, then y(z,1) is defined but y(z,0) seems meaningless.)

I’m not saying that this framework is wrong, just that I don’t understand it.

That said, Lindquist and Sobel’s criticisms of Pearl and Glymour seem sound to me.

P.S. I wrote this last month and put it in the queue. Since then I’ve noticed that Pearl has responded to Lindquist and Sobel; see here. I don’t find Pearl’s response to be so convincing—I agree with Lindquist and Sobel’s statement that the graphical or structural equation modeling expression looks simple and appealing but the underlying assumptions in those expressions are not so clear. But you can judge for yourself; as I wrote in my discussion of the book by Morgan and Winship, it’s good to have muultiple expressions for a model, as different users are looking for different things.

To be specific, Pearl contrasts three expressions of a single model, the causal chain Z—>X—>Y. Here’s Pearl:

Pearl characterizes the third expression is a more meaningful and clear display.

In contrast, Lindquist and Sobel argue that the above graphical expression appears clear only because it sweeps the model’s assumptions under the rug. Lindquist and Sobel write:

None of this seems clear and simple to me! Speaking of clear and simple, I’m reminded of a scene, several decades ago, when a bunch of us on the county math team won some competition, and the prize was that we each got to choose one of several math books. One of the books was called Elementary Linear Algebra, and I remember making a disdainful remark to my friend that I didn’t want something elementary. My friend replied, “Linear algebra is not elementary.” Good point.

Which brings back another memory: our coach for the Mathematical Olympiad program was an unbelievably grumpy old man. At one point he interrupted one of his lectures to rant about how all the calculus books now are wasting their space with applications. At some point, he said, they’re gonna come up with a book called Applied Calculus with Applications. That all seemed natural to me at the time but in retrospect I’m amazed by how brainwashed we all were. There was one kid there who I recall was interested in engineering problems rather than number theory etc., but that was an unusual preference. (I just looked him up and, amazingly, he grew up to be an engineering researcher!) The other thing I remember about the grumpy coach dude, besides his personality (which, in retrospect, was perhaps necessary to keep a bunch of 15-year-old boys in line; even nerds can make trouble), was that he thought it was cheating to use calculus or analytic geometry. His favorite sorts of problems used elaborate arguments from classical geometry and he always felt we should be able to solve these without resorting to technical means.

As I’ve remarked more than once in this space, I feel lucky in retrospect to have been pretty unprepared for the Olympiad program, with the result that I didn’t do very well there, gradually lost interest in this sort of competitive event, and decided I didn’t want to be a pure mathematician. I think it must’ve been really hard on the kids who were top performers but didn’t happen to be Noam. It was easier for those of us in the bottom half of the group.

Categories: Popular Blogs