If this is normal science I want my old job back
This also “not part of the proper Steven Levitt series”, although it anticipates a couple of things from Part 3. It’s basically an expanded version of this comment on the Brad DeLong site. Basically, I do not at all like what Levitt is doing on the issue of the challenge to the “More Abortions, Less Crime” thesis in chapter 4 of Freakonomics, and the controversy which has erupted over it since the Foote and Goetz working paper discussed below.
Basically, Levitt has a response up now. He goes hands-up on the programming error, but fights back by making a number of corrections to the underlying abortion data series and says that the key result is still there if you make these corrections. Brad Delong describes this as “normal science“, presumably in the Kuhnian sense, but I think it’s something a bit worse. Yes, thanks, I am aware that by entering the lists in this way I have become an ally of Steven Sailer, the noted film critic and race nut, but there you go. Here are my specific comments in ascending order of seriousness:
1. I think that the decision to use an instrumental variables approach to allow for measurement error in the Alan Gutmacher Institute abortion survey data is possibly wrong and underjustified (I had to look this up because my recollection of the way you use IV to deal with measurement error is just pathetic; thanks to Ragout in Tim Lambert’s comments for helping me out. God this is turning into an incestuous blog project, it reminds me of the Lancet study community). The issue is that if there are measurement errors in the AG data, then the residuals in the regression will be correlated with one of the regressors (because the left hand side is determined by the true relationship, so big residuals will be correlated with big measurement errors on the right-hand side variable). This tends to bias down your estimates of the regression coefficients, so making you more likely to find things not to be significant when they are..
On the other hand, if you’re using IV estimation (whereby you replace the series with measurement error by a proxy constructed from other series; this tends to inflate the residuals but takes away the correlation with the right hand side), then the series you use to construct your instrument mustn’t be themselves correlated with the measurement error on the original series. If they are, then you’re going to introduce a negative correlation and your will tend to bias your estimates in the opposite direction, making you more likely to find things to be significant when they are not.. It might be the case that there is good reason to believe that Levitt’s proxy for the AG data from a similar series compiled by the CDC has measurement errors which are not correlated with the measurement errors on the AG data but this ought to have been discussed. I don’t like it when people bring in IV estimation with no discussion of why they’re sure that their instrument is valid. This is a venial rather than a mortal sin, but Levitt has a chronic case of it; a lot of his work seems to rely on a kind of “gee whiz what an original idea” when coming up with off-the-wall ideas to find measurable proxies for things, rather than explaining in detail why the proxy is valid.
2. On a simple point of fact, the fourth column of row three of the table displaying Levitt’s revised results does not show a significant effect. This is the column using the correctly programmed interaction effects and IV estimation (I think it’s also using the processed data series but it might not be), so in a sense it’s the end of the “improvement” process that’s been carried out. This isn’t mentioned in the text summarizing the table; again a presentational matter rather than anything else, but irksome.
3. Finally and most importantly, this is about as far from a double blind trial as you can get. I’ve written in the past about the perils of data mining in econometrics, and to be honest, all that is lacking in the series of changes to the data and the model that the Freakonomics blog presents is a phalanx of dwarves singing “Hi Ho, Hi Ho, It’s Off To Data-Mine We Go”. What has happened here is that Levitt and his research assistant have sat down in the knowledge that a perturbation to their model doesn’t deliver their result, and decided to have a think about what kinds of alterations to the data ought to be made.
You don’t need to suggest any intentional dishonesty to say that it is somewhat unsurprising that the outcome of the brainstorming session on “What sort of changes ought one to make to this data, in an ideal world?” was a dataset and model in which the result that Levitt is famous for was present. Even if Levitt and Ethan Lieber had sat down at a table with no computer on it, starting with a blank sheet to discuss the changes to make and not touching the model until they had finished, I would still guess that it would be the easiest thing in the world for someone who was intimately familiar with the dataset to subconsciously put his thumb on the scales. And I don’t think this is what they did; colour me cynical but I would bet quids that lots and lots of iterations of different possible changes to the data were tried. I note once more that there is no accusation of intentionally cooking the books here; medical science certainly doesn’t insist on double blind trials to protect them from unscrupulous doctors.
I think that there’s a general issue here which is endemic to the territory that Levitt chooses to operate in. By their nature, political debates are debates. One side produces arguments, the other side produces counterarguments and so on, so iteratively. This is an environment which is absolutely poisonous to datasets. By the time you’ve been through two or three iterations of a “controversy” like this it’s more or less impossible to pick a model without failing even the most homeopathically weak version imaginable of a double blind criterion. This is why I now say that we’re simply never going to know the truth (by which I mean, even the simple statistical truth about the existence of a comovement, much less the truth about the underlying causal hypothesis) about abortion and crime in the period 1976-2000. Stick a fork in this dataset, it’s done.
I think it’s bad for economics and statistics as a science to start acquiring the habits of thought that are prevalent in these debates (more, much more, on this in the long awaited Part Three). I also think it’s bad for politics to have one side of any debate trying to give their case the imprimateur of objective science in exactly the way that Freakonomics does all the time with its “morality is concerned with what should be the case; economics is concerned with what actually is the case” schtick. When your response to a measured, polite working paper is to nip off to the data mines with your research assistant and write a blog post entitled “Back to the drawing board for our latest critics�and also the Wall Street Journal and (Oops!) the Economist.”, then what you’re doing isn’t “normal science”. It’s normal politics.
 By the way, there is nothing wrong with the English phrases “making you more likely to find things not to be significant than they are” or “making you more likely to find things to be significant when they are not”, so can we just give up on trying to remember which is a Type I and which is a Type II error, pretty please?
 As I say, my recollection of this stuff is terrible. When you do IV estimation in econometrics you are usually doing so because of a different problem (endogeneity) and I have never in my life taken seriously the possibility of measurement error in the series I was using (professional deformation). What I am saying here is that this bit might be wrong, I am staking about a farthing’s worth of credibility on it being right and if you want to correct me go for it in the comments.
 By this I mean something quite different from claiming that political debates shouldn’t be informed by scientific facts, or that social scientists shouldn’t get involved in doing work on important topics that they care about; my defence of the Lancet Report on Iraq would look pretty odd if this was what I believed. It’s something close to the Humean point about reason being the slave of the passions; scientists ought to be (when they are acting as scientists that is; they can do what they like when they take part in the debate as citizens) simply bringing their best estimate of the facts of the case before the people who decide what to do about those facts. And scientists shouldn’t data-dredge, either, of course.