Log Books – an Impudent Suggestion
Important caveat: It’s just struck me that anything you put up on the internet has a chance of being used for mischievous purposes. So I’d just like to clarify that I make repeated references to John Lott in this essay for two reasons a) his paper happened to be the subject of the debate in the comments section that set me off on this line of thought and b) I think my lay readers will be more familiar with “More Guns, Less Crime” than most of the rest of the econometrics literature. I have no special knowledge of how John Lott goes about his work, and no evidence of any kind which might support claims about whether he is more or less scrupulous in his use of iterated stepwise regressions than any other econometrician (I’d note that the problem is endemic, so it would hardly be a criticism of Lott if he had all the faults described in this article; I certainly do). I happen to dislike the guy’s work, and I personally don’t trust him because he’s attached to the American Enterprise Institute, but this is far more likely to reflect my personal political prejudices than anything else. As far as I can tell, in the current edited version of this essay, there is only one reference to Lott’s work in a context where it might be reasonably suspected that I’m referring to him pejoratively (a joke about the Stanford Law Review that I’m leaving in because it’s funny), and this reference would apply equally to his critics. Everyone should make up their own mind about Lott’s work based on the evidence and on detailed analysis of his work, and this essay is neither. If John Lott or his representatives want to complain about particular references, I’d be grateful if they used the comments link below for first point of contact, as I have a curious phobia about posting my email address on this site.
Anyway, on with the show.
What with things like this, plus the whole John Lott fiasco, econometrics is having a pretty hard time of things at the moment … and probably, deservedly so. Basically, read on for a story about one of the dirty little secrets of econometric analysis and my suggestions for something that could be done about it. I call this an “impudent suggestion” because, naturally, my personal contribution to the published econometrics literature is pretty precisely fuck-all. But anyway …
A statistician whose name I forget once said of the Ordinary Least Squares regression model that it was the internal combustion engine of econometrics; something so convenient that everyone tended to use it for trivial tasks where it wasn’t really needed, without regard to the polution it caused. This was a bit precious, perhaps, but nowadays, computer power has reached the point at which everybody can have the SPSS regression package (I believe that’s the standard in universities; I’m a TSP man myself) on their desktop, along with a CD containing 5,000 different economic data series for whatever they want, plus the Internet to download most any specific data series collected for whatever question they might be interested in. And this is the point at which the internal combustion analogy really does become valid; it is not at all healthy for the general advancement of economic knowledge to have the power to do OLS regressions so democratically distributed.
To pursue the analogy with cars a bit further, we all know that it’s simply not possible for everyone in China and India to drive a Ford Galaxy. The world just doesn’t have enough oil. And OLS regressions are also pretty intensive in their consumption of a scarce resource – a scarce resource called “uncontaminated data”. This is the point, by the way, at which matters start to get both a bit technical (an inevitable result of the territory, which I am trying to mitigate) and more than a bit inaccurate (both a result of my trying to simplify things for the layman, and my characteristic error-laden style). I happen to believe that it’s massively, monumentally important that the educated plain man should be able to understand this sort of thing, and that if he can’t it’s the fault of professionals for not explaining it better. I also believe that anything that can be understood, can be explained to someone with no specialist background, without making material oversimplifications. If I don’t achieve either aim, I hope someone will pick me up in the comments.
Anyway, the issue of data contamination is this. When you’re building a model of something, here’s one of the ways you might choose to go about it. (NB that first, this is a very general and oversimplified characterisation, which will be sharpened up considerably below, and second that there are a number of other ways you might set about this task, all of which share this problem, but not always in as transparent or obvious a way).
- Decide on the structure of your model (take an easy example; we’re going to assume that consumption equals X times income plus Y times wealth).
- Estimate your model (in this case we take the US consumption, income and wealth numbers from, say 1945 to 1971 and use SPSS to pick the values of X and Y which minimise the “in sample prediction error”
- Now, you might be a bit worried that your model had fitted itself to random features of your data rather than to the underlying structural relationship you assumed was there.
- So, you can now use the data for income and wealth for 1972-2002 to generate thirty years’ worth of “predicted” consumption data and compare it to the actual. If your “out of sample prediction error” isn’t too bad, then you’ve got a model that might be valid, and you’ve learned something about the relationship between consumption, income and wealth.
Fine and fun. But what happens if your out of sample prediction error is laughably awful, as it most likely would be in such a model? Well, the obvious thing to do is to go back to step 1 and alter your model a bit; say, try to model the logarithm of consumption, or changes in consumption, or add lagged values of consumption as an explanatory factor, or something. But now you run up against a really serious problem, which I shall illustrate with a Christmas cracker joke.
Q: How many hamburgers can William “The Refrigerator” Perry eat on an empty stomach?
A: Only one. After that, his stomach isn’t empty any more.
Similarly, how many times can you test a model against out-of-sample data? Only once; after you’ve done that, the data isn’t out-of-sample any more.
Unscrupulous readers will have noticed immediately that if the referee didn’t see it, it didn’t happen (a fortiori, if you’re publishing in a non-refereed journal like the Stanford Law Review, it certainly didn’t happen), and that there is nothing to stop you from doing the statistical equivalent of cheating at patience and “resetting the clock”, trying out lots of models on the 1945-71 data until you get one that has an acceptable fit with the 1972-02 data as well. You can even, I am ashamed to say, buy computer programs which carry out this process (“step-wise regression”) for you; you just feed in the data series that you want to find a relationship for and they try out all sorts of models until they come up with one that has decent performance. Or, if you don’t own one of these programs, you can do what economics graduate students get up to on those rare evenings when they lack hot dates, and sit in a computer lab, mindlessly pressing the button on SPSS over and over again, looking at the goodness-of-fit statistics and trying to ignore your undeniable and growing resemblance to the rodent half of one of BF Skinner’s operant conditioning experiments.
Obviously, it wasn’t possible to do this back in the days when linear regressions had to be carried out by armies of men with slide rules, or even in the days when you had to book time on the faculty mainframe, but technology has advanced to the point at which SPSS can be an acceptable substitute for Minesweeper for a certain kind of econometrician. As a social pastime, obviously, endlessly re-estimating permutations of regression models has all the disadvantages of masturbating into a sock with few of the advantages, but is it really all that harmful? Well yes. I’m not just being a luddite here.
The problem is that at the end of an evening spent in this fashion, our economics graduate student will have finally arrived at a model which fits the data very well, but of which it is almost impossible to judge the significance. Why?
Well (digressing for the layman here), let’s take a fairly standard significance test. Specifically, let’s take the absolutely standard t-test for significance in a multivariate linear regression (a test of this sort lies at the heart of John Lott’s “More Guns, Less Crime” paper, for example). The idea behind this test is pretty simple. Along with the regression coefficient (in the model above, X and Y were the regression coefficients), the software calculates something called the “standard error”. The standard error in this context is a measure of the variability of the coefficient; the extent to which that particular coefficient would have needed to be bigger or smaller to explain the various data points in the sample. And, under the usual assumptions, the estimated value of the coefficient can be treated as a draw from a probability distribution with a mean at the “true” value of the coefficient and a dispersion around that mean defined by the standard error. So what you do to “test significance” of a coefficient is to take your estimated coefficient, divide it by its standard error to “standardise” it, and then look up in a book the critical values of the relevant distribution (it’s called the “t-distribution”, or for those who are showing off, “Student’s1 t-distribution”. Oversimplifying mightily, the book tells you, for a given “t-ratio” (the ratio of the coefficient to its standard error”, what the probability is that a random draw from a t-distribution with mean zero would be no greater in magnitude than the t-ratio you’re looking up. So if I get a coefficient of 6 (an estimate of 6 for X in my equation above) and a standard error of 2, then my t-ratio is 3. I look this up in the book and (say) find that a random draw from a t-distribution with mean zero will be greater than 3 or less than -3 no more than 2% of the time. This is great news for me, as it means that I can say that (speaking loosely, or as a Bayesian) I can be 98% confident that my coefficient is not zero, or that (speaking rigorously, or as a frequentist), in an arbitrarily large number of similar studies, I would only expect to see results like mine 2% of the time if the true value of the coefficient was zero. In other words, it’s highly unlikely (speaking loosely again) that there is no relationship between the variable associated with my coefficient X (income) and the dependent variable (consumption). Or in other words, this variable is “significant”2.
Or is it? As you notice above, I’ve stressed on a number of occasions that the book of t-distributions tells you what to expect if you make a random draw from a t-distribution. But, if you’ve been up all night pressing the fire button on your regression program over and over again until you get a sufficiently high t-ratio on your coefficient of interest, then that t-ratio isn’t a random draw from a t-distribution. It’s a number that could only in principle be about 1.9 at the lowest, because you’ve iterated through the regressions until you found one. There is a very big problem if you pretend that a number created in such a way is a t-ratio randomly drawn from a t-distribution and start pretending to do statistical tests on it as if it was. If you publish an article based on this sort of methodology in a journal (or worse, submit it as a policy paper), you’re actually polluting the information space, because your results are indistinguishable from genuine statistical analysis, but they are damn near to being statistically meaningless. Not only that, but anyone who reads your article is contaminated; if they ever do any work of their own on the dataset that you did, they start from a basis of knowing how your model fit the data. Laymen reading this, if you take nothing else away from this article (and if you take nothing else away, sorry for wasting 1904 words of your time), remember every time you see a piece of statistical analysis which looks definitive, you normally have no guarantee that it wasn’t produced in this way.
This is the issue of “data mining”, and we had a bit of a go at it in the comments section a couple of weeks ago. In actual fact, there are all sorts of ways in which one might go data mining, and some of them are far less pernicious than others. In particular, I have to confess a soft spot for the “LSE Econometrics” approach of David Hendry3 and his mates. Under this approach, you don’t start by choosing a model at all; you use a highly general model to begin with (Hendry refers to his method as “general to specific”), one which is structured so that it has a lot of redundancy and explains the data (another slogan of the LSE school is that one should “let the data determine the model”, and I’ll explain below why this is a more controversial slogan than you might think). You then start placing restrictions on the general model (for example, you might delete a coefficient by restricting it to be zero) and re-estimate, then perform tests to see whether the restriction in question is one that impairs the model’s ability to explain the data. If the restriction doesn’t reduce the explanatory power, then the restricted model is said to “encompass” the more general model; it does the same amount of work with fewer assumptions. Using a very well-thought-out algorithm for searching through the potentially massive space of possible restricted models (an algorithm implemented in the program PCGets, whose authors it is fair to say are a bit sensitive about criticisms of data mining, you finally arrive at a model which, like the Reality of Divinity, “encompasseth all created things”. This is your “local data generating process”, your best estimate of the structural model which generated the dataset that you started estimating your general model on. Now your job is to interpret that LGDP; to see what theoretical model corresponds the best to the data. Contrast this with what one might call a Popperian approach to econometrics – the more usual method of setting up a model which corresponds quite closely to your a priori theoretical view of how you think the world works, and then setting it up against a dataset to see whether the data falsify your theory or not. In general, people who are sympathetic to the LSE approach tend to be a bit more interested in philosophy of science (and thus less inclined to be knee-jerk Popperians) than econometricians of the more orthodox tendency.
On the other hand, the LSE approach isn’t a panacea. It has a number of problems:
- You can’t be sure that you’re going to come up with a theoretical interpretation of the LDGP that makes sense. Unless your interest was in forecasting a process rather than understanding it to begin with, that leaves you no better off than when you started.
- Although encompassing and general-to-specific methodology is a way of data-mining which doesn’t give flat out deceitful results like the caricature of stepwise regression based on t-ratios I described above, it’s still pretty difficult to interpret the results of PCGets’ output. You don’t really have a feel for the sensitivity of your estimates. It’s much better for macroeconometric work (where you’re usually either forecasting, or attempting to test the validity of an entire theoretical approach such as rational expectations) than for microeconometrics (where you’re often more interested in the specific coefficients).
- Related to the above, there is a considerable danger of developing a “theory of a single dataset”. Although the problems of statistical validity of data-mining are addressed by the LSE approach, the issue of overfitting and poor out-of-sample performance persist. If you develop entirely different LDGPs for, say, the UK, the USA and Japan, then no matter what your forecasting performance, how credible are your results? Not very, I’d say, although there are people who do a lot of semi-interesting work about “model uncertainty”.
So there is still, in my opinion, a significant role for non-LSE-style, Popperian econometrics, to answer questions like “If we pass concealed-carry laws, will there be less crime?”. And there is no real equivalent of PCGets for this kind of econometrics; I disagree with the commenter a couple of weeks ago who thought that data-mining in this area wasn’t always and invariably pernicious (the comments are attached to the story “Readers’ Digest” and very good they are too. If you aren’t absolutely 100% confident of the good faith of the people involved (and on a politically sensitive issue, how could you be?), then there’s a very good chance you’re being fed sawdust rather than steak.
Which is where my proposal comes in. In an ideal world where I was the wealthiest man alive, I would run the world’s most prestigious econometric journal. It would be called the Journal of Definitive Results, and it would have a submission procedure as follows:
- The editorial board would decide on a problem to be tackled
- Having decided on this, researchers would get grants to assemble datasets. These researchers would be barred from taking any further part in the project
- Thirty different teams from different universities around the world would be invited to contribute to the Spring issue of the JDR. This would be entirely theoretical, as everyone contributed a paper on how they felt the problem should be modelled. Nobody would be allowed to see the data at all until the Spring issue was published, at which point, half the dataset would be released to the econometricians from each team.
- The Summer issue would have two sections. In the first section, the theorists would respond to each others papers from the Spring issue, while in the second section, the econometricians would write methodology papers on the subject of the correct approach to be taken to modelling the data.
- The Autumn issue would have the econometricians responses to each others’ papers. On publication of the Autumn issue, the second half of the dataset would be released.
- Each team would contribute one econometric study on the first half of the dataset and a second study, using the same methodology on the second half, to the Winter issue. The Winter issue would be introduced with an editorial summary of the theoretical debate, and would conclude with a metastudy of the thirty econometric papers by an eminent econometrician not connected to any of the teams.
Obviously, this isn’t going to happen any time soon, and like all pipe-dreams, it would presumably be impossible to make it work at an interpersonal level. But a weaker proposal based on the same ideas might have a chance.
One of the things that they do in some natural sciences is to keep lab books. This is a book in which you write down every experiment you perform. Partly to keep track of what you’re doing, partly to help in future patent lawsuits and partly to avoid duplication of effort. It strikes me, though, that the introduction of this practice to the social sciences could help to stamp out a lot of the more mindless “data dredging” (like data mining, with the difference that a miner occasionally strikes gold but a dredger just stirs up shit) that takes place. If you had to write down what you were doing every time you pressed “estimate”, and if you knew that you would have to submit your log book along with your results to whatever journal you intended to publish in, and that people would tend to look askance at your “significant at the 99% level” results if they found out you’d estimated 500 versions of the same model before you got them, then you might be a little less trigger happy with SPSS. And this is the real source of most of the data dredging that goes on; not calculated deception but just the fact that a lot of economists have experienced something close to operant conditioning at the hands of their regression software. There will always be people who intentionally iterate through models in order to reach a politically convenient conclusion, but if we can find them guilty of the heinous crime of “forging laboratory notebooks”, it will be a lot easier to chuck them out of the profession. Hopefully
1“Student” was the pen name of the bloke (William Gosset) at the Guinness factory who invented this incredibly useful distribution in 1908. I have no idea why the Guinness factory was employing statisticians but they were, and they also had a policy of not allowing them to publish journal articles under their own names (presumably Guinness Breweries was not subject to Research Assessment Exercises). So “Student” it was.
2Statistically significant that is, not necessarily practically significant. In all honesty, “statistical significance” is really just a measure of how many data points you have, which is something you already know or ought to. But everyone treats the two meanings as equivalent and I am not inclined to kick against the pricks.
3“LSE Econometrics” is a bit of a misnomer; Hendry is actually at Oxford and so are most of his school. But the tradition of this approach apparently goes back to LSE.