Kaufman Research Group - McGill University

Careful with that Hausman Test, Eugene

6/11/2015

Some of my best friends are fixed-effects Nazis, but I have long been worried about the potential loss of power in situations where there are relatively few observations per cluster and fixed effects drop out all of the concordant strata. You might gain some validity, but at the price of precision, and therefore do worse in terms of MSE. Now I see a new paper that finally addresses this trade-off in greater detail. The Hausman test is treated with great reverence in some quarters, but it is a significance test, after all, and so has all of the usual weaknesses of significance testing, such as the potential for low power to correctly reject the null in some settings, or excessive power to reject a trivially false null in other situations.

Often when faced with an unsatisfactory binary choice, the best solution may be to go a third way. In this context, it could be the so-called "hybrid model", which delivers the validity advantages of the fixed-effects model with the parsimony (and thus precision) advantages of the random effects model. There are ways that this model can go wrong, too, but in many situations, you can have your cake and eat it, too. Presenting this result to a Francophone audience last year, I learned that the way to express that latter sentiment is that you can have both the butter and the butter money.

3 Comments

The Theory is Only as Good as the Model

6/10/2015

2 Comments

We all know that the change in estimate approach to finding confounders can't work for the OR because the OR can change its value when conditioned on any determinant of outcome, even if the covariate is independent of exposure. Therefore, the OR will "cry-wolf" and claim confounders that aren't confounders (when they are not associated with exposure). It can also fail to detect a true confounder, although a recent paper by Mansournia and Greenland notes that this is decidedly less likely to occur, because it requires an incidental canceling out. On the other hand, the false accusation of confounders by a change in estimate approach applied to an OR is quite easy to encounter. This will happen whenever the frequency of the outcome is high in the data set.

Here is a quick Stata example that generates a data set of binary variables:

clear all
set obs 10000
set seed 12345
gen c = rnormal()
gen x = rnormal()
gen y = rnormal() + x + c
replace c = (c > 0)
replace x = (x > 0)
replace y = (y > 0)

This code generates a binary Y that is a function of both exposure X and covariate C.
X and C are completely independent (ORXC = 1). Therefore C by definition cannot be a confounder. Prevalence of Y=1 is 0.5.

logistic y x
logistic y x c

Crude ORYX = 5
When I adjust this for C, I get ORYX|C = 7.

Therefore, by change in estimate criteria, one might think that they have to adjust for C. But this is wrong. The OR is "crying wolf".

So far so good. We also know that the RD is collapsible, and sure enough:

binreg y x, rd
binreg y x c, rd

Crude RDYX = 0.38
When I adjust this for C, I get RDYX|C = 0.38.

Therefore, by change in estimate criteria, one would not adjust for C, which is the right answer. Good.

But here is the surprise:

binreg y x, rr
binreg y x c, rr

Crude RRYX = 2.30
When I adjust this for C, I get RRYX|C = 2.08.

These are significantly different, and therefore, by change in estimate criteria, one might adjust for C, which would be a mistake.

The RR is supposed to be collapsible, so this baffled me until I looked at the stratum specific estimates:

cs y x, by(c)
cs y x, by(c) istandard
cs y x, by(c) estandard

                c |       RR    [95% CI]
-----------------+-------------------------------------
               0 |    4.54     4.04, 5.11
               1 |    1.75     1.68, 1.83
-----------------+-------------------------------------
       Crude |    2.30    2.19, 2.40
M-H combined |    2.26 2.16, 2.36
I. Standardized |    2.24    2.15, 2.34
E. Standardized |    2.27    2.18, 2.37

For those of you who don't know Stata, "internally standardized" is standardization using the X=1 group as the target population, and "externally standardized" is standardization using the X=0 group as the target population (old occupational health terminology). X and C are uncorrelated in expectation, so these target-population-specific adjusted values differ only due to the chance imbalance in C across levels of X.

So there is tremendous effect modification by C here, even though C is not a confounder. This doesn't seem to matter for the RD, but for the RR, getting the right answer seems to depend on exactly HOW I adjust for C. These standardized estimates are slightly different, but they are all statistically homogeneous with the crude. But the binomial regression adjusted estimate is much lower than the standardized estimates (RR = 2.09). If I use Poisson regression, I get RR=2.26, so again it is collapsible, so it seems that only the binomial regression estimate is screwy here.   Should one avoid binomial regression when the outcome is common or the effect of the exposure is heterogeneous?

I was sincerely confused, so I sent e-mail around to some friends, and after some discussion, it all seemed quite obvious.

My Stata code generates the following (true) proportions of Y=1 in each stratum of X and C:

. table x c, c(mean y)
------------------------------
          |         c
        x |        0         1
----------+-------------------
        0 | 0.1084666 0.5012356
        1 | 0.4927478 0.8789078
------------------------------

Logistic regression:
------------------------------
          |         c
        x |        0         1
----------+-------------------
        0 | 0.112023    0.4974815
        1 | 0.4890754 0.8825148
------------------------------

Binomial regression:
------------------------------
          |         c
        x |        0         1
----------+-------------------
        0 | 0.2031814    0.4296737
        1 | 0.4217731    0.8919361
------------------------------

Poisson regression:
------------------------------
          |         c
        x |        0         1
----------+-------------------
        0 | 0.1835052    0.4220249
        1 | 0.4152605    0.9550153
------------------------------

Logistic is not perfect here, but much better than the other two. Still, it is not that logistic is somehow superior in any general sense for this task. It just got lucky in this instance.

The two stratum-specific ORs are 8.0 and 7.2, which happen to be much closer to being homogeneous than the two true RR’s (4.5 and 1.8). The fact that the two true RDs are roughly homogeneous (0.38 and 0.38) explains why the additive model does so well.

As Rich MacLehose kindly pointed out, this has nothing to do with collapsibility, per se. It has to do with model fit. If the model is misspecified, you can easily get change in estimate that suggests confounding, regardless of the choice of effect estimate. In *ALL* of these models, when I add an interaction term, I get the exactly correct predicted proportions in every cell, and the correct RRs or ORs when I divide these predictions. So the moral of the story is that change in estimate is not a sufficient criterion, even with a collapsible measure. One also has to get the model right. Seems obvious in retrospect, but in practice, how often do we worry about this? The well-known Greenland and Morgenstern example shows RR collapsibility even when the stratum specific RRs are very heterogeneous, but they use tabular (i.e. non-parametric) adjustments In their example, if one used a regression model, one would not generally reach the same conclusion (without adding interaction terms).

But Rich also asked another question, which is how it can be that Poisson Regression and Binomial Regression produce different point estimates when they have the same link function and differ only in the error distribution.

I also hadn't previously thought in detail about this, and if students asked I probably would have answered the same: that changing the distribution in the GLM would affect the CIs but not the point estimates. But I think that my (bad) intuition stems from equating rates and risks. If the outcome were rare, I think we would indeed see that these give approximately the same value. But the problem is that the outcome in my example is common, and in some cells VERY common. That apparently messes things up. Specifically, logistic model searches for the best (ML) solution under the constraint that all the log-odds fall on a straight line. Binomial regression searches for the best (ML) solution under the constraint that all the log-risks fall on a straight line. Poisson searches for the best (ML) solution under the constraint that all the log-rates fall on a straight line. So I guess what we are seeing here is the distinction between risks and rates. The surprising thing about that is that I have no person time here (no offset in the Poisson model), so everyone has 1 time unit, so I can't see how the rate and risk could be different. So that remains a bit of a mystery for me.

If you look at a plot of the fitted values for the three models (truth versus logistic, binomial and Poisson), you see that logistic comes closest in this instance because the true proportions are closest to being linear in the log odds than to any of the other scales:

But even though Binomial and Poisson match almost exactly on the low end, they diverge markedly at the high end. This makes me think that we could see the distinction much more clearly if we don't convert everything to binary variables. So I ran this again with continuous x and c:

I had a hard time getting binomial regression to converge, and ended up having to plot this with a smoother because the predicted line was jagged for binomial because of the difficulty in obtaining convergence. But in any case, now one can see that the shape of the underlying function is rather different between Poisson and Binomial here. If you don't smooth, Poisson shoots up
much more dramatically at the upper limit as you get very close to 1 in the true probability. Even on the low end they are not that close, but I think that this is because the model is trying to find the ML solution that also works on the top end.

And bear in mind that the logistic looks straight here only because the data in my particular example happened to be approximately homogeneous in the ORs across strata of C, but one could engineer an example where logistic would do very poorly, too.

Moral of the story: use interaction terms to make a saturated model, and then use –margins- to take the differences or ratios of the correctly specified risks (what Sam Harper recommended to me from the start). Common wisdom that Poisson and Binomial are estimating the same parameter comes from a rare outcome setting. When the outcome is not rare, these are not going to agree, and there is no way to predict which will be closer to the truth in any given instance.

Again, all this seems obvious in retrospect, but then again, most things do.

2 Comments

Fearless Surveillance

6/1/2015

1 Comment

In this new paper, Chris Murray's team at the Institute for Health Metrics and Evaluation reports on mortality, incidence, years lived with disability (YLDs), years of life lost (YLLs), and disability-adjusted life-years (DALYs) for 28 cancers in 188 countries by sex from 1990 to 2013. The only problem is that perhaps three-quarters of those 188 countries produce no meaningful data whatsoever on these quantities. But the figures and graphs are stunningly beautiful.

But with even mortality data so unreliable in most of the world, why go the extra strep and try to report cancer incidence? Clearly Murray’s professional strategy for getting money and attention is claiming to know everything, or at least being able to provide a number for everything. Thus, it would not serve him to publish something more restrained or conservative. Others could do that, but his unique forte is the chutzpah to assign a number EVERYWHERE for EVERYTHING. And he also knows that he will never be held accountable for these numbers, so what does it matter to him if he is off by a factor of 2 or even a factor of 10? His strategy is wildly successful, especially with powerful benefactors such as Bill Gates and Richard Horton. In short, his fantastic success is due to publishing papers that overreach, just like this one.

But surely the issue of cancer incidence is more complicated because any country that screens more will find more cancers, and will therefore also lengthen the time that people spend with cancer and the number of people who ultimately die with cancer (even if not from cancer). So you can get data like these:

How does the US, which does not have the highest life expectancy by any stretch, have the highest 5-year survival for every cancer on this table? By aggressive screening of old people for a profit, which I assume is not a characteristic of the health system of any other country shown. For example, Switzerland had a 2012 age-standardized cancer mortality rate for women of 83.9/100,000, whereas for the US it was 104.2/100,000.

How does the US have such a high percentage of women surviving 5 years with breast cancer but at the same time, 20% more eventually dying of breast cancer? It has to be via finding smaller tumors sooner. In poor countries, you can have lower incidence (because of less screening) and thus lower mortality from cancer, even in the context of higher mortality overall. Therefore, lowering one’s national cancer incidence and mortality rates can hardly be taken to be a sign of success.

I therefore can’t see any value to this surveillance activity for incidence, driven as it is by arbitrary policies on case finding that don’t necessarily benefit patients in terms of costs or longevity. I think we only want to know overall life expectancy, or perhaps quality adjusted or disability adjusted survival. But not cause specific incidence. Maybe not even cause specific mortality. Certainly not for a disease like cancer that is subject to so much arbitrariness in the timing of case ascertainment.

Arnaud Chiolero adds:

I agree that 5 years survival cannot be used for the surveillance of cancer (see here). The incidence is also misleading for many cancers (e.g., breast cancer in the US), but not for all (incidence of lung cancer is coherent with smoking trends, at least for the moment; it will change when screening becomes more frequent). However, mortality rates are much less biased and trying to measure and reduce cancer mortality rates is a reasonable goal, I think (see here).

1 Comment

The Health Effects of a Social Policy

5/30/2015