Zero-inflated and hurdle models of count data with extra zeros: Examples from an HIV-Risk reduction intervention trial.
In clinical trials of behavioral health interventions, outcome variables often take the form of counts, such as days using substances or episodes of unprotected sex. Classically, count data follow a Poisson distribution; however, in practice such data often display greater heterogeneity in the form of excess zeros (zero-inflation) or greater spread in the values (overdispersion) or both. Greater sample heterogeneity may be especially common in community-based effectiveness trials, where broad eligibility criteria are implemented to achieve a generalizable sample. This article reviews the characteristics of Poisson model and the related models that have been developed to handle overdispersion (negative binomial (NB) model) or zero-inflation (zero-inflated Poisson (ZIP) and Poisson hurdle (PH) models) or both (zero-inflated negative binomial (ZINB) and negative binomial hurdle (NBH) models). All six models were used to model the effect of an HIV-risk reduction intervention on the count of unprotected sexual occasions (USOs), using data from a previously completed clinical trial among female patients (N = 515) participating in community-based substance abuse treatment (National Drug Abuse Treatment Clinical Trials Network protocol CTN-0015). Goodness of fit and the estimates of treatment effect derived from each model were compared. Results found that the ZINB model provided the best fit, yielding a medium-sized effect of intervention.
Conclusions: This article illustrates the consequences of applying models with different distribution assumptions on the data. Taken together, the data suggest the importance for any given data set of finding the most appropriate model for outcome data in order to arrive at the most accurate estimate of the effect of a treatment intervention. If a model used does not closely fit the shape of the data distribution, the estimate of the effect of the intervention may be biased, either over- or underestimating the intervention effect. Investigators designing clinical trials should be encouraged to hypothesize in advance the distribution of the outcome counts based on their knowledge of the population and the intervention being tested, as well as prior data.
Related protocols: CTN-0015