Caries Res 2012;46:413–423

Review and Recommendations for Zero-Inflated Count Regression Modeling of Dental Caries Indices in Epidemiological Studies

Preisser J.S.a · Stamm J.W.b · Long D.L.a · Kincade M.E.a
aDepartment of Biostatistics, Gillings School of Global Public Health, University of North Carolina, and bDepartment of Dental Ecology, School of Dentistry, University of North Carolina, Chapel Hill, N.C., USA
email Corresponding Author


 goto top of outline Key Words

  • Dental caries
  • Excess zeros
  • Incidence
  • Increment
  • Overdispersion
  • Prevalence

 goto top of outline Abstract

Over the past 5–10 years, zero-inflated (ZI) count regression models have been increasingly applied to the analysis of dental caries indices (e.g. DMFT, dfms). The main reason for that is linked to the broad decline in children’s caries experience, such that dmf and DMF indices more frequently generate low or even zero counts. This article specifically reviews the application of ZI Poisson and ZI negative binomial regression models to dental caries, with emphasis on the description of the models and the interpretation of fitted model results given the study goals. The review finds that interpretations provided in the published caries research are often imprecise or inadvertently misleading, particularly with respect to failing to discriminate between inference for the class of susceptible persons defined by such models and inference for the sampled population in terms of overall exposure effects. Recommendations are provided to enhance the use as well as the interpretation and reporting of results of count regression models when applied to epidemiological studies of dental caries.

Copyright © 2012 S. Karger AG, Basel


Dental caries, the most common disease of childhood, can be associated with severe health, social and economic consequences, which can persist over a lifetime. Statistical modeling plays an important role in understanding caries risk factors and combating their development. More than 50 years ago, Grainger and Reid [1954] observed that caries counts are not generally approximated by a normal distribution [see also Lewsey et al., 2000]. They recommended the negative binomial distribution for describing dental caries indices in populations recognizing, as did Böhning et al. [1999] decades later, that caries counts tend to exhibit overdispersion, i.e. excess variation in them relative to the Poisson distribution. Subsequently, researchers [Syrjälä et al., 2003; Broffit et al., 2007; Ismail et al., 2008; Maserejian et al., 2008b; Thitasomakul et al., 2009; Wong et al., 2011] have often analyzed the effects of risk factors on dental caries indices using negative binomial regression [Hilbe, 2008].

As oral health has improved in populations over time [Campus et al., 2009, and references therein], epidemiological investigations often find that the traditional count data models provide poor fits to caries data. Distributions of caries counts are increasingly characterized by a large number of zero counts, with proportions in excess of what is expected under the Poisson and negative binomial distributions. To handle such ‘excess zeros’, Böhning et al. [1999], in a paper published in the statistics literature [see Simonoff, 2003, for comment], proposed zero-inflated Poisson (ZIP) regression for modeling the decayed, missing and filled teeth index (DMFT). Yet while ZIP models account for large counts of zeros, they do not adequately account for data that have sizeable numbers of large caries counts. To address both excess zeros and overdispersion, Lewsey and Thomson [2004] used zero-inflated negative binomial (ZINB) regression models in examining the effect of economic status on DMF data. In the past 5 years there have appeared over a dozen publications with applications of both types of these zero-inflated (ZI) models to dental caries indices. Their emergence warrants a review.

To help explain the recent trend in applications of ZI models to caries, consider the following example that illustrates the potential inadequacy of traditional models for suitably describing distributions of caries counts. Figure 1a shows a representative distribution of caries counts with a moderately large number of zeros, as is commonly encountered in surveys and population-based studies of caries. For this distribution, the mean and variance of the counts denoted Y are 1.2 and 1.68, respectively [calculated as E(Y) = ΣyP(y) and Var(Y) = Σ [y – E(Y)] 2P(y), respectively, where P(y) is the relative frequency of count y]. Furthermore, the frequency of zero counts is 40% while the cumulative frequency of counts of size 4 or greater is 5%. A Poisson distribution cannot adequately describe the distribution in figure 1a because all Poisson distributions have a single parameter (i.e. the mean) to describe the distribution of counts, where their variance equals their mean. Thus, not only does a Poisson distribution with a mean of 1.2 have a variance of 1.2, but it additionally specifies a relative frequency of zero counts of 30% and a relative frequency for counts of 4 or greater of 3.4% (as determined from its probability function), both of which are too low to adequately describe the distribution in figure 1a. Additionally, even the negative binomial distribution, which has a second parameter allowing for extra variation (overdispersion) in Y relative to the Poisson, often fails to account for large fractions of zeros commonly observed in studies of dental caries.

Fig. 1. Four representative distributions of caries counts are shown in this panel plot. a A ZIP distribution (ψ = 0.25, µ = 1.6) as a single population. It has π = 0.6, mean ν = 1.2, variance 1.68 and a relative frequency of zero counts of 40%. b The same distribution as in a but as a mixture of two subpopulations [or nonsusceptible (grey bar) and susceptible (black bars) latent classes]. ψ = 0.25, mean for at-risk group = 1.6. c A ZIP distribution (ψ = 0.10, µ = 2.0) that is defined relative to the distribution of counts in b through a single dichotomous covariate in equations 1 and 2 having ‘consistent’ trends in the two ZIP model parts. It has overall mean ν = 1.8, variance 2.16 and a relative frequency of all zero counts of 22%. d A ZIP distribution (ψ = 0.40, µ = 2.0) that is defined relative to b through a dichotomous covariate having ‘inconsistent’ trends in the two ZIP model parts. It has overall mean ν = 1.2, variance 2.16 and a relative frequency of all zero counts of 48%.

To overcome these limitations, the Poisson and negative binomial models have been extended to better incorporate the excess zeros, giving rise to ZIP and ZINB models. The expanded capacity for describing caries count distributions is illustrated by the ZIP distribution (defined in the Appendix) with parameters ψ = 0.25 and µ = 1.60 for caries counts Y, which perfectly describes the frequency of counts in figure 1a. The fact that this example is constructed to give a perfect fit does not diminish the fact that ZIP models provide expanded families of count distributions that often give much better fits than the Poisson distribution to counts of caries indices, especially when large numbers of zeros are present. Analogous arguments exist for the utility of the ZINB model relative to the negative binomial model in accounting for both extra zeros and extra-Poisson variation.

Notwithstanding their increased usage due to providing improved model fits for counts of caries indices, analysis results based on ZIP and ZINB models may be difficult to interpret [Mwalili et al., 2008; Solinas et al., 2009]. For a fixed set of covariate values, ZI models constitute a mixture of a standard probability distribution for count data, typically Poisson or negative binomial, representing a ‘susceptible’ subpopulation of children said to be at risk for a disease or condition (e.g. dental caries), and a subpopulation of ‘nonsusceptible’ children with only zero counts who are considered to be not at risk. For a single population (i.e. a model with no observed covariates), figure 1b gives an alternative representation of the relative frequencies of counts for a ZIP model with parameters ψ = 0.25 and µ = 1.60. It illustrates that a randomly selected child from the overall population is not at risk for caries (has excess zero) with probability 0.25; otherwise, with probability 1 – ψ = 0.75 the child is susceptible to caries and is assumed to have a caries count, a zero or otherwise, from a Poisson distribution with a mean µ of 1.60. Note that the probability of an excess zero is given by the length of the grey bar, and the mean caries count for the susceptible subgroup is the mean of the distribution represented by the black bars. The challenges that dental researchers face in understanding ZI models are related to the fact that the composition of the two respective subpopulations or groups in figure 1b is a theoretical and mathematical construct such that the specific group membership of any given subject in a study with a zero count is unknown; accordingly, these groups are referred to in the literature as latent classes.

In fact, figure 1a and b displays identical overall distributions for Y. Specifically, figure 1a depicts the overall frequency distribution resulting from the mixture of the two subgroups of figure 1b, grey and black, without distinguishing between them. The only difference in the figures is that figure 1a, by depicting a single overall distribution for Y, reflects the view of Mwalili et al. [2008] that the mixture distribution model representation (fig. 1b) is only a convenient explanation for a distribution of counts with excess zeros.

Noting that oral health research employing ZI models often limits consideration to the model-based latent class parameters ψ and µ via interpretation of regression coefficients that describe their variation, Albert et al. [2011] argue that insufficient emphasis has been given to the effects of caries risk factors on the overall population from which the study sample was drawn. From this perspective, figure 1a displays the distribution for Y that has overall mean caries count, say ν = E(Y), and the probability of a positive caries count, denoted π = Pr(Y > 0), represented by the fraction of all subjects with counts greater than zero. In a cross-sectional study, for example, ν is caries severity or extent and π is caries prevalence in the sampled population. Accordingly, Albert et al. [2011] define ‘overall effects’ as the contrasts (i.e. differences or ratios) of values taken by ν (or π) as they vary across subgroups defined by caries risk factors.

Although epidemiological investigations of risk factors on caries often report on the ZI model parameters ψ and µ and the corresponding subpopulations in figure 1b that they characterize [Gilthorpe et al., 2009], ZI models can be used for investigating overall effects on the caries count Y because ν and π, which we refer to as the population oral health parameters, have known relationships to ψ and µ [Lambert, 1992; Böhning et al., 1999; Albert et al., 2011]. Specifically, caries prevalence π and caries severity ν are related to the ZIP (or ZINB) model parameters as follows:

π = (1 – ψ) [1 – exp(–µ)],

and ν = µ(1 – ψ). Thus, the ZI model parameters ψ and µ provide only indirect information on the population oral health parameters π and ν. As long as ψ > 0, the prevalence π is always less than the probability of not being an excess zero, 1 – ψ. Further, ν ≤ µ so that caries severity in the overall population cannot be greater than the mean count of the susceptible population. Applying these mathematical relationships to the example in figure 1b where ψ = 0.25 and µ = 1.60, prevalence is calculated as π = 0.60 and severity is ν = 1.20 (as noted above) for the overall population represented by figure 1a. Analogous arguments can be made when π denotes caries incidence and ν is the mean increment in a longitudinal study.

The motivation for reviewing the usage and reporting of ZI models in the dental caries literature is the belief that drawing well-articulated and valid conclusions from ZI models relies on an understanding of the differences between the ZI model parameters and the population oral health parameters, a distinction made two decades ago with an illustration from manufacturing by Lambert [1992] and later for dental caries by Böhning et al. [1999] and Albert et al. [2011]. This article reviews the caries literature for details of applications of ZI models to dental caries counts and assesses, with respect to stated study goals, the quality of interpretations given to the numerical results of these analyses. Finally, recommendations for improved usage and reporting of ZI models are provided.

goto top of outline Materials and Methods

goto top of outline Overall Effects in ZI Models

The aims of the literature review require consideration of how the presence of excess zeros in caries counts should be taken into account in statistical analysis, interpretation and reporting when interest is in the overall effects of risk factors on caries prevalence (or incidence) π and severity (or mean increment) ν. ‘Overall effects’ refer to the effects of risk factors on caries indices in the overall population represented by the study participants, and not in the effects within a subset of the overall population defined by an unobserved variable assumed to define subgroups (latent classes) that partition that population [Albert et al., 2011]. For simplicity, consider a single dichotomous covariate, xi = 0 or 1, appearing in each ZI model component for the i-th child. The probability of an excess zero is typically modeled by a logistic regression, which is expressed in its probability form by


and the mean caries counts for at-risk children are modeled via a log-linear model (equivalently, a generalized linear model with log link function) by

µi(xi) = exp(β0 + β1xi). (2)

The regression coefficient γ1 in equation 1 represents the log odds ratio of having an excess zero or being in the not-at-risk group for the effect of xi = 1 relative to xi = 0. The coefficient β1 represents the log of the incidence rate ratio (IRR) for the effect of xi = 1 relative to xi = 0 in the at-risk group, i.e. ln [µi(xi = 1)/µi(xi = 0)]. Often γ1 and β1 are not of primary interest [Albert et al., 2011]. Rather, their importance lies in their relationship to prevalence and severity in the overall population. Substitution of equations 1 and 2 into νi(xi) = µi(xi) [1 – ψi (xi)] gives the overall mean severity


Then, the ratio of means, νi(xi = 1)/νi(xi = 0), or IRR for the overall effect of xi on caries severity is:


Thus, a ratio factor on the right-hand-side of equation 3, which depends on the excess-zero model parameters from equation 1, multiplies the IRR for the at-risk latent class [exp(β1)] to produce the overall IRR giving the effect of xi on caries severity in the overall population. Equation 3 generalizes for a continuous covariate (see Appendix).

The signs of β1 and γ1 impact the direction of the bias when the at-risk latent class IRR is used to estimate the IRR for caries severity in the overall population. First, if γ1 < 0 (i.e. negative sign), then the ratio factor in equation 3 will be greater than 1.0 and, thus, exp(β1) will underestimate the IRR for caries severity in the overall population. On the other hand, if γ1 > 0 (i.e. positive sign), then the ratio factor in equation 3 will be less than 1.0 and exp(β1) will overestimate the IRR for caries severity in the overall population.

Second, whether β1 and γ1 have consistent trends (i.e. opposite signs, one positive and the other negative) or inconsistent trends (same signs, both positive or both negative) usually determines the direction of the bias in relation to the null value of no covariate effect. The scenario of consistent trends is where a covariate decreases (increases) the probability of an excess zero and increases (decreases) the at-risk class mean. The less common scenario of inconsistent trends is where a covariate decreases (increases) the probability of an excess zero and decreases (increases) the at-risk class mean. Considering equation 3, and that β1 < 0 implies exp(β1) < 1 while β1 > 0 implies exp(β1) > 1, the impact of consistent trends and inconsistent trends in samples sufficiently large for estimates to reflect the relationship of parameters is as follows:

(1) When a covariate has consistent trends in the two ZI model parts (opposite signs), the at-risk latent class IRR will in most cases be biased towards the null hypothesis of no effect in the sense that the IRR (latent) is closer to 1.0 than IRR (severity).

(2) When a covariate has inconsistent trends in the two ZI model parts (same sign), the at-risk latent class IRR will in most cases be biased away from the null hypothesis of no effect in the sense that the IRR (latent) is farther from 1.0 (in either direction) than IRR (severity).

Exceptions to these rules sometimes occur when the IRR (latent) and IRR (severity) have different directions (one has a value <1, while the other has a value >1), but violations to these laws appear to be rare, and when they occur they are often inconsequential with both IRRs being close to 1.0; see the online supplementary appendix (for all online suppl. material, see www. for further discussion and real-life examples where exception to these rules occurred less than 2% of the time.

To illustrate the first scenario (consistent trends), suppose γ0 = –1.099, β0 = 0.470, γ1 = –1.099 and β1 = 0.223, corresponding to {ψ1 = 0.25, µ1 = 1.60} for the group with xi = 0 (fig. 1b), and {ψ2 = 0.10, µ2 = 2.00} for the group with xi = 1 (fig. 1c). Then the IRR for the at-risk latent class is exp(0.223) = 1.25 (which also equals µ21), while the IRR for the overall population calculated from equation 3 or by [µ2(1 – ψ2)] / [µ1(1 – ψ1)] equals 1.50. In this case, the IRR (latent) underestimates the IRR (severity) and is biased towards the null.

To illustrate the second scenario (inconsistent trends), suppose γ0 = –1.099, β0 = 0.470, γ1 = 0.693, and β1 = 0.223, corresponding to {ψ1 = 0.25, µ1 = 1.60} for the group with xi = 0 (fig. 1b), and {ψ2 = 0.40, µ2 = 2.00} for the group with xi = 1 (fig. 1d). Then the IRR for the at-risk latent class shown in figure 1b is µ31 = 1.25 while the IRR for the overall population is [µ3(1 – ψ3)] / [µ1 (1 – ψ1)] = 1.0. In this case, the IRR (latent) overestimates the IRR (severity) and is biased away from the null. The general point is that some bias generally occurs when exponentiated β-coefficients, which are IRRs for the at-risk latent class, are falsely interpreted as IRRs for the overall population.

The at-risk latent class IRR is equivalent to the IRR (severity) for the overall population when γ1 = 0 in equation 1 in which case the ratio term on the right-hand-side of equation 3 cancels out. Thus, when the probability of an ‘excess zero’ does not depend on xi, exp(β1) is the IRR for the overall population and its interpretation is the same as in Poisson regression and negative binomial regression. Generally, however, estimates of IRRs based on equation 3 that appropriately adjust for the zero inflation parameters (e.g. γ0 and γ1) provide valid inference for the overall effect of the risk factor on the population oral health parameters.

The general results for the relationship between trends and bias for ZI models (points 1 and 2 above) also apply to ZI models with multiple covariates. However, we wish to caution the reader that for models having multiple covariates appearing in both model parts (for µi and ψi) the IRR (severity) for a covariate will depend upon the values of other covariates. Specifically, in a ZI model with 3 covariates, the IRR (severity) for a dichotomous factor xi3 is:


If xi1 and xi2 are also dichotomous (in this example), there will be 4 different values for IRR (severity), one for each combination of xi1 and xi2. A single covariate-adjusted IRR (severity) for the effect of xi3 may be obtained by inserting mean values for xi1 and xi2 (whether they are dichotomous or continuous) into equation 4. Simplification of equation 4 occurs only if some of the covariates are omitted from the excess-zero part of the model, or otherwise have their γ-coefficients equal to zero. For example, if xi3 does not appear in the excess-zero model (equivalently, γ3 = 0), then exp(β3) from a ZI model with 3 covariates is the IRR for both the at-risk latent class and the overall population relating the risk factor to caries severity, all other covariates being held fixed. See the online supplementary material which contains a detailed illustration with real-life data involving 2 dichotomous covariates and 1 categorical factor.

In addition to the bias arising from misinterpreting an exponentiated β-term as a population IRR (severity) for caries increment, a second concern is that a variance estimate for an at-risk latent class IRR is likely to underestimate the corresponding variance estimate for the IRR in the overall population since an estimate of the variance for the latter should additionally account for uncertainty associated with estimating γ0 and γ1 in equation 3. The delta method for a scalar function of a random vector may be used to compute the large sample variances of the IRR estimates corresponding to equation 3 or equation 4 conditioning on means or specific covariate values [Albert et al., 2011].

goto top of outline Hurdle Models

Hurdle models [Mullahy, 1986; Cameron and Trivedi, 1998] are briefly mentioned, as they are occasionally used or cited in epidemiological studies of dental caries. The hurdle model approach, like the ZI model approach, is a 2-part count regression method that deals with the phenomenon of excess zeros in the data. However, hurdle models are distinct from ZI models. The first component of a hurdle model, typically logistic regression, addresses the probability of a zero count (as opposed to an ‘excess zero’) so that it pertains to prevalence (or incidence) in the overall population, as it targets all zero counts. The second part of a hurdle model is for the mean count among subjects with any caries, i.e. E(YiYi > 0). It exceeds the unconditional mean E(Yi) that is the increment for the overall population, and it is distinct from the mean µi for the at-risk latent class in a ZI model. As shown in figure 1b–d, zeros can occur in either part of a ZI model, whereas in hurdle models they are only modeled in the first part. Thus, interpretations for ZI model results are incorrect when they are based on language pertaining to hurdle models.

goto top of outline Methodology for Review of Published Articles

The authors sought to identify and review all published research articles in the dental literature that used ZIP or ZINB models to analyze caries experience, using ISI Web of Knowledge V5.4 and Pubmed as search engines. The authors were the reviewers – a biostatistician (J.P.), an oral health researcher (J.S.) and two biostatistics students (D.L. and M.K. working jointly) both holding graduate research assistantships in oral health. Each reviewer evaluated all the identified papers according to 5 criteria labeled ‘A’–‘E’ in table 1, each of which involved categorical classifications. First (‘A’), did the article present caries applications using ZIP models, ZINB models, or both? Second (‘B’), did the article assess model goodness of fit or otherwise provide some rationale for model choice? Assigned ratings were (i) ‘test’ if a statistical test of goodness of fit, likelihood ratio statistics or information criterion statistics (e.g. AIC, BIC) were presented, (ii) ‘graph’ if a graph displaying model fitted values with observed frequencies was provided, (iii) ‘not shown’ if authors claimed to have examined goodness of fit but did not report results of their evaluation and (iv) ‘none’ if the article did not mention an assessment of goodness of fit for the study data. Third (‘C’), did the reported ZI model(s) include covariates in both excess zeros and at-risk model parts: yes, no or indeterminable?

Table 1. Articles employing ZIP or ZINB models for childhood dental caries and assessment of reporting of model selection (goodness-of-fit) and whether the interpretations made in the article match the reported analysis results

The fourth and fifth assessments aimed to determine whether the interpretations of results from ZI models provided in the article were appropriate for the reported numerical results. The fourth criterion ‘D’ was whether the article presented numerical results for overall exposure effects (‘overall’) pertaining to the overall population from which the data were selected, e.g. equation 3, or whether results presented were based on estimated regression coefficients γ1 and β1 (or their exponentiated forms) corresponding to the two latent classes (‘LC’) of the ZI model. The fifth and final criterion ‘E’ assessed the quality of interpretations of reported numerical results from fitting ZI models; in particular, was language used to describe overall exposure effects or latent class-specific effects? Additionally, we noted where articles inappropriately used language pertaining to hurdle models [Mullahy, 1986]. The reviewers discussed their initial evaluations as a group to reconcile differences and reach consensus.

Finally, examples of problematic interpretations in the reviewed articles were listed. Specifically, for articles with ZI models that included covariates in the excess-zero part, we assigned to interpretations the following classifications: (i) incorrect – when regression coefficients or effects for the probability of an excess zero were falsely attributed to be effects on prevalence; (ii) misleading – when risk factor effects on mean caries counts in the at-risk group were misattributed (or could easily be misinterpreted) as overall effects on severity; for example, when ‘severity’ is used without an appropriate qualifying phrase alluding to the ‘susceptible’ or at-risk subgroup for which the inference actually applies; (iii) imprecise – when results of estimated latent class effects were used directly as the basis for making unsubstantiated claims, or statements of a speculative nature, regarding overall effects on prevalence π or severity ν.


goto top of outline Results

We identified 15 refereed papers published through 2011 that used ZIP or ZINB models to analyze dental caries in children or adults, either in cross-sectional or longitudinal observational studies (table 1). As in the cross-sectional studies that analyzed caries indices, statistical models for independent observations were applied in the longitudinal studies for caries increments. One exception is the article by Broadbent et al. [2008] that employs longitudinal data models for repeated measures of DMFS counts from a life trajectory perspective. Some of the 14 articles that used statistical methods for independent observations adjusted variance estimation for clustering in the study design.

Among the 15 papers, 6 applied ZINB (but not ZIP) models to caries outcomes, 3 applied ZIP (but not ZINB) models, and 6 applied both ZIP and ZINB models. Most articles analyzed multiple outcomes. Seven articles assessed goodness of fit of the chosen model(s) by reporting results from either statistical tests, information criteria or graphs; 4 papers made the claim to have inspected their data for model selection without showing the results of their assessment, and the remaining 4 did not report any assessment of fit for ZI models applied to their caries outcomes. However, in this latter group, each article made a general statement that the ZI model was chosen to address excess zeros in caries data.

Eight articles reported analysis results from ZIP or ZINB models that included covariates in the zero inflation part of the model in addition to the model for the mean count for the at-risk population. Each of these papers summarized latent class-specific covariate effects with estimated regression coefficients or exponentiated regression coefficients corresponding to a model specification for the probability of an excess zero, ψ, and the mean µ caries index for the susceptible class. Three of the 8 articles gave appropriate interpretations for these latent class effects in most instances. For example, in their abstract, Lewsey and Thomson [2004] state: ‘Being in the high-SES group during childhood was associated with a greater probability of being caries-free by age 18 years, over and above that which would be expected from the negative binomial process. Low childhood SES also had the largest coefficient in the modelling of the negative binomial process ...’. Next, Solinas et al. [2009] give interpretations that are appropriate for the results of the two ZI model parts presented from an analysis of Italian 4-year-olds in the National Pathfinder Survey. Consider the sentence in the abstract: ‘The father’s educational level was significant in both parts of the ZINB regression model (p < 0.05), implying that the degree of caries experience increases in children whose fathers have a low level of education, while the excess of caries-free children decreases.’ Similarly, Campus et al. [2009] who analyze data from the same study as Solinas et al. [2009] use appropriate language in their abstract such as ‘the probability of being an extra zero’ and ‘caries experience’ when describing results from the zero inflation and negative binomial parts of the ZINB model, respectively. Note that ‘caries experience’, as used in these quotations, has a general meaning that must be understood in the specific modeling context as applying to the at-risk population and not to the overall population.

The difficulty of interpreting results from ZI models often resulted in imprecise, misleading or incorrect inferences. The difficulty arises because interpretations for ZI models involving ‘excess zeros’ and ‘caries experience’ may be cumbersome and quite often at odds with interpretations dental researchers wish to make regarding overall effects relating to prevalence (or incidence) and severity (or increment). The last column of table 1 shows the types of interpretations made in the reviewed articles, which can be contrasted with the column next to it that shows the types of numerical results presented. Discrepancies, which are italicized in the last column, indicate errors of interpretation, of which a selection is listed in table 2. For example, Lewsey and Thomson [2004] make a statement where latent class effects are incorrectly interpreted as overall effects for caries severity and prevalence (table 2). To investigate the resulting bias of their estimates, we examined detailed comparisons of IRR estimates for effects in the at-risk latent class versus the IRR estimates for effects in the overall population for the covariates reported in table 1 of Lewsey and Thomson [2004]. As anticipated, the latent class estimates tended to underestimate the ‘overall’ IRR estimates in the sense of having values closer to 1.0. Therefore it is suggested that the latent class estimates not be considered as substitutes or proxies for properly computed ‘overall’ estimates of severity. Additional computations and accompanying text are provided in the online supplementary material to this article.

Table 2. Selected examples of incorrect use of the language of prevalence (p) when interpreting results from the ZI part of a ZI model and misleading use of the language of severity (n) when interpreting results from the Poisson or negative binomial process of a ZI model

Furthermore, while Solinas et al. [2009] do not report overall exposure effects, a sentence in the article’s abstract could be interpreted by readers that inferences are made for prevalence and severity (table 2). Similarly, a statement in Campus et al. [2009] linking reported results of a ZINB model for dmfs to caries severity (i.e. ν) is not substantiated by the authors, nor is it qualified by restricting interpretation to the at-risk population. The 5 other articles that include covariates in both parts of the ZI model have multiple instances of interpreting results for the latent classes of the two model components as overall exposure effects using language of ‘prevalence’ or ‘severity’ that is imprecise, misleading or incorrect (table 2).

The remaining 7 articles do not report model results for the excess zero part of the ZI model (table 1). In the first of 3 papers to analyze dental caries in children participating in the Detroit Dental Health Project, Lim et al. [2008] modeled caries count indices using ZINB models where the model for the excess zeros, as indicated in a footnote to their table 6, contained only an intercept term. The interpretations applied to the results are appropriate since, as discussed in the methods section, the exclusion of covariates from the excess-zero part of the model permits direct inference for the prevalence and caries increment in the sampled overall population. Ismail et al. [2009] and Sanders et al. [2008] interpret IRRs as overall effects for the population of children in the Detroit study, which would be appropriate assuming that covariates were not included in the excess-zero part of the model; however, like Javali and Pandit [2010], Nelson et al. [2010] and Broadbent et al. [2008, 2011], they do not explicitly state that the intercept-only model was used for excess zeros. In particular, Javali and Pandit [2010] only present estimated regression coefficients for the mean caries process of the at-risk latent class. Nelson et al. [2010] do not report any regression coefficient estimates from their ZIP and ZINB models. Rather, they report the ‘percent increase in mean’ for the outcomes when comparing two groups, using language applicable to overall exposure effects. They do not fully describe the maximum likelihood estimation procedures they employed, and it is not clear whether statistical methodology for estimating overall effects when both model parts contain covariates [Albert et al., 2011] was used.


goto top of outline Conclusions and Recommendations

With the emergence of ZI count regression models in caries research, authors have made imprecise, misleading and incorrect interpretations of results based on them. Eight of 15 (53%) caries articles reviewed reported ZI models that included covariates in the excess-zero model part, which led to problems in interpretations. In 5 of these 8 (63%) articles, authors gave multiple misleading or incorrect interpretations for regression coefficients corresponding to the ZI model’s latent class parameters ψ and µ by interpreting them as overall effects for caries prevalence π and severity ν. The other 3 articles in this group did not consistently use the terminology of ‘susceptible’ and ‘nonsusceptible’ for the at-risk and not-at-risk subgroups, and contained instances of imprecise or misleading interpretations for overall effects given to analytic results for them. While the remaining 7 articles did not have any similar concerns with interpretations, only 1 of them [Lim et al., 2008] clearly specified the ZI model being used by stating that it included only an intercept term for the excess-zero part. In total, these results support the premise underlying this review that the effects corresponding to the regression coefficients in the two model parts are not typically the parameters of interest, but rather caries researchers usually aim, sometimes unsuccessfully, to study overall effects [Albert et al., 2011].

A research goal of studying overall effects of risk factors on caries indices leads to several recommendations regarding model choice for caries counts exhibiting extra-Poisson dispersion and/or excess zeros: (1) for some populations with high caries rates (e.g. children in developing countries), a negative binomial regression model [Hilbe, 2008] may provide a reasonable approach; (2) otherwise, in the presence of excess zeros, one could consider use of a ZI model with only an intercept in the zero inflation part of the model; (3) else one could omit the subset of covariates that are of primary interest from the zero inflation part of the model as this will simplify calculations and interpretations of their effects. However, the researcher should be aware that omitting covariates from the excess-zero part of the model without proper justification could result in bias. When model reduction in the excess-zero part is not warranted such that the primary exposure variables of interest are in both model parts, a fourth approach is to estimate prevalence (or incidence) and severity (or increment) for the overall population as discussed in this article and elsewhere [Lambert, 1992; Albert et al., 2011]. Model choice should always be justified with a statistical assessment or, at least, with a graphical display of differences between observed and fitted counts.

For completeness we point out that where count distributions present with small maximum counts, alternatives to employing ZIP or ZINB models may be more appropriate. These alternative procedures are based on mixtures of distributions involving the binomial distribution, including the beta-binomial model (for overdispersion), ZI binomial model (for excess zeros) and the ZI beta-binomial model (for both overdispersion and excess zeros) [Cheung, 2006; Albert et al., 2011]. Gilthorpe et al. [2009] discuss that a model based on the Poisson or negative binomial distributions could inappropriately use long tails to describe the bounded distribution of counts, resulting in a poor fit for counts at the upper limit. A poor fit of the ZIP or ZINB model is less likely when the mean count is small relative to the maximum count. We did not find any applications of ZI binomial or ZI beta-binomial models in the caries literature, nor did we identify any caries indices analyzed in the papers reviewed in table 1 where the maximum count was sufficiently small to question the use of ZIP or ZINB models.

The results of this review lead to several recommendations regarding the implementation, interpretation and reporting of results based on ZIP and ZINB models in caries research: (1) select a model, such as one described in the previous paragraph, based on model fit and in consideration of whether interpretation of its parameters facilitates addressing the research questions of interest (e.g. inferences for overall effects vs. latent classes vs. hurdle model interpretations); (2) clearly and completely describe the statistical models used, following Lim et al. [2008] for example, by stating for ZI models whether and which covariates were included in the zero inflation part of the model; (3) report parameter estimates more completely such as intercept terms when regression coefficients are reported and overdispersion parameter estimates in the case of negative binomial and ZINB models, stating the software and version used for model estimation, and (4) use precise, consistent, and clear language for interpreting results.

In particular, the importance of discriminating between inference for the overall population versus that for the latent class of susceptible children has been emphasized. As in the case of a single dichotomous covariate appearing to both ZI model parts, it was shown that the estimated latent class effect contrasting the two groups µ 21, or equivalently exp(β 1), is a biased estimator of the overall effect ν21, which in terms of the ZI model parameters was given in equation 3. It is also the case that the large sample variance estimator of exp(β 1) is not equivalent to the variance estimator corresponding to equation 3. The implication is that caries researchers may have underestimated the variance of the overall effect by essentially removing contributions of the excess zeros from its variability in making exp(β1), and not equation 3, the basis of inference.

Our review of caries research articles using ZI models had limited scope. We were not able to determine whether the substantive conclusions reached in the dental caries articles were valid because standard errors of IRR (severity) estimates cannot be determined from published data. In order to compute these, estimates of variances and covariances of regression coefficients are needed. However, misinterpreting effects for the at-risk latent class as overall effects could lead to erroneous conclusions, and, as shown in the example in the online supplementary appendix, will usually result in bias towards the null hypothesis of no covariate effects. A second limitation is that the review did not address all statistical and methodological issues in the caries articles, but only those directly relating to the use and reporting of results from ZI models. Finally, the evaluation of assessment criteria involved subjectivity as well as objectivity. Nonetheless, this article concludes that the increasing use of ZI count regression models in dental caries research, along with frequent misinterpretations of their results as documented in our review, calls for greater collaboration among statistical scientists and oral health researchers to advance the quality of caries research utilizing these highly versatile and useful methods.


goto top of outline Appendix

The appendix provides further statistical detail. ZI models define a Bernoulli process where s = 1 selects a class of subjects to be considered not at risk for caries (i.e. conditional on being a member in this class, they have an observed zero with probability 1), with probability ψ = p(s = 1); this is the probability of an ‘excess zero’. Otherwise s = 0 indicates the child is susceptible for having caries with probability 1 – ψ. Additionally, the child’s caries count is generated from a Poisson (or negative binomial) distribution with mean µ. The overall (or marginal) distribution of a child’s caries counts, Y, is:


The expression for P(Y = 0) shows that a zero count can be generated in either the excess-zero part or the at-risk part of a ZI model. The mixture distribution in equation 5 has mean E(Y) = µ(1 – ψ), for ZIP or ZINB. The population prevalence is π = 1 – P(Y = 0), where for the Poisson component of the ZIP distribution g(0∣µ) = exp(–µ), and for the negative binomial part (i.e. equation 5.9 of Hilbe [2008]) of the ZINB distribution g(0∣µ, α) = (1 + αµ)α–1, where α is its overdispersion parameter. For the ZIP distribution, the variance is var(Y) = E(Y)(1 + ψµ), from which it is clear that the variance exceeds the mean (when ψ > 0) as in all the figures. For ZINB, var(Y) = E(Y) [1 + µ(ψ + α)].

Equations 1 and 2 specified the ZIP (or ZINB) model for a single covariate, and equation 3 gave its IRR for caries severity in the overall population. The IRR in the overall population for a dichotomous covariate in a model with multiple covariates but with no covariates besides itself in the excess zero part of the model is also given by equation 3. In the specific case of a model with a single continuous covariate xi, the ratio of means νi(xi + 1)/vi(xi), i.e. the IRR for a 1-unit increase in xi on mean caries increment, is:


Note that if xi = 0, giving a dichotomous covariate, equation 6 reduces to equation 3.


goto top of outline Acknowledgements

J.P. wrote the first draft of the paper. J.S. provided edits and oversight, particularly with respect to dental caries. D.L. provided edits, giving particular input on statistical content. J.P., J.S. and M.K. identified caries articles for review. J.P., J.S., D.L. and M.K. evaluated them according to the criteria. This work was partially funded by grant NIEHS T32ES007018.

 goto top of outline References
  1. Albert JM, Wang W, Nelson S: Estimating overall exposure effects for zero-inflated regression models with application to dental caries. Stat Methods Med Res 2011, E-pub ahead of print.
  2. Arora M, Weuve J, Schwartz J, Wright RO: Association of environmental cadmium exposure with pediatric dental caries. Environ Health Perspect 2008;116:821–825.
  3. Böhning D, Dietz E, Schlattmann P, Mendonca L, Kirchner U: The zero-inflated Poisson model and the decayed, missing and filled teeth index in dental epidemiology. J R Statist Soc A 1999;162:195–209.

    External Resources

  4. Broadbent JM, Thomson WM, Boyens JV, Poulton R: Dental plaque and oral health during the first 32 years of life. J Am Dent Assoc 2011;142:415–426.

    External Resources

  5. Broadbent JM, Thomson WM, Poulton R: Trajectory patterns of dental caries experience in the permanent dentition to the fourth decade of life. J Dent Res 2008;87:69–72.
  6. Broffit B, Levy SM, Warren JJ, Cavanaugh JE: An investigation of bottled water use and caries in the mixed dentition. J Public Health Dent 2007;67:151–158.

    External Resources

  7. Cameron AC, Trivedi PK: Regression Analysis of Count Data. New York, Cambridge University Press, 1998.
  8. Campus G, Cagetti MG, Senna A, Blasi G, Mascolo A, Demarchi P, Strohmenger L: Does smoking increase risk for caries? A cross-sectional study in an Italian military academy. Caries Res 2011;45:40–46.
  9. Campus G, Solinas G, Strohmenger L, Cagetti MG, Senna A, Minelli L, Majori S, Montagna MT, Reali D, Castiglia P: National pathfinder survey on children’s oral health in Italy: pattern and severity of caries disease in 4-year-olds. Caries Res 2009;43:155–162.
  10. Cheung YB: Growth and cognition function of Indonesian children: zero-inflated proportion models. Stat Med 2006;25:3011–3022.

    External Resources

  11. Gilthorpe MS, Frydenberg M, Cheng Y, Baelum V: Modelling count data with excessive zeros: the need for class prediction in zero-inflated models and the issue of data generation in choosing between zero-inflated and generic mixture models for dental caries data. Stat Med 2009;28:3539–3553.

    External Resources

  12. Grainger RM, Reid DBW: Distribution of dental caries in children. J Dent Res 1954;33:613–623.
  13. Hashim R, Thomson WM, Ayers KMS, Lewsey JD, Awad M: Dental caries experience and use of dental services among preschool children in Ajman, UAE. Int J Paediatr Dent 2006;16:257–262.
  14. Hilbe JM: Negative Binomial Regression. New York, Cambridge University Press, 2008, p 251.
  15. Ismail AI, Sohn W, Lim S, Willem JM: Predictors of dental caries progression in primary teeth. J Dent Res 2009;88:270–275.
  16. Ismail AI, Sohn W, Tellez M, Willem JM, Betz J, Lepkowski J: Risk factors for dental caries using the International Caries Detection and Assessment System (ICDAS) Community Dent Oral Epidemiol 2008;36:55–68.
  17. Javali SB, Pandit PV: Using zero inflated models to analyze dental caries with many zeroes. Ind J Dent Res 2010;21:480–485.

    External Resources

  18. Lambert D: Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992;34:114.

    External Resources

  19. Lewsey JD, Gilthorpe MS, Bulman JS, Bedi R: Is modelling dental caries a ‘normal’ thing to do? Community Dent Oral Epidemiol 2000;17:21217.
  20. Lewsey J, Thomson W: The utility of the zero-inflated Poisson and zero-inflated negative binomial models: a case study of cross-sectional and longitudinal DMF data examining the effect of socio-economic status. Community Dent Oral Epidemiol 2004;32:18389.

    External Resources

  21. Lim S, Sohn W, Burt BA, Sandretto AM, Kolker JL, Marshall TA, Ismail A: Cario-genicity of soft drinks, milk and fruit juice in low-income African-American children: a longitudinal study. J Am Dent Assoc 2008;139:959–967.

    External Resources

  22. Maserejian NN, Tavares MA, Hayes C, Soncini JA, Trachtenberg FL: Rural and urban disparities in caries prevalence in children with unmet dental needs: the New England Children’s Amalgam Trial. J Public Health Dent 2008a;68:7–13.

    External Resources

  23. Maserejian NN, Trachtenberg F, Hayes C, Tavares M: Oral health disparities in children of immigrants: dental caries experience at enrollment and during follow-up in the New England Children’s Amalgam Trial. J Public Health Dent 2008b;68:14–21.

    External Resources

  24. Mullahy J: Specification and testing of some modified count data models. J Econometr 1986;3:341–365.

    External Resources

  25. Mwalili SM, Lesaffre E, Declerck D: The zero-inflated negative binomial regression model with correction for misclassification: an example in caries research. Stat Methods Med Res 2008;17:123–139.
  26. Nelson S, Albert JM, Lombardi G, Wishnek S, Asaad G, Kirchner HL, Singer LT: Dental caries and enamel defects in very low birth weight adolescents. Caries Res 2010;44:509–518.
  27. Sanders AE, Lim S, Sohn W: Resilience to urban poverty: theoretical and empirical considerations for population health. Am J Public Health 2008;98:1101–1106.

    External Resources

  28. Simonoff JS: Analyzing Categorical Data. New York, Springer, 2003, p 496.
  29. Solinas G, Campus G, Maida C, Sotgiu G, Cagetti MG, Lesaffre E, Castiglia P: What statistical method should be used to evaluate risk factors associated with dmfs index? Evidence from the National Pathfinder Survey of 4-year-old Italian children. Community Dent Oral Epidemiol 2009;37:539–546.

    External Resources

  30. Syrjälä AM, Niskanen MC, Ylöstalo P, Knuuttila ML: Metabolic control as a modifier of the association between salivary factors and dental caries among diabetic patients. Caries Res 2003;37:142–147.
  31. Thitasomakul S, Piwat S, Thearmontree A, Chankanka O, Pithpornchaiyakul W, Madyusoh S: Risks for early childhood caries analyzed by negative binomial models. J Dent Res 2009;88:137–141.
  32. Tramini P, Molinari N, Tentscher M, Demattei C, Schulte AG: Association between caries experience and body mass index in 12-year-old French children. Caries Res 2009;43:468–473.
  33. Wong MC, Lu HX, Lo EC: Caries increment over 2 years in preschool children: a life course approach. Int J Paediatr Dent 2012;22:77–84.

 goto top of outline Author Contacts

John S. Preisser
Department of Biostatistics, Gillings School of Global Public Health
University of North Carolina
Chapel Hill, NC 27599-7420 (USA)
Tel. +1 919 966 7265, E-Mail

 goto top of outline Article Information

Received: January 12, 2012
Accepted after revision: April 17, 2012
Published online: June 15, 2012
Number of Print Pages : 11
Number of Figures : 1, Number of Tables : 2, Number of References : 33
Additional supplementary material is available online - Number of Parts : 1

 goto top of outline Publication Details

Caries Research

Vol. 46, No. 4, Year 2012 (Cover Date: July 2012)

Journal Editor: Beighton D. (London)
ISSN: 0008-6568 (Print), eISSN: 1421-976X (Online)

For additional information:

Open Access License / Drug Dosage / Disclaimer

Open Access License: This is an Open Access article licensed under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported license (CC BY-NC) (, applicable to the online version of the article only. Distribution permitted for non-commercial purposes only.
Drug Dosage: The authors and the publisher have exerted every effort to ensure that drug selection and dosage set forth in this text are in accord with current recommendations and practice at the time of publication. However, in view of ongoing research, changes in goverment regulations, and the constant flow of information relating to drug therapy and drug reactions, the reader is urged to check the package insert for each drug for any changes in indications and dosage and for added warnings and precautions. This is particularly important when the recommended agent is a new and/or infrequently employed drug.
Disclaimer: The statements, opinions and data contained in this publication are solely those of the individual authors and contributors and not of the publishers and the editor(s). The appearance of advertisements or/and product references in the publication is not a warranty, endorsement, or approval of the products or services advertised or of their effectiveness, quality or safety. The publisher and the editor(s) disclaim responsibility for any injury to persons or property resulting from any ideas, methods, instructions or products referred to in the content or advertisements.