Casualty Estimation

Estimation of the number of casualties — particularly civilian casualties — in conflicts is a challenging statistical problem. In most cases, aid organizations, governments, and quasi-governmental entities gather partial lists of victims. Because these lists are inevitably incomplete, and are typically not simple random samples from the population of victims, simply tabulating the number of unique individuals appearing on the union of the lists will give biased estimates of the total number of victims.

Estimation of the number of casualties via capture-recapture methods has the potential to produce more reliable estimates of the number of victims. The use of capture-recapture in casualty estimation was pioneered by my collaborators at the Human Rights Data Analysis Group (HRDAG) over 25 years ago. During this time, they have produced estimates of the number of casualties in a number of conflicts, including the 1999 conflict in the former Yugoslavia, the long-running civil conflict between the government of Colombia and the FARC guerrillas, and the civil conflict in Peru between government loyalists and the Maoist insurgent organization El Sendero Luminoso, among others.

These estimates are typically produced through a two-stage procedure. First, all of the unique individuals in the data are identified through a record linkage procedure. Then, the linked records are used to produce an estimate of the number of victims not recorded on any list by application of capture-recapture methods. My work in this area focuses on (1) identifying and characterizing the limitations of inference using existing methods and (2) development of statistical methods to circumvent or alleviate some of these limitations. My previous work has included a study of the fundamental limitations of record linkage [1]. In this work, we show that in casualty estimation applications, some non-trivial proportion of mistakes in record linkage is probably inevitable, but that capture-recapture estimates appear to be somewhat robust to these mistakes. In another recent article [2], we propose an approach to capture-recapture estimation in the presence of capture heterogeneity. Capture heterogeneity refers to the situation where some individuals are more likely than others to be observed on any list. We show that in this case, estimators of the population size that is “minimally visible” to the sampling design have much lower risk than estimators of the total population size, and advocate for the use of such estimators, at the very least as a diagnostic for problematic levels of capture heterogeneity.

Another approach to dealing with capture heterogeneity is to stratify the data by available covariates (usually time of death and place of death) in order to obtain strata within which the capture probabilities are likely to vary less across individuals. Currently, this is done by testing for heterogeneity, then splitting data into multiple strata, and testing again. This process is repeated until a stratification is found in which heterogeneity is rejected in each stratum. I am currently working with colleagues at Stanford to develop a method to appropriately adjust inferences from this model for the selection process that has occurred. This is an application of inference after selection, and can also be seen as a method for formally accounting for researcher degrees of freedom. 

[1] Johndrow, J.E., Lum, K. and Dunson, D.B. (2018). Theoretical limits of microclustering for record linkage, Biometrika, 105(2): 431–446. arXiv preprint.

[2] Johndrow, J.E., Lum, K. and Manrique-Vallier, D. (2018). Low risk population size estimates in the presence of capture heterogeneity. Biometrika (to appear). arXiv preprint.