This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Public Health and Surveillance, is properly cited. The complete bibliographic information, a link to the original publication on https://publichealth.jmir.org, as well as this copyright and license information must be included.

Population size estimates (PSE) provide critical information in determining resource allocation for HIV services geared toward those at high risk of HIV, including female sex workers, men who have sex with men, and people who inject drugs. Capture-recapture (CRC) is often used to estimate the size of these often-hidden populations. Compared with the commonly used 2-source CRC, CRC relying on 3 (or more) samples (3S-CRC) can provide more robust PSE but involve far more complex statistical analysis.

This study aims to design and describe the Shiny application (shinyrecap), a user-friendly interface that can be used by field epidemiologists to produce PSE.

shinyrecap is built on the Shiny web application framework for R. This allows it to seamlessly integrate with the sophisticated CRC statistical packages (eg, Rcapture, dga, LCMCR). Additionally, the application may be accessed online or run locally on the user’s machine.

The application enables users to engage in sample size calculation based on a simulation framework. It assists in the proper formatting of collected data by providing a tool to convert commonly used formats to that used by the analysis software. A wide variety of methodologies are supported by the analysis tool, including log-linear, Bayesian model averaging, and Bayesian latent class models. For each methodology, diagnostics and model checking interfaces are provided.

Through a use case, we demonstrated the broad utility of this powerful tool with 3S-CRC data to produce PSE for female sex workers in a subnational unit of a country in sub-Saharan Africa.

Accurate knowledge of population size is critical in many areas of science but a challenge whenever complete counts are too difficult or expensive to be obtained. One such area is the HIV pandemic, which increasingly is driven by high-risk behaviors that define “key populations” (KP), among them, female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID) [

Capture-recapture (CRC) globally has seen wide use for PSE, including for the HIV pandemic [

Several statistical models satisfy the requirements to perform the aforementioned sophisticated analysis of 3S-CRC data: log-linear models, Bayesian model averaging, and Bayesian nonparametric latent-class models. Log-linear models are a classic methodology for the analysis of multiple source CRC data. Variants are implemented that allow for varying capture probabilities across events and heterogeneous capture probabilities among members of the population. Bayesian model averaging allows the analyst to flexibly account for list dependency by creating models for all possible dependencies and averaging over them in a way that is proportional to the probability that the dependence is correct. The Bayesian latent class model deals with heterogeneity in a novel way. It posits that there are unobserved subgroups in the data with different capture probabilities for each capture event. The number of these groups and their probabilities are unknown. The algorithm uses a Bayesian framework to estimate these, along with the population size. Application of these 3 types of statistical models requires computational expertise. This is a barrier to the use of CRC involving 3 or more sources, as it typically requires knowledge of specialized software [

The objectives of this paper were to describe

The application of ratio estimation for PSE from multiple encounters dates to at least 1787 [

The next major innovation was the extension of estimation to data collected from 3 or more samples [_{1}), then the probability of being captured in all 3 samples would simply be p_{1}^{3}. On the other hand, if half the population has a capture probability of p_{1} and the other half has probability p_{2}, then the probability of a random person being captured in all 3 would be 0.5(p_{1}^{3} + p_{2}^{3}). By comparing the counts of individuals captured in all 3 samples to what would be expected if there was homogeneity, we can measure and model it. Log-linear models, Bayesian model averaging, and Bayesian Dirichlet process mixture models (nonparametric latent-class models) and each model heterogeneity in different ways, allowing for the production of more accurate estimates in the presence of inhomogeneity.

Models for capture probabilities originated in the discipline of animal ecology [_{0}); captures might have different probabilities, and individuals are uniform (M_{t}); captures have the same probability, and individuals may be heterogeneous (M_{h}); and captures may have different probabilities, and individuals may be heterogeneous (M_{th}). Selection of a single “best” model is typically done using either the Akaike or Bayesian information criterion (AIC and BIC, respectively) [

For heterogeneous models, log-linear models require the specification of a parametric distribution for the population’s log odds of being captured. These are typically set to be either Normal, Poisson, Gamma, or Darroch. Additionally, the Chao (lower bound) correction can be used to obtain a lower bound on the population size rather than an estimate of it.

The “Normal” model incorporates heterogeneity as a Gaussian mixing distribution [

Bayesian model averaging is geared to be robust to list dependence. Ideally, one would like to have all capture events be independent draws from the population. In many cases, however, some capture events may be related. For example, in a citywide survey of PWID, it might happen that the first 2 capture events were more heavily concentrated in one area of the city than the third event, introducing potential dependence. When list dependence is present, the interactions between events should be considered.

The natural logs of expected frequencies of observable encounter combinations can be modeled as linear combinations of main and interaction effects of the sampling events [

The first step in the analysis is to specify a prior distribution for population size. This represents the analyst’s prior knowledge about population size along with uncertainty. By default, a “noninformative” improper prior is used, which is proportional to 1 divided by the population size. Typically, analysts will have access to at least a rough idea of the range of possible population sizes from previous PSE reports or literature reviews. This information can be incorporated into the prior parameterized as a log-normal distribution with a truncation at a specified maximum population size. The “delta” parameter controls the prior, favoring simple models in the model averaging. This parameter is more difficult to interpret, and it is set to 2^{–}^{k}

Instead of assuming a parametric probability function for capture probability, as is done by traditional log-linear models, this approach posits that the population is divided into a number of groups, with members in each group having the same homogeneous capture probability. The number of homogeneous strata in a population is uncertain, and covariates that identify those classes may not be available. Thus, the strata are said to be latent, and strata identities are treated as missing data. Estimation is naturally accomplished using mixtures of distributions. A clever implementation of Bayesian nonparametric latent-class modeling can then be used to estimate population size [

The degree to which fewer strata are preferred is controlled by a prior on the stick-breaking process parameterized as a Gamma distribution with shape and scale values. The relationship between the Gamma distribution and the number of latent groups is complex and mediated by a stick-breaking process. In general, the default values of 0.25 for both the shape and scale parameters result in a reasonably diffuse prior.

Estimation is based on the posterior distribution of population size, of which a sample is constructed using Markov chain Monte Carlo (MCMC) simulation. MCMC algorithms start from initial values and produce serially correlated “chains” of samples from some distribution. That distribution converges to the joint posterior distribution only after some potentially large number of “burn-in” iterations. Therefore, valid inferences can be made only after discarding the burn-in iterations.

Alternatively, R users can launch the interface locally from any computer by entering the following into the R console:

When designing a CRC study, it is important to collect enough data to achieve sufficient precision for PSE. _{t} model if homogeneity is assumed. If heterogeneity is allowed, simulation and estimation are performed using the M_{th} model with normally distributed capture probabilities.

Given the input parameters, the interface provides the user with the distribution of a log-linear population size estimator across the simulations. A table is also provided that summarizes the percent of times simulated estimates were within different ranges of accuracy. A user might find it acceptable to have their estimate within 10% of the true value 90% of the time, whereas they might choose to collect more samples if the calculator says that their estimate will only be within 10% of the true value 50% of the time.

The first barrier encountered by a practitioner is putting their CRC data into the right format for analysis.

When the aggregate data type is specified, the last column represents the total number of individuals with that capture history. A properly structured 3S-CRC data set would look something like

From the first row, we see that there were 30 individuals who were observed at event 1 but not at the 2nd and 3rd events:

There were 10 individuals captured in all 3 events, as seen in the following row:

Note that there is no row for the following history because that pertains to the unknown number of population members who were not observed at any event:

For ^{k}^{k}

The capture-history data format is easily recorded from individually identifiable population members. However, in many epidemiological studies, unique individuals are not identified; rather, data are aggregated. These accumulated data files consist of counts of individuals who were encountered at each sampling event and the subsets of those who were encountered at any

It takes some thought to figure out how to convert the data to the required format, and the process becomes much more difficult if there are more than 3 events. The

Example capture-history data format for 3 encounter events (3S-CRC). Absence or presence is denoted by 0 or 1, respectively.

Aggregated capture histories in event-count format for 3-source capture-recapture (3S-CRC).

The log-linear section of the application has 3 sections. The first section, “Model Comparison,” displays population size, standard error, AIC, and BIC for each potential model formulation. The “Model Selection” section allows the user to select one of these models and calculate a confidence interval. The “Descriptives” section provides output to help the user understand the model and diagnose potential problems. Two diagnostic plots help explore the heterogeneity structure. The first diagnostic plot displays a function of the number of units captured _{th} model. The second diagnostic plot shows the number of individuals captured for the first time at the _{th} sampling event. It should be linear in the case of the M_{0} model and concave down in the case of an M_{h} model. Any other form may indicate an M_{t} or M_{th} model.

The first step in the Bayesian model averaging interface is to set a prior distribution for the population size. This is set to a noninformative 1/N distribution, but it is recommended to change this to something relevant to the population under study. To do this, the user can specify their beliefs for the median population size (ie, they believe that there is a 50% chance the population size is above it and 50% chance it is below) and the 90th percentile (ie, there is only a 10% chance the true population size is above this value). The application then parameterizes these beliefs as a log-normal distribution. The user may also specify a maximum population size to put an upper bound on the prior.

Once the prior is set, the user can go to the “Posterior Population Size” tab to obtain posterior estimates and credible intervals. The “Posterior Model Probabilities” tab allows the user to explore the different individual models that are averaged together and see their influence on the posterior.

The Bayesian nonparametric latent-class model is the most computationally intense analysis method. The user may control the Gamma distribution stick-breaking prior as well as set the maximum number of latent groups. Increasing the number of latent groups increases the computation time but, since the number of groups is determined by the algorithm, does not affect the results so long as it is set sufficiently large. Although 10 is a good default value, the user can increase that to ensure that this limit does not affect the results.

There is a number of MCMC sampling options available to the user. There are 2 primary considerations that the user should be aware of. First, the MCMC process must be at equilibrium. To ensure this, the first samples generated by the algorithm should be thrown out. The number of samples thrown out is controlled by the “burn-in” option. If there are any trends in the trace plot (available in the Diagnostics tab), the burn-in period may need to be increased. Second, the sample size must be large enough that the posterior is not dominated by sampling noise. With MCMC sampling, each sample is correlated with the last sample, so the effective sample size (also in the Diagnostics tab) is often much lower than the raw number of samples generated by the process. Typically, the user should aim for an effective sample size greater than 1000. The effective sample size can be increased by increasing the number of samples generated or the number of MCMC steps taken between samples, which reduces correlation.

After specifying the prior on the number of strata and the MCMC sampling parameters, a sample from the posterior distribution is produced by pressing the “Run” button. A progress bar displays the progress of each computational operation. A posterior summary will be displayed.

The pairwise analysis table displays PSE, standard error, and 95% confidence limits for each possible pair of sampling events and is used as a diagnostic step to examine sampling events for homogeneity. Similar PSE across pairwise results indicate the independence assumption may have been met, whereas differences across results suggest violations of the assumption. Any of the models available in the

Estimates of key population size are critical for HIV program planning. For this reason, a large 3S-CRC activity was implemented in a subnational unit (SNU) of a sub-Saharan African country with high HIV burden and unmet need for HIV/AIDS treatment services. Using 3S-CRC data collected from FSW, we demonstrated the utility of the

Between October 2018 and December 2018, 544 FSW hotspots in the SNU were sampled, representing 13,344 encounters with FSW over 3 sampling rounds. During encounters with FSW in hotspots, FSW distribution teams offered inexpensive and memorable objects that were unique to each of the 3 capture rounds. Eligible FSW who consented were considered enrolled in this PSE activity. In subsequent rounds, 1 week to 2 weeks apart, FSW were asked to show or describe objects they had received during previous rounds, and affirmative responses were tallied upon correct identification of the objects. Distributors recorded information on tablets and uploaded to a secure central server after each encounter. Data were aggregated into a table similar to

In the following sections, we work through how the

Before any study is conducted, it is wise to determine what level of precision one is likely to get out of a potential sampling plan.

The table in the upper right of

The sample size estimation tool applied to the example female sex worker study.

After the data were collected, we translated it from event format to capture history format.

The data formatter tool.

The first class of models we apply is log-linear. _{t} and M_{th} are likely more appropriate than the simpler alternatives. This is consistent with the result that the AIC and BIC values are considerably lower for these compared with the M_{0} and M_{h} models. The set of M_{th} models has the lowest AIC and BIC, indicating that there may be heterogeneity in the population.

Poisson2 induces a reasonable amount of heterogeneity and is generally a good default choice. In this case, it yields a PSE of 18,317, which, as we will see in the following sections, is consistent with other analyses.

Log-linear models applied to the example female sex worker study.

Applying a Bayesian model averaging model results in a very similar estimate compared with the Poisson2 log-linear model with a posterior estimate of 18,624 (see

Use of the default “Noninformative” prior, which is an improper prior with mass equal to the inverse of population size, is a useful robustness check to assess the influence of our choice of prior. The posterior estimate using the noninformative prior was 18,608, which is very similar to our original result. Note that the log-normal prior median input was increased from the default of 7000 to 20,000 and the 90% upper bound was adjusted from the default of 10,000 to 80,000.

Bayesian model averaging applied to the example female sex worker study.

Applying the latent-class model, as in

Note that the MCMC number of samples was increased to 100,000 from the default of 10,000, thinning was increased from the default of 10 to 100, and the burn-in was increased from the default of 10,000 to 100,000. These inputs were adjusted to increase effective sample size.

Bayesian latent-class modeling applied to the example female sex worker study.

The table in

Pairwise analysis of example female sex worker study results.

The example using 3S-CRC data from FSW in an SNU of a sub-Saharan African country demonstrates how computationally intensive statistical methods are made more accessible to epidemiologists and others with

Our model results using _{th} Chao lower bound was the smallest of all log-linear models, at 14,990 (14,620-15,378); the Bayesian model averaged 18,624 (17,625-19,649); and Bayesian latent-class models averaged 16,266 (10,612-23,512). The ability to produce confidence bounds is another benefit of

Shiny apps offer a solution to the problem of poor-quality estimates for key population program and policy developers and elevate the level of sophistication of analysis while building in-country capacity to implement critical surveillance activities. Recently, several Shiny apps were introduced that enhance HIV surveillance efforts to estimate awareness of HIV status over time [

Our work was motivated by the needs of epidemiologists and others who require reliable tools for PSE but may not have the necessary coding experience or advanced statistical skills needed to analyze CRC data involving 3 or more samples. The application facilitates the estimation of sample sizes for captures, proper formatting of individual-level and aggregate-level data in preparation for analysis, and various options for analysis of CRC data from 3 or more sources. In addition, all output can be saved in HTML, Word, or PDF formats, with an option to include the R code used by the Shiny to produce the output. Public health teams now have a powerful tool in

2-sample capture-recapture

3 (or more)-sample capture-recapture

Akaike information criterion

Bayesian information criterion

capture-recapture

female sex worker

Markov chain Monte Carlo

men who have sex with men

population size estimate

people who inject drugs

subnational unit

secure-sockets layer

None declared.