Estimating population prevalence by nonadaptive pooling

Sometimes all we want is an estimate of the prevalence of a disease, not the individual test results. For example, what is the COVID19 prevalence in a particular community? For privacy reasons or cost reasons, we may not want to drill down to the individual level, so a single nonadaptive pooling round provides both anonymity and efficiency.

This same question of population prevalence appears often in the ecology and quality control literature [REFS]. In both cases the important value is not the individual infection / defect status, but instead the more general rate.

Two questions that arise in pooling for prevalence estimation:
  1. How big should the pool be?
  2. How many samples are needed?
Both of these questions are answered by simulation below.

Pool size

Choosing the right pool helps to get the most information from each experiment. If the pool is too small, then most of the assays will mostly report back negative results, while if the pool is too large then the assays will mostly report positive results. Thus we want a pool size somewhere in the middle.

We note that in practice, the pool size may be dictated by experimental requirements such as limits on the number of samples that can be physically pooled. The following discussion gives guidance on where this pool size should be ideally chosen if there are no experimental restrictions.

The key determinant of the optimal pool size is an estimate of the underlying prevalence. This may seem circular in that we are using the pooled results to estimate the prevalence, while at the same time using a prevalence estimate to decide the pool size. Fortunately, in most cases we already have a rough idea of the prevalence, and the results are fairly tolerant of different pool sizes.

The table below shows the simulation results for prevalence studies using a variety of underlying prevalence values and pool sizes. Each entry in the table is the median confidence interval width--a measure of the uncertainty in the estimation. The simulation is based on 10 tests, randomly sampled 1000 times each to gather statistics.
size 0.001 0.01 0.02 0.05 0.1 0.2
100 5.0 x 10-3 1.6 x 10-2 9.2 x 10-1 9.2 x 10-1 9.2 x 10-1 9.2 x 10-1
50 6.0 x 10-3 2.1 x 10-2 3.2 x 10-2 9.0 x 10-1 9.0 x 10-1 9.0 x 10-1
10 2.9 x 10-2 4.8 x 10-2 6.5 x 10-2 1.2 x 10-1 1.7 x 10-1 5.3 x 10-1
7 4.1 x 10-2 4.1 x 10-2 6.8 x 10-2 1.1 x 10-1 1.6 x 10-1 2.9 x 10-1
5 5.7 x 10-2 5.7 x 10-2 9.3 x 10-2 1.2 x 10-1 1.8 x 10-1 2.9 x 10-1
3 9.2 x 10-2 9.2 x 10-2 9.2 x 10-2 1.5 x 10-1 2.4 x 10-1 3.1 x 10-1
1 2.4 x 10-1 2.4 x 10-1 2.4 x 10-1 2.4 x 10-1 3.6 x 10-1 4.4 x 10-1
Table 1: Median widths of the 95% confidence interval for the prevalence of a disease after 10 tests at a given pool size. The prevalence values along the x axis are the ground truth values used in the simulation. Entries in bold represent the pool sizes that yield the narrowest confidence intervals for each prevalence value.

Based on these simulations, the best pool size (s) can be estimated by the prevalence (pr) as:
\[ s=\frac{1}{pr}\] This relationship roughly holds across most prevalence values when there is little assay error. As assay error increases, the optimal pool size tends to decrease.

Number of samples

Once the pool size is established, next we turn to the number of samples (n) required. In general, more samples will yield more precise estimates, therefore the number of samples required depends on how precisely we need to know the prevalence. Like many measures, the width of the confidence interval (CI) goes as:
\[ CI=\frac{c}{\sqrt{n}}\]
Here c is a constant defined empirically by the confidence intervals listed in the table above.

As an example, consider the case of an estimated prevalence of 10% (p=0.10), where we want to know the actual prevalence to within 1%. Looking at the table above, we see that the optimal pool size is 7, and with 10 assays would yield a median confidence interval of 0.16 (16%) (note that if we selected a pool size of 10 or 5 the results would be similar). If we want to reduce this error down by a factor of 16x to achieve our 1% error, we will need 162 =256 more assays. Thus in total we will need 2,560 assays of pool size 7 to achieve this error rate (on average).

Next, lets compare the pooling case to the single assay case. If we have the same case of an estimated prevalence of 10% (p=0.10), where we want to know the actual prevalence to within 1%. In this case, if we start with a pool size of 1, we find the median confidence interval of 0.36 (36%) for 10 tests. To reduce this down to 1%, we will need 362=1,296 more assays, for a total of 12,960 assays.

This example demonstrates that we can reduce our expected number of assays down by a factor of approximately 5 simply by choosing the appropriate pooling size.

Prevalence estimation

Once pooled samples are tested, we can estimate the prevalence from the pool sizes and the number of positive and negative results. A sample calculator is below:


This calculator works by using a Bayesian approach to estimate the probability of the data given a prevalence value. Using a fine grid of prevalence values, we can then empirically construct a posterior probability density for all prevalence values between 0 and 1.0.

Posterior Estimation:

We assume each pool is independent, so the probability of any data configuration given a prevalence (pr). If we divide the space of prevalence values into m even segments, we can calculate the probability of any one segment i as:

\[p(pr_i|data)=\frac{ \prod{p(data| pr_i)} }{ \sum_{j=0}^m \prod{p(data| pr_j)}}\]

for each pool size (s) there will be a count of positives (ns+) and negatives (ns-), thus for a given pool size:

\[p_-=(1.0-pr)^{s}\] \[p_+=1.0-p_-\] \[p(s|pr)=(p_-(n_{s-}))(p_+(n_{s+}))\]


: See page 3, section under main results for discussion of sampling as v=ln(2) v=L*K/T (L=pool size, k=num positives, t=number of tests, v = prob) It is also interesting to note that while ν = ln 2 (which is ’maximally informative’ in the sense of maximising the entropy of the test outcome) optimises the rate of COMP (as well as DD below) for the near-constant column weight design, COMP [23] and DD [21] with Bernoulli designs are optimised with a fraction 1 − e^(−1) ≈ 0.632 of positive tests --- Pooling in field biology done all the time. -- Pooling flies to detect pathogens -- Essentially using Dorfman designs Pools of 9 flies/ test were used and a prevalence of 0.33% (while I estimate they could use more like 300 to get their data), but sample collection is hard. --- Pooling and PCR as a method to combat low frequency gene targeting in mouse embryonic stem cells Screen 2,300 colonies uisng only 123 PCR reactions (20x compression) Mouse gene targeting in mouse ES are 1%-10% --- Old pooling paper first suggestion? Recombinant fragment assay for gene targetting based on the polymerase chain reaction. --- Beef testing Determining an optimal pool size for testing beef herds for Johne's disease in Australia. Pool size of 10 was optimal due to experimental constraints. --- Pooling of urine specimens for PCR testing: a cost saving strategy for Chlamydia trachomatis control programmes. Suggests that pools of 5 samples works, 10 works most of the time, but does see some errors. Given these, the optimal pool size is determined more by the experimental condition than the math. ---- Utility of pooled urine specimens for detection of Chlamydia trachomatis and Neisseria gonorrhoeae in men attending public sexually transmitted infection clinics in Mumbai, India, by PCR. Screening 690 men, then tested in pools of 5x (total 138 pools), then retested (Dorfman) Prevalence was 15/690 and 37/690 by individual test.