Monte Carlo Simulation of Bernoulli Trials in R
Background
A user on Stackoverflow recently asked a question about a program to generate Monte Carlo simulations on Bernoulli trials to calculate coverage percentages using Wald confidence intervals. One of the problems in the code is that probability value calculations are executed on individual observations rather than sums of successes and failures. R is primarily a vector processor and the code does not aggregate the individual observations to counts of successes and failures in order to calculate observed probability values for each sample.
This causes the code to generate 0 for coverage percentage for p values above 0.01 for sample sizes tested in the original post. We use code from the original post to isolate where error is introduced into the algorithm.
We set a seed, and assign values to m
, n
, and p
, and attempt to generate 10,000 Bernoulli trials of size n
.
set.seed(95014)
m<-10000
n<-5
p<-0.01
x <- rbinom(m,size=1,prob = p)
At this point x
is a vector containing 10,000 true = 1, false = 0 values.
> table(x)
x
0 1
9913 87
However, x
is NOT 10,000 runs of samples of 5 Bernoulli trials. Given this fact, all subsequent processing by the algorithm in the original code will be incorrect.
The next line of code calculates a value for p.hat
. This should be a single value, not a vector of 10,000 elements.
p.hat <- x/n
table(p.hat)
> table(p.hat)
p.hat
0 0.2
9913 87
An accurate calculation for p.hat
would be the following:
> p.hat <- sum(x)/length(x)
> p.hat
[1] 0.0087
…which is very close to the population p-value of 0.01 that we assigned earlier in the code, but still does not represent 10,000 trials of sample size 5. Instead, p.hat
as defined above represents one Bernoulli trial with sample size 10,000.
Two minor changes to fix the code
After independently developing a Monte Carlo simulator for Bernoulli trials (see below for details), it becomes clear that with a couple of tweaks we can remediate the code from the original post to make it produce valid results.
First, we multiply m
by n
in the first argument to rbinom()
, so the number of trials produced is 10,000 times sample size. We also cast the result as a matrix with 10,000 rows and n
columns.
Second, we use rowSums()
to sum the trials to counts of successes, and divide the resulting vector of 10,000 elements by n
, producing correct values for p.hat
, given sample size. Once p.hat
is corrected, the rest of the code works as originally intended.
f3 <- function(n,probs) {
res1 <- lapply(n, function(i) {
setNames(lapply(probs, function(p) {
m<-10000
n<-i
p<-p
# make number of trials m*n, and store
# as a matrix of 10,000 rows * n columns
x <- matrix(rbinom(m*n,size=1,prob = p),nrow=10000,ncol=i)
# p.hat is simply rowSums(x) divided by n
p.hat <- rowSums(x)/n
lower.Wald <- p.hat - 1.96 * sqrt(p.hat*(1-p.hat)/n)
upper.Wald <- p.hat + 1.96 * sqrt(p.hat*(1-p.hat)/n)
p.in.CI <- (lower.Wald <p) & ( p < upper.Wald )
covprob1<- mean(p.in.CI)
covprob1
}),paste0("p=",probs))
})
names(res1) <- paste0("n=",n)
res1
}
f3(n=c(10,15,20,25,30,50,100,150,200),probs = c(0.01,0.4, 0.8))
…and the output:
> f3(n=c(10,15,20,25,30,50,100,150,200),probs = c(0.01,0.4, 0.8))
$`n=10`
$`n=10`$`p=0.01`
[1] 0.0983
$`n=10`$`p=0.4`
[1] 0.9016
$`n=10`$`p=0.8`
[1] 0.8881
$`n=15`
$`n=15`$`p=0.01`
[1] 0.1387
$`n=15`$`p=0.4`
[1] 0.9325
$`n=15`$`p=0.8`
[1] 0.8137
$`n=20`
$`n=20`$`p=0.01`
[1] 0.1836
$`n=20`$`p=0.4`
[1] 0.9303
$`n=20`$`p=0.8`
[1] 0.9163
$`n=25`
$`n=25`$`p=0.01`
[1] 0.2276
$`n=25`$`p=0.4`
[1] 0.94
$`n=25`$`p=0.8`
[1] 0.8852
$`n=30`
$`n=30`$`p=0.01`
[1] 0.2644
$`n=30`$`p=0.4`
[1] 0.9335
$`n=30`$`p=0.8`
[1] 0.9474
$`n=50`
$`n=50`$`p=0.01`
[1] 0.3926
$`n=50`$`p=0.4`
[1] 0.9421
$`n=50`$`p=0.8`
[1] 0.9371
$`n=100`
$`n=100`$`p=0.01`
[1] 0.6313
$`n=100`$`p=0.4`
[1] 0.9495
$`n=100`$`p=0.8`
[1] 0.9311
These results look more like what we expect from the simulation: poor coverage at low values of p / small sample sizes, and coverage improves as sample size increases for a given p value.
Starting from scratch: a basic simulator for one p-value / sample size
Here we develop a solution that iteratively builds on a set of basic building blocks: one p-value, one sample size, and a 95% confidence interval. The simulator also tracks parameters so we can combine results from multiple simulations into data frames that are easy to read and interpret.
First, we create a simulator that tests 10,000 samples of
We set parameters to represent a true population p-value of 0.5, a sample size of 5, and z-value of 1.96 representing a 95% confidence interval. We created function arguments for these constants so we can vary them in subsequent code. We also use set.seed()
to make the results reproducible.
set.seed(90125)
simulationList <- lapply(1:10000,function(x,p_value,sample_size,z_val){
trial <- x
successes <- sum(rbinom(sample_size,size=1,prob = p_value))
observed_p <- successes / sample_size
z_value <- z_val
lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)
upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)
data.frame(trial,p_value,observed_p,z_value,lower.Wald,upper.Wald)
},0.5,5,1.96)
A key difference between this code and the code from the original question is that we take samples of 5 from rbinom()
and immediately sum the number of true values to calculate the number of successes. This allows us to calculate observed_p
as successes / sample_size
. Now we have an empirically generated version of what was called p.hat
in the original question.
The resulting list includes a data frame summarizing the results of each trial.
We combine the list of data frames into a single data frame with do.call()
simulation_df <- do.call(rbind,simulationList)
At this point simulation_df
is a data frame containing 10,000 rows and 6 columns. Each row represents the results from one simulation of sample_size
Bernoulli trials. We’ll print the first few rows to illustrate the contents of the data frame.
> dim(simulation_df)
[1] 10000 6
> head(simulation_df)
trial p_value observed_p z_value lower.Wald upper.Wald
1 1 0.5 0.6 1.96 0.17058551 1.0294145
2 2 0.5 0.2 1.96 -0.15061546 0.5506155
3 3 0.5 0.6 1.96 0.17058551 1.0294145
4 4 0.5 0.2 1.96 -0.15061546 0.5506155
5 5 0.5 0.2 1.96 -0.15061546 0.5506155
6 6 0.5 0.4 1.96 -0.02941449 0.8294145
>
Notice how the observed_p
values are distinct values in increments of 0.2. This is because when sample size is 5, the number of TRUE values in each sample can vary between 0 and 5. A histogram of observed_p
makes this clear.
Even with a sample size of 5, we can see the shape of a binomial distribution emerging in the histogram.
Next, we calculate the coverage percentage by summing the rows where the population p-value (represented as p_value
) is within the Wald confidence interval.
# calculate coverage: % of simulations where population p-value is
# within Wald confidence limits generated via simulation
sum(simulation_df$p_value > simulation_df$lower.Wald &
simulation_df$p_value < simulation_df$upper.Wald) / 10000 * 100
> sum(simulation_df$p_value > simulation_df$lower.Wald &
+ simulation_df$p_value < simulation_df$upper.Wald) / 10000 * 100
[1] 93.54
Coverage of 93.54% is a reasonable result for the simulation, given that we calculated a 95% confidence interval. We interpret the result as 93.5% of the samples generated Wald confidence intervals that included the population p-value of 0.5.
Therefore, we conclude that our simulator appears to be generating valid results. We will build on this basic design to execute simulations with multiple p-values and sample sizes.
Simulating multiple p-values for a given sample size
Next, we’ll vary the probability values to see the percentage coverage for 10,000 samples of 5 observations. Since the statistics literature such as Sauro and Lewis, 2005 tells us that Wald confidence intervals have poor coverage for very low and very high p-values, we’ve added an argument to calculate Adjusted Wald scores. We’ll set this argument to FALSE
for the time being.
p_val_simulations <- lapply(c(0.01,0.1,0.4,.5,.8),function(p_val){
aSim <- lapply(1:10000,function(x,p_value,sample_size,z_val,adjWald){
trial <- x
successes <- sum(rbinom(sample_size,size=1,prob = p_value))
if(adjWald){
successes <- successes + 2
sample_size <- sample_size + 4
}
observed_p <- sum(successes) / (sample_size)
z_value <- z_val
lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)
upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)
data.frame(trial,p_value,sample_size,observed_p,z_value,adjWald,lower.Wald,upper.Wald)
},p_val,5,1.96,FALSE)
# bind results to 1 data frame & return
do.call(rbind,aSim)
})
The resulting list, p_val_simulations
contains one data frame for each p-value run through the simulation.
We combine these data frames and calculate coverage percentages as follows.
do.call(rbind,lapply(p_val_simulations,function(x){
p_value <- min(x$p_value)
adjWald <- as.logical(min(x$adjWald))
sample_size <- min(x$sample_size) - (as.integer(adjWald) * 4)
coverage_pct <- (sum(x$p_value > x$lower.Wald &
x$p_value < x$upper.Wald) / 10000)*100
data.frame(p_value,sample_size,adjWald,coverage_pct)
}))
As expected, the coverage is very poor the further we are away from a p-value of 0.5.
p_value sample_size adjWald coverage_pct
1 0.01 5 FALSE 4.53
2 0.10 5 FALSE 40.23
3 0.40 5 FALSE 83.49
4 0.50 5 FALSE 94.19
5 0.80 5 FALSE 66.35
However, when we rerun the simulation with adjWald = TRUE
, we get the following results.
p_value sample_size adjWald coverage_pct
1 0.01 5 TRUE 95.47
2 0.10 5 TRUE 91.65
3 0.40 5 TRUE 98.95
4 0.50 5 TRUE 94.19
5 0.80 5 TRUE 94.31
These are much better coverage values, particularly for p-values close the the ends of the distribution.
The final task remaining is to modify the code so it executes Monte Carlo simulations at varying levels of sample size. Before proceeding further, we calculate the runtime for the code we’ve developed thus far.
system.time()
tells us that the code to run 5 different Monte Carlo simulations of 10,000 Bernoulli trials with sample size of 5 takes about 38 seconds to run on a MacBook Pro 15 with a 2.5 Ghz Intel i-7 processor. Therefore, we expect that the next simulation will take multiple minutes to run.
Varying p-value and sample size
We add another level of lapply()
to account for varying the sample size. We have also set the adjWald
parameter to FALSE
so we can see how the base Wald confidence interval behaves at p = 0.01 and 0.10.
set.seed(95014)
system.time(sample_simulations <- lapply(c(10, 15, 20, 25, 30, 50,100, 150, 200),function(s_size){
lapply(c(0.01,0.1,0.8),function(p_val){
aSim <- lapply(1:10000,function(x,p_value,sample_size,z_val,adjWald){
trial <- x
successes <- sum(rbinom(sample_size,size=1,prob = p_value))
if(adjWald){
successes <- successes + 2
sample_size <- sample_size + 4
}
observed_p <- sum(successes) / (sample_size)
z_value <- z_val
lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)
upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)
data.frame(trial,p_value,sample_size,observed_p,z_value,adjWald,lower.Wald,upper.Wald)
},p_val,s_size,1.96,FALSE)
# bind results to 1 data frame & return
do.call(rbind,aSim)
})
}))
Elapsed time on the MacBook Pro was 217.47 seconds, or about 3.6 minutes. Given that we ran 27 different Monte Carlo simulations, the code completed one simulation each 8.05 seconds.
The final step is to process the list of lists to create an output data frame that summarizes the analysis. We aggregate the content, combine rows into data frames, then bind the resulting list of data frames.
summarizedSimulations <- lapply(sample_simulations,function(y){
do.call(rbind,lapply(y,function(x){
p_value <- min(x$p_value)
adjWald <- as.logical(min(x$adjWald))
sample_size <- min(x$sample_size) - (as.integer(adjWald) * 4)
coverage_pct <- (sum(x$p_value > x$lower.Wald &
x$p_value < x$upper.Wald) / 10000)*100
data.frame(p_value,sample_size,adjWald,coverage_pct)
}))
})
results <- do.call(rbind,summarizedSimulations)
One last step, we sort the data by p-value to see how coverage improves as sample size increases.
results[order(results$p_value,results$sample_size),]
…and the output:
> results[order(results$p_value,results$sample_size),]
p_value sample_size adjWald coverage_pct
1 0.01 10 FALSE 9.40
4 0.01 15 FALSE 14.31
7 0.01 20 FALSE 17.78
10 0.01 25 FALSE 21.40
13 0.01 30 FALSE 25.62
16 0.01 50 FALSE 39.65
19 0.01 100 FALSE 63.67
22 0.01 150 FALSE 77.94
25 0.01 200 FALSE 86.47
2 0.10 10 FALSE 64.25
5 0.10 15 FALSE 78.89
8 0.10 20 FALSE 87.26
11 0.10 25 FALSE 92.10
14 0.10 30 FALSE 81.34
17 0.10 50 FALSE 88.14
20 0.10 100 FALSE 93.28
23 0.10 150 FALSE 92.79
26 0.10 200 FALSE 92.69
3 0.80 10 FALSE 88.26
6 0.80 15 FALSE 81.33
9 0.80 20 FALSE 91.88
12 0.80 25 FALSE 88.38
15 0.80 30 FALSE 94.67
18 0.80 50 FALSE 93.44
21 0.80 100 FALSE 92.96
24 0.80 150 FALSE 94.48
27 0.80 200 FALSE 93.98
>
Interpreting the results
The Monte Carlo simulations illustrate that Wald confidence intervals provide poor coverage at a p-value of 0.01, even with a sample size of 200. Coverage improves at p-value of 0.10, where all but one of the simulations at sample sizes 25 and above exceeded 90%. Coverage is even better for the p-value of 0.80, where all but one of the sample sizes above 15 exceeded 91% coverage.
Coverage improves further when we calculate Adjusted Wald confidence intervals, especially at lower p-values.
results[order(results$p_value,results$sample_size),]
p_value sample_size adjWald coverage_pct
1 0.01 10 TRUE 99.75
4 0.01 15 TRUE 98.82
7 0.01 20 TRUE 98.30
10 0.01 25 TRUE 97.72
13 0.01 30 TRUE 99.71
16 0.01 50 TRUE 98.48
19 0.01 100 TRUE 98.25
22 0.01 150 TRUE 98.05
25 0.01 200 TRUE 98.34
2 0.10 10 TRUE 93.33
5 0.10 15 TRUE 94.53
8 0.10 20 TRUE 95.61
11 0.10 25 TRUE 96.72
14 0.10 30 TRUE 96.96
17 0.10 50 TRUE 97.28
20 0.10 100 TRUE 95.06
23 0.10 150 TRUE 96.15
26 0.10 200 TRUE 95.44
3 0.80 10 TRUE 97.06
6 0.80 15 TRUE 98.10
9 0.80 20 TRUE 95.57
12 0.80 25 TRUE 94.88
15 0.80 30 TRUE 96.31
18 0.80 50 TRUE 95.05
21 0.80 100 TRUE 95.37
24 0.80 150 TRUE 94.62
27 0.80 200 TRUE 95.96
The Adjusted Wald confidence intervals provide consistently better coverage across the range of p-values and sample sizes, with an average coverage of 96.72% across the 27 simulations. This is consistent with the literature that indicates Adjusted Wald confidence intervals are more conservative than unadjusted Wald confidence intervals.
At this point we have a working Monte Carlo simulator that produces valid results for multiple p-values and sample sizes. We can now review the code to find opportunities to optimize its performance.
Optimizing the solution
Following the old programming aphorism of Make it work, make it right, make it fast, working the solution out in an iterative manner helped enabled me to develop a solution that produces valid results.
Understanding of how to make it right enabled me not only to see the flaw in the code posted in the question, but it also enabled me to envision a solution. That solution, using rbinom()
once with an argument of m * n
, casting the result as a matrix()
, and then using rowSums()
to calculate p-values, led me to see how I could optimize my own solution by eliminating thousands of rbinom()
calls from each simulation.
Refactoring for performance
We create a function, binomialSimulation()
, that generates Bernoulli trials and Wald confidence intervals with a single call to rbinom()
, regardless of the number of trials in a single simulation. We also aggregate results so each simulation generates a data frame containing one row describing the results of the test.
set.seed(90125)
binomialSimulation <- function(trial_size,p_value,sample_size,z_value){
trials <- matrix(rbinom(trial_size * sample_size,size=1,prob = p_value),
nrow = trial_size,ncol = sample_size)
observed_p <- rowSums(trials) / sample_size
lower.Wald <- observed_p - z_value * sqrt(observed_p*(1-observed_p)/sample_size)
upper.Wald <- observed_p + z_value * sqrt(observed_p*(1-observed_p)/sample_size)
coverage_pct <- sum(p_value > lower.Wald &
p_value < upper.Wald) / 10000 * 100
data.frame(sample_size,p_value,avg_observed_p=mean(observed_p),coverage_pct)
}
We run the function with a population p-value of 0.5, a sample size of 5, and 10,000 trials and a confidence interval of 95%, and track the execution time with system.time()
. The optimized function is 99.8% faster than the original implementation described earlier in the article, which runs in about 6.09 seconds.
system.time(binomialSimulation(10000,0.5,5,1.96))
> system.time(binomialSimulation(10000,0.5,5,1.96))
user system elapsed
0.015 0.000 0.015
We skip the intermediate steps from the solution above, instead presenting the optimized version of the iteratively developed solution.
system.time(results <- do.call(rbind,lapply(c(5,10,15,20,25,50,100,250),
function(aSample_size,p_values) {
do.call(rbind,lapply(p_values,function(a,b,c,d){
binomialSimulation(p_value = a,
trial_size = b,
sample_size = aSample_size,
z_value = d)
},10000,5,1.96))
},c(0.1,0.4,0.8))))
As expected, elimination of the thousands of unnecessary calls to rbinom()
radically improves performance of the solution.
user system elapsed
0.777 0.053 0.830
Given that our prior solution ran in 217 seconds, performance of the optimized version is really impressive. Now we have a solution that not only generates accurate Monte Carlo simulations of Bernoulli trials, but it’s also very fast.