Random thoughts of an economist

Understanding the meaning of confidence intervals using R

Posted in R, Statistics/Econometrics by kafuwong on July 4, 2012

# To illustrate
#  (1) the meaning of confidence interval (i.e., CI)
#  (2) the similarity of confidence intervals computed using different approaches
#  (3) the relationship between confidence interval and type I error of a hypothesis test
#  ——-

mu_pop=2  # population mean
sigma_pop=2  # population standard deviation
n=100   # sample size
asim=1000  # number of repetitions
CIlimits=matrix(999,asim,12)  # create a matrix to store the resulting CI limits
colnames(CIlimits)=c(“l.95(1)”,”r.95(1)”,”l.99(1)”,”r.99(1)”,
                     “l.95(2a)”,”r.95(2a)”,”l.99(2a)”,”r.99(2a)”,
                     “l.95(2b)”,”r.95(2b)”,”l.99(2b)”,”r.99(2b)”)
                     # name the columns for easy reference later

Indicators=matrix(999,asim,6) # create a matrix to store the resulting indicators
colnames(Indicators)=c(“0.05(1)”,”0.01(1)”,
                       “0.05(2a)”,”0.01(2a)”,
                       “0.05(2b)”,”0.01(2b)”)
                       # name the columns for easy reference later

set.seed(20120704)  # fix a random number seed. 

for(i in 1:asim){  # start the 1000 repetitions
  print(i)
  x<-rnorm(n,mean=mu_pop,sd=sigma_pop)  # generate random sample
  mx=mean(x) # sample mean
  sdx=sd(x)  # sample variance

  # Approach (1)
  # Central Limit Theorem says (mx-mu_pop)/sqrt(sigma_pop^2/n) asymptotically N(0,1)
  # —————————–
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(1)”]=qnorm(0.975,mean=mx,sd=sqrt(sigma_pop^2/n))
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(1)”]=qnorm(0.025,mean=mx,sd=sqrt(sigma_pop^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”r.99(1)”]=qnorm(0.995,mean=mx,sd=sqrt(sigma_pop^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”l.99(1)”]=qnorm(0.005,mean=mx,sd=sqrt(sigma_pop^2/n))
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(1)”]=ifelse((mu_pop>=CIlimits[i,”l.95(1)”])&(mu_pop<=CIlimits[i,”r.95(1)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(1)”]=ifelse((mu_pop>=CIlimits[i,”l.99(1)”])&(mu_pop<=CIlimits[i,”r.99(1)”]),1,0)

  # Approach (2a)
  #Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)
  # —————————–
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(2a)”]=qnorm(0.975,mean=mx,sd=sqrt(sdx^2/n))
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(2a)”]=qnorm(0.025,mean=mx,sd=sqrt(sdx^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”r.99(2a)”]=qnorm(0.995,mean=mx,sd=sqrt(sdx^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”l.99(2a)”]=qnorm(0.005,mean=mx,sd=sqrt(sdx^2/n))
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2a)”])&(mu_pop<=CIlimits[i,”r.95(2a)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2a)”])&(mu_pop<=CIlimits[i,”r.99(2a)”]),1,0)
  # Approach (2b)
  # Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)
  # —————————–
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[1]
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[2]
  # lower limit of the 99% CI for the population mean
    CIlimits[i,”l.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[1]
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[2]
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2b)”])&(mu_pop<=CIlimits[i,”r.95(2b)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2b)”])&(mu_pop<=CIlimits[i,”r.99(2b)”]),1,0)

}
## We expect a x% CI to contain the poulation mean x% of the time. 
# Approach (1)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(1)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(1)”]))
# Approach (2a)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2a)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2a)”]))
# Approach (2b)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2b)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2b)”]))

# type I errors
# we expect for a y% significance test, we reject the true null y% of the time.
# Test at 5% level of significance
  mean(1-Indicators[,”0.05(1)”])
  mean(1-Indicators[,”0.05(2a)”])
  mean(1-Indicators[,”0.05(2b)”])

# Test at 1% level of significance
  mean(1-Indicators[,”0.01(1)”])
  mean(1-Indicators[,”0.01(2a)”])
  mean(1-Indicators[,”0.01(2b)”])

Understanding QQ Plot

Posted in R, Statistics/Econometrics by kafuwong on June 22, 2012

QQ plot is a statistical tool to check whether the distribution of a dataset x has a similar distribution of another dataset y. 
 
Imagine x has 100 observations.  We can compute the percentiles.  What are they?  The percentiles will simply be the sorted values of x, from the smallest to the largest.  Call the variable containing the percentiles sx (sorted x). 
 
(1) Now imagine y is the same as x, but we pretend that y is different from x and compute the percentiles accordingly.   Call the variable containing the percentiles sy (sorted y). 
 
Let’s plot the percentiles of x against the percentiles of y, i.e., sx against sy.  Because sx=sy, we should see all the percentiles to line up in a straight line.  That is, when y and x are identical (and thus have the same distribution), the plot looks like a straight line and the scales on x- and y-axis are the same. 
 
(2) Now let y=x+2.  Repeat the above step and plot percentiles of x against the percentiels of y.  Essentially, y has the same distribution as x, with its mean shifted by a constant.  We should see all the percentiles to line up on a straight line (a different stright line, though).  Note the slight change in the scales on x- and y-axis. 
 
(3) Now let y=2*x.  Repeat the above step and plot percentiles of x against the percentiels of y.  Essentially, y has the same distribution as x, with variance and mean multiplied by a constant.  We should see all the percentiles to line up on a straight line (a different stright line, though).  Note the slight change in the scales on x- and y-axis. 
 
(4) Suppose x is a vector of standard normal variables and y is another vector of standard normal variables.  Essentially, they have the same “normal” distribution.  Let’s do the plot again.  We see a plot that is broadly on a straight line.  It end points may differ from the straight lines because the end quantiles are less likely events and may be less reliable. 
 
(5) Suppose x is the same as in (4) but z=2y+1 where y is the same as in (4).   Both x and z are normally distributed.  Let’s do the plot of quantiles of x against the quantiles of z again.  We see a plot that is broadly on a straight line, just like what we have earlier.  
 
In the above calculations, we were talking about percentiles.  What if one data set has less than 100 observations?  We can still compute the percentiles with intrapolation. 
 
Do we need to insist on percentiles?  We do not have to.   All we need is a set of sorted values of two data sets if the two data sets have the same number of observations.  If the two data set has different number of observations, we can sort the dataset of more observations and then obtain equal number of intrapolated values from the smaller data set. 
 
If you use R (http://www.r-project.org/), under the R prompt, type “qqplot” without the quotation marks, you will see the set of R commands used to generate the plot. 
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), …)
{
sx <- sort(x)
sy <- sort(y)
lenx <- length(sx)
leny <- length(sy)
if (leny < lenx)
sx <- approx(1L:lenx, sx, n = leny)$y
if (leny > lenx)
sy <- approx(1L:leny, sy, n = lenx)$y
if (plot.it)
plot(sx, sy, xlab = xlab, ylab = ylab, …)
invisible(list(x = sx, y = sy))
}
By reading the codes, you will learn how the qqplot works and also about the other R synatx (such as “approx”). 
 
How about qqnorm?  qqnorm will be a plot of the quantiles of given dataset against the theoretical quantiles of a standard normal distribution (in fact, “standard” or not does not matter).