Random thoughts of an economist

A thought on how to evaluate whether education expansion helps improve mobility.

Posted in Econometrics, Economics, Hong Kong, Statistics/Econometrics, teaching by kafuwong on September 11, 2015

We often see reports comparing the median income of university graduates over time. The sad news is that the median income of university graduates are often found declining over time. One common conclusion is that the expansion of university education has not helped improve social mobility. And, university graduates seem to be doing worse than before.

While median income is easier to compute, I do not think it is the right measure to address the question of social mobility, or how university graduates nowadays fare when compared to the previous cohorts. A correct measure is some form of median income with an control of the expansion of university education.

Imagine the following hypothetical situations. Suppose that we have a stable population structure. Suppose 20 percent of high school graduates can attend university ten years ago. Imagine we end up with 20 persons achieving high school level and 5 persons achieving university level. The median income of high school graduates was X1 and that of university graduates is Y1. Y1 is usually higher than X1, reflecting the difference in ability of the two groups and added value of education.

For the sake of illustration below, let’s assume that the 5 university graduates have incomes of 12100, 12200, 12300, 12400, 12500. Obviously the median income is 12300. That is, Y1=12300. Let’s further assume that the top 5 earners of high school graduates earn 8100, 8200, 8300, 8400, and 8500 respectively.

Today, due to the expansion of higher education, 40 percent of high school graduates can attend university. Following from the example above, we end up with 15 persons achieving high school level and 10 persons achieving university level. Suppose then the 10 university graduates have incomes of 11100, 11200, 11300, 11400, 11500, 12100, 12200, 12300, 12400, 12500. Let’s denote the median income of high school graduates as X2 and that of university graduates as Y2. Note that the median income X2 is based on a smaller group size while that of Y2 is based on a larger group size. We can easily imagine that X2 will be lower than X1 because we can imagine that the top five earners (“more able”?) were removed from the original high school group and put into the university group. And Y2 will be lower than Y1 because the university group includes the “less able” ones.

Thus, if we compare the change of median income by education groups, we are bound to see a deterioration in income in BOTH GROUPS. Some would conclude that education expansion is bad.

Wait a minute. Obviously, the five persons who achieved university level because of the education expansion achieve a higher income. (11100, 11200, 11300, 11400, 11500) versus (8100 8200 8300 8400 8500). A substantial improvement in social mobility (as measured by income) due to the education expansion, isn’t it?

That is, we are evaluating whether education expansion is useful, we should focus on these 5 persons who had not the chance to study university but now have the chance to do so.

If we still insist on using measures similar to median income of the university graduates across time to conclude whether university graduates are doing worse than better, we need to make an adjustment. From the example above, we probably should compare the top 25 percentile income level today to the median income 10 years ago!

Advertisements

Generalization based on the sample size of one

Posted in Hong Kong, Statistics/Econometrics, teaching, Water by kafuwong on July 21, 2015

Recently, lead in water has occupied headlines of major newspapers in Hong Kong. Experts are consulted. Some experts appeared to mis-speak carelessly. The most laughable statement was made by a medical doctor, who is a consultant of the “Hong Kong Poison Control Network”. He remarked that lead poisoning can be caused by chewing on a pencil, among many other causes. It was so laughable that it is widely circulated on the internet.

Admit it, most of us do not know why this statement is laughable. OK. It is laughable because pencils (called “lead” pen in Chinese) do not contain any lead in the writing core nowadays. Although the writing core of some early pencils were made of lead, it has since replaced by the non-toxic grahpite. (http://pencils.com/pencil-history/) And, the statement came from a consultant of the “Hong Kong Poison Control Network”.

In fact, at least supposedly so, nowadays even the paint cover of pencils should not contain lead so that most pencils are safe for chewing (not encouraged).

—–

Internet is a powerful tool. A friend appeared excited about the lead poisoning from pencil chewing story that he added a catchy title to his sharing of the news report about the doctor’s statement: “If you cannot trust doctors, who can you trust?”

It is this catchy and exaggerated statement that caught my attention. I have to admit, I like it.

Nevertheless, the statement is clearly too much a generalization based on the sample size of one. That “specific doctor” may not be trustworthy on this specific issue, but it does not mean that “specific doctor” is not trustworthy on other issues. Certainly, it does not mean that other doctors are not trustworthy.

Beware of similar generalization from a small sample of observations.

True or False: The increase of tourists has cost Hong Kong 3.5% GDP.

Posted in Econometrics, Economics, Hong Kong, Population, Research, Statistics/Econometrics, teaching by kafuwong on March 4, 2014

First, there is an i-Cable story which uses the statistical analysis of a colleague. Second, there is a column written by a friend. They are both about the extra waiting time due to the influx of tourists.

In the i-Cable story, the reporter took the MTR trains from Tai Wai Station to Wan Chai Station. It showed the amount of waiting to get on a train at every interchange. Then, the reporter interviewed a colleague of mine. He showed that the number of MTR passengers was highly positively related to the number of tourists. Therefore, an increase in the number of tourists would cause an increase in the number of MTR passengers, and consequently the amount of waiting to get on trains, and the amount of time one has to spend on commuting. My colleague’s analysis was not about how the number of tourists would impact on the commuting time. But, audience will get the impression.

In her article, my friend estimated the amount of loss of GDP due to waiting. She used the extra 10-minute commuting time by LegCo member’s experiment during rush hour and deduce a loss of 3.5% of GDP. Suppose all employees has to work 47 hours per week on average and suppose each of them wastes 10 minutes commuting. The extra 20 minutes round trip (10 minutes x 2) is equivalent to a loss of 3.5% of GDP. Striking! The story certainly catches eyes of a lot of people, including me. Unfortunately, striking stories are often wrong — if you are willing to check their calculation or deduction.

I would like to raise two questions:
(1) Is the “extra” waiting time of 10 minutes an upper bound, lower bound or median? I took MTR today and did not have to wait for the next train to get in. Imposing the upper bound on all employees will yield a very unreasonable exaggerated number. I think it is actually much less than 3.5% of Hong Kong’s GDP.
(2) Given it is indeed extra 10-minute waiting, how much of it is due to the tourist or our increase in population and government policy to divert the flow of traffic from buses to MTR (for cleaner air, perhaps)? I do not think most tourists would take the MTR during rush hour. Of course, there are exceptions.
I am waiting for some serious researchers to provide good answer to my questions. Yes, data could be a big problem.

If you are interested in seeing the i-Cable story, here is the link to the video:
http://cablenews.i-cable.com/webapps/program/newslancet/videoPlay.php?video_id=12178031
If you are interested in the column, here is the link to the article:
http://news.mingpao.com/20140304/fad1.htm

All Americans are nice people and most of them lives in either Wisconsin or Minnesota.

Posted in Information, Life, Parenting, Statistics/Econometrics, teaching by kafuwong on November 26, 2013

During my high school years. I thought all Americans were nice people and most of them lived in either Wisconsin or Minnesota. Because… all Americans I met during my high school years were from either Wisconsin or Minnesota and they were all very nice people.

Now I know this view is biased. My biased view is due to the small sample size I had (the sample of only nice Americans from Wisconsin and Minnesota). Now I understand that my view was wrong because I had looked at demographic data and found that there are a lot more Americans from the other States than from Wisconsin and Minnesota. And because I have met not-so-nice Americans since my high school years.

Now I understand that my view is necessarily biased because no matter how hard I try, I can only obtain an incomplete picture of the world (limited time, limited brain power). There are many people out there who know what I do not know and therefore can teach me good lessons.

Should I bring canned food in my next trip to Shanghai?

Posted in Environment, Statistics/Econometrics, teaching, Uncategorized by kafuwong on May 8, 2013

Friends have cautioned me on food safety on my next trip to Shanghai.  Indeed, there are several reports of unsafe food in Shanghai lately.  Dead pigs flowed in the river; rat meats were made into fake muttons.  It is even widely reported by US media, e.g., NPR.

I know some of us do not like the talk of probability.  However, if we are talking about getting contaminated food or fake meat, it  makes sense to talk about conditional probability, conditional on where you  obtain the food.  We need to know that the probability of getting contaminated
food form a 5-star restaurant is much lower than from a street vendor in Shanghai.  I am not
saying that eating at 5-star restaurants is 100 percent safe but is definitely safer.

 

 

Understanding the meaning of confidence intervals using R

Posted in R, Statistics/Econometrics by kafuwong on July 4, 2012

# To illustrate
#  (1) the meaning of confidence interval (i.e., CI)
#  (2) the similarity of confidence intervals computed using different approaches
#  (3) the relationship between confidence interval and type I error of a hypothesis test
#  ——-

mu_pop=2  # population mean
sigma_pop=2  # population standard deviation
n=100   # sample size
asim=1000  # number of repetitions
CIlimits=matrix(999,asim,12)  # create a matrix to store the resulting CI limits
colnames(CIlimits)=c(“l.95(1)”,”r.95(1)”,”l.99(1)”,”r.99(1)”,
                     “l.95(2a)”,”r.95(2a)”,”l.99(2a)”,”r.99(2a)”,
                     “l.95(2b)”,”r.95(2b)”,”l.99(2b)”,”r.99(2b)”)
                     # name the columns for easy reference later

Indicators=matrix(999,asim,6) # create a matrix to store the resulting indicators
colnames(Indicators)=c(“0.05(1)”,”0.01(1)”,
                       “0.05(2a)”,”0.01(2a)”,
                       “0.05(2b)”,”0.01(2b)”)
                       # name the columns for easy reference later

set.seed(20120704)  # fix a random number seed. 

for(i in 1:asim){  # start the 1000 repetitions
  print(i)
  x<-rnorm(n,mean=mu_pop,sd=sigma_pop)  # generate random sample
  mx=mean(x) # sample mean
  sdx=sd(x)  # sample variance

  # Approach (1)
  # Central Limit Theorem says (mx-mu_pop)/sqrt(sigma_pop^2/n) asymptotically N(0,1)
  # —————————–
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(1)”]=qnorm(0.975,mean=mx,sd=sqrt(sigma_pop^2/n))
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(1)”]=qnorm(0.025,mean=mx,sd=sqrt(sigma_pop^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”r.99(1)”]=qnorm(0.995,mean=mx,sd=sqrt(sigma_pop^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”l.99(1)”]=qnorm(0.005,mean=mx,sd=sqrt(sigma_pop^2/n))
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(1)”]=ifelse((mu_pop>=CIlimits[i,”l.95(1)”])&(mu_pop<=CIlimits[i,”r.95(1)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(1)”]=ifelse((mu_pop>=CIlimits[i,”l.99(1)”])&(mu_pop<=CIlimits[i,”r.99(1)”]),1,0)

  # Approach (2a)
  #Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)
  # —————————–
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(2a)”]=qnorm(0.975,mean=mx,sd=sqrt(sdx^2/n))
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(2a)”]=qnorm(0.025,mean=mx,sd=sqrt(sdx^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”r.99(2a)”]=qnorm(0.995,mean=mx,sd=sqrt(sdx^2/n))
  # upper limit of the 99% CI for the population mean
    CIlimits[i,”l.99(2a)”]=qnorm(0.005,mean=mx,sd=sqrt(sdx^2/n))
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2a)”])&(mu_pop<=CIlimits[i,”r.95(2a)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2a)”])&(mu_pop<=CIlimits[i,”r.99(2a)”]),1,0)
  # Approach (2b)
  # Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)
  # —————————–
  # lower limit of the 95% CI for the population mean
    CIlimits[i,”l.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[1]
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[2]
  # lower limit of the 99% CI for the population mean
    CIlimits[i,”l.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[1]
  # upper limit of the 95% CI for the population mean
    CIlimits[i,”r.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[2]
  # Indicators (0 or 1),  cannot reject 0.05 level of significance test, i.e., inside the CI.
    Indicators[i,”0.05(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2b)”])&(mu_pop<=CIlimits[i,”r.95(2b)”]),1,0)
  # Indicators (0 or 1),  cannot reject 0.01 level of significance test, i.e., inside the CI.
    Indicators[i,”0.01(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2b)”])&(mu_pop<=CIlimits[i,”r.99(2b)”]),1,0)

}
## We expect a x% CI to contain the poulation mean x% of the time. 
# Approach (1)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(1)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(1)”]))
# Approach (2a)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2a)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2a)”]))
# Approach (2b)
  # Percentage of 95% CIs that contain the population mean
    mean((CIlimits[,”l.95(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2b)”]))
  # Percentage of 99% CIs that contain the population mean
    mean((CIlimits[,”l.99(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2b)”]))

# type I errors
# we expect for a y% significance test, we reject the true null y% of the time.
# Test at 5% level of significance
  mean(1-Indicators[,”0.05(1)”])
  mean(1-Indicators[,”0.05(2a)”])
  mean(1-Indicators[,”0.05(2b)”])

# Test at 1% level of significance
  mean(1-Indicators[,”0.01(1)”])
  mean(1-Indicators[,”0.01(2a)”])
  mean(1-Indicators[,”0.01(2b)”])

Understanding QQ Plot

Posted in R, Statistics/Econometrics by kafuwong on June 22, 2012

QQ plot is a statistical tool to check whether the distribution of a dataset x has a similar distribution of another dataset y. 
 
Imagine x has 100 observations.  We can compute the percentiles.  What are they?  The percentiles will simply be the sorted values of x, from the smallest to the largest.  Call the variable containing the percentiles sx (sorted x). 
 
(1) Now imagine y is the same as x, but we pretend that y is different from x and compute the percentiles accordingly.   Call the variable containing the percentiles sy (sorted y). 
 
Let’s plot the percentiles of x against the percentiles of y, i.e., sx against sy.  Because sx=sy, we should see all the percentiles to line up in a straight line.  That is, when y and x are identical (and thus have the same distribution), the plot looks like a straight line and the scales on x- and y-axis are the same. 
 
(2) Now let y=x+2.  Repeat the above step and plot percentiles of x against the percentiels of y.  Essentially, y has the same distribution as x, with its mean shifted by a constant.  We should see all the percentiles to line up on a straight line (a different stright line, though).  Note the slight change in the scales on x- and y-axis. 
 
(3) Now let y=2*x.  Repeat the above step and plot percentiles of x against the percentiels of y.  Essentially, y has the same distribution as x, with variance and mean multiplied by a constant.  We should see all the percentiles to line up on a straight line (a different stright line, though).  Note the slight change in the scales on x- and y-axis. 
 
(4) Suppose x is a vector of standard normal variables and y is another vector of standard normal variables.  Essentially, they have the same “normal” distribution.  Let’s do the plot again.  We see a plot that is broadly on a straight line.  It end points may differ from the straight lines because the end quantiles are less likely events and may be less reliable. 
 
(5) Suppose x is the same as in (4) but z=2y+1 where y is the same as in (4).   Both x and z are normally distributed.  Let’s do the plot of quantiles of x against the quantiles of z again.  We see a plot that is broadly on a straight line, just like what we have earlier.  
 
In the above calculations, we were talking about percentiles.  What if one data set has less than 100 observations?  We can still compute the percentiles with intrapolation. 
 
Do we need to insist on percentiles?  We do not have to.   All we need is a set of sorted values of two data sets if the two data sets have the same number of observations.  If the two data set has different number of observations, we can sort the dataset of more observations and then obtain equal number of intrapolated values from the smaller data set. 
 
If you use R (http://www.r-project.org/), under the R prompt, type “qqplot” without the quotation marks, you will see the set of R commands used to generate the plot. 
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), …)
{
sx <- sort(x)
sy <- sort(y)
lenx <- length(sx)
leny <- length(sy)
if (leny < lenx)
sx <- approx(1L:lenx, sx, n = leny)$y
if (leny > lenx)
sy <- approx(1L:leny, sy, n = lenx)$y
if (plot.it)
plot(sx, sy, xlab = xlab, ylab = ylab, …)
invisible(list(x = sx, y = sy))
}
By reading the codes, you will learn how the qqplot works and also about the other R synatx (such as “approx”). 
 
How about qqnorm?  qqnorm will be a plot of the quantiles of given dataset against the theoretical quantiles of a standard normal distribution (in fact, “standard” or not does not matter).