## A thought on how to evaluate whether education expansion helps improve mobility.

We often see reports comparing the median income of university graduates over time. The sad news is that the median income of university graduates are often found declining over time. One common conclusion is that the expansion of university education has not helped improve social mobility. And, university graduates seem to be doing worse than before.

While median income is easier to compute, I do not think it is the right measure to address the question of social mobility, or how university graduates nowadays fare when compared to the previous cohorts. A correct measure is some form of median income with an control of the expansion of university education.

Imagine the following hypothetical situations. Suppose that we have a stable population structure. Suppose 20 percent of high school graduates can attend university ten years ago. Imagine we end up with 20 persons achieving high school level and 5 persons achieving university level. The median income of high school graduates was X1 and that of university graduates is Y1. Y1 is usually higher than X1, reflecting the difference in ability of the two groups and added value of education.

For the sake of illustration below, let’s assume that the 5 university graduates have incomes of 12100, 12200, 12300, 12400, 12500. Obviously the median income is 12300. That is, Y1=12300. Let’s further assume that the top 5 earners of high school graduates earn 8100, 8200, 8300, 8400, and 8500 respectively.

Today, due to the expansion of higher education, 40 percent of high school graduates can attend university. Following from the example above, we end up with 15 persons achieving high school level and 10 persons achieving university level. Suppose then the 10 university graduates have incomes of 11100, 11200, 11300, 11400, 11500, 12100, 12200, 12300, 12400, 12500. Let’s denote the median income of high school graduates as X2 and that of university graduates as Y2. Note that the median income X2 is based on a smaller group size while that of Y2 is based on a larger group size. We can easily imagine that X2 will be lower than X1 because we can imagine that the top five earners (“more able”?) were removed from the original high school group and put into the university group. And Y2 will be lower than Y1 because the university group includes the “less able” ones.

Thus, if we compare the change of median income by education groups, we are bound to see a deterioration in income in BOTH GROUPS. Some would conclude that education expansion is bad.

Wait a minute. Obviously, the five persons who achieved university level because of the education expansion achieve a higher income. (11100, 11200, 11300, 11400, 11500) versus (8100 8200 8300 8400 8500). A substantial improvement in social mobility (as measured by income) due to the education expansion, isn’t it?

That is, we are evaluating whether education expansion is useful, we should focus on these 5 persons who had not the chance to study university but now have the chance to do so.

If we still insist on using measures similar to median income of the university graduates across time to conclude whether university graduates are doing worse than better, we need to make an adjustment. From the example above, we probably should compare the top 25 percentile income level today to the median income 10 years ago!

## Generalization based on the sample size of one

Recently, lead in water has occupied headlines of major newspapers in Hong Kong. Experts are consulted. Some experts appeared to mis-speak carelessly. The most laughable statement was made by a medical doctor, who is a consultant of the “Hong Kong Poison Control Network”. He remarked that lead poisoning can be caused by chewing on a pencil, among many other causes. It was so laughable that it is widely circulated on the internet.

Admit it, most of us do not know why this statement is laughable. OK. It is laughable because pencils (called “lead” pen in Chinese) do not contain any lead in the writing core nowadays. Although the writing core of some early pencils were made of lead, it has since replaced by the non-toxic grahpite. (http://pencils.com/pencil-history/) And, the statement came from a consultant of the “Hong Kong Poison Control Network”.

In fact, at least supposedly so, nowadays even the paint cover of pencils should not contain lead so that most pencils are safe for chewing (not encouraged).

—–

Internet is a powerful tool. A friend appeared excited about the lead poisoning from pencil chewing story that he added a catchy title to his sharing of the news report about the doctor’s statement: “If you cannot trust doctors, who can you trust?”

It is this catchy and exaggerated statement that caught my attention. I have to admit, I like it.

Nevertheless, the statement is clearly too much a generalization based on the sample size of one. That “specific doctor” may not be trustworthy on this specific issue, but it does not mean that “specific doctor” is not trustworthy on other issues. Certainly, it does not mean that other doctors are not trustworthy.

Beware of similar generalization from a small sample of observations.

## True or False: The increase of tourists has cost Hong Kong 3.5% GDP.

First, there is an i-Cable story which uses the statistical analysis of a colleague. Second, there is a column written by a friend. They are both about the extra waiting time due to the influx of tourists.

In the i-Cable story, the reporter took the MTR trains from Tai Wai Station to Wan Chai Station. It showed the amount of waiting to get on a train at every interchange. Then, the reporter interviewed a colleague of mine. He showed that the number of MTR passengers was highly positively related to the number of tourists. Therefore, an increase in the number of tourists would cause an increase in the number of MTR passengers, and consequently the amount of waiting to get on trains, and the amount of time one has to spend on commuting. My colleague’s analysis was not about how the number of tourists would impact on the commuting time. But, audience will get the impression.

In her article, my friend estimated the amount of loss of GDP due to waiting. She used the extra 10-minute commuting time by LegCo member’s experiment during rush hour and deduce a loss of 3.5% of GDP. Suppose all employees has to work 47 hours per week on average and suppose each of them wastes 10 minutes commuting. The extra 20 minutes round trip (10 minutes x 2) is equivalent to a loss of 3.5% of GDP. Striking! The story certainly catches eyes of a lot of people, including me. Unfortunately, striking stories are often wrong — if you are willing to check their calculation or deduction.

I would like to raise two questions:

(1) Is the “extra” waiting time of 10 minutes an upper bound, lower bound or median? I took MTR today and did not have to wait for the next train to get in. Imposing the upper bound on all employees will yield a very unreasonable exaggerated number. I think it is actually much less than 3.5% of Hong Kong’s GDP.

(2) Given it is indeed extra 10-minute waiting, how much of it is due to the tourist or our increase in population and government policy to divert the flow of traffic from buses to MTR (for cleaner air, perhaps)? I do not think most tourists would take the MTR during rush hour. Of course, there are exceptions.

I am waiting for some serious researchers to provide good answer to my questions. Yes, data could be a big problem.

If you are interested in seeing the i-Cable story, here is the link to the video:

http://cablenews.i-cable.com/webapps/program/newslancet/videoPlay.php?video_id=12178031

If you are interested in the column, here is the link to the article:

http://news.mingpao.com/20140304/fad1.htm

## All Americans are nice people and most of them lives in either Wisconsin or Minnesota.

During my high school years. I thought all Americans were nice people and most of them lived in either Wisconsin or Minnesota. Because… all Americans I met during my high school years were from either Wisconsin or Minnesota and they were all very nice people.

Now I know this view is biased. My biased view is due to the small sample size I had (the sample of only nice Americans from Wisconsin and Minnesota). Now I understand that my view was wrong because I had looked at demographic data and found that there are a lot more Americans from the other States than from Wisconsin and Minnesota. And because I have met not-so-nice Americans since my high school years.

Now I understand that my view is necessarily biased because no matter how hard I try, I can only obtain an incomplete picture of the world (limited time, limited brain power). There are many people out there who know what I do not know and therefore can teach me good lessons.

## Should I bring canned food in my next trip to Shanghai?

Friends have cautioned me on food safety on my next trip to Shanghai. Indeed, there are several reports of unsafe food in Shanghai lately. Dead pigs flowed in the river; rat meats were made into fake muttons. It is even widely reported by US media, e.g., NPR.

I know some of us do not like the talk of probability. However, if we are talking about getting contaminated food or fake meat, it makes sense to talk about conditional probability, conditional on where you obtain the food. We need to know that the probability of getting contaminated

food form a 5-star restaurant is much lower than from a street vendor in Shanghai. I am not

saying that eating at 5-star restaurants is 100 percent safe but is definitely safer.

## Understanding the meaning of confidence intervals using R

# To illustrate

# (1) the meaning of confidence interval (i.e., CI)

# (2) the similarity of confidence intervals computed using different approaches

# (3) the relationship between confidence interval and type I error of a hypothesis test

# ——-

mu_pop=2 # population mean

sigma_pop=2 # population standard deviation

n=100 # sample size

asim=1000 # number of repetitions

CIlimits=matrix(999,asim,12) # create a matrix to store the resulting CI limits

colnames(CIlimits)=c(“l.95(1)”,”r.95(1)”,”l.99(1)”,”r.99(1)”,

“l.95(2a)”,”r.95(2a)”,”l.99(2a)”,”r.99(2a)”,

“l.95(2b)”,”r.95(2b)”,”l.99(2b)”,”r.99(2b)”)

# name the columns for easy reference later

Indicators=matrix(999,asim,6) # create a matrix to store the resulting indicators

colnames(Indicators)=c(“0.05(1)”,”0.01(1)”,

“0.05(2a)”,”0.01(2a)”,

“0.05(2b)”,”0.01(2b)”)

# name the columns for easy reference later

set.seed(20120704) # fix a random number seed.

for(i in 1:asim){ # start the 1000 repetitions

print(i)

x<-rnorm(n,mean=mu_pop,sd=sigma_pop) # generate random sample

mx=mean(x) # sample mean

sdx=sd(x) # sample variance

# Approach (1)

# Central Limit Theorem says (mx-mu_pop)/sqrt(sigma_pop^2/n) asymptotically N(0,1)

# —————————–

# upper limit of the 95% CI for the population mean

CIlimits[i,”r.95(1)”]=qnorm(0.975,mean=mx,sd=sqrt(sigma_pop^2/n))

# lower limit of the 95% CI for the population mean

CIlimits[i,”l.95(1)”]=qnorm(0.025,mean=mx,sd=sqrt(sigma_pop^2/n))

# upper limit of the 99% CI for the population mean

CIlimits[i,”r.99(1)”]=qnorm(0.995,mean=mx,sd=sqrt(sigma_pop^2/n))

# upper limit of the 99% CI for the population mean

CIlimits[i,”l.99(1)”]=qnorm(0.005,mean=mx,sd=sqrt(sigma_pop^2/n))

# Indicators (0 or 1), cannot reject 0.05 level of significance test, i.e., inside the CI.

Indicators[i,”0.05(1)”]=ifelse((mu_pop>=CIlimits[i,”l.95(1)”])&(mu_pop<=CIlimits[i,”r.95(1)”]),1,0)

# Indicators (0 or 1), cannot reject 0.01 level of significance test, i.e., inside the CI.

Indicators[i,”0.01(1)”]=ifelse((mu_pop>=CIlimits[i,”l.99(1)”])&(mu_pop<=CIlimits[i,”r.99(1)”]),1,0)

# Approach (2a)

#Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)

# —————————–

# upper limit of the 95% CI for the population mean

CIlimits[i,”r.95(2a)”]=qnorm(0.975,mean=mx,sd=sqrt(sdx^2/n))

# lower limit of the 95% CI for the population mean

CIlimits[i,”l.95(2a)”]=qnorm(0.025,mean=mx,sd=sqrt(sdx^2/n))

# upper limit of the 99% CI for the population mean

CIlimits[i,”r.99(2a)”]=qnorm(0.995,mean=mx,sd=sqrt(sdx^2/n))

# upper limit of the 99% CI for the population mean

CIlimits[i,”l.99(2a)”]=qnorm(0.005,mean=mx,sd=sqrt(sdx^2/n))

# Indicators (0 or 1), cannot reject 0.05 level of significance test, i.e., inside the CI.

Indicators[i,”0.05(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2a)”])&(mu_pop<=CIlimits[i,”r.95(2a)”]),1,0)

# Indicators (0 or 1), cannot reject 0.01 level of significance test, i.e., inside the CI.

Indicators[i,”0.01(2a)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2a)”])&(mu_pop<=CIlimits[i,”r.99(2a)”]),1,0)

# Approach (2b)

# Central Limit Theorem says (mx-mu_pop)/sqrt(sdx^2/n) asymtotically N(0,1)

# —————————–

# lower limit of the 95% CI for the population mean

CIlimits[i,”l.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[1]

# upper limit of the 95% CI for the population mean

CIlimits[i,”r.95(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.95)$conf.int[2]

# lower limit of the 99% CI for the population mean

CIlimits[i,”l.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[1]

# upper limit of the 95% CI for the population mean

CIlimits[i,”r.99(2b)”]<-t.test(x,alternative=”two.sided”,mu=0,conf.level=0.99)$conf.int[2]

# Indicators (0 or 1), cannot reject 0.05 level of significance test, i.e., inside the CI.

Indicators[i,”0.05(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.95(2b)”])&(mu_pop<=CIlimits[i,”r.95(2b)”]),1,0)

# Indicators (0 or 1), cannot reject 0.01 level of significance test, i.e., inside the CI.

Indicators[i,”0.01(2b)”]=ifelse((mu_pop>=CIlimits[i,”l.99(2b)”])&(mu_pop<=CIlimits[i,”r.99(2b)”]),1,0)

}

## We expect a x% CI to contain the poulation mean x% of the time.

# Approach (1)

# Percentage of 95% CIs that contain the population mean

mean((CIlimits[,”l.95(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(1)”]))

# Percentage of 99% CIs that contain the population mean

mean((CIlimits[,”l.99(1)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(1)”]))

# Approach (2a)

# Percentage of 95% CIs that contain the population mean

mean((CIlimits[,”l.95(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2a)”]))

# Percentage of 99% CIs that contain the population mean

mean((CIlimits[,”l.99(2a)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2a)”]))

# Approach (2b)

# Percentage of 95% CIs that contain the population mean

mean((CIlimits[,”l.95(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.95(2b)”]))

# Percentage of 99% CIs that contain the population mean

mean((CIlimits[,”l.99(2b)”]<=mu_pop)&(mu_pop<=CIlimits[,”r.99(2b)”]))

# type I errors

# we expect for a y% significance test, we reject the true null y% of the time.

# Test at 5% level of significance

mean(1-Indicators[,”0.05(1)”])

mean(1-Indicators[,”0.05(2a)”])

mean(1-Indicators[,”0.05(2b)”])

# Test at 1% level of significance

mean(1-Indicators[,”0.01(1)”])

mean(1-Indicators[,”0.01(2a)”])

mean(1-Indicators[,”0.01(2b)”])

## Understanding QQ Plot

**R**(http://www.r-project.org/), under the R prompt, type “qqplot” without the quotation marks, you will see the set of R commands used to generate the plot.

function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)),

ylab = deparse(substitute(y)), …)

{

sx <- sort(x)

sy <- sort(y)

lenx <- length(sx)

leny <- length(sy)

if (leny < lenx)

sx <- approx(1L:lenx, sx, n = leny)$y

if (leny > lenx)

sy <- approx(1L:leny, sy, n = lenx)$y

if (plot.it)

plot(sx, sy, xlab = xlab, ylab = ylab, …)

invisible(list(x = sx, y = sy))

}

1comment