Analysis of Discrete Data #2: One-Way Tables and Goodness-of-Fit Test

Posted on September 12, 2014 by

0


Following Analysis of Discrete Data #1 – Overview

One-Way Tables and Goodness of Fit

Key concepts:

  • One-way Frequency Table
  • Sampling Schemes (Models)
  • Pearson goodness-of-fit statistic
  • Deviance statistic
  • Pearson residuals

Objectives:

  • Learn how to compute the goodness-of-fit statistics
  • Understand how well an observed table of counts corresponds to the multinomial model Mult(n, π) for some vector π.
  • Introduce the FREQ procedure in SAS and the prop.test and the chisq.test in R

Useful links:

Links to background information on the topics in this lesson and some programming help and examples.

Readings:

  • Ch.1 Agresti (2007, 1996)
  • If you are using other textbooks or editions: Ch.1 Agresti (2013, 2002, 1996)

2.1 – Introduction and Examples

Lesson 1 was a brief introduction to the categorical data and discrete distributions. We looked at different types of categorical data as well as counts and proportions. In this lesson we will see the simplest way to represent counts and proportions and then do the simplest type of analysis. Let’s begin with the definition of a one-way frequency table.

A frequency table arises when sample units are classified into mutually exclusive categories; the number of units falling into each category is recorded. “One way” means that units are classified according to a single categorical variable, and that is the table’s dimension. The cells of the table contain frequency counts and can be arranged in a single row or column.  The size of the table refers to the number of cells.

Let us look at some examples and see how they relate to the simplest concept of contingency table.

Example 1 – High-Risk Drinking at Penn State

In 2006, the Penn State’s Student Affairs Office of Research and Assessment (SAORA) conducted a survey on alcohol use of undergraduate students at the Penn State, University Park campus ( Pulse Survey: Student Drinking, March 2006). A sample of 1315 undergraduate students were classified as to whether they are high-risk drinkers or not.

High-Risk
Count
Yes
630
No
685
Total
1315

Since the sample size, n = 1315, is fixed a priori and there are only two outcomes, e.g., you are a high-risk drinker or not, this is an example of a binomial sampling.

The questions of interest are

  1. What is the true population proportion of students who are high-risk drinkers at Penn State?
  2. Is this proportion significantly different from 50%? In other words, is a randomly chosen student equally likely to be a high risk or non-high-risk drinker?

Example 2 – World Cup Soccer

How many teams scored certain number of goals during the first round of 2002 FIFA World Cup soccer championship ( http://www.worldcup-soccer.info/ )?  For example, 25 teams scored no goals during their first round matches. The table below records the frequency of the number of goals scored.

Number of goals
Count
0
23
1
37
2
20
3
11
4
2
5
1
6
0
7
0
8
1
Total
95

This is a one-way table of size 9. Note here that the total count (i.e., sample size n = 95 ) was not fixed a priori and this could be considered Poisson sampling since we fixed the time period during which we recorded the outcomes of interest. We may ask questions such as

  1. What is the mean number of goals scored?; or
  2. What is the probability that a team will score four goals?

Example 3 – Dice Rolls

Start a timer and roll a die for two minutes. Count the number of times each die face comes up on top during this two minutes. This is an example of a Poisson sampling. You may get results similar to those in the following table. Note that we could not fix the sample size n = 30 a priori!

Face
Count
1
3
2
7
3
5
4
10
5
2
6
3
Total
30

You may also get the table above, or a similar one by deciding instead to roll a die n = 30 times, and record out of these 30 times at each roll which face comes up. This is an example of Multinomial sampling. Now, we may ask the question, “Is this die fair?”, and use the sampling knowledge to answer it.

Example 4 – Eye Colors

A sample of n = 96 persons is obtained, and the eye color of each person is recorded.

Eye color
Count
Brown
46
Blue
22
Green
26
Other
2
Total
96

Notice that brown, blue, green, and other have no intrinsic ordering. The response variable, eye color, is therefore an unordered categorical or nominal variable.

Example 5  – Attitudes Towards War – 1

Hypothetical attitudes of n = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: “This is a necessary war”.

Attitude
Count
Strongly disagree
35
Disagree
27
Agree
23
Strongly agree
31
Total
116

The response categories in this example are clearly ordered, but no objectively defined numerical scores can be attached to the categories. The response variable, attitude, is therefore said to be an ordered categorical or ordinal variable.

Example 6 – Attitudes Towards a War – 2

Hypothetical attitudes of n = 130 people towards the war, but with additional categories for “not sure” and “refused to answer.”

Attitude
Count
Strongly disagree
35
Disagree
27
Agree
23
Strongly agree
31
Not sure
6
Refusal
8
Total
130

The first four categories are clearly ordered, but the placement of “not sure” and “refusal” in the ordering is questionable. We would have to say that this response is partially ordered.

Example 7 – Number of Children in Families

Number of children in n = 100 randomly selected families.

Number of children
Count
0
19
1
26
2
29
3
13
4-5
11
6+
2
Total
100

The original data, the raw number of children, has been coarsened into six categories (0, 1, 2, 3, 4–5, 6+). These categories are clearly ordered, but—unlike the previous example—the categories have objectively defined numeric values attached to them. We can say that this table represents coarsened numeric data or interval variable.

Example 8 – Household Incomes

Total gross income of n = 100 households.

Income
Count
below $10,000
11
$10,000–$24,999
23
$25,000–$39,999
30
$40,000–$59,999
24
$60,000 and above
12
Total
100

The original data (raw incomes) were essentially continuous. Any type of data, continuous or discrete, can be grouped or coarsened into categories.

Grouping data will typically result in some loss of information. How much information is lost depends on (a) the number of categories and (b) the question being addressed. In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below $10,000 has not been affected, but estimating the proportion of households with incomes above $75,000 is now virtually impossible. (Why?)

Remember measurement hierarchy? If we have a statistical method that is developed for analysis of ordinal categorical data that method typically will not be appropriate for analysis of, for example, nominal categorical data. This is why we are making a distinction between ordinal and nominal variables and partial ordinal variables. Let’s get a little bit more technical and see the notation relevant for one-way frequency tables before we go into a discussion of sampling schemes.

2.2 – Notation & Structure

Suppose data are collected on a discrete variable, X, with k categories. A one-way frequency table with k cells will be denoted by the vector:

X=(X_1,X_2,\ldots,X_k)

where Xj is the count or frequency in cell j and xj or nj is the observed count in this cell j. In some textbooks the author will use xj and other authors will use nj; the latter is more consistent with your textbook (Agresti’s) notation.

Note that X is a summary of the original raw dataset consisting of:

n=\sum\limits_{j=1}^k x_j=\sum\limits_{j=1}^k n_j

observations.

For instance, the Eye Colors example had four types of eye colors, thus k = 4.

X1 is the unknown number of people in the population with “brown” eyes. The observed number of people with brown eyes is x1 = n1 = 46. The total sample size n = n1 + n2 + n3 + n4 = 46 + 22 + 26 + 2 = 96.

Joint Distribution

If the observed units are randomly sampled from a large population, then x = (n1, n2 , n3 , … , nk) will have a joint discrete probability distribution such as Poisson, Binomial, or Multinomial; a technical review of these common discrete distributions is given in Lesson 1.

Let πj be the population probability (or proportion) that a randomly selected individual, (i.e., an observational unit), falls into the jth cell of the table. Then the joint probability distribution will be denoted as

\pi=(\pi_1,\pi_2,\pi_3,\ldots,\pi_k)=\{\pi_j\}

For this to be a proper joint probability distribution the probabilities must sum to 1:

\sum\limits_{j=1}^k \pi_j=1

In the social sciences we typically think of πj as a proportion of individuals in the population with a particular characteristic. For example, π1 = proportion of people with brown eyes in the population of interest. In statistical inference, our goal is to estimate unknown parameters πj, and/or their joint distribution.

For observed data we typically use p instead of π, to denote observed joint distribution { pj } where pj = nj/n is the observed proportion of people with a certain characteristic, e.g. the proportion of people with brown eyes in our example is p1 = 46 / 96 = 0.479. p1 is an example of a sample statistic, and a point estimate of unknown population parameter, π1.

For the Eye Color example:

X = Color

Count
Brown
n1=46
Blue
n2=22
Green
n3=26
Other
n4=2
Total
n=96
X = Color
Population
Proportion π  
Brown
π1
Blue
π2
Green
π3
Other
π4
Total
1
X = Color
Observed
Proportions
Brown
p1 = 0.479
Blue
p2 = 0.229
Green
p3 = 0.271
Other
p4 = 0.021
Total
1

What are some ways of generating these one-way tables of counts?

Discuss      Why do you think we care about the random mechanism that generated the data?

Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable have been randomly generated from a normal distribution. For categorical data we will often assume that data have been generated from a Poisson, binomial or multinomial distribution. Statistical analysis depends on the data generation mechanism, although in some situations, depending on the objective, we may ignore that mechanism and simplify our analysis.

The following sampling methods correspond to the distributions considered

  • Unrestricted sampling (corresponds to Poisson distribution)
  • Sampling with fixed total sample size (corresponds to Binomial or Multinomial distributions)

2.3.1 – Poisson Sampling

Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. Often it is useful when the probability of any particular incidence happening is very small while the number of incidences is very large. (This is very much like a binomial distribution where success probability π of a trial is very very small but the number of trials n is very very large. This is known as the limiting condition).

For example, consider the World Cup soccer data example where we collect data on the frequency on the number of goals scored by teams during the first round matches of the 2002 World Cup. Another example is rolling of a dice during a fixed two-minute time period. Similarly count the number of emails you received between 4pm-5pm on a Friday, or number of students accessing STAT 504 course website on a Saturday, etc.

Let X be the number of goals scored in the matches of the first round of the World Cup. XPoisson (λ)

P(X=x)=\dfrac{\lambda^x e^{-\lambda}}{x!}\qquad x=0,1,2,\ldots

Where λ is the parameter describing the rate, that is the mean of the distribution, e.g., the average number of goals scored during the first round matches. Once you know λ, you know everything there is to know about this distribution. x! stands for x factorial, i.e., x!=1*2*3*…*x.  P(X=x) or P(x) is a probability that a randomly chosen team scored x number of goals in a game, e.g.:

P(X=0)=\dfrac{\lambda^0 e^{-\lambda}}{0!}=\dfrac{1\cdot e^{-\lambda} }{1}=e^{-\lambda}

How the average rate λ = 1.38 is obtained, is given below.  Then P(X=0)=e^{-1.38}=\dfrac{1}{e^{1.38}}=0.252

is the probability that a randomly chosen team will score 0 goals in the first round match of the World Cup. For the remaining probabilities see the table at the end of this page.

The Poisson Model (distribution) Assumptions

  1. Independence: Events must be independent (e.g. the number of goals scored by a team should not make the number of goals scored by another team more or less likely.)
  2. Homogeneity: The mean number of goals scored is assumed to be the same for all teams.
  3. Time period (or space) must be fixed

In this sampling scheme, the overall n is not fixed. We also assume that X1, X2, … Xk are independent Poisson variables, either each with the same rate λ or each with different rate λ1, λ2, …., λk such that λ1 + λ2 + … + λk = λ.

Recall that mean and variance of Poisson distribution are the same; e.g., E(X) = Var(X) = λ. However in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it’s more realistic to assume that these probabilities will vary by the teams skills, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts.  Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.

Let us see how we can do some basic calculations with the World Cup Soccer example under the Poisson model.

QUESTION: What is the most likely mean number of goals scored; that is, what is the most likely value of the unknown parameter λ given the data x?

We can answer this question by relying on the basic principle of statistical inference, e.g., point estimation, confidence intervals and/or hypothesis testing.

Recall from Lesson 1 on Likelihood and MLE: The most common point estimate is the “maximum likelihood estimate” (MLE) which has nice statistical properties, and it is the most likely value of the parameter given the data; it is the value that maximizes the likelihood function.

The MLE of  λ from the Poisson distribution is the sample mean or the expectation of the distribution, and from the computation below for our example this is approximately:

\begin{align} \bar{x} &= \dfrac{1}{95}\sum\limits_{i=1} x_i\\ &= \dfrac{1}{95} (0\times 23+1\times 37+2\times 20+3\times 11+4\times 2+5\times 1+6\times 0+ 7\times 0+ 8 \times 1)\\ &= \dfrac{131}{95}\\ &= 1.38\\ \end{align}

2014-09-12_mlelambdapoisson

Thus, λ^=1.38 goals per first round matches.

An asymptotic or approximate 95% confidence interval is then:

[1.38-1.96\sqrt{1.38/95},1.38+1.96\sqrt{1.38/95}]=[1.14,1.62]

and we are 95% confident that the mean number of goals scored by a team during the first round match-ups will be somewhere between 1.14 and 1.62. Now that we have some estimate of the mean number of goals we can calculate the expected probabilities of a randomly chosen team scoring 0, 1, 2, 3, etc… number of goals, as well as the expected frequencies (or counts). For example, under this Poisson model with λ^=1.38, the expected probability of scoring 2 goals is \hat{\pi}_2=p_2=P(X=2)=\frac{{1.38}^{2}e^{-1.38}}{2!}=0.239 and the expected frequency is np_2=95*0.239=22.75.

Example – World Cup Soccer

Here is a link to the World Cup Soccer data (text file).

world_cup_plot

You can easily do these calculation by hand or in Excel or in any other software package you are using. Here they are in R. Here is a link to this code in R. soccer.R.

You can click on the ‘Inspect’ button below to see how the Poisson probabilities are calculated using R.

Inspect program

Please Note: There are some discrepancies between the R code file and Inspect! The file itself contains line comments explaining the code.

2.3.2 – Binomial Sampling

When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the High-risk Drinking Example where we have high-risk drinkers versus non-high-risk drinkers. In this study there was a fixed number of trials (e.g., fixed number of students surveyed, n=1315) where the researcher counted the number of “successes” and the number of “failures” that occur. We can let X be the number of “successes” that is the number of students who are the high-risk drinkers. We can use the binomial probability distribution (i.e., binomial model), to describe this particular variable.

Binomial distributions are characterized by two parameters: n, which is fixed – this could be the number of trials or the total sample size if we think in terms of sampling, and π, which usually denotes a probability of “success”. In our example this would be the probability that someone is a high-risk drinker in the population of Penn State students. Please note that some textbooks will use π to denote the population parameter and p to denote the sample estimate, whereas some may use p for the population parameters as well.  We may do both; don’t be confused by this, just make sure to read carefully the specification. Once you know n and π, the probability of success, you know the mean and variance of the binomial distribution, and you know everything about that binomial distribution.

Below are the probability density function, mean and variance of the binomial variable.

f(x)=\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x}\qquad \text{for }x=0,1,2,\ldots,n

Mean E (X) = nπ
Variance Var (X) = nπ (1 – π)

Binomial Model (distribution) Assumptions

  • Fixed n: the total number of trials/events, (or total sample size) is fixed.
  • Each event has two possible outcomes, referred to as “successes” or “failures”, (e.g., each student can be either a heavy drinker or a non-heavy drinker; heavy drinker being a success here).
  • Independent and Identical Events/Trials:
    • Identical trials means that probability of success is the same for each trial.
    • Independent means that the outcome of one trial does not affect the outcome of the other, (e.g. one student being a heavy drinker or not does not affect the status of the next student, and each student has the same probability, π, of being a heavy drinker.)

Example – Heavy Drinking Students

QUESTION: What is the probability that no students are heavy drinkers, i.e., P(X = 0)?

Let’s assume that π = 0.5.

2014-09-12_binex

Discuss    What’s the probability that there are X = 1000 heavy drinkers in this example?

QUESTION: What is the true population proportion of students who are high-risk drinkers at Penn State?

This is a statistical inference question that can be answered with a point estimate, confidence intervals and hypothesis tests about proportions. The likelihood function for Binomial L ; x) is a measure of how close the population proportion π is to the  data x; The Maximum Likelihood Estimate (MLE) is the most likely value for π given the observed data, and for the binomial distribution this is the sample mean,

\hat{\pi}=\dfrac{\sum x_i}{n}=\dfrac{x}{n}

and for the expected counts,

\hat{\mu}=\hat{\pi}\cdot n

Thus for our example, assuming the Binomial distribution, our “best” guess estimate of the true proportion of students who are high-risk drinkers is

p=\hat{\pi}=\dfrac{630}{1315}=0.48

Here are also the likelihood and loglikelihood graphs for our example. We can see that the peak of the likelihood is at the proportion value equal to 0.48. The loglikelihood looks quadratic which means that the large-sample normal theory should work fine, and we can use the approximate 95% confidence intervals.

binomial_plot_loglikebinomial_plot_loglike02

The MLE is used for statistical inference, such as testing for a specific value of π, or giving a range of likely values of π, via the 95% confidence interval. A key result here comes from understanding the properties of the sampling distribution of the sample proportion p.  

The Rule for Sample Proportion: If numerous samples of size n are taken with n large enough, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean E(p)=π and variance Var(p)=π(1- π)/n . “Large enough” usually means that the number of successes and failures are not small, i.e., nπ ≥ 5, and n(1 – π) 5. The larger the sample size n, the sampling distribution of p is better approximated by a normal distribution.  Note that the sampling distribution of p is really a discrete rather than a continuous distribution, but we rely on the above described approximation for statistical inference unless we deal with small samples; for the latter case see Agresti (2007), Sec 1.4.3-1.4.5.

QUESTION: Is the population proportion of heavy-drinkers significantly different from 50%?

Large-sample hypothesis test about π

H0: π = π0 vs. HA: ππ0

The z test statistic:

z=\dfrac{p-\pi_0}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}

where p is the sample proportion, in our case 0.48, and π0 is the value we are testing for, 0.5.

H0: π = 0.5 vs. HA: π ≠ 0.5

The statistic z = -1.45 with two sided p-value of 0.136. Thus we do not have a very strong evidence that the proportion of high-risk drinkers is different from 50%; i.e., do not reject the null hypothesis.

Confidence interval for π

The usual (1 – α) × 100% CI, holds:

p\pm z_{\alpha/2}\sqrt{\dfrac{p(1-p)}{n}}

This interval is known as the Wald confidence interval.

For our example, the 95% CI is 0.48 ± 1.96 × 0.014 = (0.453, 0.507). We can be 95% confident that the true population proportion of students high-risk drinkers is between 0.454 and 0.506.

However, when the proportions are extremely small or large, π < 0.20 or π > 0.80, this CI does not work very well. It is better to consider the likelihood-ratio-based CI, as discussed in Lesson 1.  This interval is more complex computationally but in essence simple by evaluating the likelihood function plotted below where we are looking for all possible values of pi0 for which the null hypothesis would not be rejected. In our case we get 95% CI to be (0.453, 0.506), here is a plot of that confidence interval.

binomial_plot_loglikeCI1To do the above calculations in R and SAS, see the drinking.R and drinking.sas files below.  Also, watch the viewlets that will walk you through how these program works.

R LogoHere is the R program drinking.R.

And, here is a walk-through of this program.

Inspect program
If you encounter an error with the sink() function, please see the following page with  support materials for R.

The third alternative, also likelihood-based confidence interval, known as the Score confidence interval in essence is looking for ALL π0 values that yield the desired test statistics, e.g., for 95% CI, zα/2 = ± 1.96. To simplify, we need to solve the following equation:

\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2}

which is the same as:

\dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=z_{\alpha/2} \text{ and } \dfrac{|p-\pi_0|}{\sqrt{\dfrac{\pi_0(1-\pi_0)}{n}}}=-z_{\alpha/2}

To do this, we can solve the following quadratic equation:

\left(1+\dfrac{z^2_{\alpha/2}}{n}\right)\pi^2_0+\left(-2p-\dfrac{z^2_{\alpha/2}}{n}\right)\pi_0+p^2=0

a\pi^2_0+b\pi_0+c=0

\pi_0=\dfrac{-b\pm \sqrt{b^2-4ac}}{2a}

In this example, the 95% Score CI is the same as the Wald CI, (0.453,0.507). Notice that the only difference between the Wald CI and the Score CI, is the standard error where in the former its calculated using the value of sample proportion p and in the latter the null value of π0.

Traditionally, most software packages will give you the Wald CI, but nowadays we are starting to see the score and the likelihood-ratio ones too.

Margin of error (MOE)

The margin of error is defined to be the length of the 95% CI for a single proportion.  So the 95% CI is p ± MOE; see the CNN example for a review. Confidence interval is the widest when π0 = 0.5. This knowledge is useful in determining sample size for given conditions.

Sample Size

Often we are interested in knowing what sample size is needed for a specific margin of error for a population proportion. For example, how large a sample would we need such that the 99% confidence interval is of width m. Solving the following for n:

m=2.575\sqrt{\pi(1-\pi)/n}

n=(2.575)^2 \pi(1-\pi)/m^2

Since π is unknown, take π = 0.5 to get the widest possible MOE. For example, if the required MOE is 3%

n=(2.575)^2(0.5)(0.5)/(0.03)(0.03)=1841.84

which is rounded up to 1842.

For additional simple Binomial distribution calculations see Lesson 0 and SimpleBinomialCal.R code with its viewlet Inspect programthat walks you through the code.

2.3.3 – Multinomial Sampling

Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a pre-determined number of individuals that is units and classified according to the levels of a categorical variable of interest (e.g., see Examples 4 through 8 in the Introduction of this Lesson).

X ∼ Mult (n, π), with the probability density function

f(x)=\dfrac{n!}{x_1!x_2!\cdots x_k!} \pi_1^{x_1}\pi_2^{x_2}\cdots \pi_k^{x_k}

where n is fixed and known sample size, and π is the vector of population proportions π=(π1,π2,,πk).

Each Xj represents a population count in the cell j, xj  observed sample count, and each πrepresents a population proportion in the cell j.

Multinomial distribution admits a mean vector with

E(Xj)=nπj

The variance of Xj is

Var(Xj)=nπj(1πj)

and the covariance between Xj and Xk is

cov(Xj,Xk)=nπjπk

Each cell thus has a binomial marginal distribution. Binomial is a special case of multinomial for k = 2.

f(x)=\dfrac{n!}{x_1!x_2!}\pi_1^{x_1}\pi_2^{x_2}

let,

\begin{array}{lcr} x &= & x_1\\ x_2 &=& n-x\\ \pi &=& \pi_1\\ \pi_2 &=& 1-\pi \end{array}

then,

f(x)=\binom{n}{x}\pi^x(1-\pi)^{n-x}=\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x},\qquad x=0,1,2,\ldots,n

To be consistent with Agresti(2007), pg.4 or Agresti(2013), pg.6, where the random variable is denoted as Y, and probability density function as P(y), we can write the above as,

p(y)=\dfrac{n!}{y!(n-y)!}\pi^y(1-\pi)^{n-y},\qquad y=0,1,2,\ldots,n

If n is random (not fixed, as was the case with the total number of goals scored in the World Soccer example), but we have no interest in making inferences about the sample size n only about the underlying cell probabilities, we can apply the multinomial model instead of the Poisson model and the inferences will be the same. But we do have to worry about other kinds of model failure; e.g., are the assumptions of the multinomial model satisfied?

Multinomial Model (distribution) Assumptions:

  1. the n trials are independent, and
  2. the parameter vector π remains constant from trial to trial.

The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).

Example – Eye Color

In this example, eye color was recorded for n = 96 persons.

Eye color
Count
Brown
46
Blue
22
Green
26
Other
2
Total
96

Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate multinomial assumption, distribution of eye color of blood relations will.

Now suppose that the sample consisted of “unrelated” persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that’s okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.

Could this be considered a multinomial situation? For all practical purposes, yes. The sampled individuals are not independent according the common English definition of the word, because they all live in Pennsylvania. But we can suppose that they are independent from a statistical viewpoint, because the individuals are exchangeable; before the sample is drawn, no two are a priori any more likely to have the same eye color than any other two.

Discuss     Based on the readings on Multinomial distribution in Lesson 1 and 2 can you answer the following questions:

  1. What is the MLE  vector for eye color example?
  2. If we fuse “other” eye color with “brown”, how does the distribution change; i.e., What is the Mulitnomial distribution now?
  3. It turns out that we can partition “brown” eyes as 20 with “hazel” color and 26 with “dark brown”. How would you characterize these distributions now?

2.4 – Goodness-of-Fit Test

A goodness-of-fit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodness-of-fit test compares the observed values to the expected (fitted or predicted) values.

A goodness-of-fit statistic tests the following hypothesis:

H0: the model M0 fits
vs.
HA
: the model M0 does not fit (or, some other model MA fits)

Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis (HA) will represent the saturated model MA which fits perfectly because each observation has a separate parameter.  Later in the course we will see that MA could be a model other than the saturated one. Let us now consider the simplest example of the goodness-of-fit test with categorical data.

In the setting for one-way tables, we measure how well an observed variable X corresponds to a Mult (n, π) model for some vector of cell probabilities, π. We will consider two cases:

  1. when vector π is known, and
  2. when vector π is unknown.

In other words, we assume that under the null hypothesis data come from a Mult (n, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodness-of-fit test is applied to corroborate our assumption.

Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: X ∼ Multi (n = 30, π0 = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.

H0: ( π1, π2, π3, π4, π5, π6) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)

vs. 

HA: ( π1, π2, π3, π4, π5, π6) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).

Most software packages will already have built-in functions that will do this for you; see the next section for examples in SAS and R. Here is a step-by step procedure to help you conceptually understand this test better and what is going on behind these functions.

Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2.

Step 2: Calculate the estimated (fitted) cell probabilities πj^s, and expected cell frequencies, Ej‘s under H0

Step 3: Calculate the Pearson goodness-of-fit statistic, X 2 and/or the deviance statistic, G2 and compare them to appropriate chi-squared distributions to make a decision.

Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals.

Pearson and deviance test statistics

The Pearson goodness-of-fit statistic is

X^2=\sum\limits_{j=1}^k \dfrac{(X_j-n\hat{\pi}_j)^2}{n\hat{\pi}_j}

An easy way to remember it is

X^2=\sum\limits_j \dfrac{(O_j-E_j)^2}{E_j}

where Oj = Xj is the observed count in cell j, and E_j=E(X_j)=n\hat{\pi}_j is the expected count in cell j under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that π^j is the estimated (fitted) cell proportion πj under H0.

The deviance statistic is

G^2=2\sum\limits_{j=1}^k X_j \text{log}\left(\dfrac{X_j}{n\hat{\pi}_j}\right)

where “log” means natural logarithm. An easy way to remember it is

G^2=2\sum\limits_j O_j \text{log}\left(\dfrac{O_j}{E_j}\right)

In some texts, G2 is also called the likelihood-ratio test statistic, for comparing the likelihoods (l0 and l1 ) of two models, that is comparing the loglikelihoods under H0 (i.e., loglikelihood of the fitted model, L0) and loglikelihood under HA (i.e., loglikelihood of the larger, less restricted, or saturated model L1): G2 = -2log(l0/l1) = -2(L0 – L1). A common mistake in calculating G2 is to leave out the factor of 2 at the front.

Note that X2 and G2 are both functions of the observed data X and a vector of probabilities π. For this reason, we will sometimes write them as X2(x, π) and G2(x, π), respectively; when there is no ambiguity, however, we will simply use X2 and G2. We will be dealing with these statistics throughout the course; in the analysis of 2-way and k-way tables, and when assessing the fit of log-linear and logistic regression models.

Testing the Goodness-of-Fit

X2 and G2 both measure how closely the model, in this case Mult (n, π) “fits” the observed data.

  • If the sample proportions pj = Xj /n (i.e., saturated model) are exactly equal to the model’s πj for cells j = 1, 2, . . . , k, then Oj = Ej for all j, and both X2 and G2 will be zero. That is, the model fits perfectly.
  • If the sample proportions pj deviate from the π^‘s computed under H0, then X2 and G2 are both positive. Large values of X2 and G2 mean that the data do not agree well with the assumed/proposed model M0.

How can we judge the sizes of X2 and G2?

The answer is provided by this result:

If x is a realization of X ∼ Mult(n, π), then as n becomes large, the sampling distributions of both X2(x, π) and G2(x, π) approach chi-squared distribution with df = k -1, where k = number of cells, χ2k−1.

This means that we can easily test a null hypothesis H0 : π = π0 against the alternative H1 : ππ0 for some pre-specified vector π0. An approximate α-level test of H0 versus H1 is:

Reject H0 if computed X2(x, π0) or G2(x, π0) exceeds the theoretical value χ2 k−1(1 − α).

Here, χ2k−1(1 − α) denotes the (1 − α)th quantile of the χ2k−1 distribution, the value for which the probability that a χ2k−1 random variable is less than or equal to it is 1 − α. The p-value for this test is the area to the right of the computed X2 or G2 under the χ2k−1 density curve. Below is a simple visual example. Consider a chi-squared distribution with df=10. Let’s assume that a computed test statistic is X2=21. For α=0.05, the theoretical value is 18.31.

chisqpvalue

Useful functions in SAS and R to remember for computing the p-values from the chi-square distribution are:

  • In R, p-value=1-pchisq(test statistic, df), e.g., 1-pchisq(21,10)=0.021
  • In SAS, p-value=1-probchi(test statistic,df), e.g.,1-probchi(21,10)=0.021

You can quickly review the chi-squared distribution in Lesson 0, or check out http://www.statsoft.com/textbook/stathome.html and http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html. The STATSOFT link also has brief reviews of many other statistical concepts and methods.

Here are a few more comments on this test.

  • When n is large and the model is true, X2 and G2 tend to be approximately equal. For large samples, the results of the X2 and G2 tests will be essentially the same.
  • An old-fashioned rule of thumb is that the χ2 approximation for X2 and G2 works well provided that n is large enough to have Ej = nπj ≥ 5 for every j. Nowadays, most agree that we can have Ej< 5 for some of the cells (say, 20% of them). Some of the Ej‘s can be as small as 2, but none of them should fall below 1. If this happens, then the χ2 approximation isn’t appropriate, and the test results are not reliable.
  • In practice, it’s a good idea to compute both X2 and G2 to see if they lead to similar results. If the resulting p-values are close, then we can be fairly confident that the large-sample approximation is working well.
  • If it is apparent that one or more of the Ej‘s are too small, we can sometimes get around the problem by collapsing or combining cells until all the Ej‘s are large enough. But we can also perform a small-sample inference or exact inference. We will see more on this in Lesson 3. Please note that the small-sample inference can be conservative for discrete distributions, that is may give a larger p-value than it really is (e.g., for more details see Agresti (2007), Sec. 1.4.3-1.4.5, and 2.6; Agresti (2013), Sec. 3.5, and for Bayesian inference Sec 3.6.)
  • In most applications, we will reject the null hypothesis X ∼ Mult (n, π) for large values of X2 or G2. On rare occasions, however, we may want to reject the null hypothesis for unusually small values of X2 or G2. That is, we may want to define the p-value as P2 k−1 X2) or P(χ2 k−1G2). Very small values of X2 or G2 suggest that the model fits the data too well, i.e. the data may have been fabricated or altered in some way to fit the model closely. This is how R.A. Fisher figured out that some of Mendel’s experimental data must have been fraudulent (e.g., see Agresti (2007), page 327; Agresti (2013), page 19).

2.5 – Examples in SAS/R: Dice Rolls & Tomato

Example – Dice Rolls

Suppose that we roll dice 30 times and observe the following table showing the number of times each face ends up on top.

Face
Count
1
3
2
7
3
5
4
10
5
2
6
3
Total
30

We want to test the null hypothesis that the dice is fair. Under this hypothesis, X ∼ Mult(n = 30, π0) where π0 = (1/6 , 1/6 , 1/6 , 1/6 , 1/6 , 1/6 ).

This is our assumed model, and under this H0, the expected counts are Ej = 30*1/6= 5 for each cell j. One way to calculate the difference between what we are assuming and what we are observing is to evaluate the the goodness-of-fit statistics as discussed in  the previous section.

2014-09-12_dice

Note that numerical values of Χ2 and G2 can be different. We can calculate the p-values for these statistics in order to help determine how well this model fits. The p-values are P25 ≥ 9.2) = .10 and P25 ≥ 8.8) = .12. Given these p-values, with the critical value or Type I error of α=0.05, we fail to reject the null hypothesis. But rather than relying on p-values, we should also think about the actual values of X2 and G2statistics. Small values imply a good fit here, i.e., the observed values are close to what you had assumed. Large values are going to imply the opposite. In this case, the fair-dice model doesn’t fit the data very well, but the fit isn’t bad enough to conclude beyond a reasonable doubt that the dice is unfair.

Next, we show how to do this in SAS and R; you can run either.

R logo The following R code, dice_rolls.R will perform the same analysis as what we just did in SAS. You can copy the entire code in the R window and your output will be saved into two files, dice_rolls.out and dice_rolls_Results. If you are new to R, however, I suggest you run this line by line to get more familiar with the commands; you can simply copy and past into the R-terminal. For other options, see a very basic intro to R.

dice_rolls.out (part of the output)
Inspect program

Here are a few additional comments regarding the above R code and its output. If you run Inspect, the data there will be labeled as “ex7” where in the updated code the label is “dice”, but it’s the same data. The last part of the R code does the same analysis, but gives an output that looks more like what SAS produces; we did this to show a bit more programming in R and how you can play with creating different outputs.

IMPORTANT! The Pearson chi-squared statistic is X2 = 9.2 (p-value=0.101), the deviance statistic (from the R output) is G2=8.78 (p-value=0.118), and they both follow the chi-squared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics especially for the smaller sample sizes. Notice, however, that in this case they do lead to the same conclusion — there is a moderate evidence that the dice is fair.

Discuss      Can you identify the relevant statistics and the p-value in the output? What does the column labeled “Percentage” in dice_rolls.out represent?

Example – Tomato Phenotypes

Tall cut-leaf tomatoes were crossed with dwarf potato-leaf tomatoes, and n = 1611 offspring were classified by their phenotypes.

Phenotypes
Count
tall cut-leaf
926
tall potato-leaf
288
dwarf cut-leaf
293
dwarf potato-leaf
104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potato-leaf is less likely to observed than the others. Do the observed data support this theory?

Under the null hypothesis, the probabilities are

π1 = 9/16 , π2 = π3 = 3/16 , π4 = 1/16 ,

and the expected frequencies are

E1 = 1611(9/16) = 906.2,
E
2 = E3 = 1611(3/16) = 302.1, and
E
4 = 1611(1/16) = 100.7.

We calculate the fit statistics and find that X2 = 1.47 and G2 = 1.48, which are nearly identical. The p-values based on the χ2 distribution with 3 d.f. are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and the data are consistent with the genetic theory. Again, the small values of these statistics imply that the fit between the data and the proposed model is good!

R logo Here is how to do the computations in R using the following code :

Inspect program

This has step-by-step calculations and using of built-in chisq.test() with producing some nice output including Pearson and deviance residuals.

Discuss     Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example?

https://onlinecourses.science.psu.edu/stat504/lesson2

Posted in: Food for thought