Following Analysis of Discrete Data #1 – Overview
OneWay Tables and Goodness of Fit
Key concepts:
Objectives:

Useful links:
Links to background information on the topics in this lesson and some programming help and examples.
 Understand chisquare distribution:
 SAS source on PROC FREQ:
 SAS source on oneway frequency tables:
 R: Test of Equal or Given Proportions and Chisquare goodnessoffit test.
Readings:
 Ch.1 Agresti (2007, 1996)
 If you are using other textbooks or editions: Ch.1 Agresti (2013, 2002, 1996)
2.1 – Introduction and Examples
Lesson 1 was a brief introduction to the categorical data and discrete distributions. We looked at different types of categorical data as well as counts and proportions. In this lesson we will see the simplest way to represent counts and proportions and then do the simplest type of analysis. Let’s begin with the definition of a oneway frequency table.
A frequency table arises when sample units are classified into mutually exclusive categories; the number of units falling into each category is recorded. “One way” means that units are classified according to a single categorical variable, and that is the table’s dimension. The cells of the table contain frequency counts and can be arranged in a single row or column. The size of the table refers to the number of cells.
Let us look at some examples and see how they relate to the simplest concept of contingency table.
Example 1 – HighRisk Drinking at Penn State
In 2006, the Penn State’s Student Affairs Office of Research and Assessment (SAORA) conducted a survey on alcohol use of undergraduate students at the Penn State, University Park campus ( Pulse Survey: Student Drinking, March 2006). A sample of 1315 undergraduate students were classified as to whether they are highrisk drinkers or not.
HighRisk

Count

Yes

630

No

685

Total

1315

Since the sample size, n = 1315, is fixed a priori and there are only two outcomes, e.g., you are a highrisk drinker or not, this is an example of a binomial sampling.
The questions of interest are
 What is the true population proportion of students who are highrisk drinkers at Penn State?
 Is this proportion significantly different from 50%? In other words, is a randomly chosen student equally likely to be a high risk or nonhighrisk drinker?
Example 2 – World Cup Soccer
How many teams scored certain number of goals during the first round of 2002 FIFA World Cup soccer championship ( http://www.worldcupsoccer.info/ )? For example, 25 teams scored no goals during their first round matches. The table below records the frequency of the number of goals scored.
Number of goals

Count

0

23

1

37

2

20

3

11

4

2

5

1

6

0

7

0

8

1

Total

95

This is a oneway table of size 9. Note here that the total count (i.e., sample size n = 95 ) was not fixed a priori and this could be considered Poisson sampling since we fixed the time period during which we recorded the outcomes of interest. We may ask questions such as
 What is the mean number of goals scored?; or
 What is the probability that a team will score four goals?
Example 3 – Dice Rolls
Start a timer and roll a die for two minutes. Count the number of times each die face comes up on top during this two minutes. This is an example of a Poisson sampling. You may get results similar to those in the following table. Note that we could not fix the sample size n = 30 a priori!
Face

Count

1

3

2

7

3

5

4

10

5

2

6

3

Total

30

You may also get the table above, or a similar one by deciding instead to roll a die n = 30 times, and record out of these 30 times at each roll which face comes up. This is an example of Multinomial sampling. Now, we may ask the question, “Is this die fair?”, and use the sampling knowledge to answer it.
Example 4 – Eye Colors
A sample of n = 96 persons is obtained, and the eye color of each person is recorded.
Eye color

Count

Brown

46

Blue

22

Green

26

Other

2

Total

96

Notice that brown, blue, green, and other have no intrinsic ordering. The response variable, eye color, is therefore an unordered categorical or nominal variable.
Example 5 – Attitudes Towards War – 1
Hypothetical attitudes of n = 116 people towards war. They were asked to state their opinion on a 5 point scale regarding the statement: “This is a necessary war”.
Attitude

Count

Strongly disagree

35

Disagree

27

Agree

23

Strongly agree

31

Total

116

The response categories in this example are clearly ordered, but no objectively defined numerical scores can be attached to the categories. The response variable, attitude, is therefore said to be an ordered categorical or ordinal variable.
Example 6 – Attitudes Towards a War – 2
Hypothetical attitudes of n = 130 people towards the war, but with additional categories for “not sure” and “refused to answer.”
Attitude

Count

Strongly disagree

35

Disagree

27

Agree

23

Strongly agree 
31

Not sure 
6

Refusal 
8

Total

130

The first four categories are clearly ordered, but the placement of “not sure” and “refusal” in the ordering is questionable. We would have to say that this response is partially ordered.
Example 7 – Number of Children in Families
Number of children in n = 100 randomly selected families.
Number of children

Count

0

19

1

26

2

29

3

13

45

11

6+

2

Total

100

The original data, the raw number of children, has been coarsened into six categories (0, 1, 2, 3, 4–5, 6+). These categories are clearly ordered, but—unlike the previous example—the categories have objectively defined numeric values attached to them. We can say that this table represents coarsened numeric data or interval variable.
Example 8 – Household Incomes
Total gross income of n = 100 households.
Income

Count

below $10,000 
11

$10,000–$24,999 
23

$25,000–$39,999 
30

$40,000–$59,999 
24

$60,000 and above 
12

Total 
100

The original data (raw incomes) were essentially continuous. Any type of data, continuous or discrete, can be grouped or coarsened into categories.
Grouping data will typically result in some loss of information. How much information is lost depends on (a) the number of categories and (b) the question being addressed. In this example, grouping has somewhat diminished our ability to estimate the mean or median household income. Our ability to estimate the proportion of households with incomes below $10,000 has not been affected, but estimating the proportion of households with incomes above $75,000 is now virtually impossible. (Why?)
Remember measurement hierarchy? If we have a statistical method that is developed for analysis of ordinal categorical data that method typically will not be appropriate for analysis of, for example, nominal categorical data. This is why we are making a distinction between ordinal and nominal variables and partial ordinal variables. Let’s get a little bit more technical and see the notation relevant for oneway frequency tables before we go into a discussion of sampling schemes.
2.2 – Notation & Structure
Suppose data are collected on a discrete variable, X, with k categories. A oneway frequency table with k cells will be denoted by the vector:
where X_{j} is the count or frequency in cell j and x_{j} or n_{j} is the observed count in this cell j. In some textbooks the author will use x_{j} and other authors will use n_{j}; the latter is more consistent with your textbook (Agresti’s) notation.
Note that X is a summary of the original raw dataset consisting of:
observations.
For instance, the Eye Colors example had four types of eye colors, thus k = 4.
X_{1} is the unknown number of people in the population with “brown” eyes. The observed number of people with brown eyes is x_{1} = n_{1} = 46. The total sample size n = n_{1} + n_{2} + n_{3} + n_{4} = 46 + 22 + 26 + 2 = 96.
Joint Distribution
If the observed units are randomly sampled from a large population, then x = (n_{1}, n_{2 }, n_{3 }, … , n_{k}) will have a joint discrete probability distribution such as Poisson, Binomial, or Multinomial; a technical review of these common discrete distributions is given in Lesson 1.
Let π_{j} be the population probability (or proportion) that a randomly selected individual, (i.e., an observational unit), falls into the jth cell of the table. Then the joint probability distribution will be denoted as
For this to be a proper joint probability distribution the probabilities must sum to 1:
In the social sciences we typically think of π_{j} as a proportion of individuals in the population with a particular characteristic. For example, π_{1} = proportion of people with brown eyes in the population of interest. In statistical inference, our goal is to estimate unknown parameters π_{j}, and/or their joint distribution.
For observed data we typically use p instead of π, to denote observed joint distribution { p_{j }} where p_{j} = n_{j}/n is the observed proportion of people with a certain characteristic, e.g. the proportion of people with brown eyes in our example is p_{1} = 46 / 96 = 0.479. p_{1} is an example of a sample statistic, and a point estimate of unknown population parameter, π_{1}.
For the Eye Color example:



What are some ways of generating these oneway tables of counts?
Why do you think we care about the random mechanism that generated the data? 
Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable have been randomly generated from a normal distribution. For categorical data we will often assume that data have been generated from a Poisson, binomial or multinomial distribution. Statistical analysis depends on the data generation mechanism, although in some situations, depending on the objective, we may ignore that mechanism and simplify our analysis.
The following sampling methods correspond to the distributions considered
 Unrestricted sampling (corresponds to Poisson distribution)
 Sampling with fixed total sample size (corresponds to Binomial or Multinomial distributions)
2.3.1 – Poisson Sampling
Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. Often it is useful when the probability of any particular incidence happening is very small while the number of incidences is very large. (This is very much like a binomial distribution where success probability π of a trial is very very small but the number of trials n is very very large. This is known as the limiting condition).
For example, consider the World Cup soccer data example where we collect data on the frequency on the number of goals scored by teams during the first round matches of the 2002 World Cup. Another example is rolling of a dice during a fixed twominute time period. Similarly count the number of emails you received between 4pm5pm on a Friday, or number of students accessing STAT 504 course website on a Saturday, etc.
Let X be the number of goals scored in the matches of the first round of the World Cup. X ∼ Poisson (λ)
Where λ is the parameter describing the rate, that is the mean of the distribution, e.g., the average number of goals scored during the first round matches. Once you know λ, you know everything there is to know about this distribution. x! stands for x factorial, i.e., x!=1*2*3*…*x. P(X=x) or P(x) is a probability that a randomly chosen team scored x number of goals in a game, e.g.:
How the average rate λ = 1.38 is obtained, is given below. Then
is the probability that a randomly chosen team will score 0 goals in the first round match of the World Cup. For the remaining probabilities see the table at the end of this page.
The Poisson Model (distribution) Assumptions
 Independence: Events must be independent (e.g. the number of goals scored by a team should not make the number of goals scored by another team more or less likely.)
 Homogeneity: The mean number of goals scored is assumed to be the same for all teams.
 Time period (or space) must be fixed
In this sampling scheme, the overall n is not fixed. We also assume that X_{1}, X_{2}, … X_{k} are independent Poisson variables, either each with the same rate λ or each with different rate λ_{1}, λ_{2}, …., λ_{k} such that λ_{1} + λ_{2} + … + λ_{k} = λ.
Recall that mean and variance of Poisson distribution are the same; e.g., E(X) = Var(X) = λ. However in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it’s more realistic to assume that these probabilities will vary by the teams skills, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts. Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.
Let us see how we can do some basic calculations with the World Cup Soccer example under the Poisson model.
QUESTION: What is the most likely mean number of goals scored; that is, what is the most likely value of the unknown parameter λ given the data x?
We can answer this question by relying on the basic principle of statistical inference, e.g., point estimation, confidence intervals and/or hypothesis testing.
Recall from Lesson 1 on Likelihood and MLE: The most common point estimate is the “maximum likelihood estimate” (MLE) which has nice statistical properties, and it is the most likely value of the parameter given the data; it is the value that maximizes the likelihood function.
The MLE of λ from the Poisson distribution is the sample mean or the expectation of the distribution, and from the computation below for our example this is approximately:
Thus, λ^=1.38 goals per first round matches.
An asymptotic or approximate 95% confidence interval is then:
and we are 95% confident that the mean number of goals scored by a team during the first round matchups will be somewhere between 1.14 and 1.62. Now that we have some estimate of the mean number of goals we can calculate the expected probabilities of a randomly chosen team scoring 0, 1, 2, 3, etc… number of goals, as well as the expected frequencies (or counts). For example, under this Poisson model with λ^=1.38, the expected probability of scoring 2 goals is and the expected frequency is .
Example – World Cup Soccer
Here is a link to the World Cup Soccer data (text file).
You can easily do these calculation by hand or in Excel or in any other software package you are using. Here they are in R. Here is a link to this code in R. soccer.R.
You can click on the ‘Inspect’ button below to see how the Poisson probabilities are calculated using R.
Please Note: There are some discrepancies between the R code file and Inspect! The file itself contains line comments explaining the code.
2.3.2 – Binomial Sampling
When data are collected on a predetermined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the Highrisk Drinking Example where we have highrisk drinkers versus nonhighrisk drinkers. In this study there was a fixed number of trials (e.g., fixed number of students surveyed, n=1315) where the researcher counted the number of “successes” and the number of “failures” that occur. We can let X be the number of “successes” that is the number of students who are the highrisk drinkers. We can use the binomial probability distribution (i.e., binomial model), to describe this particular variable.
Binomial distributions are characterized by two parameters: n, which is fixed – this could be the number of trials or the total sample size if we think in terms of sampling, and π, which usually denotes a probability of “success”. In our example this would be the probability that someone is a highrisk drinker in the population of Penn State students. Please note that some textbooks will use π to denote the population parameter and p to denote the sample estimate, whereas some may use p for the population parameters as well. We may do both; don’t be confused by this, just make sure to read carefully the specification. Once you know n and π, the probability of success, you know the mean and variance of the binomial distribution, and you know everything about that binomial distribution.
Below are the probability density function, mean and variance of the binomial variable.
Mean E (X) = nπ
Variance Var (X) = nπ (1 – π)
Binomial Model (distribution) Assumptions
 Fixed n: the total number of trials/events, (or total sample size) is fixed.
 Each event has two possible outcomes, referred to as “successes” or “failures”, (e.g., each student can be either a heavy drinker or a nonheavy drinker; heavy drinker being a success here).
 Independent and Identical Events/Trials:
 Identical trials means that probability of success is the same for each trial.
 Independent means that the outcome of one trial does not affect the outcome of the other, (e.g. one student being a heavy drinker or not does not affect the status of the next student, and each student has the same probability, π, of being a heavy drinker.)
Example – Heavy Drinking Students
QUESTION: What is the probability that no students are heavy drinkers, i.e., P(X = 0)?
Let’s assume that π = 0.5.
What’s the probability that there are X = 1000 heavy drinkers in this example? 
QUESTION: What is the true population proportion of students who are highrisk drinkers at Penn State?
This is a statistical inference question that can be answered with a point estimate, confidence intervals and hypothesis tests about proportions. The likelihood function for Binomial L(π ; x) is a measure of how close the population proportion π is to the data x; The Maximum Likelihood Estimate (MLE) is the most likely value for π given the observed data, and for the binomial distribution this is the sample mean,
and for the expected counts,
Thus for our example, assuming the Binomial distribution, our “best” guess estimate of the true proportion of students who are highrisk drinkers is
Here are also the likelihood and loglikelihood graphs for our example. We can see that the peak of the likelihood is at the proportion value equal to 0.48. The loglikelihood looks quadratic which means that the largesample normal theory should work fine, and we can use the approximate 95% confidence intervals.
The MLE is used for statistical inference, such as testing for a specific value of π, or giving a range of likely values of π, via the 95% confidence interval. A key result here comes from understanding the properties of the sampling distribution of the sample proportion p.
The Rule for Sample Proportion: If numerous samples of size n are taken with n large enough, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean E(p)=π and variance Var(p)=π(1 π)/n . “Large enough” usually means that the number of successes and failures are not small, i.e., nπ ≥ 5, and n(1 – π) ≥ 5. The larger the sample size n, the sampling distribution of p is better approximated by a normal distribution. Note that the sampling distribution of p is really a discrete rather than a continuous distribution, but we rely on the above described approximation for statistical inference unless we deal with small samples; for the latter case see Agresti (2007), Sec 1.4.31.4.5.
QUESTION: Is the population proportion of heavydrinkers significantly different from 50%?
Largesample hypothesis test about π
H_{0}: π = π_{0} vs. H_{A}: π ≠ π_{0}

The z test statistic:
where p is the sample proportion, in our case 0.48, and π_{0} is the value we are testing for, 0.5.
H_{0}: π = 0.5 vs. H_{A}: π ≠ 0.5

The statistic z = 1.45 with two sided pvalue of 0.136. Thus we do not have a very strong evidence that the proportion of highrisk drinkers is different from 50%; i.e., do not reject the null hypothesis.
Confidence interval for π
The usual (1 – α) × 100% CI, holds:
This interval is known as the Wald confidence interval.
For our example, the 95% CI is 0.48 ± 1.96 × 0.014 = (0.453, 0.507). We can be 95% confident that the true population proportion of students highrisk drinkers is between 0.454 and 0.506.
However, when the proportions are extremely small or large, π < 0.20 or π > 0.80, this CI does not work very well. It is better to consider the likelihoodratiobased CI, as discussed in Lesson 1. This interval is more complex computationally but in essence simple by evaluating the likelihood function plotted below where we are looking for all possible values of pi0 for which the null hypothesis would not be rejected. In our case we get 95% CI to be (0.453, 0.506), here is a plot of that confidence interval.
To do the above calculations in R and SAS, see the drinking.R and drinking.sas files below. Also, watch the viewlets that will walk you through how these program works.
Here is the R program drinking.R.
And, here is a walkthrough of this program.
The third alternative, also likelihoodbased confidence interval, known as the Score confidence interval in essence is looking for ALL π_{0} values that yield the desired test statistics, e.g., for 95% CI, z_{α/2} = ± 1.96. To simplify, we need to solve the following equation:
which is the same as:
To do this, we can solve the following quadratic equation:
In this example, the 95% Score CI is the same as the Wald CI, (0.453,0.507). Notice that the only difference between the Wald CI and the Score CI, is the standard error where in the former its calculated using the value of sample proportion p and in the latter the null value of π_{0}.
Traditionally, most software packages will give you the Wald CI, but nowadays we are starting to see the score and the likelihoodratio ones too.
Margin of error (MOE)
The margin of error is defined to be the length of the 95% CI for a single proportion. So the 95% CI is p ± MOE; see the CNN example for a review. Confidence interval is the widest when π_{0} = 0.5. This knowledge is useful in determining sample size for given conditions.
Sample Size
Often we are interested in knowing what sample size is needed for a specific margin of error for a population proportion. For example, how large a sample would we need such that the 99% confidence interval is of width m. Solving the following for n:
Since π is unknown, take π = 0.5 to get the widest possible MOE. For example, if the required MOE is 3%
which is rounded up to 1842.
For additional simple Binomial distribution calculations see Lesson 0 and SimpleBinomialCal.R code with its viewlet that walks you through the code.
2.3.3 – Multinomial Sampling
Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a predetermined number of individuals that is units and classified according to the levels of a categorical variable of interest (e.g., see Examples 4 through 8 in the Introduction of this Lesson).
X ∼ Mult (n, π), with the probability density function
where n is fixed and known sample size, and π is the vector of population proportions π=(π1,π2,…,πk).
Each X_{j} represents a population count in the cell j, x_{j} observed sample count, and each π_{j }represents a population proportion in the cell j.
Multinomial distribution admits a mean vector with
E(Xj)=nπj
The variance of X_{j} is
Var(Xj)=nπj(1−πj)
and the covariance between X_{j} and X_{k} is
cov(Xj,Xk)=nπjπk
Each cell thus has a binomial marginal distribution. Binomial is a special case of multinomial for k = 2.
let,
then,
To be consistent with Agresti(2007), pg.4 or Agresti(2013), pg.6, where the random variable is denoted as Y, and probability density function as P(y), we can write the above as,
If n is random (not fixed, as was the case with the total number of goals scored in the World Soccer example), but we have no interest in making inferences about the sample size n only about the underlying cell probabilities, we can apply the multinomial model instead of the Poisson model and the inferences will be the same. But we do have to worry about other kinds of model failure; e.g., are the assumptions of the multinomial model satisfied?
Multinomial Model (distribution) Assumptions:
 the n trials are independent, and
 the parameter vector π remains constant from trial to trial.
The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).
Example – Eye Color
In this example, eye color was recorded for n = 96 persons.
Eye color

Count

Brown

46

Blue

22

Green

26

Other

2

Total

96

Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate multinomial assumption, distribution of eye color of blood relations will.
Now suppose that the sample consisted of “unrelated” persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that’s okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.
Could this be considered a multinomial situation? For all practical purposes, yes. The sampled individuals are not independent according the common English definition of the word, because they all live in Pennsylvania. But we can suppose that they are independent from a statistical viewpoint, because the individuals are exchangeable; before the sample is drawn, no two are a priori any more likely to have the same eye color than any other two.
Based on the readings on Multinomial distribution in Lesson 1 and 2 can you answer the following questions:

2.4 – GoodnessofFit Test
A goodnessoffit test, in general, refers to measuring how well do the observed data correspond to the fitted (assumed) model. We will use this concept throughout the course as a way of checking the model fit. Like in a linear regression, in essence, the goodnessoffit test compares the observed values to the expected (fitted or predicted) values.
A goodnessoffit statistic tests the following hypothesis:
H_{0}: the model M_{0} fits
vs.
H_{A}: the model M_{0} does not fit (or, some other model M_{A} fits)
Most often the observed data represent the fit of the saturated model, the most complex model possible with the given data. Thus, most often the alternative hypothesis (H_{A}) will represent the saturated model M_{A} which fits perfectly because each observation has a separate parameter. Later in the course we will see that M_{A} could be a model other than the saturated one. Let us now consider the simplest example of the goodnessoffit test with categorical data.
In the setting for oneway tables, we measure how well an observed variable X corresponds to a Mult (n, π) model for some vector of cell probabilities, π. We will consider two cases:
 when vector π is known, and
 when vector π is unknown.
In other words, we assume that under the null hypothesis data come from a Mult (n, π) distribution, and we test whether that model fits against the fit of the saturated model. The rationale behind any model fitting is the assumption that a complex mechanism of data generation may be represented by a simpler model. The goodnessoffit test is applied to corroborate our assumption.
Consider our Dice Example from the Introduction. We want to test the hypothesis that there is an equal probability of six sides; that is compare the observed frequencies to the assumed model: X ∼ Multi (n = 30, π_{0} = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)). You can think of this as simultaneously testing that the probability in each cell is being equal or not to a specified value, e.g.
H_{0}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6)
vs.
H_{A}: ( π_{1}, π_{2}, π_{3}, π_{4}, π_{5}, π_{6}) ≠ (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Most software packages will already have builtin functions that will do this for you; see the next section for examples in SAS and R. Here is a stepby step procedure to help you conceptually understand this test better and what is going on behind these functions.
Step 1: If vector π is unknown we need to estimate these unknown parameters, and proceed to Step 2; If vector π is known proceed to Step 2.
Step 2: Calculate the estimated (fitted) cell probabilities πj^s, and expected cell frequencies, E_{j}‘s under H_{0}. Step 3: Calculate the Pearson goodnessoffit statistic, X ^{2} and/or the deviance statistic, G^{2} and compare them to appropriate chisquared distributions to make a decision. Step 4: If the decision is borderline or if the null hypothesis is rejected, further investigate which observations may be influential by looking, for example, at residuals. 
Pearson and deviance test statistics
The Pearson goodnessoffit statistic is
An easy way to remember it is
where O_{j} = X_{j} is the observed count in cell j, and is the expected count in cell j under the assumption that null hypothesis is true, i.e. the assumed model is a good one. Notice that π^j is the estimated (fitted) cell proportion π_{j }under H_{0}.
The deviance statistic is
where “log” means natural logarithm. An easy way to remember it is
In some texts, G^{2} is also called the likelihoodratio test statistic, for comparing the likelihoods (l_{0} and l_{1 }) of two models, that is comparing the loglikelihoods under H_{0} (i.e., loglikelihood of the fitted model, L_{0}) and loglikelihood under H_{A} (i.e., loglikelihood of the larger, less restricted, or saturated model L_{1}): G^{2 }= 2log(l_{0}/l_{1}) = 2(L_{0} – L_{1}). A common mistake in calculating G^{2} is to leave out the factor of 2 at the front.
Note that X^{2} and G^{2} are both functions of the observed data X and a vector of probabilities π. For this reason, we will sometimes write them as X^{2}(x, π) and G^{2}(x, π), respectively; when there is no ambiguity, however, we will simply use X^{2} and G^{2}. We will be dealing with these statistics throughout the course; in the analysis of 2way and kway tables, and when assessing the fit of loglinear and logistic regression models.
Testing the GoodnessofFit
X^{2} and G^{2} both measure how closely the model, in this case Mult (n, π) “fits” the observed data.
 If the sample proportions p_{j} = X_{j }/n (i.e., saturated model) are exactly equal to the model’s π_{j} for cells j = 1, 2, . . . , k, then O_{j} = E_{j} for all j, and both X^{2} and G^{2} will be zero. That is, the model fits perfectly.
 If the sample proportions p_{j} deviate from the π^‘s computed under H_{0}, then X^{2} and G^{2} are both positive. Large values of X^{2} and G^{2} mean that the data do not agree well with the assumed/proposed model M_{0}.
How can we judge the sizes of X^{2} and G^{2}?
The answer is provided by this result:
If x is a realization of X ∼ Mult(n, π), then as n becomes large, the sampling distributions of both X^{2}(x, π) and G^{2}(x, π) approach chisquared distribution with df = k 1, where k = number of cells, χ2_{k−1}.
This means that we can easily test a null hypothesis H_{0} : π = π_{0} against the alternative H_{1} : π ≠ π_{0} for some prespecified vector π_{0}. An approximate αlevel test of H_{0} versus H_{1} is:
Reject H_{0} if computed X^{2}(x, π_{0}) or G^{2}(x, π_{0}) exceeds the theoretical value χ^{2} _{k−1}(1 − α).
Here, χ^{2}_{k−1}(1 − α) denotes the (1 − α)th quantile of the χ^{2}_{k−1} distribution, the value for which the probability that a χ^{2}_{k−1} random variable is less than or equal to it is 1 − α. The pvalue for this test is the area to the right of the computed X^{2} or G^{2} under the χ^{2}_{k−1} density curve. Below is a simple visual example. Consider a chisquared distribution with df=10. Let’s assume that a computed test statistic is X^{2}=21. For α=0.05, the theoretical value is 18.31.
Useful functions in SAS and R to remember for computing the pvalues from the chisquare distribution are:
 In R, pvalue=1pchisq(test statistic, df), e.g., 1pchisq(21,10)=0.021
 In SAS, pvalue=1
probchi(test statistic,df), e.g.,1probchi(21,10)=0.021
You can quickly review the chisquared distribution in Lesson 0, or check out http://www.statsoft.com/textbook/stathome.html and http://www.ruf.rice.edu/~lane/stat_sim/chisq_theor/index.html. The STATSOFT link also has brief reviews of many other statistical concepts and methods.
Here are a few more comments on this test.
 When n is large and the model is true, X^{2} and G^{2} tend to be approximately equal. For large samples, the results of the X^{2} and G^{2} tests will be essentially the same.
 An oldfashioned rule of thumb is that the χ^{2} approximation for X^{2} and G^{2} works well provided that n is large enough to have E_{j} = nπ_{j} ≥ 5 for every j. Nowadays, most agree that we can have E_{j}< 5 for some of the cells (say, 20% of them). Some of the E_{j}‘s can be as small as 2, but none of them should fall below 1. If this happens, then the χ^{2} approximation isn’t appropriate, and the test results are not reliable.
 In practice, it’s a good idea to compute both X^{2} and G^{2} to see if they lead to similar results. If the resulting pvalues are close, then we can be fairly confident that the largesample approximation is working well.
 If it is apparent that one or more of the E_{j}‘s are too small, we can sometimes get around the problem by collapsing or combining cells until all the E_{j}‘s are large enough. But we can also perform a smallsample inference or exact inference. We will see more on this in Lesson 3. Please note that the smallsample inference can be conservative for discrete distributions, that is may give a larger pvalue than it really is (e.g., for more details see Agresti (2007), Sec. 1.4.31.4.5, and 2.6; Agresti (2013), Sec. 3.5, and for Bayesian inference Sec 3.6.)
 In most applications, we will reject the null hypothesis X ∼ Mult (n, π) for large values of X^{2} or G^{2}. On rare occasions, however, we may want to reject the null hypothesis for unusually small values of X^{2} or G^{2}. That is, we may want to define the pvalue as P(χ^{2} _{k−1 }≤ X^{2}) or P(χ^{2} _{k−1} ≤ G^{2}). Very small values of X^{2} or G^{2} suggest that the model fits the data too well, i.e. the data may have been fabricated or altered in some way to fit the model closely. This is how R.A. Fisher figured out that some of Mendel’s experimental data must have been fraudulent (e.g., see Agresti (2007), page 327; Agresti (2013), page 19).
2.5 – Examples in SAS/R: Dice Rolls & Tomato
Example – Dice Rolls
Suppose that we roll dice 30 times and observe the following table showing the number of times each face ends up on top.
Face

Count

1

3

2

7

3

5

4

10

5

2

6

3

Total

30

We want to test the null hypothesis that the dice is fair. Under this hypothesis, X ∼ Mult(n = 30, π_{0}) where π_{0} = (1/6 , 1/6 , 1/6 , 1/6 , 1/6 , 1/6 ).
This is our assumed model, and under this H_{0}, the expected counts are E_{j} = 30*1/6= 5 for each cell j. One way to calculate the difference between what we are assuming and what we are observing is to evaluate the the goodnessoffit statistics as discussed in the previous section.
Note that numerical values of Χ^{2} and G^{2} can be different. We can calculate the pvalues for these statistics in order to help determine how well this model fits. The pvalues are P(χ^{2}_{5} ≥ 9.2) = .10 and P(χ^{2}_{5} ≥ 8.8) = .12. Given these pvalues, with the critical value or Type I error of α=0.05, we fail to reject the null hypothesis. But rather than relying on pvalues, we should also think about the actual values of X^{2} and G^{2}statistics. Small values imply a good fit here, i.e., the observed values are close to what you had assumed. Large values are going to imply the opposite. In this case, the fairdice model doesn’t fit the data very well, but the fit isn’t bad enough to conclude beyond a reasonable doubt that the dice is unfair.
Next, we show how to do this in SAS and R; you can run either.
The following R code, dice_rolls.R will perform the same analysis as what we just did in SAS. You can copy the entire code in the R window and your output will be saved into two files, dice_rolls.out and dice_rolls_Results. If you are new to R, however, I suggest you run this line by line to get more familiar with the commands; you can simply copy and past into the Rterminal. For other options, see a very basic intro to R.
dice_rolls.out (part of the output)

Here are a few additional comments regarding the above R code and its output. If you run Inspect, the data there will be labeled as “ex7” where in the updated code the label is “dice”, but it’s the same data. The last part of the R code does the same analysis, but gives an output that looks more like what SAS produces; we did this to show a bit more programming in R and how you can play with creating different outputs.
IMPORTANT! The Pearson chisquared statistic is X^{2} = 9.2 (pvalue=0.101), the deviance statistic (from the R output) is G^{2}=8.78 (pvalue=0.118), and they both follow the chisquared sampling distribution with df=5. It is not uncommon to observe the discrepancy in the values between these two statistics especially for the smaller sample sizes. Notice, however, that in this case they do lead to the same conclusion — there is a moderate evidence that the dice is fair.
Can you identify the relevant statistics and the pvalue in the output? What does the column labeled “Percentage” in dice_rolls.out represent? 
Example – Tomato Phenotypes
Tall cutleaf tomatoes were crossed with dwarf potatoleaf tomatoes, and n = 1611 offspring were classified by their phenotypes.
Phenotypes

Count

tall cutleaf

926

tall potatoleaf

288

dwarf cutleaf

293

dwarf potatoleaf

104

Genetic theory says that the four phenotypes should occur with relative frequencies 9 : 3 : 3 : 1, and thus are not all equally as likely to be observed. The dwarf potatoleaf is less likely to observed than the others. Do the observed data support this theory?
Under the null hypothesis, the probabilities are
π_{1} = 9/16 , π_{2} = π_{3} = 3/16 , π_{4} = 1/16 ,
and the expected frequencies are
E_{1} = 1611(9/16) = 906.2,
E_{2} = E_{3} = 1611(3/16) = 302.1, and
E_{4} = 1611(1/16) = 100.7.
We calculate the fit statistics and find that X^{2} = 1.47 and G^{2} = 1.48, which are nearly identical. The pvalues based on the χ^{2} distribution with 3 d.f. are approximately equal to 0.69. Therefore, we fail to reject the null hypothesis and the data are consistent with the genetic theory. Again, the small values of these statistics imply that the fit between the data and the proposed model is good!
Here is how to do the computations in R using the following code :
This has stepbystep calculations and using of builtin chisq.test() with producing some nice output including Pearson and deviance residuals.
Do you recall what the residuals are from linear regression? How would you define them in this context? What do they tell you about the tomato example? 
—
https://onlinecourses.science.psu.edu/stat504/lesson2
Posted on September 12, 2014 by Già Bản
0