This lesson is an overview of the course content as well as a review of some advanced statistical concepts involving discrete random variables and distributions, relevant for STAT 504 — Analysis of Discrete Data. This Lesson assumes that you have glanced through the Review Materials included in the Start Here! block.
Key concepts:
Objectives:

Useful links:
 Lesson 0 reviews the basic statistical concepts.
 Statistical Applets Introduction to the Practice of Statistics – W.H. Freeman & Co.
 Java Applets for Visualization of Statistical Concepts VESTAC: Visualization of and Experimentation with STAtistical Concepts – Universitair Centrum voor Statistiek – Katholieke Universiteit Leuven
Readings:
 Ch.1 Agresti (2007)
 If you are using other textbooks or editions: Ch.1 Agresti (2013, 2002, 1996)
____________________________________________________________________________________________________________________________
The outline below can be viewed as a general template of how to approach data analysis regardless of the type of statistical problems you are dealing with. For example, you can model a continuous response variable such as income, or a discrete response such as a true proportion of U.S. individuals who support new health reform. This approach has five main steps. Each step typically requires an understanding of a number of elementary statistical concepts, e.g., a difference between a parameter to be estimated and the corresponding statistic (estimator).
The General Data Analysis Approach
1.1 – Focus of this Course
The focus of this class is multivariate analysis of discrete data. The modern statistical inference has many approaches/models for discrete data. We will learn the basic principles of statistical methods and discuss issues relevant for the analysis of Poisson counts of some discrete distribution, crossclassified table of counts, (i.e., contingency tables), binary responses such as success/failure records, questionnaire items, judge’s ratings, etc. Our goal is to build a sound foundation that will then allow you to more easily explore and learn many other relevant methods that are being used to analyze real life data. This will be done roughly at the introductory level of the required textbook by A. Agresti (2007).
Basic data are discretely measured responses such as counts, proportions, nominal variables, ordinal variables, continuous variables grouped into a small number of categories, etc. Data examples will be used to help illustrate concepts. The “canned” statistical routines and packages in R and SAS will be introduced for analysis of data sets, but the emphasis will be on understanding the underlying concepts of those procedures. For more detailed theoretical underpinnings you can read A. Agresti (2012).
We will focus on two kinds of problems.
1) The first broad problem deals with describing and understanding the structure of a (discrete) multivariate distribution, which is the joint and marginal distributions of multivariate categorical variables. Such tasks may focus on displaying and describing associations between categorical variables by using contingency tables, chisquared tests of independence, and other similar methods. Or, we may explore finding underlying structures, possibly via latent variable models.
2) The second problem is a sort of “generalization” of regression with a distinction between response and explanatory variables where the response is discrete. Predictors can be all discrete, in which case we may use log linear models to describe the relationships. Predictors can also be a mixture of discrete and continuous variables, and we may use something like logistic regression to model the relationship between the response and the predictors. We will explore certain types of Generalized Linear Models, such as logistic and Poisson regressions.
The analysis grid below highlights the focus of this class with respect to the models that you should already be familiar with.
Analysis Grid
1.2 – Discrete Data Types and Examples
Categorical/Discrete/Qualitative data
Measures on categorical or discrete variables consist of assigning observations to one of a number of categories in terms of counts or proportions. The categories can be unordered or ordered (see below).
Counts and Proportions
Counts are variables representing frequency of occurrence of an event:
 Number of students taking this class.
 Number of people who vote for a particular candidate in an election.
Proportions or “bounded counts” are ratios of counts:
 Number of students taking this class divided by the total number of graduate students.
 Number of people who vote for a particular candidate divided by the total number of people who voted.
Discretely measured responses can be:
 Nominal (unordered) variables, e.g., gender, ethnic background, religious or political affiliation
 Ordinal (ordered) variables, e.g., grade levels, income levels, school grades
 Discrete interval variables with only a few values, e.g., number of times married
 Continuous variables grouped into small number of categories, e.g., income grouped into subsets, blood pressure levels (normal, highnormal etc)
We we learn and evaluate mostly parametric models for these responses.
Measurement Scale and Context
Interval variables have a numerical distance between two values (e.g. income)
Measurement hierarchy:
 nominal < ordinal < interval
 Methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods specifically designed for ordinal data should NOT be used for nominal variables, but methods designed for nominal can be used for ordinal. However, it is good to keep in mind that such analysis method will be less than optimum as it will not be using the fullest amount of information available in the data.
Example: Grades
 Nominal: pass/fail
 Ordinal: A,B,C,D,F
 Interval: 4,3,2.5,2,1
Note that many variables can be considered as either nominal or ordinal depending on the purpose of the analysis. Consider majors in English, Psychology and Computer Science. This classification may be considered nominal or ordinal depending whether there is an intrinsic belief that it is ‘better’ to have a major in Computer Science than in Psychology or in English. Generally speaking, for a binary variable like pass/fail ordinal or nominal consideration does not matter.
Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze. For example,
 Did you get a flu? (Yes or No) — is a binary nominal categorical variable
 What was the severity of your flu? ( Low, Medium, or High) — is an ordinal categorical variable
Based on the context we also decide whether a variable is a response (dependent) variable or an explanatory (independent) variable.
Discuss the following question on the ANGEL Discussion Board:
Why do you think the measurement hierarchy matters and how does it influence analysis? That is, why we recommend that statistical methods/models designed for the variables at the higher level not be used for the analysis of the variables at the lower levels of hierarchy?
Contingency Tables
 A statistical tool for summarizing and displaying results for categorical variables
 Must have at least two categorical variables, each with at least two levels (2 x 2 table)May have several categorical variables, each at several levels (I_{1} × I_{2} × I_{3} × … × I_{k} tables) Place counts of each combination of the variables in the appropriate cells of the table.
Here are a few simple examples of contingency tables.
Example: Admissions Data
A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a twoway table of all applicants by sex and admission status. These data show an association between the sex of the applicants and their success in obtaining admission.
Male

Female

Total


Admit

35

20

55

Deny

45

40

85

Total

80

60

140

Example: Number of Delinquent Children by the County and the Head of Household Education Level
Source: OMB Statistical Policy Working Paper 22
This is another example of a twoway table but in this case 4×4 table. The variable County could be treated as nominal, where as the Education Level of Head of Household can be treated as ordinal variable. Questions to ask, for example: (1) What is the distribution of a number of delinquent children per county given the education level of the head of the household? (2) Is there a trend of where the delinquent children reside given the education levels?
County

Low

Medium

High

Very High

Total

Alpha 
15

1

3

1

20

Beta 
20

10

10

15

55

Gamma 
3

10

10

2

25

Delta 
12

14

7

2

35

Total 
50

35

30

20

135

 Ordinal and nominal variables
 Fixed total
Example: Census Data
Source: American Fact Finder website (U.S. Census Bureau: Block level data)
This is an example of a 2×2×4 threeway table that crossclassifies a population from a PA census block by Sex, Age and Race where all three variables are nominal.
Example: Clinical Trial of Effectiveness of an Analgesic Drug
Source: Koch et al. (1982)
 This is a fourway table (2×2×2×3 table) because it crossclassifies observations by four categorical variables: Center, Status, Treatment and Response
 Fixed number of patients in two Treatment groups
 Small counts
We will see throughout this course that there are many different methods to analyze data that can be represented in coningency tables.
Example of proportions in the news
You should be already familiar with a simple analysis of estimating a population proportion of interest and computing a 95% confidence interval, and the meaning of the margin or error (MOE).
Notation:
 Population proportion = p = sometimes we use π
 Population size = N
 Sample proportion = =# with a trait / total #
 Sample size = n
 X is the number of units with a particular trait, or number of success.
The Rule for Sample Proportions
 If numerous samples of size n are taken, the frequency curve of the sample proportions from the various samples will be approximately normal with the mean p and standard deviation
1.3 – Discrete Distributions
Statistical inference requires assumptions about the probability distribution (i.e., random mechanism, sampling model) that generated the data. For example for a ttest, we assume that a random variable follows a normal distribution. For discrete data key distributions are: Bernoulli, Binomial, Poisson and Multinomial. A more or less thorough treatment is given here. The mathematics is for those who are interested. But the results and their applications are important.
Recall, a random variable is the outcome of an experiment (i.e. a random process) expressed as a number. We use capital letters near the end of the alphabet (X, Y, Z, etc.) to denote random variables. Random variables are of two types: discrete and continuous. Here we are interested in distributions of discrete random variables.
A discrete random variable X is described by a probability mass functions (PMF), which we will also call “distributions,” f(x)=P(X =x). The set of xvalues for which f (x) > 0 is called the support. Support can be finite, e.g., X can take the values in {0,1,2,…,n} or countably infinite if X takes values in {0,1,…}. Note, if the distribution depends on unknown parameter(s) θ we can write it as f (x; θ) (preferred by frequentists) or f(x θ) (preferred by Bayesians).
Here are some distributions that you may encounter when analyzing discrete data.
Bernoulli distribution
The most basic of all discrete random variables is the Bernoulli. X is said to have a Bernoulli distribution if X = 1 occurs with probability π and X = 0 occurs with probability 1 − π ,
Another common way to write it is:
Suppose an experiment has only two possible outcomes, “success” and “failure,” and let π be the probability of a success. If we let X denote the number of successes (either zero or one), then X will be Bernoulli. The mean of a Bernoulli is
and the variance of a Bernoulli is
Binomial distribution
Suppose that are independent and identically distributed (iid) Bernoulli random variables, each having the distribution
Let . Then X is said to have a binomial distribution with parameters and ,
Suppose that an experiment consists of n repeated Bernoullitype trials, each trial resulting in a “success” with probability π and a “failure” with probability 1 − π . For example, toss a coin 100 times, n=100. Count the number of times you observe heads, e.g. X=# of heads. If all the trials are independent—that is, if the probability of success on any trial is unaffected by the outcome of any other trial—then the total number of successes in the experiment will have a binomial distribution, e.g, two coin tosses do not affect each other. The binomial distribution can be written as
The Bernoulli distribution is a special case of the binomial with n = 1. That is, X∼Bin(1,π) means that X has a Bernoulli distribution with success probability π.
One can show algebraically that if X∼Bin(1,π) then E(X)=nπ and V(X)=nπ(1−π). An easier way to arrive at these results is to note that where are (iid) Bernoulli random variables. Then, by the additive properties of mean and variance,
and
Note that X will not have a binomial distribution if the probability of success π is not constant from trial to trial, or if the trials are not entirely independent (i.e. a success or failure on one trial alters the probability of success on another trial).
As n increases, for fixed π, the binomial distribution approaches normal distribution N(nπ,nπ(1−π)).
For example, if we sample without replacement from a finite population, then the hypergeometric distribution is appropriate.
Hypergeometric distribution
Suppose there are n objects. n_{1} of them are of type 1 and n_{2} = n − n_{1} of them are of type 2. Suppose we draw m (less than n) objects at random and without replacement from this population. A classic example, is having a box with n balls, n_{1} are red and n_{2} are blue. What is the probability of having t red balls in the draw of m balls? Then the PMF of N_{1}=t is
The expectation and variance of are given by: and
Poisson distribution
Let X∼Poisson(λ) (this notation means “X has a Poisson distribution with parameter λ”), then the probability distribution is
Note that E(X)=V(X)=λ, and the parameter λ must always be positive; negative values are not allowed.
The Poisson distribution is an important probability model. It is often used to model discrete events occurring in time or in space.
The Poisson is also limiting case of the binomial. Suppose that X∼Bin(n,π) and let n→∞ and π→0 in such a way that nπ→λ where λ is a constant. Then, in the limit, X∼Poisson(λ). Because the Poisson is limit of the Bin(n,π), it is useful as an approximation to the binomial when n is large and π is small. That is, if n is large and π is small, then
where λ=nπ. The righthand side of (1) is typically less tedious and easier to calculate than the lefthand side.
For example, let X be the number of emails arriving at a server in one hour. Suppose that in the long run, the average number of emails arriving per hour is λ. Then it may be reasonable to assume X∼P(λ). For the Poisson model to hold, however, the average arrival rate λ must be fairly constant over time; i.e., there should be no systematic or predictable changes in the arrival rate. Moreover, the arrivals should be independent of one another; i.e., the arrival of one email should not make the arrival of another email more or less likely.
When some of these assumptions are violated, in particular if there is a presence of overdispersion (e.g., observed variance is greater than what the model assume), the Negative Binomial distribution can be used instead of Poisson.
Overdispersion
Count data often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is known as overdispersion.
Consider, for example the number of fatalities from auto accidents that occur next week in the Center county, PA. The Poisson distribution assumes that each person has the
same probability of dying in an accident. However, it is more realistic to assume that these probabilities vary due to
 whether the person was wearing a seat belt
 time spent driving
 where they drive (urban or rural driving)
Persontoperson variability in causal covariates such as these cause more variability than predicted by the Poisson distribution.
Let X be a random variable with conditional variance V(Xλ). Suppose λ is also a random variable with θ=E(λ). Then E(X)=E[E(Xλ)] and V(X)=E[V(Xλ)]+V[E(Xλ)]
For example, when Xλ has a Poisson distribution, then E(X)=E[λ]=θ (so mean stays the same) but the V(X)=E[λ]+V(λ)=θ+V(λ)>θ (the variance is no longer θ but larger).
When Xπ is a binomial random variable and π∼Beta(α,β). Then and .
Thus, E(X)=nλ (as expected the same) but the variance is larger V(X)=nλ(1−λ)+n(n−1)V(π)>nλ(1−λ).
NegativeBinomial distribution
When the data display overdispersion, the analyst is more likely to use the negativebinomial distribution instead of Poission to model the data.
Suppose a random variable Xλ∼Poisson(λ) and λ∼Gamma(α,β). Then the joint distribution of X and λ is:
Thus the marginal distribution of X is negativebinomial (i.e., PoissonGamma mixture):
with
BetaBinomial distribution
A family of discrete probability distributions on a finite support arising when the probability of a success in each of a fixed or known number of Bernoulli trials is either unknown or random. For example, the researcher believes that the unknown probability of having flu π is not fixed and not the same for the entire population, but it’s yet another random variable with its own distribution. For example, in Bayesian analysis it will describe a prior belief or knowledge about the probability of having flu based on prior studies. Below X is what we observe such as the number of flu cases.
Suppose Xπ∼Bin(n,π) and π∼Beta(α,β). Then the marginal distribution of X is that of betabinomial random variable
with
—
https://onlinecourses.science.psu.edu/stat504/node/4
Posted on September 11, 2014 by Già Bản
0