Analysis of Discrete Data #1 – Overview

Posted on September 11, 2014 by


This lesson is an overview of the course content as well as a review of some advanced statistical concepts involving discrete random variables and distributions, relevant for STAT 504 — Analysis of Discrete Data. This Lesson assumes that you have glanced through the Review Materials included in the Start Here! block.

Key concepts:

  • Discrete data types
  • Discrete distributions: Binomial, Poission, Multinomial
  • Likelihood & Loglikelhood
  • Observed & Expected Information
  • Likelihood based Confidence Intervals & Tests: Wald Test, Likelihood-ratio test, Score test


  • Learn what discrete data are and their taxonomy
  • Learn the properties of Binomial, Poission and Multinomial distributions
  • Understand the basic principles of likelihood-based inference and how to apply it to tests and intervals regarding population proportions
  • Introduce the basic SAS and R code

Useful links:


  • Ch.1 Agresti (2007)
  • If you are using other textbooks or editions: Ch.1 Agresti (2013, 2002, 1996)


The outline below can be viewed as a general template of how to approach data analysis regardless of the type of statistical problems you are dealing with. For example, you can model a continuous response variable such as income, or a discrete response such as a true proportion of U.S. individuals who support new health reform. This approach has five main steps. Each step typically requires an understanding of a number of elementary statistical concepts, e.g., a difference between a parameter to be estimated and the corresponding statistic (estimator).

The General Data Analysis Approach


1.1 – Focus of this Course

The focus of this class is multivariate analysis of discrete data. The modern statistical inference has many approaches/models for discrete data. We will learn the basic principles of statistical methods and discuss issues relevant for the analysis of Poisson counts of some discrete distribution, cross-classified table of counts, (i.e., contingency tables), binary responses such as success/failure records, questionnaire items, judge’s ratings, etc. Our goal is to build a sound foundation that will then allow you to more easily explore and learn many other relevant methods that are being used to analyze real life data. This will be done roughly at the introductory level of the required textbook by A. Agresti (2007).

Basic data are discretely measured responses such as counts, proportions, nominal variables, ordinal variables, continuous variables grouped into a small number of categories, etc. Data examples will be used to help illustrate concepts. The “canned” statistical routines and packages in R and SAS will be introduced for analysis of data sets, but the emphasis will be on understanding the underlying concepts of those procedures. For more detailed theoretical underpinnings you can read A. Agresti (2012).

We will focus on two kinds of problems.

1) The first broad problem deals with describing and understanding the structure of a (discrete) multivariate distribution, which is the joint and marginal distributions of multivariate categorical variables. Such tasks may focus on displaying and describing associations between categorical variables by using contingency tables, chi-squared tests of independence, and other similar methods. Or, we may explore finding underlying structures, possibly via latent variable models.

2) The second problem is a sort of “generalization” of regression  with a distinction between response and explanatory variables where the response is discrete. Predictors can be all discrete, in which case we may use log- linear models to describe the relationships. Predictors can also be a mixture of discrete and continuous variables, and we may use something like logistic regression to model the relationship between the response and the predictors. We will explore certain types of Generalized Linear Models, such as logistic and Poisson regressions.

The analysis grid below highlights the focus of this class with respect to the models that you should already be familiar with.

Analysis Grid


1.2 – Discrete Data Types and Examples

Categorical/Discrete/Qualitative data

Measures on categorical or discrete variables consist of assigning observations to one of a number of categories in terms of counts or proportions. The categories can be unordered or ordered (see below).

Counts and Proportions

Counts are variables representing frequency of occurrence of an event:

  • Number of students taking this class.
  • Number of people who vote for a particular candidate in an election.

Proportions or “bounded counts” are ratios of counts:

  • Number of students taking this class divided by the total number of graduate students.
  • Number of people who vote for a particular candidate divided by the total number of people who voted.

Discretely measured responses can be:

  • Nominal (unordered) variables, e.g., gender, ethnic background, religious or political affiliation
  • Ordinal (ordered) variables, e.g., grade levels, income levels, school grades
  • Discrete interval variables with only a few values, e.g., number of times married
  • Continuous variables grouped into small number of categories, e.g., income grouped into subsets, blood pressure levels (normal, high-normal etc)

We we learn and evaluate mostly parametric models for these responses.

Measurement Scale and Context

Interval variables have a numerical distance between two values (e.g. income)

Measurement hierarchy:

  • nominal < ordinal < interval
  • Methods applicable for one type of variable can be used for the variables at higher levels too (but not at lower levels). For example, methods specifically designed for ordinal data should NOT be used for nominal variables, but methods designed for nominal can be used for ordinal. However, it is good to keep in mind that such analysis method will be less than optimum as it will not be using the fullest amount of information available in the data.

Example: Grades

  • Nominal: pass/fail
  • Ordinal: A,B,C,D,F
  • Interval: 4,3,2.5,2,1

Note that many variables can be considered as either nominal or ordinal depending on the purpose of the analysis. Consider majors in English, Psychology and Computer Science. This classification may be considered nominal or ordinal depending whether there is an intrinsic belief that it is ‘better’ to have a major in Computer Science than in Psychology or in English. Generally speaking, for a binary variable like pass/fail ordinal or nominal consideration does not matter.

Context is important! The context of the study and the relevant questions of interest are important in specifying what kind of variable we will analyze. For example,

  • Did you get a flu? (Yes or No) — is a binary nominal categorical variable
  • What was the severity of your flu? ( Low, Medium, or High) — is an ordinal categorical variable

Based on the context we also decide whether a variable is a response (dependent) variable or an explanatory (independent) variable.

Discuss the following question on the ANGEL Discussion Board:

Why do you think the measurement hierarchy matters and how does it influence analysis? That is, why we recommend that statistical methods/models designed for the variables at the higher level not be used for the analysis of the variables at the lower levels of hierarchy? 

Contingency Tables

  • A statistical tool for summarizing and displaying results for categorical variables
  • Must have at least two categorical variables, each with at least two levels (2 x 2 table)May have several categorical variables, each at several levels (I1 × I2 × I3 × … × Ik tables) Place counts of each combination of the variables in the appropriate cells of the table.

Here are a few simple examples of contingency tables.

Example: Admissions Data

A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a two-way table of all applicants by sex and admission status. These data show an association between the sex of the applicants and their success in obtaining admission.


Example: Number of Delinquent Children by the County and the Head of Household Education Level

Source: OMB Statistical Policy Working Paper 22

This is another example of a two-way table but in this case 4×4 table. The variable County could be treated as nominal, where as the Education Level of Head of Household can be treated as ordinal variable. Questions to ask, for example: (1) What is the distribution of a number of delinquent children per county given the education level of the head of the household? (2) Is there a trend of where the delinquent children reside given the education levels?

Very High
  • Ordinal and nominal variables
  • Fixed total

Example: Census Data

Source: American Fact Finder website (U.S. Census Bureau: Block level data)


This is an example of a 2×2×4 three-way table that cross-classifies a population from a PA census block by Sex, Age and Race where all three variables are nominal.

Example: Clinical Trial of Effectiveness of an Analgesic Drug

Source: Koch et al. (1982)


  • This is a four-way table (2×2×2×3 table) because it cross-classifies observations by four categorical variables: Center, Status, Treatment and Response
  • Fixed number of patients in two Treatment groups
  • Small counts

We will see throughout this course that there are many different methods to analyze data that can be represented in coningency tables.

Example of proportions in the news

You should be already familiar with a simple analysis of estimating a population proportion of interest and computing a 95% confidence interval, and the meaning of the margin or error (MOE).


  • Population proportion = p = sometimes we use π
  • Population size = N
  • Sample proportion = \hat{p}=X/n=# with a trait / total #
  • Sample size = n
  • X is the number of units with a particular trait, or number of success.

The Rule for Sample Proportions

  • If numerous samples of size n are taken, the frequency curve of the sample proportions (p^\prime s) from the various samples will be approximately normal with the mean p and standard deviation \sqrt{p(1-p)/n}
  • \hat{p}\sim N(p,p(1-p)/n)

1.3 – Discrete Distributions

Statistical inference requires assumptions about the probability distribution (i.e., random mechanism, sampling model) that generated the data. For example for a t-test, we assume that a random variable follows a normal distribution. For discrete data key distributions are: Bernoulli, Binomial, Poisson and Multinomial. A more or less thorough treatment is given here. The mathematics is for those who are interested. But the results and their applications are important.

Recall, a random variable is the outcome of an experiment (i.e. a random process) expressed as a number. We use capital letters near the end of the alphabet (X, Y, Z, etc.) to denote random variables. Random variables are of two types: discrete and continuous. Here we are interested in distributions of discrete random variables.

A discrete random variable X is described by a probability mass functions (PMF), which we will also call “distributions,” f(x)=P(X =x). The set of x-values for which f (x) > 0 is called the support. Support can be finite, e.g.,  X can take the values in {0,1,2,…,n} or countably infinite if X takes values in {0,1,…}. Note, if the distribution depends on unknown parameter(s) θ we can write it as f (x; θ) (preferred by frequentists) or f(x| θ) (preferred by Bayesians).

Here are some distributions that you may encounter when analyzing discrete data.

Bernoulli distribution

The most basic of all discrete random variables is the Bernoulli. X is said to have a Bernoulli distribution if X = 1 occurs with probability π and X = 0 occurs with probability 1 − π ,

f(x)=\left\{\begin{array} {cl} \pi & x=1 \\ 1-\pi & x=0 \\ 0 & \text{otherwise} \end{array} \right.

Another common way to write it is: f(x)=\pi^x (1-\pi)^{1-x}\text{ for }x=0,1

Suppose an experiment has only two possible outcomes, “success” and “failure,” and let π be the probability of a success. If we let X denote the number of successes (either zero or one), then X will be Bernoulli. The mean of a Bernoulli is


and the variance of a Bernoulli is


Binomial distribution

Suppose that X_1, X_2,\dots,X_n are independent and identically distributed (iid) Bernoulli random variables, each having the distribution

f(x_i|\pi)=\pi^{x_i}(1-\pi)^{1-x_i}\text{ for }x_i=0,1\; \text{and }\; 0\leq\pi\leq 1

Let X=X_1+X_2+\dots+X_n. Then X is said to have a binomial distribution with parameters n and p,

X\sim \text{Bin}(n,\pi)

Suppose that an experiment consists of n repeated Bernoulli-type trials, each trial resulting in a “success” with probability π and a “failure” with probability 1 − π . For example, toss a coin 100 times, n=100. Count the number of times you observe heads, e.g. X=# of heads. If all the trials are independent—that is, if the probability of success on any trial is unaffected by the outcome of any other trial—then the total number of successes in the experiment will have a binomial distribution, e.g, two coin tosses do not affect each other. The binomial distribution can be written as

f(x)=\dfrac{n!}{x!(n-x)!} \pi^x (1-\pi)^{n-x} \text{ for }x_i=0,1,2,\ldots,n,\; \text{and }\; 0\leq\pi\leq 1.

The Bernoulli distribution is a special case of the binomial with n = 1. That is, XBin(1,π) means that X has a Bernoulli distribution with success probability π.

One can show algebraically that if  X∼Bin(1,π) then E(X)=nπ and V(X)=nπ(1−π). An easier way to arrive at these results is to note that X=X_1+X_2+\dots+X_n where X_1,X_2,\dots,X_n are (iid) Bernoulli random variables. Then, by the additive properties of mean and variance,




Note that X will not have a binomial distribution if the probability of success π is not constant from trial to trial, or if the trials are not entirely independent (i.e. a success or failure on one trial alters the probability of success on another trial).

\text{if }X_1\sim \text{Bin}(n_1,\pi) \text{ and }X_2\sim \text{Bin}(n_2,\pi),\text{ then }X_1+X_2 \sim \text{Bin}(n_1+n_2,\pi)

As n increases, for fixed π, the binomial distribution approaches normal distribution N(nπ,nπ(1−π)).

For example, if we sample without replacement from a finite population, then the hypergeometric distribution is appropriate.

Hypergeometric distribution

Suppose there are n objects. n1 of them are of type 1 and n2 = n − n1 of them are of type 2. Suppose we draw m (less than n) objects at random and without replacement from this population. A classic example, is having a box with n balls, n1 are red and n2 are blue.  What is the probability of having t red balls in the draw of m balls? Then the PMF of N1=t is

p(t) = Pr(N_1 = t) =\frac{\binom{n_1}{ t}\binom{n_2}{m-t}}{\binom{n}{m}},\;\;\;\; t \in [\max(0, m-n_2); \min(n_1, m)]

The expectation and variance of N_1 are given by: E(N_1) =\frac{n_1 m}{n} and V(N_1)=\frac{n_1n_2m(n-m)}{n^2(n-1)}

Poisson distribution

Let  XPoisson(λ) (this notation means “X has a Poisson distribution with parameter λ”), then the probability distribution is

f(x|\lambda)= Pr(X=x)= \frac{\lambda^x e^{-\lambda}}{x!},  x=0,1,2,\ldots, \mbox{and}, \lambda>0.

Note that E(X)=V(X)=λ, and the parameter λ must always be positive; negative values are not allowed.

The Poisson distribution is an important probability model. It is often used to model discrete events occurring in time or in space.

The Poisson is also limiting case of the binomial. Suppose that XBin(n,π) and let n and π0 in such a way that nπλ where λ is a constant. Then, in the limit, XPoisson(λ). Because the Poisson is limit of the Bin(n,π), it is useful as an approximation to the binomial when n is large and π is small. That is, if n is large and π is small, then

\dfrac{n!}{x!(n-x)!}\pi^x(1-\pi)^{n-x} \approx \dfrac{\lambda^x e^{-\lambda}}{x!}

where λ=nπ. The right-hand side of (1) is typically less tedious and easier to calculate than the left-hand side.

For example, let X be the number of emails arriving at a server in one hour. Suppose that in the long run, the average number of emails arriving per hour is λ. Then it may be reasonable to assume XP(λ).  For the Poisson model to hold, however, the average arrival rate λ must be fairly constant over time; i.e., there should be no systematic or predictable changes in the arrival rate. Moreover, the arrivals should be independent of one another; i.e., the arrival of one email should not make the arrival of another email more or less likely.

When some of these assumptions are violated, in particular if there is a presence of overdispersion (e.g., observed variance is greater than what the model assume), the Negative Binomial distribution can be used instead of Poisson.

Count data often exhibit variability exceeding that predicted by the binomial or Poisson. This phenomenon is known as overdispersion.

Consider, for example the number of fatalities from auto accidents that occur next week in the Center county, PA. The Poisson distribution assumes that each person has the
same probability of dying in an accident. However, it is more realistic to assume that these probabilities vary due to

  • whether the person was wearing a seat belt
  • time spent driving
  • where they drive (urban or rural driving)

Person-to-person variability in causal covariates such as these cause more variability than predicted by the Poisson distribution.

Let X be a random variable with conditional variance V(X|λ). Suppose λ is also a random variable with θ=E(λ). Then E(X)=E[E(X|λ)] and V(X)=E[V(X|λ)]+V[E(X|λ)]

For example, when X|λ has a Poisson distribution, then  E(X)=E[λ]=θ (so mean stays the same) but the V(X)=E[λ]+V(λ)=θ+V(λ)>θ (the variance is no longer θ but larger).

When X|π  is a binomial random variable and πBeta(α,β). Then E(\pi)=\frac{\alpha}{\alpha+\beta}=\lambda and V(\pi)=\frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.

Thus,  E(X)=nλ (as expected the same) but the variance is larger V(X)=nλ(1λ)+n(n1)V(π)>nλ(1λ).

Negative-Binomial distribution

When the data display overdispersion, the analyst is more likely to use the negative-binomial distribution instead of Poission to model the data.

Suppose a random variable X|λPoisson(λ) and  λGamma(α,β). Then the joint distribution of X and λ is:


Thus the marginal distribution of X is negative-binomial (i.e., Poisson-Gamma mixture):

\begin{eqnarray} p(X=k)&=&\frac{\beta^\alpha}{\Gamma(\alpha)k!}\int^{\infty}_0\lambda^{k+\alpha-1}\exp^{-(\beta+1)\lambda} d\lambda\\  & = & \frac{\beta^\alpha}{\Gamma(\alpha)k!} \frac{\Gamma(k+\alpha)}{(\beta+1)^{(k+\alpha)}} \\ & = & \frac{\Gamma(k+\alpha)}{\Gamma(\alpha)\Gamma(k+1)}(\frac{\beta}{\beta+1})^\alpha(\frac{1}{\beta+1})^k  \end{eqnarray}


with E(X)=E[E(X|\lambda)]=E[\lambda]=\frac{\alpha}{\beta}


Beta-Binomial distribution

A family of discrete probability distributions on a finite support arising when the probability of a success in each of a fixed or known number of Bernoulli trials is either unknown or random. For example, the researcher believes that the unknown probability of having flu π is not fixed and not the same for the entire population, but it’s yet another random variable with its own distribution. For example, in Bayesian analysis it will describe a prior belief or knowledge about the probability of having flu based on prior studies. Below X is what we observe such as the number of flu cases.

Suppose X|πBin(n,π) and  πBeta(α,β). Then the marginal distribution of X is that of beta-binomial random variable

\begin{eqnarray} P(X=k)& = & \binom{n}{k}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\int^{1}_0\pi^{k+\alpha-1}(1-\pi)^{n+\beta-k-1} d\pi\\ & = & \frac{\Gamma(n+1)}{\Gamma(k+1)\Gamma(n-k+1)}\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\frac{\Gamma(\alpha+k)\Gamma(n+\beta-k)}{\Gamma(n+\alpha+\beta)} \end{eqnarray}





Posted in: Food for thought