Statistical Inference

Statistical inference, or “learning” as it is called in CS, is the process of using data to infer the distribution that generated the data. A typical statistical inference question is:

Given a sample $X_1, X_2,\ldots,X_n \sim F$, how do we infer $F$?

## Parametric vs. Nonparametric models

A statistical model is a set of distributions which is likely to generate our data. Statistical models are of two kinds: Parametric or nonparametric.

### Parametric Models

In this case the statistical model is parametrized by a finite set of parameters.

**Example:** When we assume that the data is generated by a Gaussian, our statistical model is the set of all gaussian distributions. This statistical model is arametrized by two numbers: mean and variance. Then our task becomes estimating the mean and variance out of which our data is generated.

**Example:** Bernoulli trials

**Example:** Exponential distribution.

**Example:** The total life of 10,000 bulbs in a one-month period. We do not know the the pdf for survival for individual light bulbs (usually assumed exponential, but now assume that we have a new generation of bulbs which have two ”humps” in their pdf for survival). There is an important notational convenience for parametric models: If $\mathbb{F}=\{f(x;\theta) | \theta\in\Theta \}$ then

\begin{eqnarray}

P_{\theta}(X\in A) = \int_A p(x;\theta)dx\\

E_{\theta}(r(X)) = \int r(x)p(x;\theta)dx

\end{eqnarray}

and similarly for variance, $V_{\theta}$. Note that the usual notation for expectation is $E_X(x)$, which means we are averaging in $X$. But in this new notation we are not averaging in $\theta$. This new notation becomes very confusing if not understood properly, especially when investigating the EM algorithm.

### Nonparametric Models

Here we generally estimate the whole cdf/pdf. A good example is the two hump light bulb survival pdf discussed above. Use either gaussian mixture or nonparametric inference.

## Fundamental Concepts in Statistical Inference

Many inferential problems can be identified as being one of three types:

- Estimation
- Confidence sets
- Hypothesis Testing

Below, we will give an introduction to each of these.

## Point Estimation

Provides a single ”best guess“ for some parameter of interest. This can be a parameter in parametric estimation or a whole cdf in nonparametric estimation.

**Notation:** We denote estimate of $\theta$ by $\hat{\theta}_n$. Or, if $X_1,\ldots,X_n$ are iid data points from some distribution $F$,

\begin{eqnarray}

\hat{\theta}_n = g(X_1,\ldots,X_n)

\end{eqnarray}

Here $g(.)$ denotes the functional form of the estimator. Note that $\theta$ is a (mostly unknown) constant but $\hat{\theta}_n$ is a random variable. Its distribution is known as *the sampling distribution*

**Notation:** Before proceeding, it is best to define the notation that we will use in the coming pages: The moments of the random variable $X$ are defined as usual:

\begin{eqnarray}

E(X)=\mu, \qquad, V(X)=\sigma^2, \qquad se(X)=\sigma

\end{eqnarray}

Most of the time, the parameter we try to estimate are just these. Our estimate then is a random variable denoted by $\hat{\mu}_n, \hat{\sigma}^2_n, \hat{\sigma}_n$. But let us keep things general and assume that we try to estimate a parameter $\theta$, and denote the random variable corresponding to its estimate as $\hat{\theta}_n$. Then the expectation, variance and standard deviation of this

random variable is denoted by:

\begin{eqnarray}

E(\hat{\theta}_n)=\bar{\theta}_n, \qquad V(\hat{\theta}_n), \qquad se(\hat{\theta}_n)=\sqrt{V(\hat{\theta}_n)}

\end{eqnarray}

**Example:** Sample mean is:

\begin{eqnarray}

\hat{\mu}_n= \frac{1}{n}\sum_{i=1}^n X_i

\end{eqnarray}

If $X_i$ are iid gaussian with mean $\mu$ and variance $\sigma$, then the sampling distribution is $\hat{X} \sim N(\mu,\sigma^2/n)$

,ie, $E(\hat{\mu}_n)=\mu$ and $V(\hat{\mu}_n)=\sigma^2/n$.

This is already proven by using the moment generating functions. \\

If we do not know the distirbution of $X_i$ but know that $n$ is large, then the sampling distribution for $\hat{X}$ is approximately gaussian

with $N(\mu,\sigma^2/n)$,ie, $E(\hat{\mu}_n)\rightarrow\mu$ and $V(\hat{\mu}_n)\rightarrow\sigma^2/n$.. \\

When not even the iid property is satisfied, we can use the law of large numbers and say that the sampling distribution converge to a mean.

**Example:** Sample variance is:

\begin{eqnarray}

\hat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2

\end{eqnarray}

Note that sample variance is also a random variable. We will explain very shortly why the denominator is $n-1$ and not $n$. It is also a random

number.

\subsection{Properties of Estimators: Bias}

Estimator bias is defined as \footnote{Bias can also be measured wrt median rather than mean.}

\begin{eqnarray}

\mathrm{Bias}(\hat{\theta}_n)=E[\hat{\theta}_n]-\theta

\end{eqnarray}

Many estimators we will use will be biased, ie, they will have nonzero bias. \\\\

{\bf Example:} The uncorrected sample variance

\begin{eqnarray}

\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(X_i-\overline{X})^2

\end{eqnarray}

is biased.

\begin{align}

\operatorname{E}[\hat{\sigma}^2]

&= \operatorname{E}\left[ \frac{1}{n}\sum_{i=1}^n \left(X_i-\hat{\mu}_n\right)^2 \right]

= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n \big((X_i-\mu)-(\hat{\mu}_n-\mu)\big)^2 \bigg] \\[8pt]

&= \operatorname{E}\bigg( \frac{1}{n}\sum_{i=1}^n \bigg[ (X_i-\mu)^2 –

2(\hat{\mu}_n-\mu)(X_i-\mu) +

(\hat{\mu}_n-\mu)^2 \bigg] \bigg) \\[8pt]

&= \operatorname{E}\bigg[ \frac{1}{n}\sum_{i=1}^n( (X_i-\mu)^2 – (\hat{\mu}_n-\mu)^2 )\bigg]

= \sigma^2 – \operatorname{E}\left[ (\hat{\mu}_n-\mu)^2 \right] < \sigma^2.

\end{align}

\\\\

{\bf Example:} The corrected sample variance

\begin{eqnarray}

\hat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\overline{X})^2

\end{eqnarray}

is unbiased. \\\\

{\bf Example:} Number of incoming calls per minute at a customer service location is generally modelled by a Poisson process:

\begin{eqnarray}

P(X=x) = \frac{{\lambda}^x e^{-\lambda}}{x!}

\end{eqnarray}

where $\lambda$ is the expectation. Consider the event that two consecutive minutes receive no phone calls.

Probability for such an event,

\begin{eqnarray}

P(X=0)^2=e^{-2\lambda}

\end{eqnarray}

can be regarded as a parameter of the distribution, as it is a constant from which we can deduce all the info (ie, $\lambda$) about the distribution.

Assume that we have an un biased estimator $\delta(x)$ for $P(X=0)^2=e^{-2\lambda}$, which uses a single observation. Then

\begin{eqnarray}

E[\delta(x)] = \sum_{x=0}^{\infty} \delta(x) \frac{{\lambda}^x e^{-\lambda}}{x!}

=e^{-\lambda}\sum_{x=0}^{\infty} \delta(x) \frac{{\lambda}^x }{x!} =e^{-2\lambda}

\end{eqnarray}

The only way to satisfy this equality is to define $\delta(x)=(-1)^x$, ie, a totally absurd estimator.

\\\\

The usual way to design parametic estimators is the Maximum likelihood (ML) method. The ML method gives highly biased (but better) estimators.

The ML estimator for the problem above is $e^{-2x}$, which will be calculated in the next chapter.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Properties of Estimators: Mean square error}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Mean-squared error is defined by

\begin{eqnarray}

\mathrm{MSE}=E_{\theta}(\hat{\theta}_n-\theta)^2

\end{eqnarray}

As usual, we assume an iid distribution that generate the data.\\\\

{\bf Example:} The MSE of the biased ML poisson estimator given above is

\begin{eqnarray}

e^{-4\lambda}-2e^{\lambda(1/e^2-3)}+e^{\lambda(1/e^4-1)}

\end{eqnarray}

while the MSE of the unbiased estimator is

\begin{eqnarray}

1-e^{-4\lambda}.

\end{eqnarray}

Here we see that the biased estimator is actually better than the unbiased one, as its MSE is smaller.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Properties of Estimators: Bias Variance Dilemma}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{\bf Theorem:} The MSE can be written as:

\begin{eqnarray}

\mathrm{MSE}=\mathrm{bias}^2(\hat{\theta}_n)+V_{\theta}(\hat{\theta}_n)

\end{eqnarray}

{\bf Proof:} Let $\bar{\theta}_n=E_{\theta}(\hat{\theta}_n)$.\footnote{Note that for an unbiased estimator $\bar{\theta}_n=\theta$}

\begin{eqnarray}

E_{\theta}(\hat{\theta}_n-\theta)^2 &=& E_{\theta}(\hat{\theta}_n-\bar{\theta}_n+\bar{\theta}_n-\theta)^2 \nonumber\\

&=&E_{\theta}(\hat{\theta}_n-\bar{\theta}_n)^2+2(\bar{\theta}_n-\theta) E_{\theta}(\hat{\theta}_n-\bar{\theta}_n)

+E_{\theta}(\bar{\theta}_n-\theta)^2 \nonumber\\

&=&E_{\theta}(\hat{\theta}_n-\bar{\theta}_n)^2+E_{\theta}(\bar{\theta}_n-\theta)^2 \nonumber\\

&=&V_{\theta}(\hat{\theta}_n)+\mathrm{bias}^2(\hat{\theta}_n)

\end{eqnarray}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Properties of Estimators: Consistency}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

{\bf Definition:} An estimator $\hat{\theta}_n$ of a parameter $\theta$ is consistent if $\hat{\theta}_n\rightarrow_P \theta$. \\ \\

As a property of estimators, consistency is much easier to deal with then unbiasedness. \\

\\

{\bf Theorem:} If $\mathrm{bias}\rightarrow 0$, $\mathrm{se} \rightarrow 0$ as $n \rightarrow \infty$, then $\hat{\theta}_n$ is consistent. \\

{\bf Proof:} From bias-variance dilemma theorem.\\\\

Note that unbiasedness does not imply consitency. Variance must also go to zero.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection{Asymptotic Normalcy}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

An estimator is asymptotically normal if

\begin{eqnarray}

\frac{\hat{\theta}_n-\theta}{se(???)} \rightsquigarrow \mathrm{N}(0,1)

\end{eqnarray}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Confidence Sets}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

A $1-\alpha$ confidence interval for a parameter $\theta$ is an interval $C_n=(a,b)$ where $a=a(X_1,\ldots,X_n)$ and $b=b(X_1,\ldots,X_n)$

are random variables such that $P(\theta \in C_n) \geq 1-\alpha$. Note that here the interval $C_n$ is a random variable and $\theta$ is constant. \\\\

When we Discuss Bayesian Methods, we will treat $\theta$ as a random variable. Then, we will make probability statements about $\theta$. In particular,

we will make statements like: ”Given the data, the probability that $\theta$ is in $C_n$ is 95 percent. However, we are not there right now. For

the purpose of confidence intervals, $\theta$ is just a constant. \\\\

{\bf Example:} In coin fliping setting, let $C_n = (\hat{p}_n – \varepsilon_n, \hat{p}_n + \varepsilon_n)$

where

\begin{eqnarray}

\varepsilon_n^2=\frac{1}{2n}\ln\frac{2}{\alpha}

\end{eqnarray}