Hypothesis Testing – Marmara Lectures

The standard scenario of hypothesis testing is this: We have a pdf at hand representing the “normal” operation of a system, and we know
all the numerical parameters associated with the pdf.

If we measure how many cars do pass from a certain street per minute, our pdf is poisson. We also know $\lambda$.
If we measure the average height of military recruits in a country, our pdf is gaussian and we know $\mu$ and $\sigma$

Hypothesis testing is mostly used to detect if that pdf has changed or remained the same in a given system. These two alternatives are known as the null and alternative hypotheses:

Null hypothesis: The pdf remains the same. The system operates as usual.
Alternative hypothesis: The pdf is no longer valid. The system is different now..

The method to test the null hypothesis is to draw a sample from the system and see how probable the result is, assuming the null hypothesis..

It is best to proceed with the help of some examples:

Example 1: Thermal Runoff in a Nuclear Reactor

Nuclear reactors are kept within certain temperature limits via some otomatic control systems.. If these automatic control systems stop
functioning for some reason (malfunction, accident etc) the reactor gets hotter and hotter, resulting in a meltdown and environmental
catastrophe.. So, it is important to keep the reactor temperature under close observation and detect when reactor departed from its normal
operation so that malfunctions in the control system quickly and do necessary interventions..\\
Assume that during its normal operation the core temperature of a certain reactor fluctuates around a mean of 1000C,
with a standard deviation of 50C, and the
fluctiations conform to a Gaussian distribution. There are two engineers (Turkish and Japanese) responsible for
the safe operation of this reactor. Japanese engineer claims that something is broken in the reactor and it started getting hotter.
Turkish engineer claims that everything operates normally and there is no need to panic. \\
We have the elements at hand:\\

Reactor’s temperature pdf in its normal operation(gaussian), with all its parameters (mean $\mu=1000$ and
standard deviation $\sigma=50$).
Null hypothesis: Reactor operates normally. No need to panic..
Alternative hypothesis: Reactor does not operate normally. Its behaviour is no longer described by the pdf given above. Probably something is broken down and it is getting hotter. Urgent intervention is required..

The Japanese engineer insists on measuring the core temperarure. It turns out to be 1050C. This temperature is 50C above the normal
operation mean. The problem is to choose between
\begin{itemize}
\item Null hypothesis: This extra 50C is nothing but the the normal fluctuation of reactor temperature around its mean.
Everything is as usual. There is no reason to panic.
\item Alternative hypothesis: This extra 50C indicates that the reactor is getting hotter. Reactor is not operating normally (ie,
with pdf f(.) and urgent reaction is required.
\end{itemize}

How to decide? In its normal operation (ie, when the reactor temperature is distributed normally with N(1000, 50) ), the probability
of making a temperature reading with 1050C or higher is 0.1587. In other words, one in every seven readings will result in a temperature
reading of 1050C or higher. So, we can simply say that there is no enough evidence of a thermal runoff. Caution: we do not claim that
there is no thermal runoff. The extra 50C wh have observed may be just the tip of a beginning thermal runoff. But, at this stage, we simply
did not have enough evidence to prove this conclusively..

After one month, the Japanese engineer again suspected of a thermal runoff. This time, the temperature measurement yielded 1150C. The probability
that the reactor generates this temperature or higher during its normal operation is 0.0013, ie, one in thousand. In other words, it is very
improbable for the reactor to generate such a high temperature if it is operating normally, ie, generating temperature measurements with the
distribution N(1000, 50). It is highly probable that the reactor is broken and this distribution is no longer valid.
Japanese engineer notifies the reparation team.. Caution: The japanese engineer does not have a conclusive proof that there is a thermal runaway.
The reactor may still be operating under its normal pdf and what he measured may be a “black swan”.

Example 2: Clairvoyance in Guessing Coin Flips

{\bf Example:} Suppose that one of our friends claim that he is a clairvoyant and he can guess the result of a flipped coin better
than $\frac{1}{2}$ of the time on the average. In order to test his claim, we make him to guess 100 coin flips. He guessed correctly for
56 times. He points out to the fact that he performed better than average, claims that this proves that he is correct.
Can we agree with him and conclude that he really has some supernatural abilities?\\
The probability of guessing 56 or more coin flips correctly on 100 coin tosses without any “supernatural ability” is
\begin{eqnarray}
\sum_{k=56}^{100} \left( \begin{array}{c}
100\\k
\end{array} \right) 2^{-100} = 0.14
\end{eqnarray}
So, it is quite possible that extra 6 could be due to chance and no “supernatural ability” is required.. We say that “there is
no srong evidence of clairvoyance”.\\

{\bf Example:} Suppose that we have repeated the experiment with an another clairvoyant friend and this time he guessed 67 times
correctly out of 100 trials. The probability of this is
\begin{eqnarray}
\sum_{k=65}^{100} \left( \begin{array}{c}
100\\k
\end{array} \right) 2^{-100} = 0.0004
\end{eqnarray}
This is an exceedingly small probability. 67 correct guesses cannot be explained by pure chance alone.
If the coin is not tempered with, we have to accept that our friend is a real clairvoyant..

Hypothesis testing for population mean $\mu$ for normally distributed populations

Assume we have a sample of size $n$, $X_1,\ldots,X_n$ drawn from a normal distribution. Our null and alternate hypotheses are:
\begin{itemize}
\item Null Hypothesis: The mean of this distribution is $\mu=\mu_0$.
\item Alternate hypothesis: The alternative hypothesis can take one of the three different forms:
\subitem 1) The mean is not $\mu_0$, $\mu \neq \mu_0$.
\subitem 2) The mean is smaller than $\mu_0$, $\mu < \mu_0$.
\subitem 3) The mean is larger than $\mu_0$, $\mu > \mu_0$.
\end{itemize}
2 and 3 are known as “one-sided alternatives” or “single-tail tests”. 1 is known as “two-sided alternative” or “two-tail test”.
Which alternative hypothesis to use depends very much on the problem at hand.

We are required to devise a test to accept or reject the null hypothesis. Depending on the information at hand, we can either
do a z-test, or a t-test.
\begin{itemize}
\item If we know the standard deviation $\sigma=\sigma_0$ (very rare), we do a z-test.
\item If we do not have a clue about the standard deviation (usual case), we do a t-test.
\end{itemize}

Note that in accordance with the central limit theorem, as our sample size gets larger and larger, the normality assumption
becomes less and less important..
\section{z-test}
Note that as we use the z-test, ve know the standard deviation $\sigma_0$. Hence the test statistic is
\begin{eqnarray}
Z = \frac{\bar{X}-\mu_0}{\frac{\sigma_0}{\sqrt{n}}}
\end{eqnarray}
If the null hypothesis is true, $Z$ will have a standard normal distribution. Hence,
\begin{itemize}
\item If the probability of observing $Z$ is very small under the null distribution, we say “there is strong evidence
against the null hypothesis”.
\item If the probability of observing $Z$ is not small under the null distribution, we say “there is no evidence
against the null hypothesis”.
\end{itemize}
But how to decide whether a given $Z$ score is small or large under the null distribution? There are two approaches:
\begin{itemize}
\item Rejection-region approach.
\item p-value approach.
\end{itemize}
\subsection{Rejection Region Approach}
Rejection region approach requires specifying the probability of rejecting null hypothesis when the null hypothesis is true.
This probabitity is usually called the “significance level”, and it is denoted by the greek letter $\alpha$. $\alpha$ is generally
choosen to be 0.005, ie $\alpha=0.005$. Significance level is also known as “type I error”, for reasons explained below.
\begin{itemize}
\item If the test is double-sided, we need to find a $Z_0$ such that $N(|Z|<Z_0)<\alpha/2$. Note that we split the $\alpha$ value
evenly between the two tails. Assuming $\alpha=0.005$ and using a standard
normal table, we find that $Z_0=1.96$. Hence, we do not reject the null hypothesis if $|Z|<1.96$ and do not reject the null hypothesis
otherwise.
\item if the test is single sided, we need to find a $Z_0$ such that $N(Z<Z_0)<\alpha$ or $N(Z>Z_0)<\alpha$, depending on the “side” of
the test. Note that in this case we keep all the $\alpha$ value on a single tail. It the test is left-sided, we reject the null if $Z<-1.645$.
If the test is right-sided, we recect the null if $Z>1.645$
\end{itemize}
$\alpha=0.001$ is also used as a significance level
\subsection{p-value approach}
p-value is the probability of getting the observed test statistic or “worse”, assuming the null hypothesis is correct. In other words, the p-value
is a measure of the strength of evidence about the null hypothesis given by our sample data. Larger $p$ values support the null hypothesis.

Assuming that we obtain $Z_0$ as our Z-score from our sample,
\begin{itemize}
\item If the test is double-sided, $p=N(Z<-|Z_0|)+N(Z>|Z_0|)=2 N(|Z_0|<Z)= 1-2\Phi(-|Z_0|)$.
\item if the test is right sided, $p=N(Z>Z_0)=1-\Phi(Z_0)$, where $\Phi(.)$ is the cdf of the standard normal distribution.
\item if the test is left sided, $p=N(Z<Z_0)=\Phi(Z)$.
\end{itemize}
If we have a sensitivity level $\alpha$, we can reject the null hypothesis if $p<\alpha$
\subsection{Examples}
1) If the Z-score is -2.12 and the test is two sided, what is the p-value? Ans: 0.034\\\\
2) If the Z-score is 2.12 and the test is two sided, what is the p-value? Ans: 0.034\\\\
3) A producer of canned food stated that each can he sells contains 780gr of food. Standard deviation of cans is known to be 16gr. As a
part of quality control process, the producer periodically draw a sample of 25 cans, measures the weights, and tests the null hypothesis
that the mean amount of food in cans is 780 grs. The alternative hypothesis is two sided. Significance level is taken to be $\alpha=0.05$.
Sample mean is found to be $\bar{X}=776$ gr. What about the result of the test?\\
Ans: $Z$-score is
\begin{eqnarray}
Z=\frac{776-780}{\frac{16}{\sqrt{25}}}=-1.25
\end{eqnarray}
From a normal distribution table, we get $\Phi(-1.25)=0.1056$. Hence the p-value is 0.211. As this is larger than
the significance level $\alpha=0.05$, so we do not reject the null hypothesis.\\\\
4) The munincipality claims that there is less than 2mgr of mercury in the air at the city center. So we will test if there is strong evidence
contrary to their claim. The null or alternate hypotheses are:
\begin{itemize}
\item Null Hypothesis: $\mu=2$
\item Alternate Hypothesis: $\mu>2$
\end{itemize}
This is a left-sided test. \\\\
5) The steel used in manufacturing cars must be 5mm. thick. Less thick steel will give rise to structural weakness, thicker steel will
result in an unnecessarily heavy car. The null or alternate hypotheses are:
\begin{itemize}
\item Null Hypothesis: $\mu=5$
\item Alternate Hypothesis: $\mu \neq 5$
\end{itemize}
This is a two-sided test. \\\\
Note: The sidedness of a test must be decided on by the nature of the problem and never by looking at the collected data!!!. Again, one should
not use the data to determine what the alternative hypothesis is.\\\\
6) A newly developed drug claims to lower the blood pressure. In this case we need a left-handed test.

\section{Type I and Type II errors, Statistical significance and Power of a Test}
In hypothesis testing it is possible to err in two different ways:
\begin{itemize}
\item A type 1 error is rejecting the null hypothesis when, in fact, it is true..
\item A type 2 error is rejecting the alternate hypothesis when, in fact, it is true..
\end{itemize}
In practice, we never know whether we decided correctly or committed one of these two errors. We only know the probabilities of these events.
\subsection{Significance level of a Test}
The probability of type I error is equivalent to $\alpha$, or the significance level.
A significant test will rarely reject the null hypothesis when it is true. Recall that
\begin{eqnarray}
\alpha = \frac{FN}{FN+TP}
\end{eqnarray}

\subsection{Power of a Test}
The probability of a Type II error is usually denoted by $\beta$. \\\\
The power of a test is the probability of rejecting the null hypothesis, given that it is wrong. Powerful tests will determine very quickly if
the conditions are changed and the null is not valid anymore. Note that Power=1-$\beta$\\\\
\begin{eqnarray}
P=1-\beta = \frac{TN}{TN+FP}
\end{eqnarray}
While it is easy to calculate the significance level of a test, it is not so easy to calculate its power. For power calculations we need to know
what is the new probability distribution if the alternative hypothesis become true. This knowledge may not be readily available.
\subsection{Tradeoff between significance level and power: Receiver operation characteristic (ROC) curve}
Type I and type II errors have an opposite relation: If we want to get very little type I errors, we reduce $\alpha$. But in this case we will
commit a lot of type II errors, and the power of our test will be very low. In contrast, If we pick a large $\alpha$, we will commit a lot
of type I errors and very little type 2 errors. In this case, significance level of our test will be very low.