96 min read

Statistical Inference

1. Probability Theory

Set Theory

Basics of Probability Theory

Theorem 1.2.9 and Bonferroni ’s Inequality

If \(P\) is a probability function and \(A\) and \(B\) are any sets in \(B\), then
- a. \(P(B\cap A^c)=P(B)-P(A\cap B)\);
- b. \(P(A\cup B)=P(A)+P(B)-P(A\cap B)\);
- c. iF \(A\subset B\) then \(P(A)\le P(B)\);

Proof: To establish (a) note that for any sets \(A\) and \(B\) we have \[B=\{B\cap A\}\cup\{B\cap A^c\}\] and therefor \[P(B)=P(\{B\cap A\}\cup\{B\cap A^c\})=P(B\cap A)+P(B\cap A^c)\] since \(B\cap A\) and \(B\cap A^c\) are disjoint.

To establish (b), we use the identity \[A\cup B=A\cup \{B\cap A^c\}\] since \(A\) and \(B\cap A^c\) are disjoint, we have \[P(A\cup B)=P(A)+P(B\cap A^c)=P(A)+P(B)-P(A\cap B)\] from (a).

If \(A\subset B\), then \(A\cap B=A\). Therefore, using (a) we have \[0\le P(B\cap A^c)=P(B)-P(A)\] establishing (c).

Formula (b) of Theorem 1.2.9 gives a useful inequality for the probability of an intersection. Since \(P(A\cup B)\le 1\), we have from \[P(A\cup B)=P(A)+P(B\cap A^c)=P(A)+P(B)-P(A\cap B)\le 1\], after some rearranging, \[P(A\cap B)\ge P(A)+P(B)-1\] This inequality is a special case of what is known as Bonferroni ’s Inequality.

Boole’s Inequality

If \(P\) is a probability junction, then

  • a. \(P(A)=\sum_{i=1}^{\infty}P(A\cap C_i)\) for any partition \(C_1,C_2,\cdots\);
  • b. \(P(\bigcup_{i=1}^{\infty}A_i)\le\sum_{i=1}^{\infty}P(A_i)\) for any sets \(A_1,A_2,\cdots\).

Proof: Since \(C_1,C_2,\cdots\) form a partition, we have that \(C_i\cap C_j=\emptyset\) for all \(i \ne j\), and \(S=\bigcup_{i=1}^{\infty}C_i\). Hence, \[A=A\cap S=A\cap\left(\bigcup_{i=1}^{\infty}C_i\right)=\bigcup_{i=1}^{\infty}(A\cap C_i)\] where the last equality follows from the Distributive Law. We therefore have \[P(A)=P\left(\bigcup_{i=1}^{\infty}(A\cap C_i)\right)\] Now, since the \(C_i\) are disjoint, the sets \(A\cap C_i\) are also disjoint, and from the properties of a probability function we have \[P\left(\bigcup_{i=1}^{\infty}(A\cap C_i)\right)=\sum_{i=1}^{\infty}P(A\cap C_i),\] establishing (a).

To establish (b), construct a disjoint collection \(A_1^*,A_2^*,\cdots\), with the property that \(\cup_{i=1}^{\infty}A_i^*=\cup_{i=1}^{\infty}A_i\), we define \(A_i^*\) by \[A_1^*=A_1, A_i^*=A_i\setminus\left(\bigcup_{j=1}^{i-1}A_j\right),\quad i=2,3,\cdots,\] where \(A\setminus B=A\cap B^c\). It should be easy to see that \(\cup_{i=1}^{\infty}A_i^*=\cup_{i=1}^{\infty}A_i\) and we have \[P\left(\bigcup_{i=1}^{\infty}A_i\right)=P\left(\bigcup_{i=1}^{\infty}A_i^*\right)=\sum_{i=1}^{\infty}P(A_i^*)\le\sum_{i=1}^{\infty} P(A_i)\] Since, by construction \(A_i^*\subset A_i\). establishing (b).

Bonferroni ’s Inequality

If we apply Boole’s Inequality to \(A^c\) of Bonferroni ’s Inequality, we have \[P\left(\bigcup_{i=1}^{n}A_i^c\right)\le\sum_{i=1}^{n}P(A_i^c)\] and using the fact that \(\bigcup_{i=1}^{n}A_i^c=\left(\bigcap_{i=1}^{n} A_i\right)^c\) and \(P(A_i^c)=1-P(A_i)\) so we obtain \[1-P\left(\bigcap_{i=1}^{n} A_i\right)=P\left(\bigcup_{i=1}^{n}A_i^c\right)\le\sum_{i=1}^{n}P(A_i^c)= n-\sum_{i=1}^{n}P(A_i)\] This becomes \[P\left(\bigcap_{i=1}^{n} A_i\right)\ge\sum_{i=1}^{n}P(A_i)-(n-1)\] which is a more general version of the Bonferroni Inequality.

Conditional Probability and Independence

Random Variables

Distribution Functions

Density and Mass Functions

2. Transformations and Expectations

Distributions of Functions of a Random Variable Theorem

Probability integral transformation

Let \(X\) have continuous cdf \(F_{X}(x)\) and define the random variable \(Y\) as \(Y = F_{X}(X)\). Then \(Y\) is uniformly distributed on \((0, 1)\), that is, \(P(Y \le y) = y, 0 < y < 1\).

Let \(F_{X}^{-1}\) be the inverse of the cdf \(F_{X}\). If \(F_{X}\) is strictly increasing, then \(F_{X}^{-1}\) is well defined by \[F_{X}^{-1}(y)=x\iff F_{X}(x)=y.\] However, if \(F_{X}\) is constant on some interval \(x_1\le x \le x_2\), then \(F_{X}^{-1}\) is not well defined by \(F_{X}^{-1}(y)=x\iff F_{X}(x)=y.\) Any \(x\) satisfying \(x_1\le x \le x_2\) satisfies \(F_{X}(x) = y\). This problem is avoided by defining \(F_{X}^{-1}(y)\) for \(0 < y < 1\) by \[F_{X}^{-1}(y) = \inf\; \{x: F_{X}(x) \ge y\},\] Using this definition, we have \(F_{X}^{-1}(y)=x_1\). At the endpoints of the range of \(y\), \(F_{X}^{-1}(y)\) can also be defined. \(F_{X}^{-1}(1)=\infty\) if \(F_X(x) < 1\) for all \(x\) and, for any \(F_X\), \(F_{X}^{-1}(0) = -\infty.\)

Proof: For \(Y = F_X (X)\) we have, for \(0 < y < 1\), \[\begin{align} P(Y\le y)&=P(F_X (X)\le y)\\ &=P(F_X^{-1}[F_X (X)]\le F_X^{-1}(y))\quad\quad(F_X^{-1} \text{ is increasing})\\ &=P(X\le F_X^{-1}(y))\\ &=F_X(F_X^{-1}(y))\\ &=y\\ \end{align}\] At the endpoints we have \(P(Y\le y)=1\) for \(y \ge 1\) and \(P (Y\le y) = 0\) for \(y \le 0\), showing that \(Y\) has a uniform distribution.

3. Common Families of Distributions

ContinuouS Distributions

Gamma Distribution

\[f(x|\alpha,\beta)=\frac{1}{\Gamma(\alpha)\beta^{\alpha}}x^{\alpha-1}e^{-x/\beta}\quad 0<x<\infty;\quad\alpha,\beta>0\]

Gamma-Poisson relationship

If \(X\) is a \(gamma(\alpha,\beta)\) random variable, where \(\alpha\) is an integer, then for any \(x\), \[P(X\le x)=P(Y\ge\alpha),\] where \(Y\sim Poisson(x/\beta).\) \[\begin{align} P_{Gamma}(X\le x|\alpha,\beta)&=\frac{1}{\Gamma(\alpha)\beta^{\alpha}}\int_{0}^{x}t^{\alpha-1}e^{-t/\beta}dt\\ &=\frac{1}{(\alpha-1)!\beta^{\alpha}}\left[-t^{\alpha-1}\beta e^{-t/\beta}\bigg|_{0}^{x}+\int_{0}^{x}(\alpha-1)t^{\alpha-2}\beta e^{-t/\beta}dt\right]\\ &=\frac{-1}{(\alpha-1)!\beta^{\alpha-1}}x^{\alpha-1}e^{-x/\beta}+\frac{1}{(\alpha-2)!\beta^{\alpha-1}}\int_{0}^{x}t^{\alpha-2} e^{-t/\beta}dt\\ &=\frac{1}{(\alpha-2)!\beta^{\alpha-1}}\int_{0}^{x}t^{\alpha-2} e^{-t/\beta}dt-\frac{1}{(\alpha-1)!}\left(\frac{x}{\beta}\right)^{\alpha-1}e^{-x/\beta}\\ &=\frac{1}{(\alpha-2)!\beta^{\alpha-1}}\int_{0}^{x}t^{\alpha-2} e^{-t/\beta}dt-Poisson_{x/\beta}(Y=\alpha-1)\\ &=P_{Poisson}(Y\ge\alpha|\frac{x}{\beta})\\ \end{align}\] where \(Y\sim Poisson(x/\beta)\).

Normal Distribution

If \(\mathbf{X} \sim \text{n}(\mathbf{\mu}, \boldsymbol{\Sigma})\) then \(\mathbf{X}\) has pdf \[f_{\mathbf{X}}(x_1, \dots, x_k \vert{} \mathbf{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{(2\pi)^k \det\boldsymbol{\Sigma}}} \exp\left(-\frac12 (\mathbf{ x}-\mathbf{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu})\right).\]

Chi-Squared Distribution

The chi-squared distribution with \(p\) degrees of freedom has pdf \[\chi_p^2 \sim \frac{1}{\Gamma(p/2)2^{p/2}} x^{(p/2)-1} e^{-x/2}, \quad 0< x< \infty.\]

[Some facts]

  1. If \(Z \sim \text{n}(0, 1)\) then \(Z^2 \sim \chi_1^2\)

  2. If \(X_1, \dots, X_n\) are independent \(X_i \sim \chi_{p_i}^2\) then \(X_1 + \cdots + X_n \sim \chi_{p_1 + \cdots + p_n}^2\).

Student’s \(t\)-Distribution

Student’s \(t\)-distribution \(T \sim t_p\), a \(t\)-distribution with \(p\) degrees of freedom if it has pdf \[f_T(t) = \frac{\Gamma\left(\frac{p+1}{2}\right)}{\Gamma\left(\frac{p}{2}\right)} \frac{1}{\sqrt{p\pi}} \frac{1}{(1 + t^2/p)^{(p+1)/2}}, \quad t \in \mathbb R{}\]

If \(p = 1\) then this is the Cauchy distribution.

If \(X_1, \dots, X_n\) are a random sample from \(\text{n}(\mu, \sigma^2)\) then \[\frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}.\] This is often taken as the definition. Note that the denominator is independent of the numerator.

[Moments and mgf of \(t\)-distribution] Student’s \(t\) has no mgf because it does not have moments of all orders: \(t_p\) has only \(p-1\) moments. If \(T_p \sim t_p\) then

\[\begin{aligned} \mathbb E{}[T_p] &= 0 \quad p > 1 \\ \text{Var}{}[T_p] &= \frac{p}{p-2} \quad p > 2 \end{aligned}\]

Snedcor’s \(F\)-Distribution

Snedcor’s \(F\)-distribution A random variable \(X \sim F_{p,q}\) has \(F\)-distribution with p and q degrees of freedom if its pdf is \[f_X(x) = \frac{\Gamma\left(\frac{p+q}{2}\right)}{\Gamma\left(\frac p2 \right)\Gamma\left(\frac q2 \right)} (p/q)^{p/2} \frac{x^{(p/2) - 1}}{(1 + px/q)^{(p+q)/2}}, \quad 0 < x < \infty.\]

If \(X_1, \dots, X_n\) is a random sample from \(\text{n}(\mu_X, \sigma_X^2)\) and \(Y_1, \dots, Y_m\) is an independent random sample from \(\text{n}(\mu_Y, \sigma_Y^2)\), then \[\frac{S^2_X/\sigma^2_X}{S_Y^2/\sigma_Y^2} \sim F_{n-1, m-1}.\] This is often taken as the definition.

[Some facts]

  1. \(X \sim F_{p,q} \,\, \implies \,\, 1/X \sim F_{q,p}\)

  2. \(X \sim t_q \,\, \implies \,\, X^2 \sim F_{1,q}\)

  3. \(X \sim F_{p,q} \,\, \implies \,\, \frac{(p/q)X}{1 + (p/q)X} \sim \text{beta}(p/2, q/2)\)

Multinomial Distribution

Multinomial Distribution Let \(m\) and \(n\) be positive integers and let \(p_1, \dots, p_n \in [0, 1]\) satisfy \(\sum_{i=1}^n p_i = 1\). Then the random vector \((X_1, \dots, X_n)\) has multinomial distribution with m trials and cell probabilities \(p_1, \dots, p_n\) if the joint pmf of \((X_1, \dots, X_n)\) is \[f(x_1, \dots , x_n) = \frac{m!}{x_1! \cdots x_n!} p_1^{x_1} \cdots p_n^{x_n} = m! \prod_{i=1}^n \frac{p_i^{x_i}}{x_i!}\] on the set of \((x_1, \dots, x_n)\) such that each \(x_i\) is a nonnegative integer and \(\sum_{i=1}^n x_i = m\).

The marginal distributions have \(X_i \sim \text{binomial}(m, p_i)\).

[Multinomial Theorem] Let \(m\) and \(n\) be positive integers and let \(\mathcal{A}\) be the set of vectors \(\mathbf{x} = (x_1, \dots, x_n)\) such that each \(x_i\) is a nonnegative integer and \(\sum_{i=1}^n x_i = m\). Then for any real numbers \(p_1, \dots, p_n\), \[(p_1 + \cdots + p_n)^m = \sum_{\mathbf{x} \in \mathcal{A}} \frac{m!}{x_1! \cdots x_n!} p_1^{x_1}\cdots p_n^{x_n}.\]

Exponential Families

[Exponential family 1] A family of pmfs/pdfs is called an exponential family if it can be expressed \[f(x|\mathbf{\theta}) = h(x)c(\mathbf{\theta})\exp\left( \sum_{i=1}^k w_i(\mathbf{\theta}) t_i(x) \right)\] where \(h(x) \geq 0\), the \(t_i\) are real valued functions of the observation \(x\) that do not depend on \(\mathbf{\theta}\) and \(c(\theta) \geq 0\) and the \(w_i(\mathbf{\theta})\) are real valued functions of \(\mathbf{\theta}\) that do not depend on \(x\).

If \(X\) is a random variable from an exponential family distribution then \[\mathbb E{}\left[ \sum_{i=1}^k \frac{\partial w_i(\mathbf{\theta})}{\partial \theta_j} t_i(X) \right] = - \frac{\partial}{\partial \theta_j} \log c(\mathbf{\theta})\] and \[\text{Var}\left[ \frac{\partial w_i(\mathbf{\theta})}{\partial \theta_j} t_i(X) \right] = - \frac{\partial^2}{\partial \theta_j^2} \log c(\mathbf{\theta}) - \mathbb E{}\left[ \sum_{i=1}^{k} \frac{\partial^2 w_i(\mathbf{\theta})}{\partial \theta_j^2} t_i(X) \right]\]

[Exponential family 2] We can write another parameterisation of the exponential family \[f(x | \mathbf{\eta}) = h(x) c^{*}(\mathbf{\eta}) \exp(\mathbf{\eta} \cdot \mathbf{t}(x))\] where \(\mathbf{\eta}\) is called the natural parameter and the set \(\mathcal{H} = \{\mathbf{\eta} : \int_\mathbb R{} f(x|\eta) \text{d}x < \infty \}\) is called the natural parameter space and is convex.

\(\{\mathbf{\eta}: \mathbf{\eta} = \mathbf{w}(\mathbf{\theta}), \,\, \mathbf{\theta} \in \Theta\} \subseteq \mathcal{H}\). So there may be more parameterisations here than previously.

The natural parameter provides a convenient mathematical formulation, but sometimes lacks simple interpretation.

[Curved exponential family] A curved exponential family distribution is one for which the dimension of \(\mathbf{\theta}\) is \(d < k\). If \(d = k\) then we have a full exponential family.

Location and Scale Families

[Location family] Let \(f(x)\) be any pdf. The family of pdfs \(f(x - \mu)\) for \(\mu \in \mathbb R{}\) is called the location family with standard pdf \(f(x)\) and \(\mu\) is the location parameter of the family.

[Scale family] Let \(f(x)\) be any pdf. For any \(\sigma > 0\) the family of pdfs \(\frac{1}{\sigma} f(x/\sigma)\) is called the scale family with standard pdf \(f(x)\) and \(\sigma\) is the scale parameter of the family.

[Location-Scale family] Let \(f(x)\) be any pdf. For \(\mu \in \mathbb R{}\) and \(\sigma > 0\) the family of pdfs \(\frac{1}{\sigma} f(\frac{x - \mu}{\sigma})\) is called the location-scale family with standard pdf \(f(x)\); \(\mu\) is the location parameter and \(\sigma\) is the scale parameter.

[Standardisation] Let \(f\) be any pdf, \(\mu \in \mathbb R{}\) and \(\sigma \in \mathbb R>0\). Then \(X\) is a random variable with pdf \(\frac{1}{\sigma}f(\frac{x - \mu}{\sigma})\) if and only if there exists a random variable \(Z\) with pdf \(f(z)\) and \(X = \sigma Z + \mu\).

Probabilities of location-scale families can be computed in terms of their standard variables \(Z\) \[\text{P}(X \leq x) = \text{P}\left(Z \leq \frac{x - \mu}{\sigma} \right)\]

Inequalities and Identities

Chebyshev’s inequality

Let \(X\) be a random variable and let \(g(x)\) be a nonnegative function. Then, for any \(r > 0\), \[\text{P}(g(X) \geq r) \leq \frac{\mathbb E{}[g(X)]}{r}.\]

Proof: \[\begin{align} \mathbb E g(X)&=\int_{-\infty}^{\infty}g(x)f_{X}(x)dx\\ &\ge\int_{x:g(x)\ge r}g(x)f_{X}(x)dx\\ &\ge r\int_{x:g(x)\ge r}f_{X}(x)dx\\ &=r P(g(x)\ge r) \end{align}\]

This bound is conservative and almost never attained.

Chebyshev’s inequality (variance version)

The Markov inequality is the special case with \(g = \frac{(X-\mu)^2}{\sigma^2}\), where \(\mu=\mathbb E X\) and \(\sigma^2=\text{Var} X\). For convenience write \(r=t^2\).

\[\text{P}\left(\frac{(X-\mu)^2}{\sigma^2} \geq t^2\right) \leq \frac{1}{t^2}.\]

Let \(X_{\alpha, \beta}\) denote a gamma\((\alpha, \beta)\) random variable with pdf \(f(x \vert{} \alpha, \beta)\), where \(\alpha > 1\). Then for any constants \(a\) and \(b\): \[\text{P}(a < X_{\alpha, \beta} < b) = \beta (f(a \vert{} \alpha, \beta) - f(b \vert{} \alpha, \beta)) + \text{P}(a < X_{\alpha - 1, \beta} < b)\]

[Stein’s Lemma] Let \(X \sim \text{n}(\theta, \sigma^2)\) and let \(g\) be a differentiable function with \(\mathbb E{}[g'(x)] < \infty\). Then \[\mathbb E{}[g(X)(X - \theta)] = \sigma^2 \mathbb E{}[g'(X)]\]

The proof is just integration by parts.

Stein’s lemma is useful for moment calculations

Let \(\chi^2_p\) denote a chi squared distribution with \(p\) degrees of freedom. For any function \(h(x)\), \[\mathbb E{}[h(\chi^2_p)] = p \mathbb E{}\left[\frac{h(\chi_{p+2}^2)}{\chi_{p+2}^2}\right]\] provided the expressions exist.

Let \(g(x)\) be a function that is bounded at \(-1\) and has finite expectation, then

  1. If \(X \sim \text{Poisson}(\lambda)\), \[\mathbb E{}[\lambda g(X)] = \mathbb E{}[Xg(X-1)].\]

  2. If \(X\sim \text{negative-binomial}(r, p)\), \[\mathbb E{}[(1-p)g(X)] = \mathbb E{}\left[ \frac{X}{r + X - 1}g(X) \right].\]

4. Multiple Random Variables

Facts

  • RVs are independent if and only if their pdfs factorise

  • Functions of independent RVs are independent

  • Expectations of products (and hence mgfs, etc. of sums) of independent RVs factor

  • Variance of sum of independent RVs is sum of variances.

  • Independent RVs have vanishing covariance/correlation, but the converse is not true in general.

Bivariate Relations

[Conditional Expectation] \[\mathbb E{}[\mathbb E{}[X|Y]] = \mathbb E{}[X]\] provided the expectations exist.

Proof: Let \(f(x, y)\) denote the joint pdf of \(X\) and \(Y\). By definition, we have \[\mathbb E X=\int\int xf(x,y)dxdy=\int\left[\int xf(x|y)dx\right]f_Y(y)dy\\ =\int\mathbb E(X|y)f_Y(y)dy\\ =\mathbb E(\mathbb E(X|Y))\]

[Conditional variance] \[\text{Var}[X] = \mathbb E{}[\text{Var}[X \vert{} Y]] + \text{Var}[\mathbb E{}[X \vert{} Y]]\] provided the expectations exist.

Proof: By definition, we have \[\begin{align} \text{Var} X&=\mathbb E\left((X-\mathbb E X)^2\right)\\ &=\mathbb E\left((X-\mathbb E(X|Y)+\mathbb E(X|Y)-\mathbb E X)^2\right)\\ &=\mathbb E\left[\left(X-\mathbb E(X|Y)\right)^2\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]+2\mathbb E\bigg[\left(X-\mathbb E(X|Y)\right)\left(\mathbb E(X|Y)-\mathbb E X\right)\bigg]\\ &=\mathbb E\left[\left(X-\mathbb E(X|Y)\right)^2\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]+2\mathbb E\bigg[\mathbb E\bigg(\left(X-\mathbb E(X|Y)\right)\left(\mathbb E(X|Y)-\mathbb E X\right)|Y\bigg)\bigg]\\ &=\mathbb E\left[\left(X-\mathbb E(X|Y)\right)^2\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]+2\mathbb E\bigg[\left(\mathbb E(X|Y)-\mathbb E X\right)\mathbb E\bigg(\left(X-\mathbb E(X|Y)\right)|Y\bigg)\bigg]\\ &=\mathbb E\left[\left(X-\mathbb E(X|Y)\right)^2\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]+2\mathbb E\bigg[\left(\mathbb E(X|Y)-\mathbb E X\right)\cdot 0 \bigg]\\ &=\mathbb E\left[\left(X-\mathbb E(X|Y)\right)^2\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]\\ &=\mathbb E\left[\mathbb E\{\left(X-\mathbb E(X|Y)\right)^2\}|Y\right]+\mathbb E\left[\left(\mathbb E(X|Y)-\mathbb E X\right)^2\right]\\ &=\mathbb E\left[\text{Var}(X|Y)\right]+\text{Var}[\mathbb E(X|Y)]\\ \end{align}\]

[Covariance] \[\text{ Cov }[X, Y] = \mathbb E{}[(X - \mu_X)(Y - \mu_Y)]\]

\[\text{ Cov }[X, Y] = \mathbb E{}[XY] - \mu_X\mu_Y\]

\[\text{Var}[aX + bY] = a^2\text{Var}[X] + b^2\text{Var}[Y] + 2ab\text{Cov }[X, Y]\]

[Correlation] \[\rho_{XY} = \frac{\text{Cov }[X, Y]}{\sigma_X \sigma_Y}\]

The correlation measures the strength of linear relation between two RVs. It is possible to have strong non-linear relationships but with \(\rho = 0\).

We can use an argument similar to the standard proof of Cauchy-Schwarz to show the following

Let \(X\) and \(Y\) be any RVs, then

  1. \(-1 \leq \rho_{XY} \leq 1\),

  2. \(\lvert{\rho_{XY}}\lvert = 1\) if and only if there are constants \(a \neq 0, b\) such that \(\text{P}{}(Y = aX + b) = 1\). If \(\lvert{\rho_{XY}}\lvert = 1\) then \(\text{sign}(\rho) = \text{sign}(a)\).

Let \(X_1, \cdots , X_n\) be independent random vectors. Let \(g_i(x_i)\) be a function only of \(x_i, i = 1 , \cdots, n\). Then the random variables \(U_i = g_i(X_i), i = 1,\cdots, n\), are mutually independent.

Hierarchical Models and Mixture Distributions

Beta-binomial hierarchy

One generalization of the binomial distribution is to allow the success probability to vary according to a distribution. A standard model for this situation is \[X|P\sim\text{binomia1}(P),\quad i=1,\cdots,n,\] \[P\sim\text{beta}(\alpha,\beta).\] By iterating the expectation, we calculate the mean of \(X\) as \[\mathbb E X=\mathbb E [\mathbb E (X|P)]=\mathbb E [nP]=n\frac{\alpha}{\alpha+\beta}\]

Conditional variance identity

For any two random variables \(X\) and \(Y\), \[\text{Var} X=\mathbb E (\text{Var} (X|Y))+\text{Var} (\mathbb E (X|Y))\] provided that the expectations exist.

Proof: By definition, we have \[\begin{align} \text{Var} X&=\mathbb E([X-\mathbb E X]^2)\\ &=\mathbb E([X-\mathbb E(X|Y)+\mathbb E(X|Y)-\mathbb E X]^2)\\ &=\mathbb E([X-\mathbb E(X|Y)]^2)+\mathbb E([\mathbb E(X|Y)-\mathbb E X]^2)+2\mathbb E([X-\mathbb E(X|Y)][\mathbb E(X|Y)-\mathbb E X])\\ &=\mathbb E([X-\mathbb E(X|Y)]^2)+\mathbb E([\mathbb E(X|Y)-\mathbb E X]^2)\\ &=\mathbb E (\mathbb E\{[X-\mathbb E(X|Y)]^2|Y\})+\text{Var} (\mathbb E (X|Y))\\ &=\mathbb E (\text{Var} (X|Y))+\text{Var} (\mathbb E (X|Y)) \end{align}\]

Inequalities

Young’s Inequalities

Let \(a\) and \(b\) be any positive numbers and let \(p, q > 1\) satisfy \(1/p + 1/q = 1\), then \[p+q=pq\] \[q(p-1)=p\] \[1-1/q=1/p\] \[\frac1p a^p + \frac1q b^q \geq ab\] with equality if and only if \(a^p = b^q\).

Proof

Put \(\displaystyle t=1/p\) and \(\displaystyle (1-t)=1/q.\) Because the logarithm function is concave,

\[\displaystyle \ln \left(ta^{p}+(1-t)b^{q}\right)\geq t\ln \left(a^{p}\right)+(1-t)\ln \left(b^{q}\right)\\ =tp\ln(a)+(1-t)q\ln(b)\\ =\ln(a)+\ln(b)=\ln(ab)\] with the equality holding if and only if \(\displaystyle a^{p}=b^{q}.\) Young’s inequality follows by exponentiating.

H\(\ddot{o}\)lder’s Inequality Let \(X\) and \(Y\) be any random variables and let \(p, q > 1\) satisfy \(1/p + 1/q = 1\), then \[\lvert{\mathbb E{}[XY]}\lvert \leq \mathbb E{}[\lvert{XY}\lvert] \le \mathbb E{}[\lvert{X}\lvert^p]^{1/p} \mathbb E{}[\lvert{Y}\lvert^q]^{1/q}\] Proof: The first inequality follows from \(- |XY| \leq XY \leq |XY|\). To prove the second inequality, define \[a=\frac{|X|}{(\mathbb E|X|^p)^{1/p}}\quad\text{and}\quad b=\frac{|Y|}{(\mathbb E|Y|^q)^{1/q}}\] then \[\frac1p a^p+ \frac1q b^q=\frac1p\frac{|X|^p}{(\mathbb E|X|^p)}+\frac1q\frac{|Y|^q}{(\mathbb E|Y|^q)}\ge ab=\frac{|XY|}{(\mathbb E|X|^p)^{1/p}(\mathbb E|Y|^q)^{1/q}}\] Now take expectations of both sides. The expectation of the left-hand side is \(1\), then \[1\ge \frac{\mathbb E|XY|}{(\mathbb E|X|^p)^{1/p}(\mathbb E|Y|^q)^{1/q}}\].

  • Cauchy-Schwarz Inequality is the special case \(p = q = 2\) \[\mathbb E|XY|\le (\mathbb E|X|^2)^{1/2}(\mathbb E|Y|^2)^{1/2}\]
    Also let \[u(t) = E[(tX - Y)^2]\] Then:\[t^2E[X^2] - 2tE[XY] + E[Y^2] \ge 0\]

This is a quadratic in \(t\). Thus the discriminant must be non-positive. Therefore: \[(\mathbb E[XY])^2 - \mathbb E[X^2]\mathbb E[Y^2] \le 0\] or \[(\mathbb E[XY])^2 \le \mathbb E[X^2]\mathbb E[Y^2]\] Then \[|(\mathbb E[XY])|\le(\mathbb E|X|^2)^{1/2}(\mathbb E|Y|^2)^{1/2}\]

  • Covariance inequality \[(\text{Cov}[X, Y])^2 \leq \sigma_X^2 \sigma_Y^2=\text{Var}(X)\text{Var}(Y)\] If \(X\) and \(Y\) have means \(\mu_X\) and \(\mu_Y\) and variances \(\sigma_X^2\) and \(\sigma_Y^2\), respectively, we can apply the Cauchy-Schwarz Inequality to get \[\mathbb E\lvert(X-\mu_X)(Y-\mu_Y)\rvert\le \{\mathbb E(X-\mu_X)^2\}^{1/2}\{\mathbb E(Y-\mu_Y)^2\}^{1/2}\] Squaring both sides and using statistical notation, we have \[(\text{Cov}(X, Y))^2 \leq \sigma_X^2\sigma_Y^2.\]

  • \[\mathbb E{}[\lvert{X}\lvert] \leq \mathbb E{}[\lvert{X}\lvert^p]^{1/p}\]

  • Liapounov’s Inequality \[\mathbb E{}[\lvert{X}\lvert^r]^{1/r} \leq \mathbb E{}[\lvert{X}\lvert^s]^{1/s}\] where \(1 < r < s < \infty\).

  • For numbers \(a_i,b_i,i=1,\cdots,n\), \[\sum_{i=1}^{n}|a_ib_i|\leq\left(\sum_{i=1}^{n}a_i^p\right)^{1/p}\left(\sum_{i=1}^{n}b_i^q\right)^{1/q}\] where \(\frac{1}{p}+\frac{1}{q}=1\)

  • \[\frac{1}{n}\left(\sum_{i=1}^{n}|a_i|\right)^2\leq\sum_{i=1}^{n}a_i^2\]

Minkowski’s Inequality
Let \(X\) and \(Y\) be any two random variables. Then for \(1\le p<\infty\) \[(\mathbb E|X+Y|^p)^{1/p}\le (\mathbb E|X|^p)^{1/p}+(\mathbb E|Y|^p)^{1/p}\] Proof: Write \[\mathbb E(|X + Y|^p) = \mathbb E \left(|X + Y||X + Y|^{p-1}\right)\le \mathbb E\left(|X| |X + Y|^{p-1}\right) + \mathbb E\left(|Y||X + Y|^{p-1}\right),\] where we have used the fact that \(|X + Y| \le |X| + |Y|\) (the triangle inequality. Now apply HOlder’s Inequality to each expectation on the right-hand side to get \[\begin{align} \mathbb E(|X + Y|^p) &\leq \mathbb E\left(|X| |X + Y|^{p-1}\right) + \mathbb E\left(|Y||X + Y|^{p-1}\right)\\ &=\left(\mathbb E|X|^p\right)^{1/p} \left(\mathbb E|X + Y|^{q(p-1)}\right)^{1/q} + \left(\mathbb E|Y|^p\right)^{1/p}\left(\mathbb E|X + Y|^{q(p-1)}\right)^{1/q}\\ \end{align}\] where \(q\) satisfies \(1/p + 1/q = 1\). Now divide through by \(\left(\mathbb E|X + Y|^{q(p-1)}\right)^{1/q}\). Then \[\frac{\mathbb E(|X + Y|^p)}{\left(\mathbb E|X + Y|^{q(p-1)}\right)^{1/q}}=\left(\mathbb E(|X + Y|^p\right)^{1/p}\le \left(\mathbb E|X|^p\right)^{1/p} + \left(\mathbb E|Y|^p\right)^{1/p}\]

Functional Inequalities

Convex Function A function \(g(x)\) is convex on a set \(S\) if for all \(x, y \in S\) and \(0< \lambda < 1\) \[g(\lambda x + (1 - \lambda)y) \leq \lambda g(x) + (1 - \lambda)g(y).\] Strictly convex is when the inequality is strict. \(g\) is concave if \(-g\) is convex.

\(g(x)\) is convex on \(S\) if \(g''(x) \geq 0\) \(\forall x \in S\).

Jensen’s Inequality If \(g(x)\) is convex, then for any random variable \(X\) \[\mathbb E{}[g(X)] \leq g(\mathbb E{}[X]).\] Equality holds if and only if, for every line \(a + bx\) that is tangent to \(g(x)\) at \(x = \mathbb E{}[X]\), \(\text{P}{}\{g(X) = a + bX\} = 1\). (So if and only if \(g\) is affine with probability 1.)

  • \(x^2\) is convex, \(\mathbb E[X^2] \geq \mathbb E{}[X]^2\)

  • \(1/x\) is convex, \(\mathbb E[1/X] \geq 1/ \mathbb E{}[X]\)

  • \(\log x\) is concave, \(\mathbb E(\log X) \leq \log (\mathbb E X)\)

  • Arithmetic mean: \[a_A=\frac{1}{n}(a_1+a_2+\cdots+a_n)\] Geometric mean: \[a_G=(a_1a_2\cdots a_n)^{1/n}\] Harmonic mean: \[a_H=\frac{1}{\frac{1}{n}\left(\frac{1}{a_1}+\frac{1}{a_2}+\cdots+\frac{1}{a_n}\right)}\] \[a_H\leq a_G\leq a_A\]

Covariance Inequality Let \(X\) be any random variable and \(g(x)\) and \(h(x)\) any functions such that \(\mathbb E g(X), \mathbb E h(X)\) and \(\mathbb E (g(X)h(X))\) exist.
- If \(g(x)\) is a nondecreasing function and \(h(x)\) is a nonincreasing function, then \[\mathbb E \bigg(g(X)h(X)\bigg)\leq\bigg(\mathbb E g(X)\bigg)\bigg(\mathbb E h(X)\bigg)\]

  • If \(g(x)\) and \(h(x)\) are both nondecreasing or nonincreasing functions, then \[\mathbb E \bigg(g(X)h(X)\bigg)\geq\bigg(\mathbb E g(X)\bigg)\bigg(\mathbb E h(X)\bigg)\]

5. Properties of a Random Sample

Basic Concepts of Random Sample

Definition 1 A collection of random variables \(X_1, \dots, X_n\) is a random sample of size n from population \(f(x)\) if they are iid with pdf/pmf \(f(x)\).

Sums of Random Variables from a Random Sample

Definition 2 \(Y = T(X_1, \dots, X_n)\) is a statistic if the domain of \(T\) contains the sample space of \((X_1, \dots, X_n)\). The distribution of \(Y\) is the sampling distribution of \(Y\).

Remark. A statistic is any function of the data. The only restriction is that the statistic is not also a function of some other parameters.

Definition 3 The sample mean \(\bar{X}\) and sample variance \(S^2\) of a random sample \(X_1, \dots, X_n\) are, respectively,

  • \(\bar{X} = \frac1n \sum_{i=1}^n X_i\)

  • \(S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2\).

The sample standard deviation is \(S = \sqrt{S^2}\).

Theorem 1 Let \(X_1, \dots, X_n\) be a random sample from a population with mean \(\mu\) and finite variance \(\sigma^2\), then

  1. \(\mathbb E[\bar{X}] = \mu\)

  2. \(\text{Var}[\bar{X}] = \frac{\sigma^2}{n}\)

  3. \(\mathbb E[S^2] = \sigma ^2\).

Theorem 2 Take \(x_1, \dots, x_n \in \mathbb R\) and let \(\bar{x}\) be their mean. Then,

  1. \(\min_a \sum_{i=1}^n (x_i - a)^2 = \sum_{i=1}^n (x_i - \bar{x})^2\)

  2. \((n-1)s^2 = \sum_{i=1}^n(x_i - \bar{x})^2 = \sum_{i=1}^n x_i^2 - n\bar{x}^2\).

  3. The mgf \(M_{\bar{X}}(t)=\mathbb E e^{t\bar{X}}=\mathbb E e^{t(X_1+\cdots+X_n)/n}=\mathbb E e^{(t/n)(X_1+\cdots+X_n)}=M_{(X_1+\cdots+X_n)}(t/n)=[M_{X}(t/n)]^n\). The last equal sign is caused by \(X_1,\cdots,X_n\) are identically distributed, \(M_{X_i}(t)\) is the same function for each \(i\).

  4. Let \(X_1, \dots, X_n\) be a random sample from a \(\text{n}(\mu, \sigma^2)\) distribution. Then the mgf of the sample mean is \[\begin{align} M_{\bar{X}}(t)&=\left[\exp\left(\mu\frac{t}{n}+\frac{\sigma^2(t/n)^2}{2}\right)\right]^n\\ &=\exp\left(n\left(\mu\frac{t}{n}+\frac{\sigma^2(t/n)^2}{2}\right)\right)\\ &=\exp\left(\mu t+\frac{(\sigma^2/n)t^2}{2}\right) \end{align}\] Thus \(\bar{X}\) has a \(\text{n}(\mu, \sigma^2/n)\) distribution.

Theorem 3 Let \(X_1, \dots, X_n\) be a random sample from population \(f(x\vert{}\mathbf{\theta})\) belonging to an exponential family \[f(x \vert{} \mathbf{\theta}) = h(x)c(\mathbf{\theta}) \exp\left(\sum_{i=1}^k w_i(\mathbf{\theta})t_i(x)\right).\] Define the statistics \[T_i(X_1, \dots, X_n) = \sum_{j=1}^n t_i(X_j), \quad i=1, \dots, k.\] Then if the set \(\{(w_1(\mathbf{\theta}), \dots, w_n(\mathbf{\theta}), \, \theta \in \Theta\}\) contains and open subset of \(\mathbb R^k\) then the distribution of \((X_1, \dots, X_n)\) is an exponential family of the form \[f(u_1, \dots, u_n \vert{} \mathbf{\theta}) = H(u_1, \dots, u_n) c(\mathbf{\theta})^n \exp\left(\sum_{i=1}^kw_i(\mathbf{\theta})u_i\right).\]

Remark. The open condition eliminates curved exponential families from this result.

Sampling from the Normal Distribution

Properties of the Sample Mean and Variance

Theorem 4 Let \(X_1, \dots, X_n\) be a random sample from a \(\text{n}(\mu, \sigma^2)\) distribution. Then,

  • \(\bar{X}\) and \(S^2\) are independent

  • \(\bar{X} \sim \text{n}(\mu, \sigma^2/n)\)

  • \((n-1)S^2/\sigma^2 \sim \chi^2_{n-1}\).

Lemma: the independence of linear commbinations of normal random variables

Let \(X_j \sim \text{n}( \mu_j, \sigma^2_j),\quad j=1,\cdots,n\), independent. For constants \(a_{ij}\) and \(b_{rj}(j=1,\cdots,n;\;i=1,\cdots,k;\;r=1,\cdots,m)\), where \(k + m\le n .\) define \[U_i=\sum_{j=1}^{n}a_{ij}X_j\quad i=1,\cdots,k\] \[V_r=\sum_{j=1}^{n}b_{rj}X_j\quad r=1,\cdots,m\]

  • (a). The random variable \(U_i\) and \(V_r\) are independent if and only \(\text{Cor}(U_i,V_r)=0\). Furthermore, \(\text{Cor}(U_i,V_r)=\sum_{j=1}^{n}a_{ij}b_{rj}\sigma^2_j\).

  • (b). The random vectors \((U_1,\cdots, U_k)\) and \((V_1,\cdots, V_m)\) are independent if and only if \(U_i\) is independent of \(V_r\) for all pairs \(i,r(i=1,\cdots,k;\;r=1,\cdots,m)\)

Proof: For \(\mu=0\) and \(\sigma^2=1\) and to prove part (a) start with the joint pdf of \(X_1\) and \(X_2\), \[f_{X_1,X_2}(x_1,x_2)=\frac{1}{2\pi}e^{-(1/2)(x_1^2+x_2^2)}\] Let \[u=a_1x_1+a_2x_2,\quad v=b_1x_1+b_2x_2\] then \[\begin{bmatrix} a_1&a_2\\ b_1&b_2\\ \end{bmatrix}\begin{bmatrix} x_1\\ x_2\\ \end{bmatrix}=\begin{bmatrix} u\\ v\\ \end{bmatrix}\] then \[\begin{bmatrix} x_1\\ x_2\\ \end{bmatrix}=\begin{bmatrix} a_1&a_2\\ b_1&b_2\\ \end{bmatrix}^{-1}\begin{bmatrix} u\\ v\\ \end{bmatrix}=\frac{1}{a_1b_2-a_2b_1}\begin{bmatrix} b_2&-a_2\\ -b_1&a_1\\ \end{bmatrix}\begin{bmatrix} u\\ v\\ \end{bmatrix}=\begin{bmatrix} \frac{b_2u-a_2v}{a_1b_2-b_1a_2}\\ \frac{-b_1u+a_1v}{a_1b_2-b_1a_2}\\ \end{bmatrix}\] with \[J=\frac{\partial\mathbf x}{\partial[u,v]}=\begin{vmatrix}\frac{\partial x_1}{\partial u}&\frac{\partial x_1}{\partial v}\\ \frac{\partial x_2}{\partial u}&\frac{\partial x_2}{\partial v}\end{vmatrix}=\frac{1}{(a_1b_2-a_2b_1)^2}\begin{vmatrix}b_2&-a_2\\ -b_1&a_1\end{vmatrix}=\frac{1}{a_1b_2-a_2b_1}\]

Then

\[\begin{align} f_{U,V}(u,v)&=f_{X_1,X_2}(x_1,x_2)|J|\\ &=f_{X_1,X_2}\left(\frac{b_2u-a_2v}{a_1b_2-b_1a_2}, \frac{-b_1u+a_1v}{a_1b_2-b_1a_2}\right)|J|\\ &=\frac{1}{2\pi}\exp\left\{-\frac{1}{2(a_1b_2-b_1a_2)^2}\left((b_2u-a_2v)^2+(-b_1u+a_1v)^2\right)\right\}|J|\\ &=\frac{1}{2\pi}\exp\left\{-\frac{1}{2(a_1b_2-b_1a_2)^2}\left((b_1^2+b_2^2)u^2+(a_1^2+a_2^2)v^2-2(a_1b_1+a_2b_2)uv\right)\right\}|J|\\ \end{align}\]

If \(\text{Cor}(U_i,V_r)=0\), \(U\) and \(V\) are independent.

In the case of samples from a multivariate normal

  • Independence \(\iff\) vanishing covariance

  • Pairwise independence \(\iff\) independence

The Derived Distributions: Student’s \(t\) and Snedecor’s \(F\)

Convergence Concepts

Convergence in Probability

Definition 4 A sequence of random variables \(\{X_i: i \in \mathbb N \}\) converges in probability to a random variable \(X\) if \(\forall \epsilon > 0\) \[\lim_{n\to\infty}\text{P }(|{X_n - X}| \geq \epsilon) = 0\] (or equivalently if \(\lim_{n\to\infty}\text{P }(|{X_n - X}| < \epsilon) = 1\)).

Theorem 5 Weak Law of Large Numbers, WLLN Let \(X_1, X_2, \dots\) be iid RVs with mean \(\mu\) and \(\text{Var} X_i = \sigma^2 < \infty\). Define \[\bar{X}_n = \frac1n \sum_{i=1}^n X_i\] Then the sequence \(\bar{X}_n\) converges in probability to \(\mu\). That is, \(\forall \epsilon > 0\) \[\lim_{n\to\infty} \text{P }(|{\bar X_n - \mu}| < \epsilon) = 1\]

Proof is using Chebychev’s inequality.

We have for every \(\epsilon>0\), \[P(|\bar X_n - \mu|\ge\epsilon)=P((\bar X_n - \mu)^2\ge\epsilon^2)\leq\frac{\mathbb E(\bar X_n - \mu)^2}{\epsilon^2}=\frac{\text{Var}\bar X_n}{\epsilon^2}=\frac{\sigma^2}{n\epsilon^2}\] Hence, \[P(|\bar X_n - \mu|<\epsilon)=1-P(|\bar X_n - \mu|\ge\epsilon)\ge 1-\frac{\sigma^2}{n\epsilon^2}\to 1 \text{ as } n\to \infty\]

The property summarized by the WLLN, that a sequence of the “same” sample quantity approaches a constant as \(n \to \infty\), is known as consistency.

Theorem 6 If \(X_1, X_2, \dots\) converges in probability to \(X\) and \(h\) is a continuous function, then \(h(X_1), h(X_2), \dots\) converges in probability to \(h(X)\).

Almost Sure Convergence

Definition 5 A sequence of random variables \(X_1, X_2, \dots\) converges almost surely to a random variable \(X\) if \(\forall \epsilon > 0\) \[\text{P }\left(\lim_{n\to\infty} |{\bar{X}_n - X}| < \epsilon\right) = 1.\]

  • Almost sure convergence is much stronger than convergence in probability. Convergence in probability states that the sequence of measures of the sets on which the sequence has finite difference from its the limit converges to \(0\). Almost sure convergence states that any place where the sequence has finite difference from its limit must have measure \(0\). It’s like a sequence of integrals converging vs. whether the integrands converge.

  • Almost sure convergence implies convergence in probability but not the other way around.

Theorem 7 If a sequence converges in probability then it is possible to find a subsequence that converges almost surely.

Theorem 8 Strong Law of Large Numbers (SLLN) Let \(X_1, X_2, \dots\) be iid RVs with mean \(\mathbb E {X_i} = \mu\) and \(\text{Var}{X_i} = \sigma^2 < \infty\). Define \[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\]. Then the sequence \(\bar{X}_n\) converges almost surely to \(\mu\). That is, \(\forall \epsilon > 0\) \[\text{P }\left(\lim_{n\to\infty}|{\bar{X}_n - \mu}| < \epsilon\right) = 1.\]

Convergence in Distribution

Definition 6 A sequence of random variables \(X_1, X_2, \dots\) converges in distribution to a random variable \(X\) if \[\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\] at all points where \(F_X\) is continuous.

Remark. Here it is really the cdfs that converge, rather than the random variables. In this way convergence in distribution differs from the previous two concepts.

Convergence in probability implies convergence in distribution

The sequence of random variables, \(X_1, X_2, \cdots\), converges in probability to a constant \(\mu\) if and only if the sequence also converges in distribution to \(\mu\). That is, the statement \[P(|X_n-\mu|>\epsilon)\to 0\quad \forall\epsilon>0\] is equivalent to \[P(X_n\le x)\to\begin{cases}0&\text{if }x<\mu\\1&\text{if }x>\mu.\end{cases}\]

Theorem 9 Central Limit Theorem Let \(X_1, X_2, \dots\) be a sequence of iid random variables with \(\mathbb E[X_i] = \mu\) and finite variance \(\text{Var}[X_i] = \sigma^2 < \infty\). Define \(\bar{X}_n = \frac1n \sum_{i=1}^n X_i\). Let \(G_n(x)\) denote the cdf of \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\). Then \(\forall x \in \mathbb R\), \[\lim_{n\to\infty} G_n(x) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-y^2/2}dy.\] That is, \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\) converges in distribution to the standard normal.

Slutsky’s Theorem

Theorem 10 If \(X_n \to X\) in distribution and \(Y_n \to a\) in probability with \(a\) constant, then

  1. \(Y_nX_n \to aX\) in distribution

  2. \(X_n + Y_n \to X + a\) in distribution

Remark. This tells us, for instance, that \[\frac{\sqrt{n}(\bar{X}_n - \mu)}{S_n} \to \text{n}(0, 1)\] in distribution, since we know that \(S_n \to \sigma\) in probability.

The Delta Method

If we are interested in the convergence of some function of a sequence of RVs, rather than the RVs themselves, then we can use the Delta Method (follows from an application of Taylor’s theorem and Slutsky’s theorem).

Theorem 11 Let \(Y_n\) be a sequence of random variables that satisfies \(\sqrt{n}(Y_n - \theta) \to \text{n}(0, \sigma^2)\) in distribution. For a given function \(g\) and a specific value of \(\theta\), suppose that \(g'(\theta) = 0\) and \(g''(\theta)\) exists and is non-zero. Then \[\sqrt{n}[g(Y_n) - g(\theta)] \to \text{n}(0, \sigma^2[g'(\theta)]^2)\] in distribution.

Furthermore, as both \(\bar Y\) and \(S^2\) are consistent estimators, we can again apply Slutsky’s Theorem to conclude that \[\sqrt{n}[g(Y_n) - g(\bar Y)] \to \text{n}(0, S^2[g'(\bar Y)]^2)\]

If \(g'(\theta) = 0\) then we take the next term in the Taylor series.

Proof The Taylor expansion of \(g(Y_n)\) around \(Y_n = 0\) is \[g(Y_n) = g(\theta) + g'(\theta) (Y_n - \theta) + \text{Remainder},\] where the remainder \(\to 0\) as \(Y_n \to \theta\). Since \(Y_n \to \theta\) in probability it follows that the remainder \(\to 0\) in probability. By applying Slutsky’s Theorem to \[\sqrt{n}[g(Y_n) - g(\theta)] = g'(\theta)\sqrt{n}(Y_n-\theta),\] the result now follows.

The second-order Delta Method

If \(g'(\theta) = 0\), we take one more term in the Taylor expansion to get \[g(Y_n) = g(\theta) + g'(\theta)(Y_n -\theta) +\frac{g''(\theta)}{2} (Y_n -\theta)^2 + \text{Remainder}.\] If we do some rearranging (setting \(g' = 0\)), we have \[ g(Y_n) -g(\theta) = \frac{g''(\theta)}{2} (Y_n -\theta)^2 + \text{Remainder}.\] Now recall that the square of a \(n(0, 1)\) is a \(\chi_1^2\) , which implies that \[\frac{n(Y_n-\theta)^2}{\sigma^2}\to\chi_1^2\] in distribution. Therefore,

(Second-order Delta Method) Let \(Y_n\) be a sequence of random variables that satisfies \(\sqrt{n}(Y_n -\theta) \to n(0, \sigma^2)\) in distribution. For a given function \(g\) and a specific value of \(\theta\), suppose that \(g'(\theta)=0\) and \(g''(\theta)\) exists and is not \(0\). Then \[n[g(Y_n) -g(\theta)]\approx n\frac{g''(\theta)}{2} (Y_n -\theta)^2\to \sigma^2\frac{g''(\theta)}{2}\chi_1^2\] in distribution.

Multivariate Delta Method

Note that we must deal with multiple random variables although the ultimate CLT is a univariate one. Suppose the vector-valued random variable \(\mathbf X=(X_1,\cdots, X_p)\) has mean \(\boldsymbol\mu = (\mu_1, \cdots , \mu_p)\) and covariances \(Cov(X_i, X_j) = \sigma_{ij}\), and we observe an independent random sample \(\mathbf X_1, \cdots, \mathbf X_n\) and calculate the means \[\bar X_i = \frac{1}{n}\sum_{k=1}^{n}X_{ik},\quad i=1,\cdots,p\] For a function \(g(\mathbf x) = g(x_1, \cdots, x_p)\) we can write \[g(\bar x_1, \cdots, \bar x_p) = g(\mu_1, \cdots, \mu_p) + \sum_{k=1}^{p}g'_k(\mathbf x)(\bar x_k - \mu_k),\] and we then have the following theorem:

Multivariate Delta Method Let \(\mathbf X_1, \cdots, \mathbf X_n\) be a random sample with \(E(X_{ij})=\mu_i\) and \(Cov(X_{ik}, X_{jk}) = \sigma_{ij}\) . For a given function \(g\) with continuous first partial derivatives and a specific value of \(\boldsymbol\mu = (\mu_1, \cdots , \mu_p)\) for which \[\tau^2=\sum\sum\sigma_{ij}\frac{\partial g(\boldsymbol\mu)}{\partial{\mu_i}}\frac{\partial g(\boldsymbol\mu)}{\partial{\mu_j}}>0,\] \[\sqrt{n}\big[g(\bar x_1,\cdots,\bar x_p)-g(\mu_1,\cdots,\mu_p)\big]\to n(0,\tau^2)\;\;\text {in distribution}\]

Generating A Random Sample

Direct Method

Probability integral transformation Let \(X\) have continuous cdf \(F_X(x)\) and define the random variable \(Y\) as \(Y = F_X (X)\). Then \(Y\) is uniformly distributed on \((0, 1)\), that is, \(P(Y \leq y) = y, 0 < y < 1\).

Proof For \(Y = F_X (X)\) we have, for \(0 < y < 1\), \[\begin{align} P(Y\leq y)&=P(F_X (X)\leq y)\\ &=P(F_X^{-1} [F_X (X)]\leq F_X^{-1}(y))\quad (F_X^{-1} \text{ is increasing})\\ &=P(X\leq F_X^{-1}(y))\\ &=F_X(F_X^{-1}(y))\\ &=y \end{align}\] At the endpoints we have \(P (Y \leq y) = 1\) for \(y \ge 1\) and \(P (Y \leq y)=0\) for \(y \leq 0\), showing that \(Y\) has a uniform distribution.

Let \(X\) have cdf \(F_X(x)\), let \(Y=g(X)\), and let \(\mathcal X\) and \(\mathcal Y\) be defined as \(\mathcal X \{x: f_X(x) > 0\}\) and \(\mathcal Y = \{y: y = g(x)\},\text{ for some }x \in \mathcal X\}\).

    1. If \(g\) is an increasing function on \(\mathcal X\), \[F_Y (y) =\int_{x\in \mathcal X:x\leq g^{-1}(y)}f_X(x)dx=\int_{-\infty}^{g^{-1}(y)}f_X(x)dx= F_X (g^{-1}(y)) \text{ for }y \in \mathcal Y\]
    1. If \(g\) is a decreasing function on \(\mathcal X\) and \(X\) is a continuous random variable, \[F_Y (y)=\int_{g^{-1}(y)}^{\infty}f_X(x)dx=1 - F_X(g^{-1} (y)) \text{ for } y \in \mathcal Y\]

Definition 7 A Direct Method of generating a random sample uses the probability integral transform to map draws from a \(\text{uniform}(0, 1)\) random variable to draws from the distribution of interest.
The Probability Integral Transform states that if \(X\) has continuous cdf \(F_X(x)\) then \[F_X(X) \sim \text{uniform}(0, 1).\]

Indirect Methods

The Accept/Reject Algorithm

Definition 8 Let \(Y \sim f_Y(y)\) and \(V \sim f_V(v)\) where \(f_Y\) and \(f_V\) have common support with \[M = \sup_y f_Y(y) / f_V(y) < \infty.\] To generate a random variable \(Y \sim f_Y\):

  1. Generate \(U \sim \text{uniform}(0, 1)\), \(V \sim f_V\) independent.

  2. If \(U < \frac 1M f_Y(V)/f_V(V)\), return \(V\) as a sample of \(Y\); otherwise go back to (a.).

  • It is typical to call \(V\) the candidate density and \(Y\) the target density.

  • One would normally try to choose a candidate density with heavier tails than the target density (e.g. Cauchy and normal) to ensure that the tails of the target are well represented. If the target has heavy tails, however, it can be hard to find a candidate that results in finite \(M\). In this case people turn to MCMC methods.

  • Note that \(\text{P }(\texttt{terminate}) = 1/M\). The number of trials to generate one sample of \(Y\) is therefore \(\text{geometric}(1/M)\), with \(M\) the expected number of trials.

  • The intuition behind this algorithm is that if we consider placing the density of a random variable \(Y\) in a box (2d for simplicity) with coordinates \((v,u)\), we express the cdf of \(Y\) using \(V, U \sim \text{uniform}(0, 1)\) \[\text{P }(Y \leq y) = \text{P }(V \leq y \vert{} U \leq \frac1c f_Y(V))\] where \(c = \sup_y f_Y(y)\). In the actual algorithm we take \(U \sim \text{uniform}(0,1)\) and \(V\) to be an RV that has common support with \(Y\).

Order Statistics

Definition: sample percentile

The notation \(\{b\}\), when appearing in a subscript, is defined to be the number \(b\) rounded to the nearest integer in the usual way. More precisely, if \(i\) is an integer and \(i - .5 \le b < i + .5\), then \(\{b\} = i\).

The \((100p)th\) sample percentile is \(X_{(\{np\})}\) if \(\frac{1}{2n}<p<.5\) and \(X_{(n+1-\{n(1-p)\})}\) if \(.5<p<1-\frac{1}{2n}\)

6. Principles of Data Reduction

In this chapter, we explore how we can use functions of a sample \(\mathbf{X}\) to make inferences about an unknown parameter (of the distribution of the sample) \(\theta\).

[Statistic] A statistic is any function of the data.

A statistic \(T(X)\) forms a partition of the sample space \(\mathcal X\) according to its image, \(\mathcal{T} = \{t: \exists \mathbf{x} \in \mathcal X{} \text{ so that } t = T(\mathbf{x})\}\). In this way a statistic provides a method of data reduction. An experimenter who observes only \(T\) will treat as equal two samples \(\mathbf{x}, \mathbf{y}\) for which \(T(\mathbf{x}) = T(\mathbf{y})\).

The Sufficiency Principle

Sufficient Statistic A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given \(T(\mathbf{X})\) does not depend on \(\theta\).

We ignore the fact that all points have \(0\) probability for continuous distributions.

The Sufficiency Principle If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\), then any inference about \(\theta\) should depend on the sample \(\mathbf{X}\) only through \(T(\mathbf{X})\).

If \(p(\mathbf{x}\vert{}\theta)\) is the pmf/pdf of the sample \(\mathbf{X}\) and \(q(t\vert{} \theta)\) is the pmf/pdf of \(T(\mathbf{X})\), then \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if \(\forall \mathbf{x} \in \mathcal X{}\), \[f(\mathbf{x}\vert{}\theta) / q(T(\mathbf{X})\lvert\theta)\] is constant as a function of \(\theta\).

Niceness of the exponential family It turns out that outside of the exponential family, it is rare to have a sufficient statistic that is of smaller dimension than the size of the sample.

Factorization Theorem

Let \(f(\mathbf{x} \vert{} \theta)\) denote the pmf/pdf of a sample \(\mathbf{X}\). A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if and only if there exists functions \(g(T(\mathbf{X})\vert{} \theta)\) and \(h(\mathbf{x})\) such that \(\forall \mathbf{x} \in \mathcal X{}\), \(\forall \theta \in \Theta\) \[\begin{equation} f(\mathbf{x} \vert{} \theta) = g(T(\mathbf{x}) \vert{} \theta)h(\mathbf{x}). \end{equation}\]

This theorem shows that the identity is a sufficient statistic. It is straightforward to show from this that any bijection of a sufficient statistic is a sufficient statistic.

Proof Suppose \(T(\mathbf X)\) is a sufficient statistic. Choose \(g(t\lvert \theta) = P_{\theta}(T(\mathbf X) = t)\) and \(h(\mathbf x) =P\big(\mathbf X = \mathbf x\lvert T(\mathbf X) = T(\mathbf x)\big)\). Because \(T(\mathbf X)\) is sufficient, the conditional probability defining \(h(\mathbf x)\) does not depend on \(\theta\). Thus this choice of \(h(\mathbf x)\) and \(g(t\lvert \theta)\) is legitimate, and for this choice we have \[\begin{align} f(\mathbf x\lvert \theta) &= P_{\theta}(\mathbf X = \mathbf x)\\ &= P_{\theta}\bigg(\mathbf X = \mathbf x \text{ and } T(\mathbf X) = T(\mathbf x)\bigg)\\ &= P_{\theta}(T(\mathbf X) = T(\mathbf x)) P(\mathbf X=\mathbf x\lvert T(\mathbf X) = T(\mathbf x))\quad \text{(sufficiency)}\\ &= g(T(\mathbf x)\lvert \theta)h(\mathbf x). \end{align}\]

So factorization has been exhibited. We also see from the last two lines above that \[P_{\theta}\big(T(\mathbf X) = T(\mathbf x)\big) =g(T(\mathbf x)\lvert \theta),\] so \(g(T(\mathbf x)\lvert \theta)\) is the pmf of \(T(\mathbf X).\)

Now assume the factorization exists. Let \(q(t\lvert \theta)\) be the pmf of \(T(\mathbf X)\). To show that \(T(\mathbf X)\) is sufficient we examine the ratio \(f(\mathbf x\lvert \theta)/q(T(\mathbf x)\lvert \theta)\). Define \(A_{T(\mathbf x)} =\{\mathbf y: T(\mathbf y) = T(\mathbf x)\}\). Then \[\begin{align} \frac{f(\mathbf x\lvert \theta)}{q(T(\mathbf x)\lvert \theta)}&=\frac{g(T(\mathbf x)\lvert \theta)h(\mathbf x)}{q(T(\mathbf x)\lvert \theta)}\\ &=\frac{g(T(\mathbf x)\lvert \theta)h(\mathbf x)}{\sum_{A_{T(\mathbf x)}}g(T(\mathbf y)\lvert \theta)h(\mathbf y)}\quad \text{(definition of the pmf of T)}\\ &=\frac{g(T(\mathbf x)\lvert \theta)h(\mathbf x)}{g(T(\mathbf y)\lvert \theta)\sum_{A_{T(\mathbf x)}}h(\mathbf y)}\quad \text{(since T is constant on} \;A_{T(\mathbf x)})\\ &=\frac{g(T(\mathbf x)\lvert \theta)h(\mathbf x)}{g(T(\mathbf x)\lvert \theta)\sum_{A_{T(\mathbf x)}}h(\mathbf y)}\\ &=\frac{h(\mathbf x)}{\sum_{A_{T(\mathbf x)}}h(\mathbf y)}\\ \end{align}\]

Since the ratio does not depend on \(\theta\), \(T(\mathbf X)\) is a sufficient statistic for \(\theta\).

Let \(X_1, \dots, X_n\) be iid observations from a pmf/pdf \(f(x \vert{} \boldsymbol\theta)\) from an exponential family \[f(x|\boldsymbol{\theta}) = h(x)c(\boldsymbol{\theta})\exp\left( \sum_{i=1}^k w_i(\boldsymbol{\theta}) t_i(x) \right),\] where \(\boldsymbol{\theta} = (\theta_1, \dots, \theta_d)\), \(d \leq k\). Then \(T(\mathbf{X})\) defined by \[T(\mathbf{X}) = \left(\sum_{j=1}^n t_1(\mathbf{X}_j),\cdots,\sum_{j=1}^n t_k(\mathbf{X}_j)\right)\] is a sufficient statistic for \(\boldsymbol{\theta}\).

Minimal sufficient statistic A sufficient statistic \(T(\mathbf{X})\) is called a minimal sufficient statistic if, for any other sufficient statistic \(T'(\mathbf{X})\), \(T(\mathbf{x})\) is a function of \(T'(\mathbf{x})\).

By ‘function of’ we mean that if \(T'(\mathbf{x}) = T'(\mathbf{y})\) then \(T(\mathbf{x}) = T(\mathbf{y})\), \(T\) varies with respect to \(X\) only as it varies with \(T'\). This means that each tile of the partition \(\{B_{t'}:t'\in \mathcal T'\}\) of the sample space according to the image of \(T'\) is a subset of some tile in the partition \(\{A_{t}:t\in \mathcal T\}\) according to \(T\). This means that minimal sufficient statistics provide the coarsest possible tiling of the sample space and thus are the sufficient statistics that provide the greatest data reduction.

Theorem Let \(f(\mathbf{x} \vert{} \theta)\) be the pmf/pdf of a sample \(\mathbf{X}\). Suppose there exists a function \(T(\mathbf{X})\) such that \(\forall \mathbf{x}, \mathbf{y} \in \mathcal X{}\) the ratio \(f(\mathbf{x}\vert{}\theta)/f(\mathbf{y}\vert{}\theta)\) is constant as a function of \(\theta\) if an only if \(T(\mathbf{x}) = T(\mathbf{y})\). Then \(T(\mathbf{X})\) is a minimal sufficient statistic for \(\theta\).

Proof To simplify the proof, we assume \(f(\mathbf x\lvert \theta) > 0\) for all \(\mathbf x \in \mathcal X\) and \(\theta\).

First we show that \(T(\mathbf X)\) is a sufficient statistic. Let \(\mathcal T = \{t : t=T(\mathbf x)\; \text{for some }\; \mathbf x \in \mathcal X\}\) be the image of \(\mathcal X\) under \(T(\mathbf x)\). Define the partition sets induced by \(T(\mathbf x)\) as \(A_t = \{\mathbf x: T(\mathbf x) = t\}\). For each \(A_t\), choose and fix one element \(\mathbf x_t \in A_t\). For any \(\mathbf x \in \mathcal X\), \(\mathbf x_{T(\mathbf x)}\) is the fixed element that is in the same set, \(A_t\) , as \(\mathbf x\). Since \(\mathbf x\) and \(\mathbf x_{T(\mathbf x)}\) are in the same set \(A_t\) , \(T(\mathbf x) = T(\mathbf x_{T(\mathbf x)})\) and, hence, \(f(\mathbf x\lvert\theta)/ f(\mathbf x_{T(\mathbf x)}\lvert\theta)\) is constant as a function of \(\theta\). Thus, we can define a function on \(\mathcal X\) by \(h(\mathbf x) = f(\mathbf x\lvert\theta)/ f(\mathbf x_{T(\mathbf x)}\lvert\theta)\) and \(h\) does not depend on \(\theta\). Define a function on \(\mathcal T\) by \(g(t\lvert\theta) = f(\mathbf x_t\lvert\theta)\). Then it can be seen that \[f(\mathbf x_t\lvert\theta)=\frac{f(\mathbf x_{T(\mathbf x)}\lvert\theta)f(\mathbf x\lvert\theta)}{f(\mathbf x_{T(\mathbf x)}\lvert\theta)} = g(T(\mathbf x)\lvert\theta)h(\mathbf x)\]
and, by the Factorization Theorem, \(T(\mathbf X)\) is a sufficient statistic for \(\theta\).

Now to show that \(T(\mathbf X)\) is minimal, let \(T'(\mathbf X)\) be any other sufficient statistic. By the Factorization Theorem, there exist functions \(g'\) and \(h'\) such that \(f(\mathbf x\lvert\theta) =g'(T'(\mathbf x)\lvert\theta)h'(\mathbf x)\). Let \(\mathbf x\) and \(\mathbf y\) be any two sample points with \(T'(\mathbf x) = T'(\mathbf y)\). Then \[\frac{f(\mathbf x\lvert\theta)}{f(\mathbf y\lvert\theta)}=\frac{g'(T'(\mathbf x)\lvert\theta)h'(\mathbf x)}{g'(T'(\mathbf y)\lvert\theta)h'(\mathbf y)}=\frac{h'(\mathbf x)}{h'(\mathbf y)}\] Since this ratio does not depend on \(\theta\), the assumptions of the theorem imply that \(T(\mathbf x) = T(\mathbf y)\). Thus, \(T(\mathbf x)\) is a function of \(T'(\mathbf x)\) and \(T(\mathbf x)\) is minimal.

Necessary Statistic A statistic is necessary if it can be written as a function of every sufficient statistic.

A statistic is minimal sufficient if and only if it is a necessary and sufficient statistic.

Ancillary Statistic

Ancillary Statistic A statistic \(S(\mathbf{X})\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic.

First Order Ancillary A statistic \(V(\mathbf{X})\) is first order ancillary if \(\mathbb E[V(\mathbf{X})]\) is independent of \(\theta\).

Complete Statistic

Complete Statistic Let \(f(t \vert{} \theta)\) be a family of pdfs or pmfs for a statistic \(T(\mathbf{X})\). The family of probability distributions is called complete if for every (measurable) function \(g\) \[\mathbb E_{\theta}[g(T)] = 0 \,\, \forall \theta \implies P_{\theta}(g(T) = 0) = 1 \,\, \forall \theta.\] Equivalently, \(T(\mathbf{X})\) is called a complete statistic.

(It is left unsaid that the function \(g\) must be independent of \(\theta\).)

Basu’s Theorem If \(T(\mathbf{X})\) is a complete and minimal sufficient statistic, then \(T(\mathbf{X})\) is independent of every ancillary statistic.

Proof

  • For discrete distributions:

Let \(S(\mathbf X)\) be any ancillary statistic. Then \(P(S(\mathbf X) = s)\) does not depend on \(\theta\) since \(S(\mathbf X)\) is ancillary. Also the conditional probability, \[P\big(S(\mathbf X) = s\lvert T(\mathbf X)=t\big) = P\big(\mathbf X \in \{\mathbf x: S(\mathbf x) = s\}\lvert T(\mathbf X)=t\big),\] does not depend on \(\theta\) because \(T(\mathbf X)\) is a sufficient statistic (the definition!). Thus, to show that \(S(\mathbf X)\) and \(T(\mathbf X)\) are independent, it suffices to show that \[P(S(\mathbf X) = s\lvert T(\mathbf X) = t) = P(S(\mathbf X) = s)\] for all possible values \(t \in \mathcal T\). Now, \[P(S(\mathbf X)=s)= \sum_{t\in\mathcal T}^{} P\big(S(\mathbf X) =s\lvert T(\mathbf X) = t\big)P_{\theta}\big(T(\mathbf X)= t\big).\] Furthermore, since \(\sum_{t\in\mathcal T}P_{\theta}(T(\mathbf X) = t) = 1\), we can write \[P(S(\mathbf X) = s) = \sum_{t \in \mathcal T}^{} P(S(\mathbf X)=s)P_{\theta}(T(\mathbf X)=t).\]

Therefore, if we define the statistic \[g(t) = P\big(S(\mathbf X) = s\lvert T(\mathbf X) = t\big) - P\big(S(\mathbf X)=s\big),\] the above two equations show that \[\mathbb E_{\theta}g(T) = \sum_{t \in \mathcal T} g(t)P_{\theta}(T(\mathbf X) = t) = 0\quad \forall \theta.\] Since \(T(\mathbf X)\) is a complete statistic, this implies that \(g(t)=0\) for all possible values \(t\in\mathcal T\). Hence \[P(S(\mathbf X) = s\lvert T(\mathbf X) = t) = P(S(\mathbf X) = s)\] is verified.

  • For continuous distributions:

Let \(T(\mathbf X)\) be a complete sufficient statistic. Let \(S(\mathbf X)\) be an ancillary statistic. Let \(A\) be an event in the sample space.

Basu’s theorem states that \(S\) and \(T\) are independent. We need to show:

\[\begin{align} P( S \in A | T=t )&=P(S \in A)\\ &=\mathbb{E}[I(S \in A)]\\ &=\mathbb{E}_{\theta}\mathbb{E}[I(S \in A)|T=t]\\ \end{align}\]

Define the function \(g(t)\) by \[g(t)=\mathbb E[I(V∈A)|T=t]\] then we conclude \[\mathbb{E}_{\theta}[g(t)-P(S \in A)]=\mathbb{E}_{\theta}[\mathbb E[I(V∈A)|T=t]-P(S \in A)]=0\] for all \(\theta\) in the sample space.

This is using the definition of a complete statistic:

If you have a function \(h(T)\) that

  • does not depend on the parameter \(\theta\) directly, but only on \(T\) (as it is written, “\(h(T)\)”, and
  • for which \(\mathbb E_{\theta}(h(T)) = 0\) for whatever \(\theta\) you pick,

then \(h(t)\) is itself zero almost everywhere (or: \(P_{\theta}(h(T) = 0) = 1\)), again for any value of \(\theta\).

\(T\) is complete, and the function \(h\) of \(T\) is \[h(T) = g(T) − P(S \in A) = g(T) - c\]. We need that \(S\) is ancillary because else, \(h\) would not be purely a function of \(T\), but some \(h(\theta, T)\).

Since \[\mathbb E_{\theta}(I_{V \in A}) = \mathbb E_{\theta}[g(T)],\] so \(\mathbb E_{\theta}[h(T)]\) is zero, so the second point is fulfilled, too, so the conclusion follows.

Thus \[\mathbb{E}_{\theta}[I(S \in A)|T=t]=P(S \in A)|T=t)=P(S \in A)\]

Complete statistics in the exponential family Let \(X_1, \dots, X_n\) be iid observations from a pmf/pdf \(f(x \vert{} \theta)\) from an exponential family \[f(x|\mathbf{\theta}) = h(x)c(\mathbf{\theta})\exp\left( \sum_{i=1}^k w_i(\mathbf{\theta}) t_i(x) \right),\] where \(\mathbf{\theta} = (\theta_1, \dots, \theta_d)\), \(d \leq k\). Then \(\mathbf{T}(\mathbf{X})\) defined by \[T_i(\mathbf{X}) = \sum_{j=1}^k t_i(\mathbf{X}_j)\] is complete if \(\{(w_1(\mathbf{\theta}), \dots, w_n(\mathbf{\theta}))\}\) contains an open set in \(\mathbb R^k\).

The open set criteria excludes curved exponential families.

If a minimal sufficient statistic exists, then every complete statistic is minimal sufficient.

The Likelihood Principle

Likelihood Let \(f(\mathbf{x}\vert{} \theta)\) denote the pmf/pdf of the sample \(\mathbf{X}\). Then the likelihood function, given an observation \(\mathbf{X} = \mathbf{x}\), is \[L(\theta \vert{} \mathbf{x}) = f(\mathbf{x} \vert{} \theta)\] as a function of \(\theta\).

Likelihood Principle If \(\mathbf{x}\) and \(\mathbf{y}\) are two sample points such that \(L(\theta \vert{} \mathbf{x})\) is proportional to \(L(\theta \vert{} \mathbf{y})\), that is, there exists a constant \(C(\mathbf{x}, \mathbf{y})\), which does not depend on \(\theta\), such that \[L(\theta \vert{} \mathbf{x}) = C(\mathbf{x}, \mathbf{y})L(\theta \vert{} \mathbf{y}) \quad \forall \theta,\] then the conclusions drawn from \(\mathbf{x}\) and \(\mathbf{y}\) should be identical.

A Slightly More Formal Construction

We define an experiment \(E\) to be a triple \((\mathbf{X}, \theta, f(\mathbf{x}\vert{}\theta))\), where \(\mathbf{X}\) is a random vector with pmf/pdf \(f(\mathbf{x}\vert{}\theta)\) for some \(\theta\) in the parameter space \(\Theta\). As an experimenter, knowing what experiment \(E\) was performed and having observed a particular sample \(\mathbf{X}=\mathbf{x}\), will make some inference or draw some conclusion about \(\theta\). This conclusion we denote by \(E_v(E,\mathbf{x})\), which stands for the evidence about \(\theta\) arising from \(E\) and \(\mathbf{x}\).

Formal Sufficiency Principle Consider an experiment \(E = (\mathbf{X}, \theta, f(\mathbf{x}\vert{}\theta))\) and suppose that \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\). If \(\mathbf{x}\) and \(\mathbf{y}\) are sample points satisfying \(T(\mathbf{x}) = T(\mathbf{y})\), then \(\text{Ev}(E, \mathbf{x}) = \text{Ev}(E, \mathbf{y})\).

Conditionality Principle Suppose that \(E_1 = \{X_1, \theta, f_1(x_1\vert{} \theta)\}\) and \(E_2 = \{X_2, \theta, f_2(x_2\vert{} \theta)\}\) are two experiments, where only the unknown parameter \(\theta\) need be common between the two experiments. Consider the mixed experiment in which the random variable \(J\) is observed, where \(P(J = 1) = P(J = 2) = \frac12\) (independent of \(\mathbf{X}_1, \mathbf{X}_2, \theta\)), and then the experiment \(E_J\) is performed. Formally, the experiment performed is \(E^* = (\mathbf{X}^*, \theta, f(\mathbf{x}^* \vert{} \theta))\), where \(\mathbf{X}^* = (j, \mathbf{X}_j)\) and \[f^*(\mathbf{x}^* \vert{} \theta) = f^*((j, \mathbf{x}_j)\vert{} \theta) = \frac12 f_j(\mathbf{x}_j \vert{}\theta).\] Then \[\text{Ev}(E^*, (j, \mathbf{x}_j)) = \text{Ev}(E_j, \mathbf{x_j}).\] That is, information about \(\theta\) depends only on the experiment run (not on the fact that the particular experiment was chosen).

Formal Likelihood Principle Suppose that we have two experiments, \(E_1 = (\mathbf{X}_1, \theta, f_1(\mathbf{x}_1 \vert{} \theta))\) and \(E_2 = (\mathbf{X}_2, \theta, f_2(\mathbf{x}_2 \vert{} \theta))\) where the unknown parameter \(\theta\) is the same in both experiments. Suppose that \(\mathbf{x}_1^*\) and \(\mathbf{x}_2^*\) are sample points from \(E_1\) and \(E_2\) respectively, such that \[L(\theta \vert{} \mathbf{x}_2^*) = CL(\theta \vert{} \mathbf{x}_1^*)\] for all \(\theta\) and for some constant \(C(\mathbf{x}_1^*, \mathbf{x}_2^*)\) that is independent of \(\theta\). Then \[\text{Ev}(E_1, \mathbf{x}_1^*) = \text{Ev}(E_2, \mathbf{x}_2^*).\]

Note that this is more general from the other likelihood principle since it concerns two experiments (that we can of course set to be equal).

Likelihood Principle Corollary If \(E = (\mathbf{X}, \theta, f(\mathbf{x} \vert{} \theta))\) is an experiment then \(\text{Ev}(E, \mathbf{x})\) should depend on \(E\) and \(\mathbf{x}\) only through \(L(\theta \vert{} \mathbf{x})\).

Birnbaum’s Theorem The Formal Likelihood Principle follows from the Formal Sufficiency Principle and the Conditionality Principle. The converse is also true.

Many common statistical procedures violate the Formal Likelihood Principle – the topic of the applicability of these principles is not settled. For instance, checking the residuals of a model (to grade the model) violates the Sufficiency Principle. These notions are model dependent, so may not be applicable until after we have decided on a model.

The Equivariance Principle

Measurement Equivariance Inferences should not depend on the measurement scale used.

Formal Invariance If two inference problems have the same formal structure, in terms of the mathematical model used, then the same inference procedure should be used, regardless of the physical realisation.

Equivariance Principle If \(\mathbf{Y} = g(\mathbf{X})\) is a change of measurement scale such that the model \(\mathbf{Y}\) has the same formal structure as the model \(\mathbf{X}\), then an inference procedure should be both measurement equivariant and formally equivariant.

A set of functions \(\{g(\mathbf{x}) : g \in \mathcal G\}\) from the sample space \(\mathcal X\) onto \(\mathcal X\) is called a group of transformations of \(\mathcal X\) if:

  • (i). (Inverse) For every \(g \in \mathcal G\) there is a \(g' \in \mathcal G\) such that \(g'(g(\mathbf{x})) = \mathbf{x}\) for all \(\mathbf{x} \in \mathcal X\).

  • (ii). (Composition) For every \(g \in \mathcal G\) and \(g' \in \mathcal G\) there exists \(g'' \in \mathcal G\) such that \(g'(g(\mathbf{x})) = g''(\mathbf{x})\) for all \(\mathbf x \in \mathcal X\).

  • (iii). (Identity) The identity, \(e(\mathbf x)\), defined by \(e(\mathbf x) =\mathbf x\) is an element of \(\mathcal G\), which is a consequence of (i) and (ii) and need not be verified separately.

Let \(\mathcal{F} = \{f(\mathbf{x} \vert{} \theta): \theta \in \Theta\}\) be a set of pdfs or pmfs for \(\mathbf{X}\), and let \(\mathcal{G}\) be group of transformations on the sample space \(\mathcal X{}\). Then \(\mathcal{F}\) is invariant under the group \(\mathcal{G}\) if for every \(\theta \in \Theta\) and \(g \in \mathcal{G}\) there exists a unique \(\theta' \in \Theta\) such that \(\mathbf{Y} = g(\mathbf{X})\) has the distribution \(f(\mathbf{y} \vert{} \theta')\) if \(\mathbf{X}\) has the distribution \(f(\mathbf{x} \vert{} \theta)\).

7. Point Estimation

Point Estimator A point estimator is any function \(W(X_1, \dots, X_n)\) of a sample; that is, any statistic is a point estimator. An estimate is a realised value \(w(x_1, \dots, x_n)\).

Note that there is the implicit restriction that the estimator is not a function of the parameter you are trying to estimate.

Methods of Finding Estimators

Method of Moments

The method of moments is performed using a random sample \(X_1, \dots, X_n\) from a population with unknown parameters \(\theta_1, \dots, \theta_k\) by computing the first \(k\) empirical moments \(m_k = \frac{1}{n}\sum_{i=1}^nX_i^k\) and matching them with the corresponding moments of the population \(\mu_k' = \mathbb E[X^t]\).

\[\begin{aligned} m_1 &= \mu_1'(\theta_1, \dots, \theta_k)\\ m_2 &= \mu_2'(\theta_1, \dots, \theta_k)\\ &\vdots \\ m_k &= \mu_k' (\theta_1, \dots, \theta_k)\\ \end{aligned}\]

From this you get \(k\) simultaneous equations that you can use to solve for the parameters of the population.

Maximum Likelihood Estimators

Maximum Likelihood Estimator For each sample point \(\mathbf x\), let \(\hat{\boldsymbol\theta}(\mathbf x)\) be a parameter value at which the likelihood function \[L(\boldsymbol\theta \vert{} \mathbf x)=\prod_{i=1}^{n}f(x_i\lvert \theta_1,\cdots,\theta_k)\] attains its maximum as a function of \(\boldsymbol\theta\), with \(\mathbf x\) held fixed. A maximum likelihood estimator (MLE) of the parameter \(\boldsymbol\theta\) based on a sample \(\mathbf X\) is \(\hat{\boldsymbol\theta}(\mathbf X)\).

Suppose we want to find the MLE for some function of the parameter \(\tau(\theta)\).

Induced Likelihood Given some function of the parameter \(\tau(\theta)\), we define the induced likelihood function \(L^*\) by \[L^*(\eta \vert{} \mathbf x) = \sup_{\theta : \tau(\theta) = \eta} L(\theta \vert{} \mathbf x).\] The value \(\hat{\eta}\) that maximises \(L^*(\eta \vert{} \mathbf x)\) will be called the MLE of \(\eta = \tau(\theta)\). It can be seen that the maxima of \(L^*\) and \(L\) coincide.

Invariance Property of MLEs IF \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(\tau(\theta)\), the MLE of \(\tau(\theta)\) is \(\tau(\hat{\theta})\).

  1. The MLE can be an unstable function of the data.

  2. When verifying maxima for multi-dimensional problems try to avoid going down the hessian route, which can be tedious.

To use two-variate calculus to verify that a function \(H(\theta_1,\theta_2)\) has a local maximum at \((\hat\theta_1,\hat\theta_2)\), it must be shown that the following three conditions hold.

    1. The first-order partial derivatives are \(0\), \[\frac{\partial}{\partial\theta_1}H(\theta_1,\theta_2)|_{\theta_1=\hat\theta_1,\theta_2=\hat\theta_2}=0\] and \[\frac{\partial}{\partial\theta_2}H(\theta_1,\theta_2)|_{\theta_1=\hat\theta_1,\theta_2=\hat\theta_2}=0\]
    1. At least one second-order partial derivative is negative, \[\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)|_{\theta_1=\hat\theta_1,\theta_2=\hat\theta_2}<0\] or \[\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)|_{\theta_1=\hat\theta_1,\theta_2=\hat\theta_2}<0\]
    1. The Jacobian of the second-order partial derivatives is positive, \[\begin{vmatrix} \frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)&\frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)\\ \frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)&\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2) \end{vmatrix}_{\theta_1=\hat\theta_1,\theta_2=\hat\theta_2}\\ =\frac{\partial^2}{\partial\theta_1^2}H(\theta_1,\theta_2)\frac{\partial^2}{\partial\theta_2^2}H(\theta_1,\theta_2)-\left(\frac{\partial^2}{\partial\theta_1\partial\theta_2}H(\theta_1,\theta_2)\right)^2 >0\]
      Since \[f(\theta) \approx f(\theta_0) + \nabla f(\theta_0)^T (\theta - \theta_0) + \frac12 (\theta - \theta_0)^T \nabla^2 f(\theta_0) (\theta - \theta_0).\] If \(\theta_0\) is a local extremum for \(f\), then \(\nabla f(\theta_0) = 0\), so if \(\nabla^2 f(\theta_0)\) is negative definite, then \((\theta - \theta_0)^T \nabla^2 f(\theta_0) (\theta - \theta_0) \leq 0\) for all \(\theta\), so \(\theta_0\) is a local maximum. Then the eigenvalues of \(\nabla^2 f(\theta_0)\) should be both negative and the determinant should be positive.

Bayes Estimators

If we denote the prior distribution by \(\pi(\theta)\) and the sampling distribution by \(f(\mathbf x \vert{} \theta)\), then the posterior distribution of \(\theta\) given the sample \(\mathbf x\) is \[\pi(\theta \vert{} \mathbf x) = \frac{f(\mathbf x\vert{}\theta)\pi(\theta)}{m(\mathbf x)},\] where \(m(\mathbf x)\) is the marginal distribution of the sample \(\mathbf x\) \[m(\mathbf x) = \int f(\mathbf x\vert{}\theta) \pi(\theta) d\theta.\]

Conjugate Priors Let \(\mathcal{F}\) denote the class of pdfs/pmfs \(f(\mathbf x \vert{} \theta)\) (indexed by \(\theta\)). A class \(\Pi\) of prior distribution is a conjugate family for \(\mathcal{F}\) if, \(\forall f \in \mathcal{F}\), \(\forall \, \text{priors} \in \mathcal{F}\) and \(\forall \mathbf x \in \mathcal X\), the posterior distribution is in \(\Pi\).

Note that this relation is not said to be symmetric.

[Some Examples]

  • Normal is self-conjugate as a family.

  • Beta distribution is conjugate to binomial.

  • Gamma distribution is conjugate to Poisson.
    Let \(X_1,\cdots, X_n\) be iid \(Poisson(\lambda)\), and let \(\lambda\) have a \(gamma(\alpha,\beta)\) distribution, the conjugate family for the Poisson.

For \(n\) observations, \(Y =\sum_i X_i\sim Poisson(n\lambda)\). - a. The marginal pmf of \(Y\) is \[\begin{align} m(y)&=\int_{-\infty}^{\infty}f(Y|\lambda)f(\lambda)d\lambda\\ &=\int_{-\infty}^{\infty}\frac{(n\lambda)^ye^{-n\lambda}}{y!}\frac{1}{\Gamma(\alpha)\beta^{\alpha}}\lambda^{\alpha-1}e^{-\lambda/\beta}d\lambda\\ &=\frac{n^y}{y!\Gamma(\alpha)\beta^{\alpha}}\int_{-\infty}^{\infty}\lambda^{(y+\alpha)-1}e^{-n\lambda-\lambda/\beta}d\lambda\\ &=\frac{n^y}{y!\Gamma(\alpha)\beta^{\alpha}}\int_{-\infty}^{\infty}\lambda^{(y+\alpha)-1}e^{-\frac{\lambda}{\beta/(n\beta+1)}}d\lambda\\ &=\frac{n^y}{y!\Gamma(\alpha)\beta^{\alpha}}\Gamma(y+\alpha)\left(\frac{\beta}{n\beta+1}\right)^{y+\alpha}\\ \end{align}\] Thus \[\pi(\lambda|y)=\frac{f(y|\lambda)\pi(\lambda)}{m(y)}=\frac{\lambda^{(y+\alpha)-1}e^{-\frac{\lambda}{\beta/(n\beta+1)}}}{\Gamma(y+\alpha)\left(\frac{\beta}{n\beta+1}\right)^{y+\alpha}}\\ \sim gamma\left(y+\alpha, \frac{\beta}{n\beta+1}\right)\]

The EM (Expectation-Maximization) Algorithm

In general, we can move in either direction, from the complete-data problem to the incomplete-data problem or the reverse. If \(\mathbf Y = (Y_1, \cdots , Y_n)\) are the incomplete data, and \(\mathbf X = (X_1, \cdots, X_m)\) are the augmented data, making \((\mathbf Y, \mathbf X)\) the complete data, the densities \(g(\cdot\lvert \theta)\) of \(\mathbf Y\) and \(f(\cdot\lvert\theta)\) of \((\mathbf Y, \mathbf X)\) have the relationship \[g(\mathbf y\lvert\theta) = \int f(\mathbf y, \mathbf x\lvert\theta) d\mathbf x\] with sums replacing integrals in the discrete case.
If we turn these into likelihoods, \[L(\theta\lvert\mathbf y) = g(\mathbf y\lvert\theta)\] is the incomplete-data likelihood and \[L(\theta\lvert\mathbf y, \mathbf x)=f(\mathbf y,\mathbf x\lvert\theta)\] is the complete-data likelihood.

Methods of Evaluating Estimators

Mean Squared Error

The mean squared error (MSE) of an estimator \(W\) of a parameter \(\theta\) is defined by \(\mathbb E_{\theta}[(W - \theta)^2]\).

Bias-Variance Decomposition \[\mathbb E_{\theta}[(W - \theta)^2] = \text{Var}_{\theta} [W] + (\mathbb E_{\theta} [W] - \theta)^2 = \text{Var}_{\theta} [W] + (\text{Bias}_{\theta} [W])^2\]

Bias The bias of a point estimator \(W\) of a parameter \(\theta\) is given by \[\text{Bias}_{\theta} [W] = \mathbb E_{\theta} [W] - \theta.\] An estimator whose bias is \(0\) is called an unbiased estimator and has \(\mathbb E_{\theta}[W] = \theta,\quad\forall \theta\).

  • Clearly, if an estimator is unbiased, then its MSE is equal to its variance.

  • The MSE makes sense for location parameters but not so much for scale parameters, since it is symmetric and scale parameters have a natural floor at \(0\).

  • The MSE may be a function of the thing you’re trying to estimate. So which estimator you choose as being the ‘best’ may depend on the range you expect the parameter to lie within.

The MLE (biased) estimator of the variance of a normal distribution is \[\hat \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(X_i - \bar X)^2,\] where \(\bar X\) is the sample mean and \(X_i \sim^{iid} \mathcal{N}(\mu,\sigma^2)\). Since \[\frac{1}{\sigma ^2} \sum_{i=1}^{N}(X_i-\bar X)^2 \sim \chi^2_{N},\] we have that \[\begin{align} \text{Var}\big(\frac{N\hat \sigma^2}{\sigma^2}\big) = \text{Var}(\chi^2_{N-1}) = 2(N-1) \\ \implies \text{Var}(\hat \sigma ^2) = \frac{2(N-1) \sigma^4}{N^2} \end{align} \] The variance of this estimator is equal to \(\frac{2(N-1) \sigma^4}{N^2}\). The MSE is \[\text{MSE}=\text{Var}(\hat \sigma ^2)+\text{Bias} =\frac{2\sigma ^4}{N}+ \mathbb E [\hat \sigma ^2] - \sigma ^2=\text{Var}(\hat \sigma ^2)+\text{Bias}^2 \\=\text{Var}(\hat \sigma ^2)+(\mathbb E(\hat \sigma ^2)-\sigma ^2)^2=\frac{2(N-1) \sigma^4}{N^2}+ \left(\frac{N-1}{N}\sigma ^2 - \sigma ^2\right)^2\\ =\frac{2N-1}{N^2}\sigma ^4.\] where \[\begin{align} \mathbb E(\hat \sigma ^2)&=\mathbb E\left(\frac{1}{N}\sum_{i=1}^{N}(x_i-\bar{x})^2\right)\\ &=\frac{1}{N}\mathbb E\left(\sum_{i=1}^{N}x_i^2-2\sum_{i=1}^{N}x_i\bar{x}+\sum_{i=1}^{N}\bar{x}^2\right)\\ &=\frac{1}{N}\mathbb E\left(\sum_{i=1}^{N}x_i^2-N\bar{x}^2\right)\\ &=\frac{1}{N}\mathbb E\left(\sum_{i=1}^{N}x_i^2\right)-\mathbb E\left(\bar{x}^2\right)\\ &=\mathbb E\left(x^2\right)-\mathbb E\left(\bar{x}^2\right)\\ &=(\sigma_x^2+\mu^2)-(\sigma_{\bar x}^2+\mu^2)\\ &=\sigma_x^2-\sigma_{\bar x}^2\\ &=\sigma_x^2-\text{Var}(\bar x)\\ &=\sigma_x^2-\text{Var}\left(\frac{1}{N}\sum_{i=1}^{N}x_i\right)\\ &=\sigma_x^2-\frac{1}{N^2}\text{Var}\left(\sum_{i=1}^{N}x_i\right)\\ &=\sigma_x^2-\frac{1}{N^2}\sum_{i=1}^{N}\text{Var}\left(x\right)\\ &=\sigma_x^2-\frac{1}{N}\text{Var}\left(x\right)\\ &=\sigma_x^2-\frac{1}{N}\sigma_x^2\\ &=\frac{N-1}{N}\sigma_x^2\\ \end{align}\]

A unbiased estimator of the variance of a normal distribution is \[S^2 = \frac{1}{N-1} \sum_{i=1}^{N}(X_i - \bar X)^2,\] since \(\mathbb E S^2=\sigma^2\). And since \[\frac{(N-1) S^2}{\sigma^2} \sim \chi_{n-1}^2,\] \[\begin{align} \text{Var}\big(\frac{(N-1)S^2}{\sigma^2}\big) = \text{Var}(\chi^2_{N-1}) = 2(N-1) \\ \implies \text{Var}(S^2) = \frac{2(N-1) \sigma^4}{(N-1)^2}=\frac{2\sigma^4}{N-1} \end{align} \] The variance of this estimator is equal to \(\frac{2\sigma^4}{N-1}\), which is also the MSE.

The variance (or MSE) in the unbiased case \(\frac{2\sigma^4}{N-1}\) is greater than the variance in the biased case \(\frac{2(N-1) \sigma^4}{N^2}\). The MSE of the unbiased estimator \(\frac{2\sigma^4}{N-1}\) is larger than the MSE of the biased estimator \(\frac{2N-1}{N^2}\sigma ^4\).

\[ \begin{array}{llll} \text{Estimator} & \text{MLE} & \text{best-unbiased} & \text{Bayes} \\ \hline \text{Normal}(\mu) & \bar X & \bar X & \frac{\tau^2}{\tau^2+\sigma^2}\bar X+\frac{\sigma^2}{\tau^2+\sigma^2}\mu \\ \text{Normal}(\sigma^2) & \frac{1}{n}\sum(X_i-\bar X)=\frac{n-1}{n}S^2 & S^2=\frac{1}{n-1}\sum(X_i-\bar X) & \frac{\sigma^2\tau^2}{\sigma^2+\tau^2} \\ \text{Bernoulli}(p) & \bar X & & \\ \text{Bernoulli}(p(1-p)) & & & \\ \text{Binomial}(np) & \frac{\sum_{i=1}^n X_i}{n} & & \frac{(\sum_{i=1}^n X_i) +\alpha}{\alpha+\beta+n} \\ \text{Binomial}(np(1-p)) & & & \\ \end{array} \]

Best Unbiased Estimators

[Best Unbiased Estimator] An estimator \(W^*\) is best unbiased estimator of of \(\tau(\theta)\) if it satisfies \(\mathbb E[W^*] = \tau(\theta) \,\, \forall \theta\), and for any other estimator with \(\mathbb E[W] = \tau(\theta) \,\, \forall \theta\) we have \(\text{Var}[W*] \leq \text{Var}[W] \,\, \forall \theta\). We also call \(W^*\) the uniform minimum variance unbiased estimator (UMVUE) of \(\tau(\theta)\).

Suppose that we are trying to estimate \(\theta\) and consider the class of estimators \[\mathcal{C}_\tau = \{W: \mathbb E[W] = \tau(\theta)\}.\] All estimators in this class have the same bias, so we can compare their MSEs by comparing their variances alone. (So the best estimator in this class is just the minimum variance one.) This means that the considerations of this chapter can be applied to classes like \(\mathcal{C}_\tau\), even if \(\tau(\theta) \neq \theta\).

The best unbiased estimator, if it exists, could be hard to find. The following lower bound at least gives a stopping criterion to our search.

Cramér-Rao Inequality

Let \(X_1, \dots, X_n\) be a sample with pdf \(f(\mathbf x\vert{}\theta)\) and let \(W(\mathbf X) = W(X_1, \dots, X_n)\) be any estimator with finite variance satisfying \[\frac{d}{d \theta} \mathbb E_{\theta}[W(\mathbf X)] = \int_{\mathcal X{}} \frac{\partial}{\partial \theta} \left(W(\mathbf X) f(\mathbf x\vert{}\theta) \right)d\mathbf x.\] and \[\text{Var}_{\theta}[W(\mathbf X)]<\infty\] Then \[\text{Var}_{\theta}[W(\mathbf X)] \geq \frac{\left(\frac{d}{d \theta} \mathbb E_{\theta}[W(\mathbf X)] \right)^2}{\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(\mathbf X\vert{} \theta) \right)^2 \right]}.\]

The proof of this considers the correlation between the gradient of the log likelihood and the statistic. Note that the sample in the above theorem is not necessarily iid.

proof

Covariance inequality For any two random variables \(X\) and \(Y\), \[(\text{Cov}[X, Y])^2 \leq \sigma_X^2 \sigma_Y^2=\text{Var}(X)\text{Var}(Y)\] and \[\text{Var}(X)\ge\frac{(\text{Cov}[X, Y])^2}{\text{Var}(Y)}\] The cleverness in this theorem follows from choosing \(X\) to be the estimator \(W(\mathbf X)\) and \(Y\) to be the quantity \(\frac{\partial}{\partial\theta} \log f(\mathbf X|\theta)\) and applying the Cauchy-Schwarz Inequality. First note that \[\begin{align} \frac{d}{d\theta}\mathbb E_{\theta}W(\mathbf X)&=\int_{\mathcal X}W(\mathbf X)\left[\frac{\partial}{\partial\theta}f(\mathbf X|\theta)\right]dx\\ &=\int_{\mathcal X}W(\mathbf X)\frac{\frac{\partial}{\partial\theta}f(\mathbf X|\theta)}{f(\mathbf X|\theta)}f(\mathbf X|\theta)dx\\ &=\mathbb E_{\theta}\left[W(\mathbf X)\frac{\frac{\partial}{\partial\theta}f(\mathbf X|\theta)}{f(\mathbf X|\theta)}\right]\\ &=\mathbb E_{\theta}\left[W(\mathbf X)\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]\\ \end{align}\] which suggests a covariance between \(W(\mathbf X)\) and \(\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\). For it to be a covariance, we need to subtract the product of the expected values, so we calculate \(\mathbb E_{\theta}\left[\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]\). If \(W(\mathbf X) = 1\), we have \[\frac{d}{d\theta}\mathbb E_{\theta}W(\mathbf X)=\frac{d}{d\theta}\mathbb E_{\theta}[1]=0=\mathbb E_{\theta}\left[\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]\]

Therefore \[\begin{align} \text{Cox}_{\theta}\left(W(\mathbf X), \frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right)&=\mathbb E_{\theta}\left[W(\mathbf X)\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]+\mathbb E_{\theta}W(\mathbf X)\mathbb E_{\theta}\left[\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]\\ &=\mathbb E_{\theta}\left[W(\mathbf X)\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]\\ &=\frac{d}{d\theta}\mathbb E_{\theta}W(\mathbf X) \end{align}\] Also since \[\mathbb E_{\theta}\left[\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right]=0\] we have \[\text{Var}_{\theta}\left(\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right)=\mathbb E_{\theta}\left(\left(\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right)^2\right)\] Using the Cauchy-Schwarz Inequality we obtain \[\text{Var}_{\theta}(W(\mathbf X))\ge \frac{\left(\frac{d}{d\theta}\mathbb E_{\theta}W(\mathbf X)\right)^2}{\mathbb E_{\theta}\left(\left(\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)\right)^2\right)}=\frac{\left(\frac{d}{d\theta}\mathbb E_{\theta}W(\mathbf X)\right)^2}{\text{(information number)}}\]

Cramer-Rao Inequality, iid case If the assumptions of last Theorem are satisfied and, additionally, if \(X_1,\cdots ,X_n\) are iid with pdf \(f(x|\theta)\), then \[\begin{align} \mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(\mathbf X \vert{} \theta) \right)^2 \right]&=\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log \prod_{i=1}^{n}f(X_i \vert{} \theta) \right)^2 \right] \end{align}\]

proof Since \(X_1, \dots, X_n\) are independent, \[\begin{align} \mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(\mathbf X \vert{} \theta) \right)^2 \right]&=\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log \prod_{i=1}^{n}f(X_i \vert{} \theta) \right)^2 \right]\\ &=\mathbb E_{\theta}\left[ \left(\sum_{i=1}^{n}\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right)^2 \right]\\ &=\sum_{i=1}^{n}\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right)^2 \right]\\ &+\sum_{i\ne j}^{}\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right) \left(\frac{\partial}{\partial \theta} \log f(X_j \vert{} \theta) \right)\right]\\ \end{align}\] since \(X_1, \dots, X_n\) are independent, for \(i\ne j\) we have \[\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right) \left(\frac{\partial}{\partial \theta} \log f(X_j \vert{} \theta) \right)\right] =\mathbb E_{\theta}\left(\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right) \mathbb E_{\theta}\left(\frac{\partial}{\partial \theta} \log f(X_j \vert{} \theta) \right)=0\] So \[\begin{align} \mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(\mathbf X \vert{} \theta) \right)^2 \right]=\sum_{i=1}^{n}\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X_i \vert{} \theta) \right)^2 \right]=n\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right)^2 \right]\\ \end{align}\]

Fisher Information

Fisher Information The quantity \[\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(\mathbf X \vert{} \theta) \right)^2 \right]\] is called the Fisher Information or Information number of the sample \(\mathbf X\). This terminology reflects the fact that the information number gives a bound on the variance of the best unbiased estimator of \(\theta\). As the information number gets bigger and we have more information about \(\theta\), we have a smaller bound on the variance of the best unbiased estimator.

If \(f(x \vert{} \theta)\) satisfies \[\frac{d}{d \theta} \mathbb E_{\theta}\left[ \frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right] = \int \frac{\partial}{\partial \theta} \left[ \left( \frac{\partial}{\partial \theta} \log f(x \vert{} \theta) \right) f(x \vert{} \theta) \right] dx\] (as is true for an exponential family), the partial derivative with respect to \(\theta\) of the natural logarithm of the likelihood function is called the score. Under certain regularity conditions, if \(\theta\) is the true parameter (i.e. \(X\) is actually distributed as \(f(X; \theta)\)), it can be shown that the expected value (the first moment) of the score, evaluated at the true parameter value \(\theta\) , is \(0\)

\[\begin{align} \mathbb E_{\theta}\left[\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right]&=\int_{\mathbb R}^{}\left[\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right]f(X \vert{} \theta) dx\\ &=\int_{\mathbb R}^{}\frac{\frac{\partial}{\partial \theta} f(X \vert{} \theta)}{f(X \vert{} \theta)} f(X \vert{} \theta) dx\\ &=\frac{\partial}{\partial \theta}\int_{\mathbb R}^{} f(X \vert{} \theta) dx\\ &=\frac{\partial}{\partial \theta}\cdot 1\\ &=0\\ \end{align}\] The variance of the score is defined to be the Fisher information: \[\mathcal I_{\theta}=\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right)^2 \right]=\int_{\mathbb R}\left(\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right)^2f(X \vert{} \theta)dx\] Since \[\begin{align} \frac{\partial^2}{\partial \theta^2} \log f(X \vert{} \theta)&=\frac{\partial}{\partial \theta}\frac{\frac{\partial}{\partial \theta} f(X \vert{} \theta)}{f(X \vert{} \theta)}\\ &=\frac{\left[\frac{\partial^2}{\partial \theta^2} f(X \vert{} \theta)\right]f(X \vert{} \theta)-\left[\frac{\partial}{\partial \theta} f(X \vert{} \theta)\right]^2}{(f(X \vert{} \theta))^2}\\ &=\frac{\frac{\partial^2}{\partial \theta^2} f(X \vert{} \theta)}{f(X \vert{} \theta)}-\frac{\left[\frac{\partial}{\partial \theta} f(X \vert{} \theta)\right]^2}{(f(X \vert{} \theta))^2}\\ &=\frac{\frac{\partial^2}{\partial \theta^2} f(X \vert{} \theta)}{f(X \vert{} \theta)}-\left(\frac{\partial}{\partial \theta} \log f(X \vert{} \theta)\right)^2\\ \end{align}\] and \[\mathbb E_{\theta}\left[\frac{\frac{\partial^2}{\partial \theta^2} f(X \vert{} \theta)}{f(X \vert{} \theta)}\right]=\frac{\partial^2}{\partial \theta^2}\int_{\mathbb R}f(X \vert{} \theta)dx=0\] then the Fisher information can be written \[\begin{equation} \mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log f(X \vert{} \theta) \right)^2 \right] =\mathbb E_{\theta}\left[ \left(\frac{\frac{\partial}{\partial \theta} f(X \vert{} \theta)}{f(X \vert{} \theta)}\right)^2 \right]= - \mathbb E_{\theta}\left[ \frac{\partial^2}{\partial \theta^2} \log f(X \vert{} \theta) \right] \tag{1} \end{equation}\]

Even if the Cramér-Rao bound is applicable, it may not be sharp – there may not be an estimator that attains this bound.

  • For Bernoulli RVs, the Fisher information is \[\begin{align} I_X(p)&=E_p\left[\left(\frac{d}{dp}\log\left(p^X(1-p)^{1-X}\right)\right)^2\right]\\ &=E_p\left[\left(\frac{d}{dp}\left[X\log p+(1-X)\log (1-p)\right]\right)^2\right]\\ &=E_p\left[\left(\frac Xp-\frac{1-X}{1-p}\right)^2\right]\\ &=E_p\left[\left(\frac Xp\right)^2-2\left(\frac Xp\frac{1-X}{1-p}\right)+\left(\frac{1-X}{1-p}\right)^2\right]\\ &=E_p\left(\frac{X^2}{p^2}\right)-2E_p\left(\frac{X(1-X)}{p(1-p)}\right)+E_p\left(\frac{(1-X)^2}{(1-p)^2}\right)\\ &=\frac{1}{p}-0+\frac{1}{(1-p)}\\ &=\frac{1}{p(1-p)}\\ \end{align}\]

Since \[E_p(X^2)=0^2\cdot p_X(0)+1^2\cdot p_X(1)=0^2(1-p)+1^2p=p,\] \[E_p((1-X)^2)=1-p,\]

  • For Exponential RVs \(X \sim \exp(\beta)\), If the density is \[f(X|\beta) = \frac1\beta e^{-\frac1\beta X}\] The Fisher information is \[\begin{align*} \mathcal I_{\beta} & = n\mathbb E_{\beta}\left( \left(\frac{\partial \log f(X| \beta)}{\partial \beta}\right)^2\right) \\ & = n\int_0^\infty \left(\frac{\partial \log f(X|\beta)}{\partial \beta}\right)^2 \, f(X|\beta) \, dx \\ &=n\int_0^\infty \left(\frac{\partial}{\partial \beta}\left(-\frac{1}{\beta}x+\log\frac{1}{\beta}\right)\right)^2 \, \frac{1}{\beta}e^{-\frac{1}{\beta}x} \, dx \\ & = n\int_0^\infty \left(\frac{x}{\beta^2} -\frac{1}{\beta}\right)^2 \, \frac{1}{\beta}e^{-\frac{1}{\beta}x} \, dx \\ & = n\int_0^\infty \left(\frac{x^2}{\beta^4} -2\frac{x}{\beta^3}+\frac{1}{\beta^2}\right) \, \frac{1}{\beta}e^{-\frac{1}{\beta}x} \, dx \\ &=n\left(\frac{2}{\beta^2}-\frac{2}{\beta^2}+\frac{1}{\beta^2}\right)\\ &=\frac{n}{\beta^2}\\ \end{align*}\]

  • For Exponential RVs \(X \sim \exp(\lambda)\), If the density is \[f(X|\lambda) = \lambda e^{-\lambda X}\] The Fisher information is \[\begin{align*} \mathcal I_{\lambda} & = n\mathbb E_{\lambda}\left( \left(\frac{\partial \log f(X| \lambda)}{\partial \lambda}\right)^2\right) \\ & = n\int_0^\infty \left(\frac{\partial \log f(X|\lambda)}{\partial \lambda}\right)^2 \, f(X|\lambda) \, dx \\ &=n\int_0^\infty \left(\frac{\partial}{\partial \lambda}\left(-\lambda x+\log\lambda\right)\right)^2 \, \lambda e^{-\lambda X}\, dx \\ & = n\int_0^\infty \left(-x +\frac{1}{\lambda}\right)^2 \, \lambda e^{-\lambda X} \, dx \\ & = n\int_0^\infty \left(x^2 -2\frac{x}{\lambda}+\frac{1}{\lambda^2}\right) \, \lambda e^{-\lambda X} \, dx \\ &=n\left(\frac{2}{\lambda^2}-\frac{2}{\lambda^2}+\frac{1}{\lambda^2}\right)\\ &=\frac{n}{\lambda^2}\\ \end{align*}\]

\[ \begin{array}{llll} & \text{Fisher Information} & \text{Cramér-Rao lower bound} \\ \hline \text{Exponential}(\beta) & \frac{n}{\beta^2} & \frac{\beta^2}{n} \\ \text{Bernoulli}(p) & \frac{1}{p(1-p)} & p(1-p) \\ \text{Normal}(\sigma^2) & \frac{n}{2\sigma^4} & \frac{2\sigma^4}{n} \\ \end{array} \]

The Matrix form

When there are \(N\) parameters, so that \(\theta\) is an \(N \times 1\) vector \[\displaystyle\boldsymbol\theta =\begin{bmatrix} \theta _{1}&\theta _{2}&\dots &\theta _{N} \end{bmatrix}^{T}\] then the Fisher information takes the form of an \(N \times N\) matrix. This matrix is called the Fisher information matrix (FIM) and has typical element \[[\mathcal I(\theta)]_{i,j}=\mathbb E_{\theta}\left[\left(\frac{\partial}{\partial\theta_i}\log f(X|\theta)\right)\left(\frac{\partial}{\partial\theta_j}\log f(X|\theta)\right)\right]\] The FIM is a \(N \times N\) positive semidefinite matrix. If it is positive definite, then it defines a Riemannian metric on the \(N\)-dimensional parameter space.

Under certain regularity conditions, the Fisher information matrix may also be written as \[[\mathcal I(\theta)]_{i,j}=-\mathbb E_{\theta}\left[\frac{\partial}{\partial\theta_i}\frac{\partial}{\partial\theta_j}\log f(X|\theta)\right]\]

[Attainment] Let \(X_1, \dots, X_n\) be iid \(X \sim f(x \vert{} \theta)\), where \(f(x \vert{} \theta)\) satisfies the conditions of the Cramér-Rao Theorem. Let \(L(\theta \vert{} \mathbf x) = \prod_{i=1}^n f(x_i \vert{} \theta)\) denote the likelihood function. If \(W(\mathbf X)=W(X_1,\cdots,X_n)\) is any unbiased estimator of \(\tau(\theta)\) then \(W(\mathbf X)\) attains the Cramér-Rao lower bound if and only if exist \(a(\theta)\) such that \[a(\theta) [W(\mathbf x) - \tau(\theta)] = \frac{\partial}{\partial \theta} \log L(\theta \vert{} \mathbf x).\]

Proof: Recalling that \(\mathbb E_{\theta}W=\tau(\theta)\), \(\mathbb E_{\theta}\left(\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)\right)=0\), the Cramer-Rao Inequality, can be written as \[\begin{align} \left[\text{Cov}_{\theta}\left(W(\mathbf X),\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)\right)\right]^2&=\left[\mathbb E_{\theta}\bigg(W(\mathbf X)-\tau(\theta)\bigg)\left(\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)-0\right)\right]^2\\ &=\left[\mathbb E_{\theta}\left(W(\mathbf X)\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)\right)\right]^2\\ &\le\text{Var}_{\theta}\left[W(\mathbf X)\right]\text{Var}_{\theta}\left[\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)\right]\end{align}\] We can have equality if and only if \[W(\mathbf X)-\tau(\theta)\] is proportional to \(\frac{\partial}{\partial\theta}\log\prod_{i=1}^{n}f(X_i|\theta)\).

Sufficiency and Unbiasedness

Rao-Blackwell Let \(W\) be any unbiased estimator of \(\tau(\theta)\) and let \(T\) be a sufficient statistic for \(\theta\). Define \(\phi(T) = \mathbb E[W \vert{} T]\). Then \(\mathbb E_{\theta}[\phi(T)] = \tau(\theta)\) and \(\text{Var}_{\theta}[\phi(T)] \leq \text{Var}_{\theta}[W] \,\, \forall \theta\). That is, \(\phi(T)\) is a uniformly better unbiased estimator of \(\tau(\theta)\).

Proof: Since \[\mathbb E X=\mathbb E[\mathbb E(X|Y)]\] \[\text{Var} X=\text{Var}[\mathbb E(X|Y)]+\mathbb E[\text{Var} (X|Y)]\] then we have \[\tau(\theta)=\mathbb E_{\theta}W=\mathbb E_{\theta}[\mathbb E(W|T)]=\mathbb E_{\theta}\phi(T)\] so \(\phi(T)\) is unbiased for \(\tau(\theta)\). Also \[\begin{align} \text{Var}_{\theta}W&=\text{Var}_{\theta}[\mathbb E(W|T)]+\mathbb E_{\theta}[\text{Var} (W|T)]\\ &=\text{Var}_{\theta}[\phi(T)]+\mathbb E_{\theta}[\text{Var} (W|T)]\\ &\ge\text{Var}_{\theta}[\phi(T)] \end{align}\] Hence \(\phi(T)\) is uniformly better than \(W\), and it only remains to show that \(\phi(T)\) is indeed an estimator. That is, we must show that \(\phi(T) = \mathbb E[W \vert{} T]\) is a function of only the sample and, in particular, is independent of \(\theta\). But it follows from the definition of sufficiency, and the fact that \(W\) is a function only of the sample, that the distribution of \(W|T\) is independent of \(\theta\). Hence \(\phi(T)\) is a uniformly better unbiased estimator of \(\tau(\theta)\).

  • Conditioning any unbiased estimator on a sufficient statistic will result in a uniform improvement, so we need consider only statistics that are functions of a sufficient statistic when looking for best unbiased estimators.

  • The proof doesn’t require that the statistic we condition on is sufficient, but if it isn’t then the resulting quantity will probably depend on the parameter we are trying to estimate and not be an estimator..

Uniqueness of best unbiased estimator

If \(W\) is a best unbiased estimator of \(\tau(\theta)\) then \(W\) is unique.

Proof: Suppose \(W'\) is another best unbiased estimator, and consider the estimator \(W^*=\frac{1}{2}(W + W')\). Note that \(\mathbb E_{\theta}W^* = \tau(\theta)\) and
\[\begin{align} \text{Var}_{\theta}W^*&=\text{Var}_{\theta}\left(\frac{1}{2}W+\frac{1}{2}W'\right)\\ &=\frac{1}{4}\text{Var}_{\theta}W+\frac{1}{4}\text{Var}_{\theta}W'+\frac{1}{2}\text{Cov}_{\theta}(W,W')\\ &\le\frac{1}{4}\text{Var}_{\theta}W+\frac{1}{4}\text{Var}_{\theta}W'+\frac{1}{2}\left[\text{Var}_{\theta}(W)\text{Var}_{\theta}(W')\right]^{1/2}\\ &=\text{Var}_{\theta}W \quad\quad(\text{Var}_{\theta}W=\text{Var}_{\theta}W')\\ \end{align}\] But if the above inequality is strict, then the best unbiasedness of \(W\) is contradicted, so we must have equality for all \(\theta\). Since the inequality is an application of Cauchy-Schwarz, we can have equality only if \(W' = a(\theta) W + b(\theta)\). Now using properties of covariance, we have \[\begin{align} \text{Cov}_{\theta}(W + W')&=\text{Cov}_{\theta}(W + a(\theta) W + b(\theta))\\ &=\text{Cov}_{\theta}(W + a(\theta) W)\\ &=a(\theta)\text{Var}_{\theta}(W)\\ \end{align}\] But \[\text{Cov}_{\theta}(W + W')=\text{Var}_{\theta}(W)\] since we had equality in last Cauchy-Schwarz, hence \(a(\theta)=1\) and since \(\mathbb E_{\theta}W^* = \tau(\theta)\), we must have \(b(\theta)=0\) and \(W=W'\).

The following theorem is mostly useful to show that a given estimator isn’t best unbiased.

If \(\mathbb E_{\theta}[W] = \tau(\theta)\), \(W\) is the best unbiased estimator of \(\tau(\theta)\) if and only if \(W\) is uncorrelated with all unbiased estimators of \(0\).

Proof: If \(W\) satisfies \(\mathbb E_{\theta}[W] = \tau(\theta)\), and we have another estimator, \(U\), that satisfies \(\mathbb E_{\theta}U=0\) for all \(\theta\), that is, \(U\) is an unbiased estimator of \(0\). The estimator \[\phi_a=W+aU,\] where \(a\) is a constant, satisfies \(\mathbb E_{\theta}\phi_a = \tau(\theta)\) and hence is also an unbiased estimator of \(\tau(\theta)\). The variance of \(\phi_a\) is \[\text{Var}_{\theta}\phi_a=\text{Var}_{\theta}(W+aU)\\ =\text{Var}_{\theta}W+a^2\text{Var}_{\theta}U+2a\text{Cov}_{\theta}(W,U)\] Now, if for some \(\theta=\theta_0,\text{Cov}_{\theta_0}(W,U)<0\), then we can male \(2a\text{Cov}_{\theta_0}(W,U)+a^2\text{Var}_{\theta}U<0\) by choosing \(a\in(0, -2\text{Cov}_{\theta_0}(W,U)/\text{Var}_{\theta_0}U)\). Hence, \(\phi_a\) will be better than \(W\) at \(\theta=\theta_0\) and \(W\) can’t be best unbiased. A similar argument will show that if \(\text{Cov}_{\theta_0}(W,U)>0\) for any \(\theta_0\), \(W\) also cannot be best unbiased.

If \(W\) is best unbiased, the above argument shows that \(W\) must satisfy \(\text{Cov}_{\theta}(W,U)=0\) for all \(\theta\), for any \(U\) satisfying \(\mathbb E_{\theta}U=0\). Hence the necessity is established.

Suppose now that we have an unbiased estimator \(W\) that is uncorrelated with all unbiased estimators of \(0\). Let \(W'\) be any other estimator satisfying \(\mathbb E_{\theta}W'=\mathbb E_{\theta}W=\tau(\theta)\) we will show that \(W\) is better than \(W'\). Write \[W'=W+(W'-W),\] and calculate \[\begin{align} \text{Var}_{\theta}W'&=\text{Var}_{\theta}W+\text{Var}_{\theta}(W'-W)+2\text{Cov}_{\theta}(W,W'-W)\\ &=\text{Var}_{\theta}W+\text{Var}_{\theta}(W'-W) \end{align}\] where the last equality is true because \(W'-W\) is an unbiased estimator of \(0\) and is uncorrelated with \(W\) by assumption. Since \(\text{Var}_{\theta}(W'-W)\ge 0\), then \(\text{Var}_{\theta}W'\ge \text{Var}_{\theta}W\). Since \(W'\) is arbitrary, it follows that \(W\) is the best unbiased estimator of \(\tau(\theta)\).

The idea comes from considering \(\phi_a = W + aU\) where \(\mathbb E[U] = 0\), then considering the variance.

Unbiased estimator of \(0\) Note that an unbiased estimator of \(0\) is simply random noise (there is no information in an estimator of \(0\)). If an estimator can be improved by adding noise, then it is probably defective.

We are now in a position such that, if we can characterise all of the unbiased estimators of \(0\) then we can check if a given estimator is best unbiased. In general this is not easy and requires conditions on the distribution. However, if a distribution is complete then it admits no unbiased estimators of \(0\) other than \(0\) itself (by definition), so we will be done.

Note that due to the Rao-Blackwell theorem, only the distribution of the sufficient statistic needs to be complete (not the underlying population distribution).

completeness and best unbiasedness

Let \(T\) be a complete sufficient statistic for a parameter \(\theta\) and let \(\phi(T)\) be any estimator based only on \(T\). Then \(\phi(T)\) is the unique best unbiased estimator of its expected value.

Lehmann-Scheffé Unbiased estimators based on complete sufficient statistics are unique.

Loss Function Optimality

8. Hypothesis Testing

Introduction

  • Test statistic: \(W(X_1,..., X_n) = W(\mathbf X)\)
  • \(H_0: \theta \in \Theta_0\)
  • \(H_1: \theta \in \Theta^c_0\)

Methods of Finding Tests

Likelihood Ratio Tests

Likelihood ratio test (LRT) statistic for testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta^c_0\) is:

\[\lambda(\mathbf x) = \frac{\underset{\Theta_0}{\sup}L(\theta|\mathbf x)}{\underset{\Theta}{\sup}L(\theta|\mathbf x)}\] A likelihood ratio test (LRT) is any test that has a rejection region of the form \(\{\mathbf x : \lambda(\mathbf x)\le c\}\) , where \(c\) is any number satisfying \(0\le c \le 1\). \[L(\theta|x_1,..., x_n) = f(\mathbf x|\theta) = \prod^n_{i=1}f(x_i|\theta)\] the numerator is the maximum probability of the observed sample, the maximum being computed over parameters in the null hypothesis. The denominator is maximum probability of observed sample over all possible parameters. Small likelihood implies rejecting \(H_0\).

Suppose \(\hat\theta\) an MLE of \(\theta\), exists; \(\hat\theta\) is obtained by doing an unrestricted maximization of \(L(\theta|\mathbf x)\). We can also consider the MLE of \(\theta\), call it \(\hat\theta_0\), obtained by doing a restricted maximization, assuming \(\Theta_0\) is the parameter space. That is, \(\hat\theta_0=\hat\theta_0(\mathbf x)\) is the value of \(\theta\in\Theta_0\) that maximizes \(L(\theta|\mathbf x)\). Then, the LRT statistic is \[\lambda(\mathbf x)=\frac{L(\hat\theta_0|\mathbf x)}{L(\hat\theta|\mathbf x)}\]

Likelihood Ratio Tests using the sufficiency statistic

If \(T(\mathbf X)\) is a sufficient statistic for \(\theta\) and \(\lambda^*(t)\) and \(\lambda(\mathbf x)\) are the LRT statistics based on \(T\) and \(\mathbf X\), respectively, then \(\lambda^*(T(\mathbf x))=\lambda(\mathbf x)\) for every \(\mathbf x\) in the sample space.

Proof: From the Factorization Theorem, the pdf or pmf of \(\mathbf X\) can be written as \(f(\mathbf x|\theta) = g(T(\mathbf x)|\theta)h(\mathbf x)\), where \(g(t|\theta)\) is the pdf or pmf of \(T\) and \(h(\mathbf x)\) does not depend on \(\theta\). Thus \[\begin{align} \lambda(\mathbf x) &= \frac{\underset{\Theta_0}{\sup}L(\theta|\mathbf x)}{\underset{\Theta}{\sup}L(\theta|\mathbf x)}\\ &=\frac{\underset{\Theta_0}{\sup}f(\mathbf x|\theta)}{\underset{\Theta}{\sup}f(\mathbf x|\theta)}\\ &=\frac{\underset{\Theta_0}{\sup}g(T(\mathbf x)|\theta)h(\mathbf x)}{\underset{\Theta}{\sup}g(T(\mathbf x)|\theta)h(\mathbf x)}\\ &=\frac{\underset{\Theta_0}{\sup}g(T(\mathbf x)|\theta)}{\underset{\Theta}{\sup}g(T(\mathbf x)|\theta)}\\ &= \frac{\underset{\Theta_0}{\sup}L^*(\theta|T(\mathbf x))}{\underset{\Theta}{\sup}L^*(\theta|T(\mathbf x))}\\ &=\lambda^*(T(\mathbf x))\\ \end{align}\]

Bayesian tests

Normal Bayesian test: accept \(H_0\) if and only if \[\frac{1}{2} \leq P(\theta \in \Theta_0|\mathbf X) = P(\theta \leq \theta_0|\mathbf X)\]

Union-intersection and intersection-union tests

Union-intersection test: \[H_0: \theta \in \underset{\gamma \in \Gamma}\bigcap \Theta_{\gamma}\] intersection-union test: \[H_0: \theta \in \underset{\gamma \in \Gamma}\bigcup \Theta_{\gamma}\]

Methods of Evaluating Tests

Error Probabilities and the Power Function

  • Type I Error (FP): If \(\theta \in \Theta_0\) but the hypothesis test incorrectly decides to reject \(H_0\),
  • Type II Error (FN): If \(\theta \in \Theta^c_0\) but the test decides to accept \(H_0\),

Power function: The power function of a hypothesis test with rejection region \(R\) is the function of \(\theta\) defined by \[\beta(\theta) = P_{\theta}(\mathbf X \in R)\] Ideally; - \(\beta(\theta) = 0\) for all \(\theta \in \Theta_0\) - \(\beta(\theta) = 1\) for all \(\theta \in \Theta^c_0\)

Let \(X_1,\cdots,X_n\) be a random sample from a \(n(\theta, \sigma^2)\) population, \(\sigma^2\) known. An LRT of \(H_0 : \theta\le\theta_0\) versus \(H_1 : \theta > \theta_0\) is a test that rejects \(H_0\) if \[\frac{\bar{X} - \theta_0}{\sigma/ \sqrt{n}} > c\] The constant \(c\) can be any positive number. The power function of this test is: \[\begin{align} \beta(\theta) &= P_{\theta}\left(\frac{\bar{X} - \theta_0}{\sigma/ \sqrt{n}} > c\right)\\ &= P_{\theta}\left(\frac{\bar{X} - \theta}{\sigma/ \sqrt{n}} > c+\frac{\theta_0 - \theta}{\sigma/ \sqrt{n}}\right)\\ &=P\left(Z > c + \frac{\theta_0 - \theta}{\sigma/ \sqrt{n}}\right) \end{align}\] where \(Z\) is a standard normal random variable, since \[\frac{\bar{X} - \theta}{\sigma/ \sqrt{n}}\sim n(0,1)\] As \(\theta\) increases from \(-\infty\to \infty\), it is easy to see that this normal probability increases from \(0\) to \(1\). Therefore, it follows that \(\beta(\theta)\) is an increasing function of \(\theta\), with \[\underset{\theta\to-\infty}{\lim}\beta(\theta)=0, \;\;\underset{\theta\to+\infty}{\lim}\beta(\theta)=1,\] and \[\beta(\theta_0)=\alpha\quad\text{if}\quad P(Z>c)=\alpha\] \(\alpha\) is the Type I Error probability.

  • For \(0\le\alpha\le 1\), a test with power function \((\beta(\theta)\) is a size \(\alpha\) test if \(\sup_{\theta\in\Theta_0}\beta(\theta) = \alpha\).

  • For \(0\le\alpha\le 1\), a test with power function \((\beta(\theta)\) is a level \(\alpha\) test if \(\sup_{\theta\in\Theta_0}\beta(\theta) \le \alpha\).

  • Size of LRT a size \(\alpha\) LRT is constructed by choosing \(c\) such that \(\sup_{\theta\in\Theta_0}P_{\theta}(\lambda(\mathbf X)\le c) = \alpha\). How that \(c\) is determined depends on the particular problem. For example, \(\Theta_0\) consists of the single point \(\theta=\theta_0\) and \(\sqrt{n}(\bar{X} - \theta_0) \sim n(0,1)\) if \(\theta=\theta_0\). So the test \[\text{reject }H_0\text{ if }|\bar{X}-\theta_0|\ge z_{\alpha/2}/\sqrt{n}\] where \(z_{\alpha/2}\) satisfies \(P(Z\ge z_{\alpha/2})=\alpha/2\) with \(Z\sim n(0,1)\) is a size \(\alpha\) LRT.

  • Size of union-intersection test Finding constants \(t_L\) and \(t_U\) such that \[\underset{\theta\in\Theta_0}{\sup}P_{\theta}\left(\frac{\bar{X}-\mu_0}{\sqrt{S^2/n}}\ge t_L\quad\text{ or }\quad\frac{\bar{X}-\mu_0}{\sqrt{S^2/n}}\le t_U\right)=\alpha\] But for any \((\mu,\sigma^2)=\theta\in\Theta_0,\;\mu=\mu_0\) and thus \[\frac{\bar{X}-\mu_0}{\sqrt{S^2/n}}=\frac{\bar{X}-\mu}{\sqrt{S^2/n}}\] has a Student’s \(t\) distribution with \(n-1\) degree of freedom. So any choice of \(t_U=t_{n-1,1-\alpha_1}\) and \(t_L=t_{n-1,\alpha_2}\) with \(\alpha_1+\alpha_2=\alpha\) will yield a test with Type I Error probability of exactly \(\alpha\) for all \(\theta\in\Theta_0\).

  • Unbiased power function A test with power function \(\beta(\theta)\) is unbiased if \(\beta(\theta')\ge\beta(\theta'')\) for every \(\theta'\in\Theta_0^c\) and \(\theta''\in\Theta_0\)

Most Powerful Tests (UMP)

Let \(\mathcal C\) be a class of tests for testing \(H_0:\theta\in\Theta_0\) versus \(H_1:\theta\in\Theta_0^c\). A test in class \(\mathcal C\), with power function \(\beta(\theta)\), is a uniformly most powerful (UMP) class \(\mathcal C\) test if \(\beta(\theta)\ge \beta'(\theta)\) for every \(\theta\in\Theta_0^c\) and every \(\beta'(\theta)\) that is a power function of a test in class \(\mathcal C\).

Neyman-Pearson lemma

Consider testing \(H_0:\theta=\theta_0\) versus \(H_1:\theta=\theta_1\), where the pdf or pmf corresponding to \(\theta_i\) is \(f(\mathbf x|\theta_i), i = 0, 1\), using a test with rejection region \(R\) that satisfies \[\mathbf x\in R\quad\text{if}\quad f(\mathbf x|\theta_1)>kf(\mathbf x|\theta_0)\\ \text{and}\\ \mathbf x\in R^c\quad\text{if}\quad f(\mathbf x|\theta_1)<kf(\mathbf x|\theta_0)\tag1\] for some \(K\ge 0\) and \[\alpha=P_{\theta_0}(\mathbf X\in R)\tag2\] Then

    1. Sufficiency, Any test that satisfies the conditions above is a UMP level \(\alpha\) test.
    1. Necessity, If there exists a test satisfying the conditions above with \(k > 0\), then every UMP level \(\alpha\) test is a size \(\alpha\) test (satisfies (2)) and every UMP level \(\alpha\) test satisfies (1) except perhaps on a set \(A\) satisfying \[P_{\theta_0}(\mathbf X\in A)=P_{\theta_1}(\mathbf X\in A)=0\]

Proof: Note first that any test satisfying (2) is a size \(\alpha\) and, hence, a level \(\alpha\) test because \[\sup_{\theta\in\Theta_0}P_{\theta}(\mathbf X\in R)=P_{\theta_0}(\mathbf X\in R)=\alpha\] since \(\Theta_0\) has only one point.

To ease notation, we define a test junction, a function on the sample space that is \(1\) if \(\mathbf x \in R\) and \(0\) if \(\mathbf x \in R^c\). That is, it is the indicator function of the rejection region. Let \(\phi(\mathbf x)\) be the test function of a test satisfying (1) and (2). Let \(\phi'(\mathbf x)\) be the test function of any other level \(\alpha\) test, and let \(\beta(\theta)\) and \(\beta'(\theta)\) be the power functions corresponding to the tests \(\phi\) and \(\phi'\), respectively. Because \(0\le\phi'(\mathbf x)\le 1\), (1) implies that \[(\phi(\mathbf x)-\phi'(\mathbf x))(f(\mathbf x|\theta_1)-kf(\mathbf x|\theta_0))\ge 0\] for every \(\mathbf x\) (since \(\phi=1\) if \(f(\mathbf x|\theta_1)>kf(\mathbf x|\theta_0)\) and \(\phi = 0\) if \(f(\mathbf x|\theta_1) < kf(\mathbf x|\theta_0)\). Thus \[0\le\int\left[\phi(\mathbf x)-\phi'(\mathbf x)\right]\left[f(\mathbf x|\theta_1)-kf(\mathbf x|\theta_0)\right]d\mathbf x\\ =\beta(\theta_1)-\beta'(\theta_1)-k(\beta(\theta_0)-\beta'(\theta_0))\tag3\] Statement (a) is proved by noting that, since \(\phi'\) is a level \(\alpha\) test and \(\phi\) is a size \(\alpha\) test, \(\beta(\theta_0)-\beta'(\theta_0)=\alpha-\beta'(\theta_0)\ge 0\) Thus (3) and \(k\ge 0\) imply that \[0\le\beta(\theta_1)-\beta'(\theta_1)-k(\beta(\theta_0)-\beta'(\theta_0))\le\beta(\theta_1)-\beta'(\theta_1)\] showing that \(\beta(\theta_1)\ge \beta'(\theta_1)\) and hence \(\phi\) has greater power than \(\phi'\). Since \(\phi'\) was an arbitrary level \(\alpha\) test and \(\theta_1\) is the only point in \(\Theta_0^c\), \(\phi\) is a UMP level \(\alpha\) test.

To prove statement (b) , let \(\phi'\) now be the test function for any UMP level \(\alpha\) test. By part (a), \(\phi\), the test satisfying (1) and (2), is also a UMP level \(\alpha\) test, thus \(\beta(\theta_1)=\beta'(\theta_1)\). This fact, (3), and \(k > 0\) imply \[\alpha-\beta'(\theta_0)=\beta(\theta_0)-\beta'(\theta_0)\leq0\]

Now, since \(\phi'\) is a level \(\alpha\) test, \(\beta'(\theta_0)\le\alpha\). Thus \(\beta'(\theta_0)=\alpha\), that is, \(\phi'\) is a size \(\alpha\) test, and this also implies that (3) is an equality in this case. But the nonnegative integrand \((\phi(\mathbf x)-\phi'(\mathbf x))(f(\mathbf x|\theta_1)-kf(\mathbf x|\theta_0))\) will have a zero integral only if \(\phi'\) satisfies (1), except perhaps on a set \(A\) with \(\int_Af(\mathbf x|\theta_i)d\mathbf x=0\). This implies that the last assertion in statement (b) is true.

Corollary: Neyman-Pearson lemma and sufficiency

Consider the hypothesis problem posed in Theorem Neyman-Pearson lemma. Suppose \(T(\mathbf X)\) is a sufficient statistic for \(\theta\) and \(g(t|\theta_i)\) is the pdf or pmf of \(T\) corresponding to \(\theta_i,\;i=0,1\). Then any test based on \(T\) with rejection region \(S\) (a subset of the sample space of \(T\)) is a UMP level \(\alpha\) test if it satisfies \[t\in S\quad\text{if}\quad g(t|\theta_1)>kg(t|\theta_0)\\ \text{and}\\ t\in S^c\quad\text{if}\quad g(t|\theta_1)<kg(t|\theta_0),\tag4\] for some \(k\ge0\) where \[\alpha=P_{\theta_0}(T\in S)\tag5\]

Proof: In terms of the original sample \(\mathbf X\), the test based on \(T\) has the rejection region \(R = \{\mathbf x: T(\mathbf x) \in S\}\). By the Factorization Theorem, the pdf or pmf of \(\mathbf X\) can be written as \(f(\mathbf x|\theta_i)=g(T(\mathbf x)|\theta_i)h(\mathbf x),\quad i=0,1,\) for some nonnegative function \(h(\mathbf x)\). Multiplying the inequalities in (4) by this non-negative function, we see that \(R\) satisfies \[\mathbf x\in R\quad\text{if}\quad f(\mathbf x|\theta_1)=g(T(\mathbf x)|\theta_1)h(\mathbf x)>kg(T(\mathbf x)|\theta_0)h(\mathbf x)=kf(\mathbf x|\theta_0)\] and \[\mathbf x\in R^c\quad\text{if}\quad f(\mathbf x|\theta_1)=g(T(\mathbf x)|\theta_1)h(\mathbf x)<kg(T(\mathbf x)|\theta_0)h(\mathbf x)=kf(\mathbf x|\theta_0)\] Also by (5), \[P_{\theta_0}(\mathbf X\in R)=P_{\theta_0}(T(\mathbf X)\in S)=\alpha.\] So, by the sufficiency part of the Neyma-Pearson Lemma, the test based on \(T\) is a UMP level \(\alpha\) test.

Monotone Likelihood Ratio (MLR)

A family of pdfs or pmfs \(\{g(t|\theta):\theta\in\Theta\}\) for a univariate random variable \(T\) with real-valued parameter \(\theta\) has a monotone likelihood ratio (MLR) if, for every \(\theta_2 > \theta_1\), \(g(t|\theta_2)/g(t|\theta_1)\) is a monotone (nonincreasing or nondecreasing) function of \(t\) on \(\{t:g(t|\theta_1)>0\text{ or }g(t|\theta_2)>0\}\). Note that \(c/0\) is defined as \(\infty\) if \(0 < c\).

Many common families of distributions have an MLR. For example, the normal (known variance, unknown mean), Poisson, and binomial all have an MLR. Indeed, any regular exponential family with \(g(t|\theta)=h(t)c(\theta)e^{w(\theta)t}\) has an MLR if \(w(\theta)\) is a nondecreasing function.

Karlin-Rubin Theorem

Consider testing \(H_0:\theta\le\theta_0\) versus \(H_1:\theta>\theta_0\). Suppose that \(T\) is a sufficient statistic for \(\theta\) and the family of pdfs or pmfs \(\{g(t|\theta):\theta\in\Theta\}\) of \(T\) has an MLR. Then for any \(t_0\), the test that rejects \(H_0\) if and only if \(T > t_0\) is a UMP level \(\alpha\) test, where \(\alpha = P_{\theta_0}(T > t_0)\).

Proof: Let \(\beta(\theta)=P_{\theta}(T>t_0)\) be the power function of the test. Fix \(\theta'>\theta_0\) and consider testing \(H_0':\theta=\theta_0\) versus \(H_1':\theta=\theta'\). Since the family of pdfs or pmfs of \(T\) has an MLR, \(\beta(\theta)\) is nondecreasing, so

    1. \(\sup_{\theta<\theta_0}\beta(\theta)=\beta(\theta_0)=\alpha\), and this is a level \(\alpha\) test.
    1. If we define \[k'=\inf_{t\in\mathcal T}\frac{g(t|\theta')}{g(t|\theta_0)}\] where \(\mathcal T=\{t:t>t_0\text{ and either } g(t|\theta')>0\text{ or }g(t|\theta_0)>0\}\), it follows that \[T>t_0\iff\frac{g(t|\theta')}{g(t|\theta_0)}>k'.\]

Together with Corollary: Neyman-Pearson lemma and sufficiency, (i) and (ii) imply that \(\beta(\theta')\ge\beta^*(\theta')\), where \(\beta^*(\theta)\) is the power function for any other level \(\alpha\) test of \(H_0'\), that is, any test satisfying \(\beta(\theta_0)\le\alpha\). However, any level \(\alpha\) test of \(H_0\) satisfies \(\beta^*(\theta_0)\le\sup_{\theta\in\Theta_0}\beta^*(\theta)\le\alpha.\) Thus, \(\beta(\theta')\ge\beta^*(\theta')\) for any level \(\alpha\) test of \(H_0\). Since \(\theta'\) was arbitrary, the test is a UMP level \(\alpha\) test.

Sizes of Union-Intersection and Intersection-Union Tests

Union-Intersection tests ( UIT) and Intersection-Union tests (IUT) can often be bounded above by the sizes of some other tests. First consider UITs. In this situation, we are testing a null hypothesis of the form \(H_0:\theta\in\Theta_0\), where \(\Theta_0=\underset{\gamma\in\Gamma}{\bigcap}\Theta_{\gamma}\). To be specific, let \(\lambda_{\gamma}(\mathbf x)\) be the LRT statistic for testing \[H_{0\gamma}:\theta\in\Theta_{\gamma}\] versus \[H_{1\gamma}:\theta\in\Theta_{\gamma}^c\] and let \(\lambda(\mathbf x)\) be the LRT statistic for testing \(H_{0}:\theta\in\Theta_{0}\) versus \(H_{1}:\theta\in\Theta_{0}^c\) Then we have the following relationships between the overall LRT and the UIT based on \(\lambda_{\gamma}(\mathbf x)\).

Theorem: LRT is uniformly more powerful than the Union-Intersection Tests (UITs)

Consider testing \(H_{0}:\theta\in\Theta_{0}\) versus \(H_{1}:\theta\in\Theta_{0}^c\), where \(\Theta_0=\underset{\gamma\in\Gamma}{\bigcap}\Theta_{\gamma}\) and \(\lambda_{\gamma}(\mathbf x)\) is defined in the previous paragraph. Define \(T(\mathbf x)=\inf_{\gamma\in\Gamma}\lambda_{\gamma}(\mathbf x)\), and form the UIT with rejection region \[\{\mathbf x:\lambda_{\gamma}(\mathbf x)<c\quad\text{for some}\quad\gamma\in\Gamma\}=\{\mathbf x:T(\mathbf x)<c\}.\] Also consider the usual LRT with rejection region \[\{\mathbf x:\lambda(\mathbf x)<c\}.\] Then

    1. \(T(\mathbf x)\ge\lambda(\mathbf x)\quad\text{for every}\quad \mathbf x\);
    1. If \(\beta_T(\theta)\) and \(\beta_{\lambda}(\theta)\) are the power functions for the tests based on \(T\) and \(\lambda\), respectively, then \(\beta_T(\theta)\le\beta_{\lambda}(\theta)\quad\text{for every}\quad\theta\in\Theta\);
    1. If the LRT is a level \(\alpha\) test, then the UIT is a level \(\alpha\) test.

Proof: Since \(\Theta_0=\underset{\gamma\in\Gamma}{\bigcap}\Theta_{\gamma}\subset\Theta_{\gamma}\) for any \(\gamma\), from Definition of Likelihood Ratio Tests, we see that for any \(\mathbf x\), \[\lambda_{\gamma}(\mathbf x)\ge\lambda(\mathbf x)\quad \text{for each }\quad \gamma\in\Gamma\] because the region of maximization is bigger for the individual \(\lambda_{\gamma}\), thus \(T(\mathbf x)=\inf_{\gamma\in\Gamma}\lambda_{\gamma}(\mathbf x)\ge\lambda(\mathbf x)\), proving (a).

By (a), \(\{\mathbf x:T(\mathbf x)<c\}\subset\{\mathbf x:\lambda(\mathbf x)<c\}\), so \[\beta_T(\theta)=P_{\theta}(T(\mathbf X)<c)\le P_{\theta}(\lambda(\mathbf X)<c)=\beta_{\lambda}(\theta),\] proving (b). Since (b) holds for every \(\theta, \;\;\sup_{\theta\in\Theta_0}\beta_T(\theta)\le\sup_{\theta\in\Theta_0}\beta_{\lambda}(\theta)\le\alpha\), proving (c).

Theorem: Level \(\alpha\) test IUT

A simple bound for the size of an IUT is related to the sizes of the individual tests that are used to define the IUT. Recall that in this situation the null hypothesis is expressible as a union; that is, we are testing \[H_0:\theta\in\Theta_0\quad\text{versus}\quad H_1:\theta\in\Theta_0^c,\quad\text{where}\quad\Theta_0=\bigcup_{\gamma\in\Gamma}\Theta_{\gamma}.\] An IUT has a rejection region of the form \(R=\bigcap_{\gamma\in\Gamma}R_{\gamma},\) where \(R_{\gamma}\) is the rejection region for a test of \(H_{0\gamma}:\theta\in\Theta_{\gamma}\)H

Let \(\alpha_{\gamma}\) be the size of the test of \(H_{0\gamma}\) with rejection region \(R_{\gamma}\). Then the IUT with rejection region \(R=\bigcap_{\gamma\in\Gamma}R_{\gamma}\) is a level \(\alpha=\sup_{\gamma\in\Gamma}\alpha_{\gamma}\) test.

Proof: Let \(\theta\in\Theta_0\). Then \(\theta\in\Theta_{\gamma}\) for some \(\gamma\) and \[P_{\theta}(\mathbf X\in R)\le P_{\theta}(\mathbf X\in R_{\gamma})\le\alpha_{\gamma}\le\alpha.\] Since \(\theta\in\Theta_0\) was arbitrary, the IUT is a level \(\alpha\) test.

Consider testing \(H_0:\theta\in\bigcup_{j=1}^{k}\Theta_j\), where \(k\) is a finite positive integer. For each \(j = 1 ,\cdots, k\), let \(R_j\) be the rejection region of a level \(\alpha\) test of \(H_{0j}\). Suppose that for some \(i= 1,\cdots, k\), there exists a sequence of parameter points, \(\theta_l\in\Theta_i,\quad l=1,2,\cdots\) such that

    1. \(\lim_{l\to\infty}P_{\theta_l}(\mathbf X\in R_i)=\alpha\),
    1. for each \(j=1,\cdots,k,j\ne i,\lim_{l\to\infty}P_{\theta_l}(\mathbf X\in R_j)=1.\)

Then, the IUT with rejection region \(R=\bigcap_{j=1}^{k}R_j\) is a size \(\alpha\) test.

Proof: By Theorem Level \(\alpha\) test IUT , \(R\) is a level \(\alpha\) test, that is, \[\sup_{\theta\in\Theta_0}P_{\theta}(\mathbf X\in R)\le\alpha.\] But, because all the parameter points \(\theta_l\) satisfy \(\theta_l\in\Theta_i\subset\Theta_0,\) \[\begin{align}\sup_{\theta\in\Theta_0}P_{\theta}(\mathbf X\in R)&\ge\lim_{l\to\infty}P_{\theta_l}(\mathbf X\in R)\\ &=\lim_{l\to\infty}P_{\theta_l}\left(\mathbf X\in \bigcap_{j=1}^{k}R_j\right)\\ &\ge\lim_{l\to\infty}\sum_{j=1}^{k}P_{\theta_l}(\mathbf X\in R_j)-(k-1)\quad\quad\quad(\text{Bonferroni's Inequality})\\ &=(k-1)+\alpha-(k-1)\\ &=\alpha. \end{align}\] Bonferroni ’s Inequality which and \(\sup_{\theta\in\Theta_0}P_{\theta}(\mathbf X\in R)\le\alpha\) imply the test has size exactly equal to \(\alpha\).

\(p\)-Values

Definition A \(p\)-value \(p(\mathbf X)\) is a test statistic satisfying \(0\le p(\mathbf x)\le 1\) for every sample point \(\mathbf x\). Small values of \(p(\mathbf X)\) give evidence that \(H_1\) is true. A \(p\)-value is valid if, for every \(\theta\in\Theta_0\) and every \(0\le\alpha\le1\), \[P_{\theta}(p(\mathbf X)\le\alpha)\le\alpha\]

Theorem: valid \(p\)-value

Let \(W(\mathbf X)\) be a test statistic such that large values of \(W\) give evidence that \(H_1\) is true. For each sample point \(\mathbf x\), define \[p(\mathbf x)=\sup_{\theta\in\Theta_0}P_{\theta}(W(\mathbf X)\ge W(\mathbf x)).\] Then \(p(\mathbf X)\) is a valid \(p\)-value.

Proof: Fix \(\theta\in\Theta_0\). Let \(F_{\theta}(w)\) denote the cdf of \(-W(\mathbf X)\). Define \[p_{\theta}(\mathbf x)=p_{\theta}(W(\mathbf X)\ge W(\mathbf x))=P_{\theta}(-W(\mathbf X)\le -W(\mathbf x))=F_{\theta}(-W(\mathbf x))\] Then the random variable \(p_{\theta}(\mathbf X)\) is equal to \(F_{\theta}(-W(\mathbf X))\). Hence, by the Probability integral transformation, the distribution of \(p_{\theta}(\mathbf X)\) is stochastically greater than or equal to a \(uniform(0,1)\) distribution. That is, for every \(0\le\alpha\le1\), \(P_{\theta}(p_{\theta}(\mathbf X)\le\alpha)\le\alpha\). Because \(p(\mathbf x)=\sup_{\theta'\in\Theta_0}p_{\theta'}(\mathbf x)\ge p_{\theta}(\mathbf x)\) for every \(\mathbf x\), \[P_{\theta}(p(\mathbf X)\le\alpha)\le P_{\theta}(p_{\theta}(\mathbf X)\le\alpha)\le\alpha\] This is true for every \(\theta\in\Theta_0\) and for every \(0\le\alpha\le1\), \(p(\mathbf X)\) is a valid p-value.

Loss Function Optimality

9. Interval Estimation

Introduction

Definition: coverage probability For an interval estimator \([L(\mathbf X), U(\mathbf X)]\) of a parameter \(\theta\), the coverage probability of \([L(\mathbf X), U(\mathbf X)]\) is the probability that the random interval \([L(\mathbf X), U(\mathbf X)]\) covers the true parameter, \(\theta\). In symbols, it is denoted by either \(P_{\theta}(\theta\in[L(\mathbf X), U(\mathbf X)])\) or \(P(\theta\in[L(\mathbf X), U(\mathbf X)]|\theta)\).

Definition: confidence coefficient
(Confidence coefficient is a measure of confidence.) For an interval estimator \([L(\mathbf X), U(\mathbf X)]\) of a parameter \(\theta\), the confidence coefficient of \([L(\mathbf X), U(\mathbf X)]\) is the infimum of the coverage probabilities, \(\inf_{\theta}P_{\theta}(\theta\in[L(\mathbf X), U(\mathbf X)])\).

It is important to keep in mind that tbe interval is the random quantity, not the parameter. Therefore, when we write probability statements such as \(P_{\theta}(\theta\in[L(\mathbf X), U(\mathbf X)])\), these probability statements refer to \(\mathbf X\), not \(\theta\).

Interval estimators, together with a measure of confidence (usually a confidence coefficient) , are sometimes known as confidence internals.

Methods of Finding Interval Estimators

Inverting a Test Statistic

Inverting a normal test

Let \(X_1,\cdots, X_n\) be iid \(n(\mu, \sigma^2)\). The acceptance region of the hypothesis test, the test in the sample space for which \(H_0:\mu=\mu_0\) is accepted, is given by \[A(\mu_0)=\bigg\{(x_1,\cdots,x_n):\mu_0-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\le\bar x\le \mu_0+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\bigg\}\] and the confidence interval, the set in the parameter sapce with plausible values of \(\mu\), is given by \[C(x_1,\cdots,x_n)=\bigg\{\mu:\bar x-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\le\mu\le \bar x+z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\bigg\}\] These sets are connected to each other by the tautology: \[(x_1,\cdots,x_n)\in A(\mu_0)\iff\mu_0\in C(x_1,\cdots,x_n).\]

Theorem: acceptance regions of tests and confidence sets

For each \(\theta_0\in\Theta\), let \(A(\theta_0)\) be the acceptance region of a level \(\alpha\) test of \(H_0:\theta=\theta_0\). For each \(\mathbf x \in\mathcal X\), define a set \(C(\mathbf x)\) in the parameter space by \[C(\mathbf x)=\{\theta_0:\mathbf x\in A(\theta_0)\}\] Then the random set \(C(\mathbf X)\) is a \(1 - \alpha\) confidence set. Conversely, let \(C(\mathbf X)\) be a \(1 - \alpha\) confidence set. For any \(\theta_0\in\Theta\), define \[A(\theta_0)=\{\mathbf x:\theta_0\in C(\mathbf x)\}\] Then \(A(\theta_0)\) is the acceptance region of a level \(\alpha\) test of \(H_0:\theta=\theta_0\).

Proof: For the first part, since \(A(\theta_0)\) is the acceptance region of a level \(\alpha\) test, \[P_{\theta_0}(\mathbf X\notin A(\theta_0))\le\alpha\quad\text{and hence}\quad P_{\theta_0}(\mathbf X\in A(\theta_0))\ge 1-\alpha\] Since \(\theta_0\) is arbitrary, write \(\theta\) instead of \(\theta_0\). The above inequality, together with \[C(\mathbf x)=\{\theta_0:\mathbf x\in A(\theta_0)\}\] shows that the coverage probability of the set \(C(\mathbf X)\) is given by \[P_{\theta}(\theta\in C(\mathbf X))=P_{\theta}(\mathbf X\in A(\theta_0))\ge 1-\alpha,\] showing that \(C(\mathbf X)\) is a \(1-\alpha\) confidence set.

For the second part, the Type I Error probability for the test of \(H_0:\theta=\theta_0\) with acceptance region \(A(\theta_0)\) is \[P_{\theta_0}(\mathbf X\notin A(\theta_0))=P_{\theta_0}(\theta_0\notin C(\mathbf X))\le\alpha.\] So this is a level \(\alpha\) test.

Pivotal Quantities

Definition: Pivotal Quantities A random variable \(Q(\mathbf X, \theta) = Q(X_1, \cdots, X_n, \theta)\) is a pivotal quantity (or pivot) if the distribution of \(Q(\mathbf X, \theta)\) is independent of all parameters. That is, if \(\mathbf X \sim F(\mathbf x|\theta)\), then \(Q(\mathbf X, \theta)\) has the same distribution for all values of \(\theta\).

Pivoting the CDF

Pivoting a continuous cdf

Let \(T\) be a statistic with continuous cdf \(F_{T}(t|\theta)\). Let \(\alpha_1 + \alpha_2 = \alpha\) with \(0 < \alpha < 1\) be fixed values. Suppose that for each \(t \in \mathcal T\), the functions \(\theta_L(t)\) and \(\theta_U(t)\) can be defined as follows.

  • i. If \(F_{T}(t|\theta)\) is a decreasing function of \(\theta\) for each \(t\), define \(\theta_L(t)\) and \(\theta_U(t)\) by \[F_{T}(t|\theta_U(t))=\alpha_1;\quad F_{T}(t|\theta_L(t))=1-\alpha_2\]

  • ii. If \(F_{T}(t|\theta)\) is an increasing function of \(\theta\) for each \(t\), define \(\theta_L(t)\) and \(\theta_U(t)\) by \[F_{T}(t|\theta_L(t))=\alpha_1;\quad F_{T}(t|\theta_U(t))=1-\alpha_2\] Then the random interval \([\theta_L(t), \theta_U(t)]\) is a \(1-\alpha\) confidence interval for \(\theta\).

Proof: Assume that we have constructed the \(1-\alpha\) acceptance region \[\big\{t:\alpha_1\le F_{T}(t|\theta_0)\le1-\alpha_2\big\}\] Since \(F_{T}(t|\theta)\) is a decreasing function of \(\theta\) for each \(t\) and \(1-\alpha_2>\alpha_1,\quad \theta_L(t) < \theta_U(t)\), and the values \(\theta_L(t)\) and \(\theta_U(t)\) are unique. Also, \[F_{T}(t|\theta)<\alpha_1\iff \theta>\theta_U(t),\] \[F_{T}(t|\theta)>1-\alpha_2\iff \theta<\theta_L(t),\] and hence \[\big\{\theta:\alpha_1\le F_{T}(t|\theta)\le1-\alpha_2\big\}=\big\{\theta:\theta_L(T) \le\theta\le \theta_U(T)\big\}.\]

Pivoting a discrete cdf

Let \(T\) be a discrete statistic with cdf \(F_T(t|\theta)=P(T\le t|\theta)\). \(\alpha_1 + \alpha_2 = \alpha\) with \(0 < \alpha < 1\) be fixed values. Suppose that for each \(t \in \mathcal T\), the functions \(\theta_L(t)\) and \(\theta_U(t)\) can be defined as follows.

  • i. If \(F_{T}(t|\theta)\) is a decreasing function of \(\theta\) for each \(t\), define \(\theta_L(t)\) and \(\theta_U(t)\) by \[P(T\le t|\theta_U(t))=\alpha_1;\quad P(T\ge t|\theta_L(t))=\alpha_2\]

  • ii. If \(F_{T}(t|\theta)\) is an increasing function of \(\theta\) for each \(t\), define \(\theta_L(t)\) and \(\theta_U(t)\) by \[P(T\ge t|\theta_U(t))=\alpha_1;\quad P(T\le t|\theta_L(t))=\alpha_2\] Then the random interval \([\theta_L(t), \theta_U(t)]\) is a \(1-\alpha\) confidence interval for \(\theta\).

Proof: Since \(F_T(T|\theta)\) is stochastically greater than a uniform random variable, that is, \(P_{\theta}(F_T(T|\theta)\le x)\le x\). Furthermore, this property is shared by \(\bar F_T(T|\theta)=P(T\ge t|\theta)\), and this implies that the set \[\big\{\theta: F_{T}(T|\theta)\le\alpha_1\quad\text{and}\quad\bar F_T(T|\theta)\le\alpha_2\big\}.\] is a \(1-\alpha\) confidence set. The fact that \(F_T(t|\theta)\) is a decreasing function of \(\theta\) for each \(t\) implies that \(\bar F_T(t|\theta)\) is a nondecreasing function of \(\theta\) for each \(t\). It therefore follows that \[\theta>\theta_U(t)\Rightarrow F_{T}(t|\theta)<\frac{\alpha}{2},\] \[\theta<\theta_L(t)\Rightarrow \bar F_{T}(t|\theta)<\frac{\alpha}{2},\] and hence \[\big\{\theta: F_{T}(T|\theta)\le\alpha_1\quad\text{and}\quad\bar F_T(T|\theta)\le\alpha_2\big\}=\big\{\theta:\theta_L(T)\le\theta\le\theta_U(T)\big\}.\]

Methods of Evaluating Interval Estimators

Size and Coverage Probability

unimodal:

A pdf \(f (x)\) is unimodal if there exists \(x^*\) such that \(f(x)\) is nondecreasing for \(x \le x^*\) and \(f (x)\) is nonincreasing for \(x \ge x^*\).

shortest interval Theorem

Let \(f(x)\) be a unimodal pdf. If the interval \([a, b]\) satisfies

  • i. \(\int_{a}^{b} f (x) dx 1-\alpha,\)
  • ii. \(f(a) = f(b) > 0,\)
  • iii. \(a \le x^* \le b, \text{ where } x^* \text{ is a mode of } f(x),\) then \([a, b]\) is the shortest among all intervals that satisfy (i).

Proof: Let \([a', b']\) be any interval with \(b' - a' < b - a\). We will show that this implies \(\int_{a'}^{b'} f(x) dx < 1 - \alpha\). The result will be proved only for \(a' \le a\), the proof being similar if \(a < a'\). If \(b' \le a\), then \(a' \le b' \le a \le x^*\le b\) and \[\begin{align} \int_{a'}^{b'} f(x) dx &\le f(b')(b'-a')\\ &\le f(a)(b'-a')\\ &<f(a)(b-a)\\ &\le\int_{a}^{b}f(x)dx\\ &=1-\alpha \end{align}\]

If \(b'>a\), then \(a'\le a<b'<b\). In this case, we can write \[\begin{align} \int_{a'}^{b'} f(x) dx &= \int_{a}^{b} f(x) dx+\int_{a'}^{a} f(x)dx-\int_{b'}^{b} f(x) dx\\ &= 1-\alpha+\int_{a'}^{a} f(x)dx-\int_{b'}^{b} f(x) dx\\ &<1-\alpha \end{align}\] Since \[\begin{align} \int_{a'}^{a} f(x) dx &\le f(a)(a-a')\\ \end{align}\] and \[\begin{align} \int_{b'}^{b} f(x) dx &\ge f(b)(b-b')\\ \end{align}\] Thus,

\[\begin{align} \int_{a'}^{a} f(x) dx -\int_{b'}^{b} f(x) dx &\le f(a)(a-a')-f(b)(b-b')\\ &=f(a)[(a-a')-(b-b')]\quad (f(a)=f(b))\\ &=f(a)[(b'-a')-(b-a)] \end{align}\] which is negative if \((b'-a')<(b-a)\) and \(f(a)>0\).

10. Asymptotic Evaluations

Point Estimation

Consistency

Definition: consistent sequence of estimators

A sequence of estimators \(W_n = W_n(X_1, \cdots, X_n)\) is a consistent sequence of estimators of the parameter \(\theta\) if, for every \(\epsilon > 0\) and every \(\theta\in\Theta\), \[\lim_{n\to\infty}P_{\theta}(|W_n-\theta|<\epsilon)=1.\] An equivalent statement is this: For every \(\epsilon > 0\) and every \(\theta\in\Theta\), a consistent sequence \(W_n\) will satisfy \[\lim_{n\to\infty}P_{\theta}(|W_n-\theta|\ge\epsilon)=0.\] Definition says that a consistent sequence of estimators converges in probability to the parameter \(\theta\) it is estimating.

This definition should be compared with Convergence in Probability. Definition of Convergence in Probability dealt with one sequence of random variables with one probability structure, Definition of consistent sequence of estimators deals with an entire family o f probability structures, indexed by \(\theta\). For each different value of \(\theta\), the probability structure associated with the sequence \(W_n\) is different. And the definition says that for each value of \(\theta\), the probability structure is such that the sequence converges in probability to the true \(\theta\). This is the usual difference between a probability definition and a statistics definition. The probability definition deals with one probability structure, but the statistics definition deals with an entire family.

Recall that, for an estimator \(W_n\) Chebychev’s Inequality states \[P_{\theta}(|W_{\theta}-\theta|\ge \epsilon\sigma)\le \frac{\mathbb E_{\theta}[(W_n-\theta)^2]}{\sigma^2\epsilon^2}\] so if for every \(\theta\in\Theta\), \[\lim_{n\to\infty}\mathbb E_{\theta}[(W_n-\theta)^2]=0\] then the sequence of estimators is consistent.

Since \(\mathbb E_{\theta}[(W_n-\theta)^2]=\text{Var}_{\theta}W_n+[\text{Bias}_{\theta}W_n]^2\), Then the following theorem:

If \(W_n\) is a sequence of estimators of a parameter \(\theta\) satisfying

  • \(\lim_{n\to\infty}\text{Var}_{\theta}W_n=0\),
  • \(\lim_{n\to\infty}\text{Bias}_{\theta}W_n=0\),

for every \(\theta\in\Theta\), then \(W_n\) is a consistent sequence of estimators of \(\theta\).

Suitable Regularity Conditions

These conditions mainly relate to differentiability of the density and the ability to interchange differentiation and integration.

The following four assumptions are sufficient to prove Theorem: Consistency of MLEs

  • (A1) We observe \(X_1 X_2 , \cdots,X_n\), where \(X_i \sim f(x|\theta)\) are iid.

  • (A2) The parameter is identifiable; that is, if \(\theta\ne \theta'\), then \(f(x|\theta) \ne f(x|\theta').\)

  • (A3) The densities \(f(x|\theta)\) have common support, and \(f(x|\theta)\) is differentiable in \(\theta\).

  • (A4) The parameter space \(\Omega\) contains an open set \(\omega\) of which the true parameter value \(\theta_0\) is an interior point.

The next two assumptions, together with (Al)-(A4) are sufficient to prove Theorem: Asymptotic efficiency of MLEs, asymptotic normality and efficiency of MLEs.

  • (A5) For every \(x \in \mathcal X\), the density \(f(x|\theta)\) is three times differentiable with respect to \(\theta\), the third derivative is continuous in \(\theta\), and \(\int f(x|\theta)dx\) can be differentiated three times under the integral sign.

  • (A6) For any \(\theta_0 \in \Omega\), there exists a positive number \(c\) and a function \(M(x)\) (both of which may depend on (\(\theta_0\)) such that \[\left|\frac{\partial^3}{\partial\theta^3}\right|\log f(x|\theta)\le M(x)\quad \forall x\in \mathcal X, \theta_0-c<\theta<\theta_0+c\] with \(\mathbb E_{\theta_0}[M(x)]<\infty.\)

Theorem: Consistency of MLEs

Let \(X_1, X_2, \cdots,\) be iid \(f(x|\theta)\), and let \(L(\theta|\mathbf x)=\prod_{i=1}^{n}f(x_i|\theta)\) be the likelihood function. Let \(\hat\theta\) denote the MLE of \(\theta\). Let \(\tau(\theta)\) be a continuous function of \(\theta\). Under the regularity conditions on \(f(x|\theta)\) and, hence, \(L(\theta|\mathbf x)\), for every \(\theta > 0\) and every \(\theta\in\Theta\), \[\lim_{n\to\infty}^{}P_{\theta}(|\tau(\hat\theta)-\tau(\theta)|\ge\epsilon)=0.\] That is \(\tau(\hat\theta)\) is a consistent estimator of \(\tau(\theta)\).

Proof: The proof proceeds by showing that \(\frac{1}{n}\log L(\hat\theta|\mathbf x)\) converges almost surely to \(\mathbb E_{\theta}(\log f(X|\theta))\) for every \(\theta\in\Theta\). Under some conditions on \(f(x|\theta)\), this implies that \(\hat\theta\) converges to \(\theta\) in probability and, hence, \(\tau(\hat\theta)\) converges to \(\tau(\theta)\) in probability.

Efficiency

Definition: limiting variance

For an estimator \(T_n\), if \(\lim_{n\to\infty}k_n\text{Var} T_n=\tau^2<\infty\), where \(\{k_n\}\) is a sequence of constants, then \(\tau^2\) is called the limiting variance or limit of the variances.

Definition: asymptotic variance

For an estimator \(T_n\), suppose that \(k_n(T_n - \tau(\theta)) \to n(0, \sigma^2)\) in distribution. The parameter \(\sigma^2\) is called the asymptotic variance or variance of the limit distribution of \(T_n\).

Definition: asymptotically efficient

A sequence of estimators \(W_n\) is asymptotically efficient for a parameter \(\tau(\theta)\) if \(\sqrt{n}[W_n-\tau(\theta)]\to n[0,v(\theta)]\) in distribution and \[v(\theta)=\frac{[\tau'(\theta)]^2}{\mathbb E_{\theta}\left(\left(\frac{\partial}{\partial\theta}\log f(X|\theta)\right)^2\right)}\] that is, the asymptotic variance of \(W_n\) achieves the Cramer-Rao Lower Bound.

Theorem: Asymptotic efficiency of MLEs

Let \(X_1 > X_2, \cdots,\) be iid \(f(x|\theta)\), let \(\hat\theta\) denote the MLE of \(\theta\), and let \(\tau(\theta)\) be a continuous junction of \(\theta\). Under the regularity conditions on \(f(x|\theta)\) and, hence, \(L(\theta|\mathbf x)\), \[\sqrt{n}[\tau(\hat\theta)-\tau(\theta)]\to n[0,v(\theta)],\] where \(v(\theta)\) is the Cramer-Rao Lower Bound. That is \(\tau(\hat\theta)\) is a consistent and asymptotically efficient estimator of \(\tau(\theta)\).

Proof: Recall that \(L(\theta|\mathbf x)=\sum\log f(x_i|\theta)\) is the log likelihood function. Now expande the first derivative of the log likelihood function around the true value \(\theta_0\), \[L'(\theta|\mathbf x)=L'(\theta_0|\mathbf x)+(\theta-\theta_0)L''(\theta_0|\mathbf x)+\frac{1}{2!}(\theta-\theta_0)^2L'''(\theta_0|\mathbf x)+\cdots\] where we are going to ignore the higher-order terms (a justifiable maneuver under the regularity conditions).

Now substitute the MLE \(\hat\theta\) for \(\theta\), and realize that the left-hand side of last equation is \(0\). Rearranging and multiplying through by \(\sqrt{n}\) gives us

\[\sqrt{n}(\hat\theta-\theta_0)=\sqrt{n}\frac{-L'(\theta_0|\mathbf x)}{L''(\theta_0|\mathbf x)}=\frac{-\frac{1}{\sqrt{n}}L'(\theta_0|\mathbf x)}{\frac{1}{n}L''(\theta_0|\mathbf x)}\] If we let \(I(\theta_0)=\mathbb E[L'(\theta_0|\mathbf x)]^2=1/v(\theta)\) denote the information number for one observation, application of the Central Limit Theorem and the Weak Law of Large Numbers will show \[-\frac{1}{\sqrt{n}}L'(\theta_0|\mathbf X)\to n[0,I(\theta_0)],\] \[\frac{1}{n}L''(\theta_0|\mathbf x)\to I(\theta_0)\] Thus if we let \(W\sim n[0,I(\theta_0)]\), then \(\sqrt{n}(\hat\theta-\theta_0)\) converges in distribution to \(W/I(\theta_0)\sim n[0, 1/I(\theta_0)]\).

Asymptotic normality and consistency

Suppose that \[\sqrt{n}\frac{W_n-\mu}{\sigma}\to Z\text{ in distribution}\] where \(Z\sim n(0,1)\). By applying Slutsky’s Theorem we conclude \[W_n-\mu=\left(\frac{\sigma}{\sqrt{n}}\right)\left(\sqrt{n}\frac{W_n-\mu}{\sigma}\right)\to\lim_{n\to\infty}\left(\frac{\sigma}{\sqrt{n}}\right)Z=0,\] so \(W_n-\mu \to 0\) in distribution. Since convergence in distribution to a point is equivalent to convergence in probability, so \(W_n\) is a consistent estimator of \(\mu\).

Calculations and Comparisons

If an MLE is asymptotically efficient, the asymptotic variance in Consistency of MLEs Theorem is the Delta Method variance (without the \(1/n\) in term). Thus, we can use the Cramer-Rao Lower Bound as an approximation to the true variance of the MLE. Suppose that \(X_1,\cdots, X_n\) are iid \(f(x|\theta)\), \(\hat\theta\) is the MLE of \(\theta\), and \(I_n(\theta)=\mathbb E_{\theta}\left(\frac{\partial}{\partial\theta}\log L(\theta|\mathbf X)\right)^2\) is the information number of the sample. From the Delta Method and asymptotic efficiency of MLEs, the variance of \(h(\hat\theta)\) can be approximated by
\[\begin{align} \text{Var} (h(\hat\theta)|\theta)&\approx\frac{[h(\hat\theta)]^2}{I_n(\theta)}\\ &=\frac{[h(\hat\theta)]^2}{\mathbb E_{\theta}\left(\frac{\partial}{\partial\theta}\log L(\theta|\mathbf X)\right)^2}\\ &=\frac{[h(\hat\theta)]^2}{-\mathbb E_{\theta}\left(\frac{\partial^2}{\partial\theta^2}\log L(\theta|\mathbf X)\right)}\\ &\approx\frac{[h(\hat\theta)]^2}{-\frac{\partial^2}{\partial\theta^2}\log L(\theta|\mathbf X)|_{\theta=\hat{\theta}}}\\ \end{align}\] See Fisher information (1).

The denominator is \(\hat I_n(\hat\theta)\), the observed information number. To estimate \(\text{Var}_{\theta}h(\hat\theta)\), first we approximate \(\text{Var}_{\theta}h(\hat\theta)\); then we estimate the resulting approximation, usually by substituting \(\hat\theta\) for \(\theta\). The resulting estimate can be denoted by \(\text{Var}_{\hat\theta}h(\hat\theta)\) or \(\widehat{\text{Var}}_{\theta}h(\hat\theta).\)

It follows from Theorem of Consistency of MLEs that \(-\frac{1}{n}\frac{\partial^2}{\partial\theta^2}\log L(\theta|\mathbf X)|_{\theta=\hat\theta}\) is a consistent estimator of \(I(\theta)\), so it follows that \(\text{Var}_{\hat\theta}h(\hat\theta)\) is a consistent estimator of \(\text{Var}_{\theta}h(\hat\theta)\).

Asymptotic Relative Efficiency (ARE)

If two estimators \(W_n\) and \(V_n\) satisfy \[\sqrt{n}[W_n-\tau(\theta)]\to n[0,\sigma^2_W]\] \[\sqrt{n}[V_n-\tau(\theta)]\to n[0,\sigma^2_V]\] in distribution, the Asymptotic Relative Efficiency (ARE) of \(V_n\) with respect to \(W_n\) is \[\text{ARE}(V_n, W_n)=\frac{\sigma^2_W}{\sigma^2_V}\] And \(n\text{Var}(V_n)=\sigma^2_V\), \(n\text{Var}(W_n)=\sigma^2_W\).

Bootstrap Standard Errors

We can estimate the variance of the sample mean \(\bar{X}\) by \[\text{Var}^*(\bar{X})=\frac{1}{n^n-1}\sum_{i=1}^{n^n}(\bar{x}^*_i-\bar{\bar{x}^*})^2\] where \[\bar{\bar{x}^*}=\frac{1}{n^n}\sum_{i=1}^{n^n}\bar{x}^*_i\] the mean of the resamples.

The variance formula is applicable to virtually any estimator. Thus, for any estimator \(\hat\theta(\mathbf x)=\hat\theta\), we can write \[\text{Var}^*(\hat\theta)=\frac{1}{n^n-1}\sum_{i=1}^{n^n}(\hat\theta^*_i-\bar{\hat\theta^*})^2\] where \(\hat\theta^*_i\) is the estimator calculated from the \(i\)th resample and \[\bar{\hat\theta^*}=\frac{1}{n^n}\sum_{i=1}^{n^n}\hat\theta^*_i\] the mean of the resampled values.

Robustness

The Mean and the Median

Definition: breakdown value

Let \(X_{(1)} < \cdots < X_{(n)}\) be an ordered sample of size \(n\), and let \(T_n\) be a statistic based on this sample. \(T_n\) has breakdown value \(b, 0\le b\le 1\), if, for every \(\epsilon>0\), \[\lim_{X_{(\{(1-b)n\})}\to\infty} T_n<\infty\quad\text{and}\quad\lim_{X_{(\{(1-(b+\epsilon))n\})}\to\infty} T_n=\infty\] The \(\{(1-b)n\}\) is the rounded sample percentile notation.

M-Estimators

Huber estimator

The estimator defined as the minimizer of the \(\rho\)-functions: \[\sum_{i=1}^{n}\rho(x_i-a)= \begin{cases} \frac{1}{2}\sum_{i=1}^{n}(x_i-a)^2 & \text{if }|x_i-a|\le k\\ \sum_{i=1}^{n}\left(k|x_i-a|-\frac{1}{2}k^2\right)& \text{if }|x_i-a|\ge k\\ \end{cases}\] is called a M-Estimators.

Since minimization of a function is typically done by solving for the zeros of the derivative (when we can take a derivative), defining \(\psi=\rho'\), we see that an M-estimator is the solution to \[\sum_{i=1}^{n}\psi(x_i-\theta)=0\] We write a Taylor expansion for \(\psi\) as \[\sum_{i=1}^{n}\psi(x_i-\theta)\approx\sum_{i=1}^{n}\psi(x_i-\theta_0)+(\theta-\theta_0)\sum_{i=1}^{n}\psi'(x_i-\theta_0)=0\] where \(\theta_0\) is the true value and let \(\hat\theta_M\) be the solution to last equation and substitute this for \(\theta\) to obtain \[\sum_{i=1}^{n}\psi(x_i-\theta_0)+(\hat\theta_M-\theta_0)\sum_{i=1}^{n}\psi'(x_i-\theta_0)=0\] Then \[\sqrt{n}(\hat\theta_M-\theta_0)=\frac{-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi(x_i-\theta_0)}{\frac{1}{n}\sum_{i=1}^{n}\psi'(x_i-\theta_0)}\] Now we assume that \(\theta_0\) satisfies \(\mathbb E_{\theta_0}\psi(X-\theta_0)=0\) (which is usually taken as the definition of (\(\theta_0\)). It follows that \[-\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\psi(X_i-\theta_0)=\sqrt{n}\left[-\frac{1}{n}\sum_{i=1}^{n}\psi(X_i-\theta_0)\right]\to n\left(0,\mathbb E_{\theta_0}\psi(X-\theta_0)^2\right)\] in distribution, and the Law of Large Numbers yields \[\frac{1}{n}\sum_{i=1}^{n}\psi'(x_i-\theta_0)\to \mathbb E_{\theta_0}\psi'(X-\theta_0)\] in probability. Putting this all together we have \[\sqrt{n}(\hat\theta_M-\theta_0)\to n\left(0, \frac{\mathbb E_{\theta_0}\psi(X-\theta_0)^2}{[\mathbb E_{\theta_0}\psi'(X-\theta_0)]^2}\right)\]

Hypothesis Testing

Asymptotic Distribution of LRTs

Likelihood ratio method of test construction \(H_0:\theta\in \Theta_0\): \[\lambda(\mathbf x)=\frac{\underset{\Theta_0}{\sup}L(\theta|\mathbf x)}{\underset{\Theta}{\sup}L(\theta|\mathbf x)}\] and an explicit form for the rejection region, \(\{\mathbf x:\lambda(\mathbf x)\le c\}\). To define a level \(\alpha\) test, the constant \(c\) must be chosen so that \[\underset{\theta\in\Theta_0}{\sup}P_{\theta}(\lambda(\mathbf X)\le c)\le\alpha.\]

Wilks’ theorem

For testing \(H_0:\theta=\theta_0\) versus \(H_1:\theta\ne\theta_0\), suppose \(X_1,\cdots, X_n\) are iid \(f(x|\theta)\), \(\hat\theta\) is the MLE of \(\theta\), and \(f(x|\theta)\) satisfies the regularity conditions. Then under \(H_0\), as \(n\to\infty\), \[-2\log\lambda(\mathbf X)\to\chi^2_1\quad\text{in distribution}\] where \(\chi^2_1\) is a \(\chi^2\) random variable with \(1\) degree of freedom.

Proof: First expand \(\log L(\theta|\mathbf x)=l(\theta|\mathbf x)\) in a Taylor series around \(\hat\theta\), giving \[l(\theta|\mathbf x)\approx l(\hat\theta|\mathbf x)+l'(\hat\theta|\mathbf x)(\theta-\hat\theta)+l''(\hat\theta|\mathbf x)\frac{(\theta-\hat\theta)^2}{2!}\] Now substitute the expansion for \(l(\theta_0|\mathbf x)\) in \(-2\log\lambda(\mathbf x)=-2l(\theta_0|\mathbf x)+2l(\hat\theta|\mathbf x)\), and get \[\begin{align} -2\log\lambda(\mathbf x)&\approx -l''(\hat\theta|\mathbf x)(\theta-\hat\theta)^2\\ &=-\log'' L(\hat\theta|\mathbf x)(\theta-\hat\theta)^2\\ &=nI(\hat\theta)(\theta-\hat\theta)^2\\ &=\left(\sqrt{nI(\hat\theta)}(\theta-\hat\theta)\right)^2 \end{align}\] since \(l'(\hat\theta|\mathbf x)=0\). By Asymptotic efficiency of MLEs Theorem, \(\sqrt{nI(\hat\theta)}(\theta-\hat\theta)\sim n(0,1)\), then \[-2\log\lambda(\mathbf X)\to\chi^2_1\quad\text{in distribution}\]

Wilks’ theorem (generalized)

Let \(X_1,\cdots,X_n\) be a random sample from a pdf or pmf \(f(x|\theta)\). Under the regularity conditions, if \(\theta\in\Theta_0\), then the distribution of the statistic \(-2\log\lambda(\mathbf X)\) converges to a chi squared distribution as the sample size \(n \to\infty\). The degrees of freedom of the limiting distribution is the difference between the number of free parameters specified by \(\theta\in\Theta_0\) and the number of free parameters specified by \(\theta\in\Theta\).

Proof: Rejection of \(H_0:\theta\in\Theta_0\) for small values of \(\lambda(\mathbf X)\) is equivalent to rejection for large values of \(-2\log\lambda(\mathbf X)\). Thus, \[H_0\text{ is rejected if and only if } -2\log\lambda(\mathbf X)\ge\chi^2_{\nu,\alpha},\] The Type I Error probability will be approximately \(\alpha\) if \(\theta\in\Theta_0\) and the sample size is large. In this way, Wilks’ theorem will be approximately satisfied for large sample sizes and an asymptotic size \(\alpha\) test has been defined. Note that the theorem will actually imply only that \[\lim_{n\to\infty}P_{\theta}(\text{reject }H_0)=\alpha\quad\text{for each }\theta\in\Theta_0,\] not that the \(\sup_{\theta\in\Theta_0}P_{\theta}(\text{reject }H_0)\) converges to \(\alpha\). This is usually the case for asymptotic size \(\alpha\) tests.

\(\Theta\) can be represented as a subset of \(q\)-dimentional Euclidian space that contains an open subset in \(\mathfrak R^q\), \(\Theta_0\) can be represented as a subset of \(p\)-dimentional Euclidian space that contains an open subset in \(\mathfrak R^p\), where \(p<q\). Then \(q-p=\nu\) is the degree of freedom for the test statistic.

Wald test

A Wald test is a test based on a statistic of the form \[Z_n=\frac{W_n-\theta_0}{S_n}=\frac{W_n-\theta}{S_n}+\frac{\theta-\theta_0}{S_n}\] where \(\theta_0\) is a hypothesized value of the parameter \(\theta\), \(W_n\) is an estimator of \(\theta\), and \(S_n\) is a standard error for \(W_n\), an estimate of the standard deviation of \(W_n\). If \(W_n\) is the MLE of \(\theta\), then, as discussed in Wilks’ theorem \(1/\sqrt{I_n(W_n)}\) is a reasonable standard error for \(W_n\). Alternatively, \(1/\sqrt{\hat I_n(W_n)}\) where \[\hat I_n(W_n)=-\frac{\partial^2}{\partial\theta^2}\log L(\theta|\mathbf X)\bigg|_{\theta=W_n}\] is the observed information number, is often used.

Score test

The score statistic is defined to be \[S(\theta)=\frac{\partial}{\partial\theta}\log f(\mathbf X|\theta)=\frac{\partial}{\partial\theta}\log L(\theta|\mathbf X)\] from Cramér-Rao Inequality we know that for all \(\theta\), \(\mathbb E_{\theta}S(\theta)=0\). In particular, if we are testing \(H_0:\theta=\theta_0\) and if \(H_0\) is true, then \(S(\theta_0)\) has mean \(0\). Furthermore, from Fisher Information \[\text{Var}_{\theta}S(\theta)=\mathbb E_{\theta}\left[ \left(\frac{\partial}{\partial \theta} \log L(\theta\vert{} \mathbf X ) \right)^2 \right]=I_n(\theta);\] the information number is the variance of the score statistic. The test statistic for the score test is \[Z_S=S(\theta_0)/\sqrt{I_n(\theta_0)}.\] If \(H_0\) is true, \(Z_s\) has mean \(0\) and variance \(1\). From Theorem: Asymptotic efficiency of MLEs it follows that \(Z_s\) converges to a standard normal random variable if \(H_0\) is true. Thus, the approximate level \(\alpha\) score test rejects \(H_0\) if \(|Z_S|>z_{\alpha/2}.\)

Interval Estimation

11. Analysis of Variance and Regression

Oneway Analysis of Variance

Model and Distribution Assumptions

Definition: identifiable

A parameter \(\theta\) for a family of distributions \(\{f(x|\theta):\theta\in\Theta\}\) is identifiable if distinct values of \(\theta\) correspond to distinct pdfs or pmfs. That is, if \(\theta\ne\theta'\), then \(\{f(x|\theta)\) is not the same function of \(x\) as \(\{f(x|\theta')\).

  • One way ANOVA equation: \[Y_{ij} = \theta_{i} + \epsilon_{ij}, \text{ with: } i = 1,..., k \text{ and } j = 1,..., n_{i}\]
    • \(i\) = treatment
    • \(j\) = observation
  • Assumptions:
    1. \(E[\epsilon_{ij}] = 0 , \mathbb E Y_{ij}=\theta_{i}, \text{Var}\epsilon_{ij} = \sigma^2_{i} < \infty\) the \(\theta_{i}\)s are usually referred to as treatment means, since the index often corresponds to different treatments or to levels of a particular treatment, such as dosage levels of a particular drug.

    2. \(\text{Cov}(\epsilon_{ij}\epsilon_{i'j'}) = 0, \text{ unless } i=i' \text{ and } j=j'\)

    3. \(\epsilon_{ij}\) are independently and normally distributed

    4. \(\sigma^2_i=\sigma^2\;\;\forall i\) (equality of variance, also known as homoscedasticity).

  • If data doesn’t meet the assumptions can transform or rely on CLT
  • Requires identifiability, each \(\theta\) has a unique distribution
    • Otherwise it’s impossible to tell label future observations

The Classic ANOVA Hypothesis

  • Hypothesises:
    • \(H_0: \theta_1 = \theta_2 = ... = \theta_k\)
    • \(H_1: \theta_i \neq \theta_j\) for some \(i, j\)

Definition: linear combination and contrast.

Let \(t = (t_1, \cdots , t_k)\) be a set of variables, either parameters or statistics, and let \(\mathbf a=(a_1, \cdots, a_k)\) be known constants. The function \[\sum_{i=1}^{k}a_it_i\] is called a linear combination of the \(t_i\)s. If, furthermore, \(\sum a_i = 0\), it is called a contrast.

Theorem of contrast

Let \(\theta(\theta_1, \cdots, \theta_k)\) be arbitrary parameters. Then \[\theta_1=\theta_2=\cdots=\theta_k\iff\sum_{i=1}^{k}a_i\theta_i=0;\;\;\forall\mathbf a\in\mathcal A,\] where \(\mathcal A\) is the set of constants satisfying \(\mathcal A=\{\mathbf a=(a_1,\cdots,a_k):\sum a_i=0\}\); that is all contrasts must satisfy \(\sum_{i=1}^{k}a_i\theta_i=0.\)

  • Hypothesises:
    • \(H_0: \sum_{i=1}^{k}a_i\theta_i=0\)
    • \(H_1: \sum_{i=1}^{k}a_i\theta_i\ne0\) for some \(i, j\)

Inferences Regarding Linear Combinations of Means

Working under the oneway ANOVA assumptions, we have that \[Y_{ij}\sim n(\theta_i,\sigma^2), i=1,\cdots,k;\; j=1,\cdots,n_i.\] Therefore, \[\bar{Y}_{i\cdot}=\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\sim n(\theta_i,\sigma^2/n_i),\;\; i=1,\cdots,k.\] \[\bar{\bar{Y}}=\frac{1}{N}\sum_{i=1}^{k}\sum_{j=1}^{n_i}Y_{ij},\] For any constants \(\mathbf a=(a_1,\cdots,a_k),\;\; \sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}\) is also normal with \[\mathbb E\left(\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}\right)=\sum_{i=1}^{k}a_i\theta_{i}\quad\text{and}\quad\text{Var}\left(\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}\right)=\sigma^2\sum_{i=1}^{k}\frac{a_i^2}{n_i}\] and therefore

\[\frac{\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}-\sum_{i=1}^{k}a_i\theta_{i}}{\sqrt{\sigma^2\sum_{i=1}^{k}\frac{a_i^2}{n_i}}}\sim n(0,1).\]

In each population, if we denote the sample variance by \(S_i^2\), that is, \[S_i^2=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\cdot})^2,\;\; i=1,\cdots,k,\] then \(S_i^2\) is an estimate of \(\sigma^2\) and \((n_i-1)S_i^2/\sigma^2\sim \chi^2_{n_i-1}\),

Furthermore, under the ANOVA assumptions, since each \(S_i^2\) estimates the same \(\sigma^2\), we can improve the estimators by combining them. We thus use the pooled estimator of \(\sigma^2\), \(S_p^2\)

  • Pooled variance: \[S^{2}_{p} = \frac{1}{N-k}\sum^{k}_{i=1}(n_{i}-1)S^{2}_{i} = \frac{1}{N-k}\sum^{k}_{i=1}\sum^{n_{i}}_{j=1}(Y_{ij}-\bar{Y_{i\cdot}})^{2}\] \[(N-k)S^{2}_{p}/\sigma^2\sim\chi^2_{N-k}\] Also, \(S^{2}_{p}\) is independent of each \(\bar{Y_{i\cdot}}\) and thus \[\frac{\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-\sum_{i=1}^{k}a_i\theta_i}{\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}}\sim t_{N-k}\] Student’s \(t\) with \(N - k\) degrees of freedom.

  • t statistic:

\[H_0:\sum_{i=1}^{k}a_i\theta_i=0\quad\text{versus}\quad H_1:\sum_{i=1}^{k}a_i\theta_i\ne0\] at level \(\alpha\), we would reject \(H_0\) if \[\left|\frac{\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}}{\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}}\right|> t_{N-k,\alpha/2}\]

Furthermore, a pivot that can be inverted to give an interval estimator of \(\sum a_i\theta_i\) with probability \(1-\alpha\) \[\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-t_{N-k,\alpha/2}\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}\le\sum_{i=1}^{k}a_i\theta_i\le\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}+t_{N-k,\alpha/2}\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}\]

To compare treatments \(1\) and \(2\), take \(\mathbf a =(1,-1,0,\cdots,0)\). Then to test \(H_0:\theta_1=\theta_2\quad\text{versus}\quad H_1:\theta_1\ne\theta_2\), we would reject \(H_0\) if \[\left|\frac{\bar{Y_{1\cdot}}\bar{Y_{2\cdot}}}{\sqrt{S^{2}_{p}(\frac{1}{n_1} + \frac{1}{n_2}})}\right| > t_{N-k, \alpha/2}\]

The ANOVA F Test

From Theorem of contrast, the ANOVA hypothesis test can be written \[H_0:\sum_{i=1}^{k}a_i\theta_i=0\;\;\forall\mathbf a\in\mathcal A\quad\text{versus}\quad H_1:\sum_{i=1}^{k}a_i\theta_i\ne0\;\;\text{for some }\mathbf a\in\mathcal A.\] where \[\mathcal A=\left\{\mathbf a=(a_1,\cdots,a_k):\sum_{i=1}^{k}a_i=0\right\}.\] To see this more clearly as a union-intersection test, define, for each \(\mathbf a\), the set \[\Theta_{\mathbf a}=\left\{\boldsymbol\theta=(\theta_1,\cdots,\theta_k):\sum_{i=1}^{k}a_i\theta_i=0\right\}\] Then we have \[\theta\in\left\{\theta:\theta_1=\theta_2= \cdots= \theta_k\right\}\iff\theta\in\Theta_{\mathbf a}\;\;\forall\mathbf a\in\mathcal A\iff\theta\in\underset{\mathbf a\in\mathcal A}{\bigcap}\Theta_{\mathbf a},\] showing that the ANOVA null can be written as an intersection.

By the Union-intersection and intersection-union tests, we would reject \(H_0:\theta\in\underset{\mathbf a\in\mathcal A}{\bigcap}\Theta_{\mathbf a}\) if we can reject \[H_{0_{\mathbf a}}:\theta\in\Theta_{\mathbf a}\quad\text{versus}\quad H_{1_{\mathbf a}}:\theta\notin\Theta_{\mathbf a};\quad \forall \mathbf a\] We test \(H_{0_{\mathbf a}}:\theta\in\Theta_{\mathbf a}\) with \(t\) statistic of \[T_{\mathbf a}=\left|\frac{\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-\sum_{i=1}^{k}a_i\theta_i}{\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}}\right|\] We then reject \(H_{0_{\mathbf a}}\) if \(T_{\mathbf a} > k\) for some constant \(k\).

From the union-intersection methodology, it follows that if we could reject for any \(\mathbf a\), we could reject for the \(\mathbf a\) that maximizes \(T_{\mathbf a}\). Thus, the union-intersection test of the ANOVA null is to reject \(H_{0_{\mathbf a}}\) if \(\sup_{\mathbf a} T_{\mathbf a} > k\), where \(k\) is chosen so that \(P_{H_0}(\sup_{\mathbf a} T_{\mathbf a} > k) =\alpha\).

Lemma of maximum of multiplied constants

Let \((v_1 ,\cdots, v_k)\) be constants and let \((c_1, \cdots , c_k)\) be positive constants. Then, for \(\mathcal A = \left\{\mathbf a = (a_1,\cdots,a_k): \sum a_i = 0\right\}\), \[\underset{\mathbf a\in\mathcal A}{\max}\left\{\frac{\left(\sum_{i=1}^{k}a_iv_i\right)^2}{\sum_{i=1}^{k}a_i^2/c_i}\right\}=\sum_{i=1}^{k}c_i(v_i-\bar{v}_c)^2,\] where \(\bar{v}_c=\sum c_iv_i/\sum c_i\) The maximum is attained at any \(\mathbf a\) of the form \(a_i = K c_i (v_i-\bar{v}_c)\), where \(K\) is a nonzero constant.

For \[T_{\mathbf a}=\left|\frac{\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-\sum_{i=1}^{k}a_i\theta_i}{\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}}\right|\] we see that maximizing \(T_{\mathbf a}\) is equivalent to maximizing \(T_{\mathbf a}^2\). We have \[T_{\mathbf a}^2=\frac{\left(\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-\sum_{i=1}^{k}a_i\theta_i\right)^2}{S^2_p\sum_{i=1}^{k}a_i^2/n_i}=\frac{1}{S_p^2}\frac{\left(\sum_{i=1}^{k}a_i(\bar{Y_{i\cdot}}-\theta_i)\right)^2}{S^2_p\sum_{i=1}^{k}a_i^2/n_i}\]

Theorem: \(\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2/(k-1)\sim F_{k-1,N-k}\)

For \[T_{\mathbf a}=\left|\frac{\sum_{i=1}^{k}a_i\bar{Y_{i\cdot}}-\sum_{i=1}^{k}a_i\theta_i}{\sqrt{S^2_p\sum_{i=1}^{k}a_i^2/n_i}}\right|\]

\[\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2=\frac{1}{S_p^2}\sum_{i=1}^{k}n_i\left((\bar{Y_{i\cdot}}-\bar{\bar{Y}})-(\theta_i-\bar\theta)\right)^2\] where \(\bar{\bar{Y}}=\sum n_i\bar{Y_{i\cdot}}/\sum n_i\) and \(\bar\theta=\sum n_i\theta_i/\sum n_i\) Furthermore, under the ANOVA assumptions, \[\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2\sim(k-1)F_{k-1,N-k}\] that is \(\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2/(k-1)\sim F_{k-1,N-k}\).

Proof:

Using [Lemma: maximum of multiplied constants][Lemma: maximum of multiplied constants], \[\underset{\mathbf a\in\mathcal A}{\max}\left\{\frac{\left(\sum_{i=1}^{k}a_iv_i\right)^2}{\sum_{i=1}^{k}a_i^2/c_i}\right\}=\sum_{i=1}^{k}c_i(v_i-\bar{v}_c)^2,\]

\[\begin{align} \underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2&=\underset{\mathbf a:\sum a_i=0}{\sup}\frac{1}{S_p^2}\frac{\left(\sum_{i=1}^{k}a_i(\bar{Y_{i\cdot}}-\theta_i)\right)^2}{S^2_p\sum_{i=1}^{k}a_i^2/n_i}\\ &=\frac{1}{S_p^2}\sum_{i=1}^{k}n_i\left((\bar{Y_{i\cdot}}-\bar{\bar{Y}})-(\theta_i-\bar\theta)\right)^2 \end{align}\]

To prove \[\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2\sim(k-1)F_{k-1,N-k}\] we must show that the numerator and denominator of \[\frac{1}{S_p^2}\sum_{i=1}^{k}n_i\left((\bar{Y_{i\cdot}}-\bar{\bar{Y}})-(\theta_i-\bar\theta)\right)^2\] are independent chi squared random variables, each divided by its degrees of freedom.

From the ANOVA assumptions two things follow. The numerator and denominator are independent and \[S_p^2/\sigma^2\sim\chi^2_{N-k}/(N-k)\]. We must show that \[\frac{1}{\sigma^2}\sum_{i=1}^{k}n_i\left((\bar{Y_{i\cdot}}-\bar{\bar{Y}})-(\theta_i-\bar\theta)\right)^2\sim\chi^2_{k-1}\]

If \(H_0 : \theta_1 = \theta_2 = \cdots = \theta_k\) is true, \(\theta_i = \bar\theta\;\;\forall i=1,2,\cdots,k\) and the \(\theta_i = \bar\theta\) terms drop out. Thus, for an \(\alpha\) level test of the ANOVA hypotheses \[H_0:\theta_1 = \theta_2 = \cdots = \theta_k\quad\text{versus}\quad H_1:\theta_i\ne\theta_j\;\text{for some }i,j\]

  • ANOVA F test: \[\text{reject } H_0 \text{ if } F= \frac{\sum^{k}_{i=1}n_{i}((\bar{Y_{i\cdot}} - \bar{\bar{Y}}))^{2}/(k-1)}{S^{2}_{p}} > F_{k-1, N-K, \alpha}\] and the test statistic \(F\) is called the ANOVA \(F\) statistic.

Simultaneous Estimation of Contrasts

Theorem: Scheffe procedure (S method)

Under the ANOVA assumptions, if \(M=\sqrt{(k-1)F_{k-1,N-k,\alpha}}\) then the probability is \(1-\alpha\) that \[\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}-M\sqrt{S_p^2\sum_{i=1}^{k}\frac{a_i^2}{n_i}}\le\sum_{i=1}^{k}a_i\theta_{i}\le\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}+M\sqrt{S_p^2\sum_{i=1}^{k}\frac{a_i^2}{n_i}}\] simultaneously for all \(\mathbf a \in \mathcal A=\{\mathbf a=(a_1,\cdots,a_k):\sum a_i=0\}\).

Proof:

The simultaneous probability statement requires \(M\) to satisfy \[P\left(\left|\sum_{i=1}^{k}a_i\bar{Y}_{i\cdot}-\sum_{i=1}^{k}a_i\theta_{i}\right|\le M\sqrt{S_p^2\sum_{i=1}^{k}\frac{a_i^2}{n_i}}\;\;\forall\mathbf a\in\mathcal A\right)=1-\alpha\] or, equivalently, \[P\left(T_{\mathbf a}^2\le M^2\;\;\forall\mathbf a\in\mathcal A\right)=1-\alpha\] And \[P\left(T_{\mathbf a}^2\le M^2\;\;\forall\mathbf a\in\mathcal A\right)=P\left(\underset{\mathbf a:\sum a_i=0}{\sup}T_{\mathbf a}^2\le M^2=(k-1)F_{k-1,N-k,\alpha}\right)\]

Partitioning Sums of Squares

Theorem: allocating variation

For any numbers \(y_{ij}, i = 1, \cdots, k\), and \(j=1 ,\cdots , n_i\), \[\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{\bar{y}})^2=\sum_{i=1}^{k}n_i(\bar{y}_{i\cdot}-\bar{\bar{y}})^2+\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{y}_{i\cdot})^2\] where \(\bar{y}_{i\cdot}=\frac{1}{n_i}\sum_j y_{ij}\) and \(\bar{\bar{y}}=\sum_in_i\bar{y}_{i\cdot}/\sum_in_i\)

Proof:

\[\begin{align} \sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{\bar{y}})^2&=\sum_{i=1}^{k}\sum_{j=1}^{n_i}\left((y_{ij}-\bar{y}_{i\cdot})+(\bar{y}_{i\cdot}-\bar{\bar{y}})\right)^2\\ &=\sum_{i=1}^{k}n_i(\bar{y}_{i\cdot}-\bar{\bar{y}})^2+\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{y}_{i\cdot})^2 \end{align}\]

Under the ANOVA assumptions, in particular if \(Y_{ij}\sim n(\theta_i,\sigma^2)\), it is easy to show that \[\frac{1}{\sigma^2}\sum_{i=1}^{k}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\cdot})^2\sim\chi^2_{N-k},\] because for each \(i = 1, \cdots, k,\) \[\frac{1}{\sigma^2}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\cdot})^2\sim\chi^2_{n_i-1},\] all independent, and for independent chi squared random variables, \[\frac{1}{\sigma^2}\sum_{i=1}^{k}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y}_{i\cdot})^2\sim\chi^2_{N-k},\]

Furthermore, if \(\theta_i=\theta_j\) for every \(i,j\), then \[\frac{1}{\sigma^2}\sum_{i=1}^{k}n_i(\bar{Y}_{i\cdot}-\bar{\bar{Y}})^2\sim\chi^2_{k-1}\] and \[\frac{1}{\sigma^2}\sum_{i=1}^{k}\sum_{j=1}^{n_i}(Y_{ij}-\bar{\bar{Y}})^2\sim\chi^2_{N-1}\]

Thus, under \(H_0 : \theta_1 = \cdots = \theta_k,\) the sum of squares partitioning of \[\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{\bar{y}})^2=\sum_{i=1}^{k}n_i(\bar{y}_{i\cdot}-\bar{\bar{y}})^2+\sum_{i=1}^{k}\sum_{j=1}^{n_i}(y_{ij}-\bar{y}_{i\cdot})^2\sim\sigma^2(\chi^2_{k-1}+\chi^2_{N-k})\sim\sigma^2\chi^2_{N-1}\] is a partitioning of chi squared random variables. Note that the \(\chi^2\) partitioning is true only if the terms on the right-hand side are independent, which follows in this case from the normality in the ANOVA assumptions.

ANOVA table for oneway classification

\[ \begin{array}{l|llll} \hline \text{Source of}\\ \text{variation} & \text{df.} & \text{Sum of squares} & \text{Mean square} & F\; \text{ statistic} \\ \hline \text{Between}\\ \text{treatment}\\\text{groups} & k-1 & \mathrm{SSB}=\sum n_i(\bar{y}_{i\cdot}-\bar{\bar{y}})^2 & \mathrm{MSB}=\mathrm{SSB}/(k-1) & F=\frac{\mathrm{MSB}}{\mathrm{MSW}} \\\hline \text{Within}\\ \text{treatment}\\\text{groups} & N-k & \mathrm{SSW}=\sum\sum (y_{ij}-\bar{y}_{i\cdot})^2 & \mathrm{MSW}=\mathrm{SSW}/(N-k) & \\ \hline \text{Total} & N-1 & \mathrm{SST}=\sum\sum (y_{ij}-\bar{\bar{y}})^2 & & \end{array} \]

  • Sum of squares = (SS between) + (SS within)
    • This is easy to look at than some complicated equation
    • Are chi-square random variables
    • If \(H_0\) is true:
      • (SS between) \(\sim\chi^2_{k-1}\)
      • (SS total) \(\sim\chi^2_{N-1}\)

Simple Linear Regression

population regression function

\[Y_i = \beta_0 + \beta_1 x_i+\epsilon_i\]

\[E[Y_i] = \beta_0 + \beta_1 x_i\]

Assumes: \(\mathbb E[\epsilon_i] = 0\)

  • Another form: \(\mathbb E[Y_i|x_i] \approx \beta_0 + \beta_1 x_i\)

Least Squares

  • parameters solutions:
    • \(S_{xx}=\sum(x_i-\bar{x})^2\quad S_{xy}=\sum(x_i-\bar{x})(y_i-\bar{y})\)
    • \(\beta_0 = \bar{y} - \beta_1\bar{x}\)
    • \(\beta_1 = \frac{S_{xy}}{S_{xx}}\)

Best Linear Unbiased Estimators

  • Assume the form: \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \text{ for } i = 1,\cdots, n\]

  • \(\epsilon_i,..., \epsilon_n\) are uncorrelated random variables with: \[E[\epsilon_i] = 0 \text{ and } \text{Var}[\epsilon_i] = \sigma^2\]

    • \(\epsilon_i\) are random errors, therefore \(Y_i\) are uncorrelated
    • An estimator is a linear estimator if it is of the form \(\sum d_iY_i\), where \(d_1,\cdots,d_n\) are known, fixed constants.
    • An unbiased estimator of the slope \(\beta_1\) must satisfy \(\mathbb E\sum d_iY_i=\beta_1\)
    • This equality is true for all \(\beta_0\) and \(\beta_1\) if and only if \(\sum_{i=1}^{n} d_i=0\) and \(\sum_{i=1}^{n}d_ix_i=1\).
    • An estimator is the best linear unbiased estimator (BLUE) if it is the linear unbiased estimator with the smallest variance, which must satisfy \(\beta_1=\mathbb E\sum_{i=1}^{n}d_iY_i\).

Because \(Y_1,\cdots, Y_n\) are uncorrelated with equal variance \(\sigma^2\), the variance of any linear estimator is given by \[\text{Var}\sum_{i=1}^{n}d_iY_i=\sum_{i=1}^{n}d_i^2\sigma^2=\sigma^2\sum_{i=1}^{n}d_i^2\] The best linear unbiased estimator of \(\beta_1\) is, therefore, defined by constants \(d_1,\cdots,d_n\) that satisfy \(\sum_{i=1}^{n} d_i=0\) and \(\sum_{i=1}^{n}d_ix_i=1\) and have the minimum value of \(\sum_{i=1}^{n}d_i^2\).

The minimizing values of the constants \(d_1,\cdots,d_n\) can now be found by using Lemma of maximum of multiplied constants: For \(\mathcal A = \left\{\mathbf a = (a_1,\cdots,a_k): \sum a_i = 0\right\}\), \[\underset{\mathbf a\in\mathcal A}{\max}\left\{\frac{\left(\sum_{i=1}^{k}a_iv_i\right)^2}{\sum_{i=1}^{k}a_i^2/c_i}\right\}=\sum_{i=1}^{k}c_i(v_i-\bar{v}_c)^2,\] where \(\bar{v}_c=\sum c_iv_i/\sum c_i\) The maximum is attained at any \(\mathbf a\) of the form \(a_i = K c_i (v_i-\bar{v}_c)\), where \(K\) is a nonzero constant.

To apply the lemma to our minimization problem \[\max\frac{\left(\sum_{i=1}^{n}d_ix_i\right)^2}{\sum_{i=1}^{n}d_i^2}\] where \(\bar{x}=\sum x_i/n\). The maximum is attained at \(d_i=K(x_i-\bar{x}), \;i=1,\cdots,n\).

Then we have \[\sum_{i=1}^{n}d_ix_i=\sum_{i=1}^{n}K(x_i-\bar{x})x_i=KS_{xx}\] The constraint \(\sum_{i=1}^{n}d_ix_i=1\) is satisfied if \(K=\frac{1}{S_{xx}}\). Therefore \[d_i=\frac{(x_i-\bar{x})}{S_{xx}}\quad i=1,\cdots,n\] Finally, \[\frac{\left(\sum_{i=1}^{n}d_ix_i\right)^2}{\sum_{i=1}^{n}d_i^2}=\frac{1}{\sum_{i=1}^{n}d_i^2}\] and \[\max\frac{\left(\sum_{i=1}^{n}d_ix_i\right)^2}{\sum_{i=1}^{n}d_i^2}=\min\sum_{i=1}^{n}d_i^2, \\ \text{under constrain} \quad\sum_{i=1}^{n} d_i=0, \quad\sum_{i=1}^{n}d_ix_i=1\] with \(d_i=(x_i-\bar x)/S_{xx}\) and the linear unbiased estimator defined by these \(d_i\)s, \[\beta_1=\mathbb E\sum_{i=1}^{n}d_iY_i=\sum_{i=1}^{n}\frac{(x_i-\bar{x})}{S_{xx}}y_i=\frac{S_{xy}}{S_{xx}}\]

The variance of \(\beta_1\) is \[\text{Var}\beta_1=\sigma^2\sum_{i=1}^{n}d_i^2=\sigma^2\sum_{i=1}^{n}(x_i-\bar x)^2/S_{xx}^2=\frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar x)^2}\]

Since \(X_1 ,\cdots, X_n\) are values chosen by the experimenter, they can be chosen to make \(S_{xx}\) large and the variance of the estimator small that is, the experimenter can design the experiment to make the estimator more precise.

Models and Distribution Assumptions

Conditional normal model, \(X\) is fixed \(Y\) is random

\[Y_i\sim\mathrm{n}(\beta_0+\beta_1x_i,\sigma^2),\quad i=1,\cdots,n,\] and \[\mathbb E{Y|x}=\beta_0+\beta_1x\] Then \[Y_i=\beta_0+\beta_1x_i+\epsilon_i,\quad i=1,\cdots,n,\] where \(\epsilon_i\) are iid \(\mathrm{n}(0,\sigma^2)\) random variables.

Bivariate normal model, \(X\) and \(Y\) are both random

\[(X_i,Y_i)\sim\text{bivariate normal}(\mu_X,\mu_Y,\sigma^2_X,\sigma^2_Y,\rho)\] For a bivariate normal model, the conditional distribution of \(Y\) given \(X = x\) is normal. The population regression function is now a true conditional expectation, as the notation suggests, and is \[\mathbb E(Y|x)=\mu_Y+\rho\frac{\sigma_Y}{\sigma_X}(x-\mu_X)\] The bivariate normal model implies that the population regression is a linear function of \(x\). Here \(\mathbb E{Y|x}=\beta_0+\beta_1x\) where \(\beta_1=\rho\frac{\sigma_Y}{\sigma_X}\) and \(\beta_0=\mu_Y-\rho\frac{\sigma_Y}{\sigma_X}\mu_X\) and conditional variance of the response variable \(Y\) does not depend on \(x\), \[\text{Var}(Y|x)=\sigma^2_Y(1-\rho^2)\]

Estimation and Testing with Normal Errors

\[Y_i\sim\mathrm{n}(\beta_0+\beta_1x_i,\sigma^2),\quad i=1,\cdots,n,\] \[Y_i=\beta_0+\beta_1x_i+\epsilon_i,\quad i=1,\cdots,n,\]

The maximum likelihood estimates of the three parameters: \(\hat\beta_1=\frac{S_{xy}}{S_{xx}}\) which is linear unbiased estimator of \(\beta_1\).

\(\hat\beta_0=\bar{y}-\hat\beta_1\bar{x}\) which is linear unbiased estimator of \(\beta_0\).

\[\hat\sigma^2=\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat\beta_0-\hat\beta_1x_i)^2\] which is the RSS, evaluated at the least squares line, divided by the sample size, which is not an unbiased estimator of \(\sigma^2\)

Lemma: the covariance of linear combinations of uncorrelated random variables

Let \(Y_1,\cdots, Y_n\) be uncorrelated random variables with \(\mathrm{Var}Y_i=\sigma^2\) for all \(i = 1,\cdots, n\). Let \(c_1,\cdots,c_n\) and \(d_1, \cdots, d_n\) be two sets of constants. Then \[\mathrm{Cov}\left(\sum_{i=1}^{n}c_iY_i, \sum_{i=1}^{n}d_iY_i\right)=\left(\sum_{i=1}^{n}c_id_i\right)\sigma^2\]

\[\mathrm{Cov}\left(\sum_{i=1}^{n}c_i\bar Y_i, \sum_{i=1}^{n}d_i\bar Y_i\right)=\frac{1}{n_i}\left(\sum_{i=1}^{n}c_id_i\right)\sigma^2\]

Proof: See Lemma: the independence of linear commbinations of normal random variables. Here we do not need either normality or independence of \(Y_1, \cdots , Y_n\).

The bias in \(\sigma^2\)

For \[Y_i=\beta_0+\beta_1x_i+\epsilon_i,\quad i=1,\cdots,n\] we have \[\epsilon_i=Y_i-\beta_0-\beta_1x_i\] We define the residuals from the regression to be \[\hat\epsilon_i=Y_i-\hat\beta_0-\hat\beta_1x_i\] and thus \[\hat\sigma^2=\frac{1}{n}\sum_{i=1}^{n}\hat\epsilon_i^2=\frac{1}{n}\mathrm{RSS.}\]

\[\mathbb E\hat\epsilon_i=\mathbb E(Y_i-\hat\beta_0-\hat\beta_1x_i)=\beta_0+\beta_1x_i-\beta_0-\beta_1x_i=0\]

  • (1). \(\text{Var}(Y_i)=\text{Var}(\epsilon_i)=\sigma^2\)
  • (2). By results of Best Linear Unbiased Estimators, \[\hat\beta_1=\mathbb E\sum_{i=1}^{n}d_iY_i=\sum_{i=1}^{n}\frac{(x_i-\bar{x})}{\sum(x_i-\bar{x})^2}y_i=\frac{S_{xy}}{S_{xx}}\] then \[\begin{align} \hat\beta_0&=\bar{y}-\hat\beta_1\bar{x}\\ &=\sum\frac{1}{n}y_i-\frac{\bar{x}\sum(x_i-\bar{x})y_i}{\sum(x_i-\bar{x})^2}\\ &=\sum\left[\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{\sum(x_i-\bar{x})^2}\right]y_i \end{align}\] For \[c_i=\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{\sum(x_i-\bar{x})^2},\] \(\sum c_i=1,\sum c_ix_i=0\). Then \[\mathbb E(\hat\beta_0)=\mathbb E(\sum c_iY_i)=\sum c_i(\beta_0+\beta_1x_i)=\beta_0\]

\[\begin{align} \text{Var}(\hat\beta_0)&=\text{Var}(\sum c_iY_i)\\ &=\sum c_i^2\text{Var}(Y_i)\\ &=\sigma^2\sum c_i^2\\ &=\sigma^2\sum\left[\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{\sum(x_i-\bar{x})^2}\right]^2\\ &=\sigma^2\left[\sum\frac{1}{n^2}+\frac{\sum\bar{x}^2(x_i-\bar{x})^2}{\left(\sum(x_i-\bar{x})^2\right)^2}\right]\\ &=\sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum(x_i-\bar{x})^2}\right]\\ &=\sigma^2\left[\frac{\sum(x_i-\bar{x})^2+n\bar{x}^2}{n\sum(x_i-\bar{x})^2}\right]\\ &=\sigma^2\left[\frac{\sum x_i^2}{n\sum(x_i-\bar{x})^2}\right]\\ &=\sigma^2\left[\frac{\sum x_i^2}{nS_{xx}}\right]\\ \end{align}\] - (3). By results of Best Linear Unbiased Estimators, \[\text{Var}\hat\beta_1=\sigma^2\sum_{i=1}^{n}d_i^2=\sigma^2\sum_{i=1}^{n}(x_i-\bar x)^2/S_{xx}^2=\frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar x)^2}\] - (4). Write \(\hat\beta_1=\sum d_iy_i\) where \(d_i=\frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2}=\frac{x_i-\bar{x}}{S_{xx}}\), by Lemma: the covariance of linear combinations of uncorrelated random variables, \[\begin{align} \text{Cov}(\hat\beta_0,\hat\beta_1)&=\text{Cov}\left(\sum c_iY_i,\sum d_iY_i\right)\\ &=\left(\sum_{i=1}^{n}c_id_i\right)\sigma^2\\ &=\sigma^2\sum_{i=1}^{n}\left(\left[\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{\sum(x_i-\bar{x})^2}\right]\frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2}\right)\\ &=\sigma^2\sum_{i=1}^{n}\left(\left[\frac{\sum(x_i-\bar{x})^2-n\bar{x}(x_i-\bar{x})}{n\sum(x_i-\bar{x})^2}\right]\frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2}\right)\\ &=\sigma^2\left(\left[\frac{\sum(x_i-\bar{x})\sum(x_i-\bar{x})^2-n\bar{x}\sum(x_i-\bar{x})^2}{n\sum(x_i-\bar{x})^2}\right]\frac{1}{\sum(x_i-\bar{x})^2}\right)\\ &=\sigma^2\left(\frac{\sum(x_i-\bar{x})-n\bar{x}}{n\sum(x_i-\bar{x})^2}\right)\\ &=\sigma^2\left(\frac{-\bar{x}}{\sum(x_i-\bar{x})^2}\right)\\ \end{align}\]

  • (5). \[\begin{align} \text{Cov}(Y_i, \hat\beta_0)&=\text{Cov}\left(Y_i, \sum c_iY_i\right)\\ &=\left[\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{\sum(x_i-\bar{x})^2}\right]\sigma^2\\ \end{align}\]

  • (6). \[\begin{align} \text{Cov}(Y_i, \hat\beta_1)&=\text{Cov}\left(Y_i, \sum d_iY_i\right)\\ &=\frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2}\sigma^2\\ \end{align}\]

\[\begin{align} \mathrm{Var}\hat\epsilon_i&=\mathbb E \hat\epsilon_i^2\\ &=\mathbb E(Y_i-\hat\beta_0-\hat\beta_1x_i)^2\\ &=\mathbb E\left[(Y_i-\beta_0-\beta_1x_i)-(\hat\beta_0-\beta_0)-x_i(\hat\beta_1-\beta_1)\right]^2\\ &=\mathrm{Var}Y_i+\mathrm{Var}\hat\beta_0+x_i^2\mathrm{Var}\hat\beta_1-2\mathrm{Cov}(Y_i,\hat\beta_0)-2x_i\mathrm{Cov}(Y_i,\hat\beta_1)+2x_i\mathrm{Cov}(\hat\beta_0,\hat\beta_1)\\ &=\left(1+\frac{\sum x_i^2}{n\sum(x_i-\bar{x})^2}+\frac{x_i^2}{\sum_{i=1}^{n}(x_i-\bar x)^2}-\frac{2}{n}-\frac{2(x_i-\bar{x})^2}{\sum(x_i-\bar{x})^2}-\frac{2x_i\bar{x}}{\sum(x_i-\bar{x})^2}\right)\sigma^2\\ \end{align}\]

Thus \[\begin{align} \mathbb E\hat\sigma^2&=\frac{1}{n}\sum_{i=1}^{n}\mathbb E \hat\epsilon_i^2\\ &=\frac{1}{n}\sum_{i=1}^{n}\left[\left(1+\frac{\sum x_i^2}{n\sum(x_i-\bar{x})^2}+\frac{x_i^2}{\sum_{i=1}^{n}(x_i-\bar x)^2}-\frac{2}{n}-\frac{2(x_i-\bar{x})^2}{\sum(x_i-\bar{x})^2}-\frac{2x_i\bar{x}}{\sum(x_i-\bar{x})^2}\right)\sigma^2\right]\\ &=\left(\frac{n-2}{n}+\frac{\sum x_i^2}{n\sum(x_i-\bar{x})^2}+\frac{\sum x_i^2}{n\sum_{i=1}^{n}(x_i-\bar x)^2}-\frac{2\sum(x_i-\bar{x})^2}{n\sum(x_i-\bar{x})^2}-\frac{2\sum x_i\bar{x}}{n\sum(x_i-\bar{x})^2}\right)\sigma^2\\ &=\left(\frac{n-2}{n}+0\right)\sigma^2\\ &=\frac{n-2}{n}\sigma^2\\ \end{align}\] The MLE \(\hat\sigma^2\) is a biased estimator of \(\sigma^2\). The more commonly used estimator of \(\sigma^2\), which is unbiased, is \[S^2=\frac{n}{n-2}\hat\sigma^2=\frac{1}{n-2}\sum_{i=1}^{n}\left(y_i-\hat\beta_0-\hat\beta_1 x_i\right)^2=\frac{1}{n-2}\sum_{i=1}^{n}\hat\epsilon_i^2\]

Theorem: \(\frac{(n-2)S^2}{\sigma^2}\sim\chi_{n-2}^2\)

Under the conditional normal regression model, the sampling distributions of the estimators \(\hat\beta_0\), \(\hat\beta_1\) and \(S^2\) are \[\hat\beta_0\sim\mathrm{n}\left(\beta_0, \sigma^2\left[\frac{\sum x_i^2}{nS_{xx}}\right]\right)\] \[\hat\beta_1\sim\mathrm{n}\left(\beta_1,\frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar x)^2}\right)\]

\[\text{Cov}(\hat\beta_0,\hat\beta_1)=\sigma^2\left(\frac{-\bar{x}}{\sum(x_i-\bar{x})^2}\right)\] Furthermore, \((\hat\beta_0,\hat\beta_1)\) and \(S^2\) are independent and \[\frac{(n-2)S^2}{\sigma^2}\sim\chi_{n-2}^2\]

Proof:

To show that \(\hat\beta_0\) and \(\hat\beta_1\) are independent of \(S^2\), a fact that will follow from Lemma: the covariance of linear combinations of uncorrelated random variables and Lemma: the independence of linear commbinations of normal random variables. Since \[\hat\epsilon_i=Y_i-\hat\beta_0-\hat\beta_1x_i,\] we can write \[\hat\epsilon_i=\sum_{j=1}^{n}\left[\delta_{ij}-(c_j+d_jx_i)\right]Y_i\] where \[\delta_{ij}=\begin{cases} 1&\text{if }i= j\\ 0&\text{if }i\ne j \end{cases},\quad c_j=\frac{1}{n}-\frac{\bar{x}(x_j-\bar{x})}{\sum(x_j-\bar{x})^2},\quad d_j=\frac{x_j-\bar{x}}{\sum(x_j-\bar{x})^2}\] Since \(\hat\beta_0=\sum c_iY_i\) and \(\hat\beta_1=\sum d_iY_i\), application of Lemma: the covariance of linear combinations of uncorrelated random variables, we have \[\begin{align} \text{Cov}(\hat\epsilon_i,\hat\beta_0)&=\text{Cov}\left((Y_i-\hat\beta_0-\hat\beta_1x_i),\hat\beta_0\right)\\ &=\sigma^2\left(\left[\frac{1}{n}-\frac{\bar{x}(x_i-\bar{x})}{S_{xx}}\right]-\frac{\sum x_i^2}{nS_{xx}}-\frac{-\bar{x}x_i}{S_{xx}}\right)\\ &=0 \end{align}\] and \[\begin{align} \text{Cov}(\hat\epsilon_i,\hat\beta_1)&=\text{Cov}\left((Y_i-\hat\beta_0-\hat\beta_1x_i),\hat\beta_1\right)\\ &=\sigma^2\left(\frac{x_i-\bar{x}}{\sum(x_i-\bar{x})^2}+\frac{\bar{x}}{\sum(x_i-\bar{x})^2}-\frac{x_i}{\sum(x_i-\bar x)^2}\right)\\ &=0 \end{align}\]

Thus by Lemma: the independence of linear commbinations of normal random variables, under normal sampling, \(S^2=\sum\hat\epsilon_i^2/(n-2)\) is independent of \(\hat\beta_0\) and \(\hat\beta_1\).

To prove that \((n-2)S^2/\sigma^2=\sum\hat\epsilon_i^2/\sigma^2\sim\chi^2_{n-2}\), we write \((n-2)S^2\) as the sum of \((n-2)\) independent random variables, each of which has a \(\chi^2_1\) distribution. That is, we find constants \(a_{ij}, i = 1,\cdots,n \quad\text{and}\quad j = 1,\cdots, n-2\), that satisfy \[\sum_{i=1}^{n}\hat\epsilon_i^2=\sum_{j=1}^{n-2}\left(\sum_{i=1}^{n}a_{ij}Y_i\right)^2,\] where \(\sum_{i=1}^{n}a_{ij}=0,\quad j = 1,\cdots, n-2\) and \(\sum_{i=1}^{n}a_{ij}a_{ij'}=0,\quad j \ne j'\).

Inferences regarding the two parameters \(\beta_0\) and \(\beta_1\) are usually based on the following two Student’s \(t\) distributions. Their derivations follow immediately from the normal and \(\chi^2\) distributions and the independence in Theorem: \(\frac{(n-2)S^2}{\sigma^2}\sim\chi_{n-2}^2\). We have \[\frac{\hat\beta_0-\beta_0}{\sqrt{\text{Var}(\hat\beta_0)}}=\frac{\hat\beta_0-\beta_0}{\sigma\sqrt{\frac{\sum x_i^2}{nS_{xx}}}}\sim\mathrm{n}(0,1)\] \[\frac{(n-2)S^2}{\sigma^2}\sim\chi_{n-2}^2\] Then \[\frac{Z}{\sqrt{\chi^2/df}}=\frac{\frac{\hat\beta_0-\beta_0}{\sigma\sqrt{\frac{\sum x_i^2}{nS_{xx}}}}}{\sqrt{\frac{S^2}{\sigma^2}}}=\frac{\hat\beta_0-\beta_0}{S\sqrt{\frac{\sum x_i^2}{nS_{xx}}}}\sim t_{n-2}\]

And \[\frac{\hat\beta_1-\beta_1}{\sqrt{\text{Var}(\hat\beta_1)}}=\frac{\hat\beta_1-\beta_1}{\sigma\sqrt{\frac{1}{\sum_{i=1}^{n}(x_i-\bar x)^2}}}\sim\mathrm{n}(0,1)\]

Then \[\frac{Z}{\sqrt{\chi^2/df}}=\frac{\frac{\hat\beta_1-\beta_1}{\sigma\sqrt{\frac{1}{\sum_{i=1}^{n}(x_i-\bar x)^2}}}}{\sqrt{\frac{S^2}{\sigma^2}}}=\frac{\hat\beta_1-\beta_1}{S/\sqrt{S_{xx}}}\sim t_{n-2}\]

To test \(H_0:\beta_1=0\quad\text{versus}\quad H_1:\beta_1\ne 0\) we reject \(H_0\) at level \(\alpha\) if \[\left|\frac{\hat\beta_1-0}{S/\sqrt{S_{xx}}}\right|\ge t_{n-2,\alpha/2}\] or equivalently, if \[\frac{\hat\beta_1^2}{S^2/S_{xx}}>F_{1,n-2,\alpha}.\] Since \[\frac{\hat\beta_1^2}{S^2/S_{xx}}=\frac{S_{xy}^2/S_{xx}^2}{S^2/S_{xx}}=\frac{S_{xy}^2/S_{xx}}{S^2}=\frac{S_{xy}^2/S_{xx}}{\frac{1}{n-2}\sum\hat\epsilon_i^2}=\frac{S_{xy}^2/S_{xx}}{\text{RSS}/(n-2)}=\frac{\text{Regression sum of squares}}{\text{Residual sum of squares/df}}.\]

ANOVA table fo r simple linear regression

\[ \begin{array}{l|llll} \hline \text{Source of}\\ \text{variation} & \text{df.} & \text{Sum of squares} & \text{Mean square} & F\; \text{ statistic} \\ \hline \text{Regression}\\ \text{(slope)} & 1 & \mathrm{Reg.SS}=S^2_{xy}/S_{xx} & \mathrm{MS(Reg)}=\mathrm{\text{Reg.SS}} & F=\frac{\mathrm{MS(Reg)}}{\mathrm{MS(Resid)}} \\\hline \text{Residual} & n-2 & \mathrm{RSS}=\sum\hat\epsilon_i^2 & \mathrm{MS(Resid)}=\mathrm{RSS}/(n-2) & \\ \hline \text{Total} & n-1 & \mathrm{SST}=\sum (y_{i}-\bar{y})^2 & & \end{array} \]

The coefficient of determination \(r^2\):

\[r^2=\frac{\text{Regression sum of squares}}{\text{Total sum of squares}}=\frac{\sum(\hat y_i-\bar y)^2}{\sum(y_i-\bar y)^2}=\frac{S_{xy}^2}{S_{xx}S_{yy}}\]

Estimation and Prediction at a Specified \(x = x_0\)

Simultaneous Estimation and Confidence Bands

12. Regression Models

Regression with Errors in Variables (EIV) (or measurement error model)

The simple linear regression models \[Y_i=\beta_0+\beta_1x_i+\epsilon_i, \] the \(x\)s are known and fixed.

Regression with errors in variables (EIV), also known as the measurement error model, are generalizations of simple linear regression models, we measure a random variable whose mean is \(x_i\).

In the general EIV model we assume that we observe pairs \((x_i, y_i)\) sampled from random variables \((X_i, Y_i)\) whose means satisfy the linear relationship \[\mathbb EY_i=\beta_0+\beta_1(\mathbb E X_i)\]

Errors in variables model (or measurement error model)

Let \(\mathbb E X_i=\xi_i\), errors in variables model: \[Y_i=\beta_0+\beta_1\xi_i+\epsilon_i,\quad \epsilon_i\sim\mathrm{n}(0,\sigma^2_{\epsilon}),\] \[X_i=\xi_i+\delta_i,\quad \delta_i\sim\mathrm{n}(0,\sigma^2_{\delta}),\] Note that the assumption of normality, although common, is not necessary. Other distributions can be used.

Functional and Structural Relationships

Linear functional relationship model

\[Y_i=\beta_0+\beta_1\xi_i+\epsilon_i,\quad \epsilon_i\sim\mathrm{n}(0,\sigma^2_{\epsilon}),\] \[X_i=\xi_i+\delta_i,\quad \delta_i\sim\mathrm{n}(0,\sigma^2_{\delta}),\] where the \(\xi_i\)s are fixed, unknown parameters and the \(\epsilon_i\)s and \(\delta_i\)s are independent. The parameters of main interest are \(\beta_0\) and \(\beta_1\), and inference on these parameters is made using the joint distribution of \(((X_1, Y_1),\cdots,(X_n, Y_n))\), conditional on \(\xi_1,\cdots,\xi_n.\)

Linear structural relationship model

As in the functional relationship model, we have random variables \(X_i\) and \(Y_i\) , with \(\mathbb E X_i=\xi_i,\quad \mathbb E Y_i=\eta_i\), and we assume the functional relationship \(\eta_i=\beta_0+\beta_1\xi_i\). But now we assume that the parameters \(\xi_1,\cdots,\xi_n\) are themselves a random sample from a common population. Thus, conditional on \(\xi_1,\cdots,\xi_n\), we observe pairs \((X_i, Y_i), i = 1 , \cdots , n\), according to \[Y_i=\beta_0+\beta_1\xi_i+\epsilon_i,\quad \epsilon_i\sim\mathrm{n}(0,\sigma^2_{\epsilon}),\] \[X_i=\xi_i+\delta_i,\quad \delta_i\sim\mathrm{n}(0,\sigma^2_{\delta}),\] and also \[\xi_i\sim \mathrm{iid }\;\mathrm{n}(\xi,\sigma^2_{\xi}).\] As before, \(\epsilon_i\)s and \(\delta_i\)s are independent and they are also independent of the \(\xi_i\)s. As in the functional relationship model, the parameters of main interest are \(\beta_0\) and \(\beta_1\). Here, however, the inference on these parameters is made using the joint distribution of \(((X_1, Y_1),\cdots,(X_n, Y_n))\), unconditional on \(\xi_1,\cdots,\xi_n.\)

A Least Squares Solution: orthogonal least squares

One way to take account of the fact that the \(x\)s also have error in their measurement is to perform orthogonal least squares, that is, to find the line that minimizes orthogonal (perpendicular to the line) distances rather than vertical distances.

For a particular data point \((x', y')\), the point \((\hat x', \hat y')\) on a line \(y = a + bx\) that is closest when we measure distance orthogonally is given by \[\hat x'=\frac{by'+x'-ab}{1+b^2},\quad\text{and}\quad \hat y'=a+\frac{b}{1+b^2}(by'+x'-ab).\] and the norm line has slope \(-\frac{1}{b}\)

The total least squares problem is to minimize, over all \(a\) and \(b\), the quantity \[\begin{align} &\sum_{i=1}^{n}\left((x_i-\hat x_i)^2+(y_i-\hat y_i)^2\right)\\ &=\frac{1}{1+b^2}\sum_{i=1}^{n}\left(y_i-(a+bx_i)\right)^2\\ \end{align}\]

If \(b\) is fixed, the minimizing choice of \(a\) in the sum is \(a=\bar{y} - b\bar {x}\). Then \[\begin{align} &\sum_{i=1}^{n}\left((x_i-\hat x_i)^2+(y_i-\hat y_i)^2\right)\\ &=\frac{1}{1+b^2}\sum_{i=1}^{n}\left(y_i-(a+bx_i)\right)^2\\ &=\frac{1}{1+b^2}\sum_{i=1}^{n}\left(y_i-(\bar{y} - b\bar {x}+bx_i)\right)^2\\ &=\frac{1}{1+b^2}\sum_{i=1}^{n}\left(y_i-\bar{y}) - b(x_i-\bar {x})\right)^2\\ &=\frac{1}{1+b^2}\left[S_{yy}-2bS_{xy}+b^2S_{xx}\right]. \end{align}\]

With \[S_{xx}=\sum(x_i-\bar{x})^2,\quad S_{yy}=\sum(y_i-\bar{y})^2,\quad S_{xy}=\sum(x_i-\bar{x})(y_i-\bar{y}),\]

The minimum is achieved at \(b=\frac{-(S_{xx}-S_{yy})+\sqrt{(S_{xx}-S_{yy})^2+4S_{xy}^2}}{2S_{xy}}\)

Maximum Likelihood Estimation

Logistic Regression

The Model

\[\log\left(\frac{\pi_i}{1-\pi_i}\right)=\beta_0+\beta_1x_i\] with \[\pi_i=\frac{e^{\beta_0+\beta_1x_i}}{1+e^{\beta_0+\beta_1x_i}}\]

References

1. Casella G, Berger RL. Statistical inference. Cengage Learning; 2021.