20 min read

Introducing Monte Carlo Methods with R

1. Basic R Programming

library(MASS)
e <- c(1:5)
d <- c(6:10)
e*d
## [1]  6 14 24 36 50
t(e)*d
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    6   14   24   36   50
sum(t(e)*d)
## [1] 130
t(e)%*%d
##      [,1]
## [1,]  130
d%*%t(e)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    6   12   18   24   30
## [2,]    7   14   21   28   35
## [3,]    8   16   24   32   40
## [4,]    9   18   27   36   45
## [5,]   10   20   30   40   50
x1=matrix(1:20,nrow=5) #build the numeric matrix x1 of dimension
#5  4 with rst row 1, 6, 11, 16
x2=matrix(1:20,nrow=5,byrow=T) #build the numeric matrix x2 of dimension
#5  4 with rst row 1, 2, 3, 4
x3=t(x2) #transpose the matrix x2
b=x3%*%x2
b
##      [,1] [,2] [,3] [,4]
## [1,]  565  610  655  700
## [2,]  610  660  710  760
## [3,]  655  710  765  820
## [4,]  700  760  820  880
sum(b)
## [1] 11380
c=x2%*%x3
c
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   30   70  110  150  190
## [2,]   70  174  278  382  486
## [3,]  110  278  446  614  782
## [4,]  150  382  614  846 1078
## [5,]  190  486  782 1078 1374
sum(c)
## [1] 11150
m <- runif(16)
m2 <- matrix(m, nrow=4)
m2
##            [,1]       [,2]       [,3]      [,4]
## [1,] 0.10189894 0.90168646 0.32918666 0.5923876
## [2,] 0.39018669 0.88521994 0.94812635 0.8528523
## [3,] 0.90028130 0.49344937 0.05230949 0.0183888
## [4,] 0.03540736 0.07567063 0.44852255 0.6043442
solve(m2)
##            [,1]       [,2]       [,3]       [,4]
## [1,] -0.4549106 -0.3146329  1.2651309  0.8514263
## [2,]  0.9858854  0.3718322 -0.2143544 -1.4845886
## [3,] -1.9439557  2.5941484 -0.8362530 -1.7299290
## [4,]  1.3459425 -1.9534075  0.5733549  3.0745809
solve(m2) %*% m2
##               [,1]          [,2]          [,3]          [,4]
## [1,]  1.000000e+00 -1.942890e-16 -5.551115e-17  1.110223e-16
## [2,]  2.081668e-17  1.000000e+00  1.110223e-16  1.110223e-16
## [3,] -1.665335e-16 -8.326673e-17  1.000000e+00 -2.220446e-16
## [4,] -4.163336e-17 -1.387779e-16 -4.440892e-16  1.000000e+00
m2 %*% solve(m2)
##              [,1]          [,2]          [,3]          [,4]
## [1,] 1.000000e+00 -2.220446e-16  5.551115e-17 -2.220446e-16
## [2,] 2.220446e-16  1.000000e+00 -5.551115e-17 -4.440892e-16
## [3,] 3.122502e-17 -1.387779e-17  1.000000e+00 -1.387779e-17
## [4,] 0.000000e+00  0.000000e+00  5.551115e-17  1.000000e+00
x3%*%x2
##      [,1] [,2] [,3] [,4]
## [1,]  565  610  655  700
## [2,]  610  660  710  760
## [3,]  655  710  765  820
## [4,]  700  760  820  880
eigen(x3%*%x2)
## eigen() decomposition
## $values
## [1] 2.864414e+03 5.585784e+00 6.505686e-13 6.838974e-14
## 
## $vectors
##            [,1]       [,2]       [,3]       [,4]
## [1,] -0.4430188  0.7097424 -0.5477226  0.0000000
## [2,] -0.4798725  0.2640499  0.7302967  0.4082483
## [3,] -0.5167262 -0.1816426  0.1825742 -0.8164966
## [4,] -0.5535799 -0.6273351 -0.3651484  0.4082483
b=matrix(1:9,ncol=3)
b
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
var(b)
##      [,1] [,2] [,3]
## [1,]    1    1    1
## [2,]    1    1    1
## [3,]    1    1    1
b[,1]
## [1] 1 2 3
b[,2]
## [1] 4 5 6
b[,3]
## [1] 7 8 9
var(b[,1], b[,1])
## [1] 1
var(b[,1], b[,2])
## [1] 1
var(b[,1], b[,3])
## [1] 1
var(b[,2], b[,3])
## [1] 1
x=rnorm(25) #produces a N(0,1) sample of size 25
t.test(x)
## 
## 	One Sample t-test
## 
## data:  x
## t = -0.71721, df = 24, p-value = 0.4802
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.5815596  0.2816076
## sample estimates:
## mean of x 
## -0.149976
out=t.test(x)
names(out)
##  [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
##  [6] "null.value"  "stderr"      "alternative" "method"      "data.name"

Bivariate Normal distributions

Just as Normal distributions are the most important univariate distributions, joint or multivariate Normal distributions are the most important joint distributions. We mostly focus on the case of two random variables.

Jointly continuous random variables \(X\) and \(Y\) have a Bivariate Normal distribution with parameters \(\mu_X\), \(\mu_Y\), \(\sigma_X>0\), \(\sigma_Y>0\), and \(-1<\rho<1\) if the joint pdf is \[\begin{align*} f_{X, Y}(x,y) & = \frac{1}{2\pi\sigma_X\sigma_Y\sqrt{1-\rho^2}}\exp\left(-\frac{1}{2(1-\rho^2)}\left[\left(\frac{x-\mu_X}{\sigma_X}\right)^2+\left(\frac{y-\mu_Y}{\sigma_Y}\right)^2-2\rho\left(\frac{x-\mu_X}{\sigma_X}\right)\left(\frac{y-\mu_Y}{\sigma_Y}\right)\right]\right), \\ \quad -\infty <x<\infty, -\infty<y<\infty \end{align*}\] It can be shown that if the pair \((X, Y)\) has a BivariateNormal(\(\mu_X\), \(\mu_Y\), \(\sigma_X\), \(\sigma_Y\), \(\rho\)) distribution \[\begin{align*} E(X) & =\mu_X\\ E(Y) & =\mu_Y\\ SD(X) & = \sigma_X\\ SD(Y) & = \sigma_Y\\ Corr(X, Y) & = \rho \end{align*}\] A Bivariate Normal pdf is a density on \((x, y)\) pairs. The density surface looks like a mound with a peak at the point of means \((\mu_X,\mu_Y)\).

A Bivariate Normal Density has elliptical contours. For each height \(c>0\) the set \(\{(x,y): f_{X, Y}(x,y)=c\}\) is an ellipse. The density decreases as \((x, y)\) moves away from \((\mu_X, \mu_Y)\), most steeply along the minor axis of the ellipse, and least steeply along the major of the ellipse.

library(GA)

#input
mu1<-0 #mean of X_1
mu2<-0 #mean of X_2
sigma11<-1 #variance of X_1
sigma22<-1 #variance of X_2
sigma12<-0.7 #covariance of X_1 and X_2 

# 3D plot
x1 <- seq(mu1-3, mu1+3, length= 500)
x2 <- seq(mu2-3, mu2+3, length= 500)
z <- function(x1,x2){ z <- exp(-(sigma22*(x1-mu1)^2+sigma11*(x2-mu2)^2-2*sigma12*(x1-mu1)*(x2-mu2))/(2*(sigma11*sigma22-sigma12^2)))/(2*pi*sqrt(sigma11*sigma22-sigma12^2)) }
f <- outer(x1,x2,z)
persp3D(x1, x2, f, theta = 30, phi = 30, expand = 0.5)

# contour plot
library(plotly)
z <- function(x1,x2){ z <- exp(-(sigma22*(x1-mu1)^2+sigma11*(x2-mu2)^2-2*sigma12*(x1-mu1)*(x2-mu2))/(2*(sigma11*sigma22-sigma12^2)))/(2*pi*sqrt(sigma11*sigma22-sigma12^2)) }
f <- t(outer(x1,x2,z))


plot_ly(x=x1,y=x2,z=f,type = "contour")%>%layout(xaxis=list(title="x1"),yaxis=list(title="x2"))

A scatterplot of \((x,y)\) pairs generated from a Bivariate Normal distribution will have a rough linear association and the cloud of points will resemble an ellipse.

from symbulate import *
import matplotlib
import matplotlib.pyplot as plt

X,Y = RV(BivariateNormal(mean1 = 0, mean2 = 0, sd1 = 1, sd2 = 1, corr = -0.5))

(X & Y).sim(1000).plot(['scatter', 'marginal'])
plt.show()

If \(X\) and \(Y\) have a Bivariate Normal distribution, then the marginal distributions are also Normal: \(X\) has a Normal\(\left(\mu_X,\sigma_X\right)\) distribution and \(Y\) has a Normal\(\left(\mu_Y,\sigma_Y\right)\).

The strength of the association in a Bivariate Normal distribution is completely determined by the correlation \(\rho\). Remember, in general it is possible to have situations where the correlation is 0 but the random variables are not independent. However, for Bivariate Normal distributions independence and uncorrelatedness are equivalent.

Theorem 1 If \(X\) and \(Y\) have a Bivariate Normal distribution and \(Corr(X, Y)=0\) then \(X\) and \(Y\) are independent.

The above is true because the joint pdf factors into the product of the marginal pdfs if and only if \(\rho=0\).

It can also be shown that if \(X\) and \(Y\) have a Bivariate Normal distribution then any conditional distribution is Normal. The conditional distribution of \(Y\) given \(X=x\) is \[ N\left(\mu_Y + \frac{\rho\sigma_Y}{\sigma_X}\left(x-\mu_X\right),\;\sigma_Y\sqrt{1-\rho^2}\right) \]

(Y | (abs(X - 1.5) < 0.1) ).sim(1000).plot()
(Y | (abs(X - (-0.5)) < 0.1) ).sim(1000).plot()
plt.show()

The conditional expected value of \(Y\) given \(X=x\) is a linear function of \(x\), called the regression line of \(Y\) on \(X\): \[ E(Y | X=x) = \mu_Y + \rho\sigma_Y\left(\frac{x-\mu_X}{\sigma_X}\right) \] The regression line passes through the point of means \((\mu_X, \mu_Y)\) and has slope \[ \frac{\rho \sigma_Y}{\sigma_X} \] The regression line estimates that if the given \(x\) value is \(z\) SDs above of the mean of \(X\), then the corresponding \(Y\) values will be, on average, \(\rho z\) SDs away from the mean of \(Y\) \[ \frac{E(Y|X=x) - \mu_Y}{\sigma_Y} = \rho\left(\frac{x-\mu_X}{\sigma_X}\right) \] Since \(|\rho|\le 1\), for a given \(x\) value the corresponding \(Y\) values will be, on average, relatively closer to the mean of \(Y\) than the given \(x\) value is to the mean of \(X\). This is known as regression to the mean.

For Bivariate Normal distributions, the conditional variance of \(Y\) given \(X=x\) does not depend on \(x\): \[ SD(Y |X = x) = \sigma_Y\sqrt{1-\rho^2} \] The \(Y\) values corresponding to a particular \(x\) value are less variable then the entire collection of \(Y\) values.

Theorem 2 \(X\) and \(Y\) have a Bivariate Normal distribution if and only if every linear combination of \(X\) and \(Y\) has a Normal distribution. That is, \(X\) and \(Y\) have a Bivariate Normal distribution if and only if \(aX+bY+c\) has a Normal for all (At least one of \(a\), \(b\) must not be 0, unless degenerate Normal distributions are considered. A non-random constant \(c\) can be considered a degenerate Normal random variable with a mean of \(c\) and a SD of 0. \(a\), \(b\), \(c\).)

(X + Y).sim(10000).plot()
plt.show()

In short, if \(X\) and \(Y\) have a Bivariate Normal distribution then “everything is Normal”: marginal distributions are Normal, conditional distributions are Normal, and distributions of linear combinations are Normal. Therefore, probabilities for Bivariate Normal distribution problems can be solved by identifying the proper Normal distribution — that is, its mean and variance — standardizing, and using the empirical rule.

M, R = RV(BivariateNormal(mean1 = 640, sd1 = 80, mean2 = 610, sd2 = 70, corr = 0.7))

(M & R).sim(1000).plot()
plt.show()

You might think that if both \(X\) and \(Y\) have Normal distributions then their joint distribution is Bivariate Normal. But this is not true in general, as the following example shows. This is a particular example of a general concept we have seen often: marginal distributions alone are not enough to specify the joint distribution (unless the random variables are independent).

If the pair \((X,Y)\) has a joint Normal distribution then each of \(X\) and \(Y\) has a Normal distribution. But the example shows that the converse is not true. That is, if each of \(X\) and \(Y\) has a Normal distribution, it is not necessarily true that the pair \((X, Y)\) has a joint Normal distribution

However, if \(X\) and \(Y\) are independent and each of \(X\) and \(Y\) has a Normal distribution, then the pair \((X, Y)\) has a joint Normal distribution.

The following is a standardization result for which can be used to simulate values from a Bivariate Normal distribution.

Let \(Z_1, Z_2\) be independent, each having a Normal(0, 1) distribution. For constants \(\mu_X\), \(\mu_Y\), \(\sigma_X\), \(\sigma_Y\), and \(-1<\rho<1\) define \[\begin{align*} X & = \mu_X + \sigma_X Z_1\\ Y & = \mu_Y + \rho\sigma_Y Z_1 + \sqrt{1-\rho^2}\, \sigma_Y Z_2. \end{align*}\] Then the pair \((X,Y)\) has a BivariateNormal(\(\mu_X,\mu_Y,\sigma_X,\sigma_Y,\rho)\) distribution.

Random variables \(X_1, \ldots, X_n\) have a Multivariate Normal (MVN) (a.k.a., joint Gaussian) distribution if and only if for all non-random constants the RV \(a_1,\ldots,a_n\), the RV \(a_1X_1+\cdots+a_n X_n\) has a Normal distribution. That is, if random variables have a MVN distribution then any random variable formed by taking a linear combination of the RVs has a Normal distribution. In particular, each \(X_i\) must have a marginal Normal distribution.

In fact, a more general result is true: any random vector whose components are formed by taking linear combinations of a MVN random vector is itself a MVN random vector. As a special case, if \(X_1,\ldots, X_n\) are independent and each \(X_i\) has a Normal distribution, then their joint distribution is MVN.

If random variables \(X_1,\ldots, X_n\) have a Multivariate Normal distribution, then their joint distribution is fully specified by

  • the mean vector, with entries \(E(X_i)\), \(i=1,\ldots,n\)
  • the covariance matrix, with entries \(Cov(X_i,X_j)\), \(i,j=1,\ldots, n\)

Random variables \(X\) and \(Y\) which have a MVN distribution are independent if and only if \(Cov(X,Y)=0\).

the t value of Pearson correlation

The statistic \[t=\frac{r}{\sqrt{(1-r^2)/(N-2)}}\] is essentially a significance of regression testing problem under the given setting :

Let \((x_i, y_i)\) be a bivariate normal random sample of size \(N\). Then we know that \[E(Y|X=x)=\left(\mu_Y-r \dfrac{S_Y}{S_X} \mu_X\right)+\left(r \dfrac{S_Y}{S_X}\right)x = \beta_0 + \beta_1x\]

Where \[\beta_0 = \left(\mu_Y-r \dfrac{S_Y}{S_X} \mu_X\right)\] and \[\beta_1 = \left(r \dfrac{S_Y}{S_X}\right)\]

By definition \[ MSE = \frac{1}{n-2}\sum_{i=1}^n (Y_i -\hat{Y}_i)^2. \]

Again, by definition

\[ \hat{Y}_i = r\frac{S_y}{S_x}(X_i - \bar{X}) + \bar{Y}, \]

so

\[SSE = \sum_{i=1}^n \left( (Y_i - \bar{Y}) - r\frac{S_y}{S_x}(X_i - \bar{X}) \right) ^2 = (n-1)S_y^2 + (n-1)r^2S_y^2 - 2(n-1)r^2S_y^2,\]

so that

\[ MSE = \frac{n-1}{n-2}S_y^2 \left(1-r^2\right).\]

MSE depends on \(r^2\), the lower the correlation, the higher MSE, but it depends as well on the variance of the response variable \(S_y^2\). Actually, this is expected, because MSE has a unit (the square of the unit of \(Y\)) and the coefficient of correlation does not. For that reason you cannot usually compare MSE of different regressions models.

Also note that multiplying the values of \(Y\) by 10 would not change the coefficient of correlation between \(Y\) and \(X\) (what I think you call your ground truth), but it would multiply MSE by 100.

In short, you can use MSE to compare different predictors (the name of “ground truths” in regression models) to model the same response variable, but you cannot use MSE to compare different response variables modeled by the same predictor.

Now to test \(H_0 : r = 0\) is equivalent to testing \(H_0 : \beta_1 = 0\).

Let \(\hat{\beta_1}\) be the OLS estimator of \(\beta_1\) then under \(H_0 : \beta_1 =0\)

When the residuals in a linear regression are normally distributed, the least squares parameter \(\hat{\beta}\) is normally distributed. Of course, when the variance of the residuals must be estimated from the sample, the exact distribution of \(\hat{\beta}\) under the null hypothesis is \(t\) with \(n-p\) degrees of freedom (\(p\) the dimension of the model, usually two for a slope and intercept).

The \(t\) for the Pearson Correlation Coefficient can be shown to be mathematically equivalent to the \(t\) test statistic for the least squares regression parameter by:

\[t = \frac{\hat{\beta}}{\sqrt{\frac{\text{MSE}}{\sum (X_i - \bar{X})^2}}}= \frac{r (S_y / S_x)}{\sqrt{\frac{(n-1)(1-r^2)S_y^2}{(n-2)(n-1)S_x^2}}}=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]

\[t=\frac{\hat{\beta_1}}{s.e(\hat{\beta_1})}=\frac{r}{\sqrt{(1-r^2)/(N-2)}}\] follows a t-distribution with \(N-2\) degrees of freedom.

Hence the only thing you need to ensure is that the samples are independent and follow Bivariate Normal Distribution. Note that checking for individual normality will not work as marginal normality doesn’t imply joint normality.

attach(faithful) #resident dataset
cor.test(faithful[,1],faithful[,2])
## 
## 	Pearson's product-moment correlation
## 
## data:  faithful[, 1] and faithful[, 2]
## t = 34.089, df = 270, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8756964 0.9210652
## sample estimates:
##       cor 
## 0.9008112

Kolmogorov-Smirnov Goodness-of-Fit Test

The Kolmogorov-Smirnov test assesses the hypothesis that a random sample (of numerical data) came from a continuous distribution that was completely specified without referring to the data.

Here is the graph of the cumulative distribution function (CDF) of such a distribution.

Figure 1: Graph of the standard normal CDF from -3 to 3
Figure 1: Graph of the standard normal CDF from -3 to 3

A sample can be fully described by its empirical (cumulative) distribution function, or ECDF. It plots the fraction of data less than or equal to the horizontal values. Thus, with a random sample of \(n\) values, when we scan from left to right it jumps upwards by \(1/n\) each time we cross a data value.

The next figure displays the ECDF for a sample of \(n=10\) values taken from this distribution. The dot symbols locate the data. The lines are drawn to provide a visual connection among the points similar to the graph of the continuous CDF.

Figure 2: Graph of an ECDF
Figure 2: Graph of an ECDF

The K-S test compares the CDF to the ECDF using the greatest vertical difference between their graphs. The amount (a positive number) is the Kolmogorov-Smirnov test statistic.

We may visualize the KS test statistic by locating the data point situated furthest above or below the CDF. Here it is highlighted in red. The test statistic is the vertical distance between the extreme point and the value of the reference CDF. Two limiting curves, located this distance above and below the CDF, are drawn for reference. Thus, the ECDF lies between these curves and just touches at least one of them.

Figure 3: CDF, ECDF, and limiting curves
Figure 3: CDF, ECDF, and limiting curves

To assess the significance of the KS test statistic, we compare it–as usual–to the KS test statistics that would tend to occur in perfectly random samples from the hypothesized distribution. One way to visualize them is to graph the ECDFs for many such (independent) samples in a way that indicates what their KS statistics are. This forms the “null distribution” of the KS statistic.

Figure 4: Many ECDFs, displaying a null distribution
Figure 4: Many ECDFs, displaying a null distribution

The ECDF of each of \(200\) samples is shown along with a single red marker located where it departs the most from the hypothesized CDF. In this case it is evident that the original sample (in blue) departs less from the CDF than would most random samples. (73% of the random samples depart further from the CDF than does the blue sample. Visually, this means 73% of the red dots fall outside the region delimited by the two red curves.) Thus, we have (on this basis) no evidence to conclude our (blue) sample was not generated by this CDF. That is, the difference is “not statistically significant.”

More abstractly, we may plot the distribution of the KS statistics in this large set of random samples. This is called the null distribution of the test statistic. Here it is:

Figure 5: Histogram of 200 KS test statistics
Figure 5: Histogram of 200 KS test statistics

The vertical blue line locates the KS test statistic for the original sample. 27% of the random KS test statistics were smaller and 73% of the random statistics were greater. Scanning across, it looks like the KS statistic for a dataset (of this size, for this hypothesized CDF) would have to exceed 0.4 or so before we would conclude it is extremely large (and therefore constitutes significant evidence that the hypothesized CDF is incorrect).


Although much more can be said–in particular, about why KS test works the same way, and produces the same null distribution, for any continuous CDF–this is enough to understand the test and to use it together with probability plots to assess data distributions.

ecdf.ks <- function(x, f=pnorm, col2="#00000010", accent="#d02020", cex=0.6,
                    limits=FALSE, ...) {
  obj <- ecdf(x)
  x <- sort(x)
  n <- length(x)
  y <- f(x) - (0:(n - 1))/n
  p <- pmax(y, 1/n - y)
  dp <- max(p)
  i <- which(p >= dp)[1]
  q <- ifelse(f(x[i]) > (i-1)/n, (i-1)/n, i/n)

  # if (dp != ks.test(x, f)$statistic) stop("Incorrect.")

  plot(obj, col=col2, cex=cex, ...)
  points(x[i], q, col=accent, pch=19, cex=cex)
  if (limits) {
    curve(pmin(1, f(x)+dp), add=TRUE, col=accent)
    curve(pmax(0, f(x)-dp), add=TRUE, col=accent)
  }
  c(i, dp)
}

ref: https://stats.stackexchange.com/questions/471732/intuitive-explanation-of-kolmogorov-smirnov-test/471775#471775

ks.test(jitter(faithful[,1]),pnorm)
## 
## 	Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  jitter(faithful[, 1])
## D = 0.94858, p-value < 2.2e-16
## alternative hypothesis: two-sided

Shapiro–Wilk normality test

The question normality tests answer: Is there convincing evidence of any deviation from the Gaussian ideal? With moderately large real data sets, the answer is almost always yes.

shapiro.test(faithful[,2])
## 
## 	Shapiro-Wilk normality test
## 
## data:  faithful[, 2]
## W = 0.92215, p-value = 1.015e-10

Wilcoxon signed rank test

summary(faithful[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.600   2.163   4.000   3.488   4.454   5.100
wilcox.test(faithful[,1])
## 
## 	Wilcoxon signed rank test with continuity correction
## 
## data:  faithful[, 1]
## V = 37128, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 0
sample <- rnorm(100)
summary(sample)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.2425 -0.5201  0.1796  0.1426  0.7391  2.5836
wilcox.test(rnorm(100))
## 
## 	Wilcoxon signed rank test with continuity correction
## 
## data:  rnorm(100)
## V = 2442, p-value = 0.7767
## alternative hypothesis: true location is not equal to 0
Nit = c(0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,6,6,6)
AOB =c(4.26,4.15,4.68,6.08,5.87,6.92,6.87,6.25, 6.84,6.34,6.56,6.52,7.39,7.38,7.74,7.76,8.14,7.22)
AOBm=tapply(AOB,Nit,mean) #means of AOB
Nitm=tapply(Nit,Nit,mean) #means of Nit
plot(Nit,AOB,xlim=c(0,6),ylim=c(min(AOB),max(AOB)),pch=19)
fitAOB=lm(AOBm~splines::ns(Nitm,df=2)) #natural spline
xmin=min(Nit);xmax=max(Nit)
lines(seq(xmin,xmax,.5), predict(fitAOB,data.frame(Nitm=seq(xmin,xmax,.5))), col='blue')
fitAOB2=loess(AOBm~Nitm,span = 1.25) #loess
lines(seq(xmin,xmax,.5), predict(fitAOB2,data.frame(Nitm=seq(xmin,xmax,.5))), col='red')

x=seq(-3,3,le=5) # equidispersed regressor
x
## [1] -3.0 -1.5  0.0  1.5  3.0
y=2+4*x+rnorm(5) # simulated variable
lm(y~x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##       1.896        3.822
summary(lm(y~x))
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##       1       2       3       4       5 
## -1.1619  0.6028  1.5515 -0.2636 -0.7288 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8958     0.5610   3.379 0.043116 *  
## x             3.8223     0.2645  14.453 0.000718 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.254 on 3 degrees of freedom
## Multiple R-squared:  0.9858,	Adjusted R-squared:  0.9811 
## F-statistic: 208.9 on 1 and 3 DF,  p-value: 0.000718
out=lm(y~x)
out$res
##          1          2          3          4          5 
## -1.1619222  0.6027505  1.5514994 -0.2635613 -0.7287664
out$df
## [1] 3
sqrt(sum(out$res^2)/out$df)
## [1] 1.25447
summary(lm(weight ~ feed, data = chickwts))
## 
## Call:
## lm(formula = weight ~ feed, data = chickwts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -123.909  -34.413    1.571   38.170  103.091 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    323.583     15.834  20.436  < 2e-16 ***
## feedhorsebean -163.383     23.485  -6.957 2.07e-09 ***
## feedlinseed   -104.833     22.393  -4.682 1.49e-05 ***
## feedmeatmeal   -46.674     22.896  -2.039 0.045567 *  
## feedsoybean    -77.155     21.578  -3.576 0.000665 ***
## feedsunflower    5.333     22.393   0.238 0.812495    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.85 on 65 degrees of freedom
## Multiple R-squared:  0.5417,	Adjusted R-squared:  0.5064 
## F-statistic: 15.36 on 5 and 65 DF,  p-value: 5.936e-10
anova(lm(weight ~ feed, data = chickwts))
## Analysis of Variance Table
## 
## Response: weight
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## feed       5 231129   46226  15.365 5.936e-10 ***
## Residuals 65 195556    3009                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm(formula = type ~ bmi + age, family = "binomial", data = Pima.tr)
## 
## Call:  glm(formula = type ~ bmi + age, family = "binomial", data = Pima.tr)
## 
## Coefficients:
## (Intercept)          bmi          age  
##    -6.49870      0.10519      0.07104  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:	    256.4 
## Residual Deviance: 215.9 	AIC: 221.9
runif(2)
## [1] 0.9331583 0.5059668
#glm(y ~ x, family=quasi(var="mu^2", link="log"), start = runif(2))
arima(diff(EuStockMarkets[,1]),order=c(0,0,5))
## 
## Call:
## arima(x = diff(EuStockMarkets[, 1]), order = c(0, 0, 5))
## 
## Coefficients:
##          ma1      ma2      ma3      ma4      ma5  intercept
##       0.0054  -0.0130  -0.0110  -0.0041  -0.0486     2.0692
## s.e.  0.0234   0.0233   0.0221   0.0236   0.0235     0.6990
## 
## sigma^2 estimated as 1053:  log likelihood = -9106.23,  aic = 18226.45
acf(ldeaths)

acf(ldeaths,type="partial")

y = c(4.313, 4.513, 5.489, 4.265, 3.641, 5.106, 8.006, 5.087)
ystar=sample(y,replace=T)
ystar
## [1] 4.313 4.313 5.489 5.106 5.087 4.265 4.265 3.641
x=seq(-3,3,le=5) # equidispersed regressor
y=2+4*x+rnorm(5) # simulated dependent variable
lm(y~x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##       2.822        4.042
fit=lm(y~x) #fit the linear model
Rdata=fit$residuals #get the residuals
nBoot=2000 #number of bootstrap samples
B=array(0,dim=c(nBoot, 2)) #bootstrap array
for(i in 1:nBoot){
  ystar=y+sample(Rdata,replace=T)
  Bfit=lm(ystar~x)
  B[i,]=Bfit$coefficients
} #bootstrap loop
fit
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##       2.822        4.042
hist(B[,1], breaks = 20)

hist(B[,2], breaks = 20)

jpeg(file="faith.jpeg")
par(mfrow=c(1,2),mar=c(4,2,2,1))
hist(faithful[,1],nclass=21,col="grey",main="", xlab=names(faithful)[1])
hist(faithful[,2],nclass=21,col="wheat",main="", xlab=names(faithful)[2])
dev.off()
## svg 
##   2
as.vector(seq(1:10))
##  [1]  1  2  3  4  5  6  7  8  9 10
as.vector(seq(1:10))[-1]
## [1]  2  3  4  5  6  7  8  9 10
plot(as.vector(time(mdeaths)),as.vector(mdeaths),cex=.6, pch=19,xlab="",ylab="Monthly deaths from bronchitis")
lines(spline(mdeaths),lwd=2,col="chocolate",lty=3)
ar=arima(mdeaths,order=c(1,0,0))$coef
lines(as.vector(time(mdeaths))[-1], ar[2]+ar[1]*(mdeaths[-length(mdeaths)]-ar[2]),col="grey",lwd=2,lty=2, title("Splines versus AR(1) predictor"))
ari=arima(mdeaths,order=c(1,0,0),seasonal=list(order=c(1, 0,0),period=12))$coef

lines(as.vector(time(mdeaths))[-(1:13)],ari[3]+ari[1]*(mdeaths[-c(1:12,72)]-ari[3])+ari[2]*(mdeaths[-(60:72)]- ari[3]),lwd=2,col="steelblue",lty=2) 
title("\n\nand SAR(1,12) predictor")
legend(1974,2800,legend=c("spline","AR(1)","SAR(1,12)"), col=c("chocolate","grey","steelblue"), lty=c(3,2,2),lwd=rep(2,3),cex=.5)

x=rnorm(1)
for (t in 2:10^3) x=c(x,.09*x[t-1]+rnorm(1))
plot(x,type="l",xlab="time",ylab="x",lwd=2,lty=2, col="steelblue",ylim=range(cumsum(x)))
lines(cumsum(x),lwd=2,col="orange3")

par(mar=c(2,2,2,2))
x=matrix(0,ncol=100,nrow=10^4)
for (t in 2:10^4)
x[t,]=x[t-1,]+rnorm(100)*10^(-2)
plot(seq(0,1,le=10^4),x[,1],ty="n", ylim=range(x),xlab="",ylab="")
polygon(c(1:10^4,10^4:1)/10^4,c(apply(x,1,max), rev(apply(x,1,min))),col="gold",bor=F)
polygon(c(1:10^4,10^4:1)/10^4,c(apply(x,1,quantile,.95), rev(apply(x,1,quantile,.05))),col="brown",bor=F)

Newton’s method for calculating the square root

Recall that Newton’s method finds an approximate root of \(f(x) = 0\) from a guess \(x_n\) by approximating \(f(x)\) as its tangent line \(f(x_n) + f'(x_n) (x − x_n)\), leading to an improved guess \(x_{n+1}\) from the root of the tangent:

\[x_{n+1}=x_{n}-\frac{f(x_n)}{f'(x_n)}\] and for \(f(x) = x^2 − a\) this yields the Babylonian formula: “Newton’s” method to compute square roots \(x =\sqrt{a}\) for \(a > 0\), i.e. to solve \(x^2 = a\). The algorithm starts with some guess \(x_1 > 0\) and computes the sequence of improved guesses \[x_{n+1} =x_n-\frac{x_n^2-a}{2x_n}=\frac{1}{2}\left(x_n+\frac{a}{x_n}\right)\] Let us analyze the convergence rate quantitatively—given a small error δn on the \(n\)-th iteration, we will determine how much smaller the error \(\delta_{n+1}\) is in the next iteration. In particular, let us define \(x_n = x(1+\delta_{n})\), where \(x =\sqrt{a}\) is the exact solution. This corresponds to defining \(|\delta_{n}|\) as the relative error: \[|\delta_{n}|=\frac{|x_n-x|}{|x|}\] also called the fractional error (the error as a fraction of the exact value). Relative error is typically the most useful way to quantify the error because it is a dimensionless quantity (independent of the units or overal scalling of \(x\)). The logarithm \((− log10(\delta_{n})\) of the relative error is roughly the number of accurate significant digits in the answer \(x_n\).
We can plug this definition of \(x_n\) (and \(x_{n+1}\)) in terms of \(\delta_{n}\) (and \(\delta_{n+1}\)) into our Newton iteration formula to solve for the iteration of \(\delta_{n}\), using the fact that \(a/x = x\) to divide both sides by \(x\): \[1+\delta_{n+1}=\frac{1}{2}\left(1+\delta_{n}+\frac{1}{1+\delta_{n}}\right)=\frac{1}{2}\left[1+\delta_{n}+1-\delta_{n}+\delta_{n}^2+O(\delta_{n}^3)\right]\] where we have Taylor-expanded \((1−\delta_{n})^{-1}\). The \(O(\delta_{n}^3)\) means roughly “terms of order \(\delta_{n}^3\) or smaller;” we will define it more precisely later on. Because the sequence converges, we are entitled to assume that \(|\delta_{n}|^3\ll 1\) for sufficiently large \(n\), and so the \(\delta_{n}^3\) and higher-order terms are eventually negligible compared to \(\delta_{n}^2\). We obtain: \[\delta_{n+1}=\frac{\delta_{n}^2}{2}+O(\delta_{n}^3)\] which means the error roughly squares (and halves) on each iteration once we are close to the solution.