 (Upload on August 13 2022) [ 日本語 | English ]

## Statistics (統計学)

Mount Usu / Sarobetsu post-mined peatland
From left: Crater basin in 1986 and 2006. Cottongrass / Daylily

Too many people use statistics as a drunken man uses a lamppost, for support but not for illumination. (Finney 1997)
Facts are stubborn, but statistics are more pliable. (Mark Twain)

Statistics: statsu (Gr) = state (en.), 国家 (jp.)

#### Statistical ecology (統計生態学)

= ecological statistics (生態統計学)
referring to the application of statistical methods to the description and monitoring of ecological phenomena
≈ overlapping mostly with quantitative ecology (定量生態学, s.l.)
adding momentum for quantification after 1960's - transcending descriptive ecology (記載生態学)

Ex. 1969 International Symposium on Statistical Ecology (New Haven, Conn.)

## Fundamentals of statistics (統計基礎)

Def. Trial (試行)

Experiment: conducting a trial or experiment to obtain some statistical information

Def. Event (事象): a set of outcomes of an experiment (a subset of the sample space) to which a probability is assigned

Expression: A, B, C, …

Def. Complementary event (余事象): AC, BC, CC, …

#### Combination and permutation (組み合わせと順列)

##### Combination (組み合わせ)
Def. A selection of items from a collection, such that (unlike permutations) the order of selection does not matter

Ex. "My fruit salad is a combination of apples, grapes and bananas." → combination

nCr = = nPr/r! = n!/{r!(nr)!}
nC0 = nCn = n!/n! = 1
Eq. nCk = n!/{k!(nk)!} = n!/[(nk)!{(n – (nk)}!] = nCnk
Th. nCk + nCk+1 = n+1Ck+1
Pr. nCk + nCk+1 = n!/{k!(nk)!} + n!/[(k + 1)!{(n – (k + 1)!}]

= n!/{k!(nk – 1)!}·{1/(nk) + 1/(k + 1)}
= n!/{k!(nk – 1)!}·(k + 1 + nk)/{(nk)(k + 1)}
= n!/{k!(nk – 1)!}·(n + 1)/{(nk)(k + 1)}
= (n + 1)!/{(k + 1)!(nk)!} = (n + 1)!/[(k + 1)!{n + 1 – (k + 1)}!]
= n+1Ck+1___//

Th. (1) knCk = nn-1Ck-1, (2) n-1Ck + n-1Ck-1 = nCk
Pr. (1) knCk = k·(n!/(k!(n - k)!) = (n·(n - 1)!)/((k - 1)!(n - k)!)

= n·((n - 1)!)/((n - 1) - (k - 1))! = nn-1Ck-1

___(2) n-1Ck + n-1Ck-1 = ((n - 1)!)/(k!(n - 1 - k)!) + (n - 1)!)/((k - 1)!(n - k)!)

= ((n - k)·(n - 1)!)/(k!(n - k)!) + (k·(n - 1)!)/(k!(n - k)!)
= (n·(n - 1)!)/(k!(n - k)!) = (n!)/(k!(n - k)!) = nCk___//

##### Permutation (順列)
= an ordered combination
Def. The act of arranging the members of a set into a sequence or order, or, if the set is already ordered, rearranging (reordering) its elements - a process called permuting
1. Repeated permutation (重複順列): allowed repetition: Ex. "333" on a permutation lock

select r from nnHr = n+r-1Cr

2. Non-repeated permutation: not allowed repetition: Ex. a winner can not be the loser

#### Scales and attributes

1. qualitative merkmal
• category (nominal)
• ordinal = hybrid: rank order exists between the values or classes
2. quantitative merkmal
• interval (distance)
• ratio = quantitative

### Data presentation (データ表現)

#### Graph

##### Histograms (ヒストグラム)
Bar graph (棒グラフ): bars with heights or lengths proportional to the values that they represent

Vertical bar graph (column chart)
Horizontal bar graph
Stereogram = 3-dimentional

Stacked bar graph (積み上げ棒グラフ)
Box-and-whisker plot (箱髭図): a standardized way of displaying the distribution of data based on a five number summary 1. median, Q2 or 50th percentile
2. first quartile,Q1 or 25th percentile: the middle number between the smallest number (not the "minimum") and the median
3. third quartile, Q3 or 75th percentile: the middle value between the median and the highest value (not the "maximum")
interquartile range (IQR): between the 25th and 75th percentile Fig. Waterfall chart

4. "maximum": Q3 + 1.5×IQR
5. "minimum": Q1 -1.5×IQR
whiskers: in blue
outliers: green circles

Waterfall chart (滝グラフ, flying bricks chart or Mario chart)
Pie chart (円グラフ)
Multi-pie chart
Multi-level pie chart
Sunburst chart (サンバーストチャート)
Contour plot (等高線図)
Spherical contour graph
Venn diagram (ベン図)
Spider chart (クモの巣グラフ)
Mosaic or mekko chart
Line graph (線グラフ)
Multi-line graph
Scatter-line combo
Control chart (管理図)
Paleto chart (パレート図)
Scatter plot (scattergram, 散布図)
= scatterplot, scatter graph, scatter chart, scattergram and scatter diagram

using Cartesian coordinates to display values for typically two variables for a set of data

Area graph (area chart, 面積グラフ)
Stacked area chart
Trellis plot (Trellis chart and Trellis graph)
Trellis line graph
Trellis bar graph
Function plot
Binary decision diagram (BDD, cluster, 二分決定グラフ)
Hierarchy diagram (階層図)
Circuit diagram (回路図)
Flowchart (フローチャート, 流れ図): a type of diagram that represents a workflow or process

⇒ Ex. TWINSPAN

Pictograph (統計図表): using pictures instead of numbers
3D graph

## Mean, average (平均)

##### Arithmetic mean (算術平均), m or x-
= (x1 + x2 + x3 + … + xn)/n = 1/n·Σk=1nxk

affected by the outlier(s). may loose the representativeness when the data are censored
Ex. 3, 4, 4, 4, 6, 6, 8, 13, 15 (n = 9)
M1D1 = 9 = (3 + 15)/2, Mode = 4, Median = 6, Mean = 7

##### Geometric mean (幾何平均, xg)
= (x1·x2xn)1/n

∴ logxg = 1/n·(logx1 + logx2 + … + logxn)

= 1/n·Σk=1nlogxk (xi > 0)

used for change rate

#### Circular statistics (角度統計学)

≈ directional statistics and spherical statistics
##### Graph
Histogram
Circular raw data plot (円周プロット)
(Nightingale) rose diagram (kite diagram or circular graph, 鶏頭図) Q. Mean of 1° and 359°
A. correct: 0°, incorrect: (1 + 359)/2 = 180°
Q. obtain mean of (80°, 170°, 175°, 200°, 265°, 345°)
A. × (80° + 170° + 175° + 200° + 265° + 345°)/6 = 206°
Def. mean (Θ) of vectors, (RcosΘ, RsinΘ)

= 1/N(Σicosθi, Σisinθi),__R: angle. θ: length
= (〈cosθi〉, 〈sinθi〉),__mean: 〈•〉 Θ = 191°

Def. (circular) variance (円周分散), V ≡ 1 - R (0 ≤ V ≤ 1)
Def. (circular) standard deviation (円周標準偏差), S ≡ √(-2·logR)__(0 ≤ S ≤ ∞)
Def. mean angular deviation, vS = √(2V) when V is sufficiently small

= √{2(1 - R)}

Circular uniform distribution (円周一様分布)

p.d.f. f(θ) = 1/(2π)__(0 ≤ θ ≤ 2π)
c.p.d.f. F(θ) = θ/(2π)

von Mises distribution (フォン・ミーゼス分布)

P(θ) = exp(κcos(θ - μ)/(2πI0(κ)) ∝ exp(κcos(θ - μ))

parameters = (μ, κ), μ = mean, R = Ii(κ)/I0(κ)

called normal distribution on the circumference of circle

##### Statistical tests of circular data
When von Mises distribution is assumed, the test is parametric
Rayleigh test (Rayleigh z test)

a test for periodicity in irregularly sampled data

Kuiper test, H0: ƒ(θ) ~ P(θ)

a test if the sample distribution follows von Mises distribution

Mardia-Watson-Wheeler test, H0: Θ1 = Θ2

a test if the two samples are extracted from the same population

## Probability theory (確率論)

 Def. statistical phenomenon (統計的現象) = probabilistic event/stochastic event (確率的現象): satisfied the two conditions shown below 1) non-deterministic (非決定論的) 2) statistical regularity (collective regularity, 集団的規則性) Ex.

### Characteristic values on statistical variables (確率変数特性値)

 Def. Mean (expectation, 平均/期待値) E(X) = mx = 1/nΣi=1nxifi = Σn=1hxiX(fi/n) i) discrete rvX: E(X) = Σi=1nxiP(X = xi) ii) continuous rvX: E(X) = ∫-∞∞xif(x)dx Def. variance (分散), V(X) = E{(X – E(X))2} cp. s2 = 1/nΣi=1n(xi – mx)2 [√V(X): standard deviation] i) discrete rvX: V(X) = Σi=1n(xi – μ)2·P(x = xi) ii) continuous rvX: V(X) = ∫-∞∞(xi – μ)2·f(x)dx, μ = E(X) Th. Characteristics of mean (expectation) (a, b, constant) 1. E(aX + b) = aE(X) + b 2. E(X1 + X2 + … +Xn) = E(X1) + E(X2) + … + E(Xn) 3. XΠY → E(X·Y) = E(X)·E(Y) Pr. (Case, discrete) 1. E(aX + b) = Σk(axk + b)P{X = xk} = aΣkxkP{X = xk} + bΣkP{X = xk} = aE(X) + b 2. E(X + Y) = Σk(xkP(X = xk) + ykP(Y = yk)) = ΣkxkP(X = xk) + ΣkykP(Y = yk) = E(X) + E(Y) → extension 3. E(X·Y) = Σk(xkP(X = xk)·ykP(Y = yk) = Σk(xkP(X = xk)·ΣkykP(Y = yk) = E(X)E(Y) [条件より] Th. Characteristics of variance (分散の性質) 1. V(aX + b) = a2V(X) Ex. V(X + b) = V(X), V(2X) = 4V(X) 2. V(X) = E(X2) – E2(X) 3. V(X1 + X2 + … + Xn) = V(X1) + V(X2) + … + V(Xn) + 2Σi

### Law of great numbers and central limit theorem, CLT (大数の法則と中心極限定理)

 Th. 0. Chebyshev's inequality (チェビシェフの不等式) = Bienaymé–Chebyshev inequality For any probability distributions, no more than 1/λ2 of the distribution values can be λ or more standard deviations (SDs) away from the mean (or equivalently, over 1 − 1/λ2 of the distribution values are less than λ SDs away from the mean) Form 1) Σ|xi–x|≥λs ≤ n/λ2, or Σ|xi–x| < λs ≥ n(1 – n/λ2) Pr. ns2 = Σi=1n(xi – m)2 = Σ|xi – mx| ≥ λs(xi – m)2 + Σ|xi – mx| < λs(xi – m)2 ≥ Σ|xi – x| ≥ λs(xi – m)2 |xi – m| ≥ λs ⇒ (xi – mx)2 ≥ λ2s2, ns2 ≥ Σ|xi – x| ≥ λsλ2s2, n ≥ Σ|xi – x| ≥ λsλ2 ∴ n/λ2 ≥ Σ|xi – x| ≥ λs Form 2) P{|X – E(X)| ≥ λ√V(X)} ≤ 1/λ2, or P{|X – E(X)| ≤ λ√V(X)} ≥ 1 - 1/λ2 Pr. λ = ε/√V(X) ∴ ε = λ√V(X) Th. Chebyshev's theorem チェビシェフの定理 {Xn}, XiΠXj (i ≠ j), V(Xk) ≤ c (c = constant, k = 1, 2, … , n), ∃ε > 0 → limn→∞{|1/n·Σk=1nXk – 1/n·Σk=1nE(Xk)| < ε} = 1 Pr. V(1/n·Σk=1nXk) = 1/n2·1/n·Σk=1nV(Xk) ≥ c/n Chebyshev's inequality → P{|1/nΣk=1nXk – 1/nΣk=1nE(Xk)| < ε} ≥ 1 – 1/ε2·V(1/nΣk=1nXk) ≥ 1 – c/nε2 limn→∞P(|1/nΣk=1nXk – 1/nΣk=1nE(Xk)| < ε) ≥ 1, P ≤ 1 → limn→∞P(|1/nΣk=1nXk – 1/nΣk=1nE(Xk)| < ε) = 1 // Th. Law of large numbers (ベルヌーイの大数の法則), S X1, X2, … Xn: independence, E(xi) = μ, V(xi) = σ2 → limn→∞{|(X1 + X2 + … + Xn)/n – μ| ≥ ε} = 0 [m = (X1 + X2 + … + Xn)/n is convergent to μ in probability] Pr._E(m) = E((X1 + X2 + … + Xn)/n) = 1/n·E(X1 + X2 + … + Xn) = 1/n·{E(X1) + E(X2) + … + E(Xn)} = 1/n·nμ V(m) = V((X1 + X2 + … + Xn)/n) = 1/n2·E(X1 + X2 + … + Xn)/n) = 1/n2·{(E(X1) + E(X2) + … + E(Xn)) = 1/n2·nσ2 = σ2/n P{|(X1 + X2 + … + Xn)/n - μ| > λ·σ/√n} ≤ 1/λ2, λ·σ/√n ≡ ε, λ = √(n)/σ·ε 0 ≤ P{|(X1 + X2 + … + Xn)/n - μ| > ε} ≤ σ2/(nε2) → 0

### Sampling theory (標本理論)

 ____________ ↗ Sample (x1, x2, … xn) ____________ ↗ Sample (x'1, x'2, … x'n) [ population ] → Sample (x''1, x''2, … x''n) _____└─────────┘ Statistical theory Def. population (母集団): the total set of observations that can be made Ex. the set of GPAs of all the students at Harvard Def. sample (標本): a set of individuals collected from a population by a defined procedure

### 正規標本論 (normal sample theory)

The most popular and the central model is the normal sample theory in mathematical statistics

#### Outlier (外れ値)

= abnormal, discordant or unusual value (異常値)
Def. a data point that differs significantly from other observations

affecting greatly the mean ⇔ affecting leass or none the median and mode

##### Detection of outliers
Firstly, need to construct the assumption of distribution (that is usually normal distribution)
Tietjen-Moore Test (Tietjen-Moore 1972): detecting multiple outliers in a univariate data set that follows a normal distribution

= a generalized or extended Grubbs' test
The test can be used to answer if the data set contain k outliers

Generalized extreme studentized deviate test

## Multivariate correlation (多変量相関論)

#### Terminology

##### Matrix (行列)
Correlation matrix (相関行列): a table showing correlation coefficients between variables

Positive definite (正値/定符号): displaying the coefficients of a positive definite quadratic form

Covariate (共変量): a statistical variable that changes in a predictable way and can be used to predict the outcome of a study
Covariance matrix (共分散行列) = auto-covariance matrix, dispersion matrix, variance matrix or variance–covariance matrix (分散共分散行列)
a square matrix giving the covariance between each pair of elements of a given random vector

Unbiased estimator (不偏推定量): an estimator of a given parameter is said to be unbiased if its expected value is equal to the true value of the parameter - an estimator is unbiased if it produces parameter estimates that are on average correct

Variance inflation factor, VIF (分散インフレ係数): the quotient of the variance in a model with multiple terms by the variance of a model with one term alone

### Bivariate/two-dimensional (二変量)

Correlation (相関): f(x, y) → y

x: independent variable, explanatory variable, or predictor
y: dependent variable → Causal relation: response variable

Link function (リンク関数), g(μi): provides the relationship between the linear predictor and the mean of the distribution function in GLM
The function relates the expected value of the response to the linear predictors in the model. A link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with linear regression.

g(μi) = Xi'β

Table. Link functions. The exponential family functions available in R are:

binomial(link = "logit"): = ln(μ/(1 - μ)) (logistic or logit)
Gamma(link = "inverse"): = ln(μ) (inverse Gaussian, 逆ガウス)
poisson(link = "log"): = ln(μ) (logarithmic)

Log-normal: log = lon(μ)
Exponential: inverse = 1/(1 - μ)

Others

probit
cauchit: Cauchy
cloglog: complementary log-log
sqrt: square-root

#### Robust linear regression (ロバスト線形回帰)

Ordinary least-square estimators for a linear model are sensitive to outliers in the design space or outliers among y values

### Extended linear regression model (拡張線形回帰分析)

#### Generalized linear model, GLM (一般化線形モデル)

consisting of three components:
1. A random component, specifying the conditional distribution of the response variable, Yi (for the ith of n independently sampled observations), given the values of the explanatory variables in the model. In the initial formulation of GLMs, the distribution of Yi was a member of an exponential family, such as the Gaussian, binomial, Poisson, gamma, or inverse-Gaussian families of distributions
2. A linear predictor—that is a linear function of regressors,

ηi = α + β1Xi1 + β2Xi2 + … + βkXik

3. A smooth and invertible linearizing link function g(·), which transforms the expectation of the response variable, μi = E(Yi), to the linear predictor:

g(μi) = ηi = α + β1Xi1 + β2Xi2 + … +βkXik

##### manyglm
= in R package mvabund, fitting generalized linear models for multivariate abundance data

### Exam (試験)

##### March 1999
Answer the following questions in English or Japanese on the answer sheet(s).
I. Indicate the correct answer by the alphabets:
1. If a correlation coefficient is 0.80, then:

a. The explanatory variable is usually less than the response variable.
b. The explanatory variable is usually more than the response variable.
c. Below average values of the explanatory variable are more often associated with below average values of the response variable.
d. Below average values of the explanatory variable are more often associated with above average values of the response variable.
e. None of the above.

2. On observational and experimental studies,

a. An observational study can show a causal relationship.
b. An experimental study can show a causal relationship.
c. The closer the value of r2 is to 1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
d. Both a and b are true.
e. Both b and c are true.

3. On correlation coefficient (r) and determinant coefficient (r²),

a. The closer a correlation coefficient is to 1 or 1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
b. The closer a correlation coefficient is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
c. The closer the value of r² is to 1 or -1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
d. The closer the value of r² is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable. e. None of the above.

4. The design of an experiment is biased if:

a. A sample has large variability.
b. The center of a sample is not close to the population center.
c. All samples have large variability.
d. The centers of all samples are on the same side of the population center.
e. Both c and d are true.

II. The average number of books in the homes of all Hokkaido University students is 1000. You have selected 25 homes and the first two you look at have 900 books and 950 books respectively. What do you expect the mean number of books to be for the entire sample (numerical answer).

[900 + 950 + (23 × 1000)]/25 = 994

III. One of the following statements is better than the others. Indicate that statement. VERY BRIEFLY explain why you did not choose each of the other statements:

When comparing the size the residuals from two different models for the same data:
a. Use the range of each set of residuals as a basis for comparison. → the range is only the max minus the minimum residual. It tells you nothing about what is in between.

b. Use the mean of each set of residuals as a basis for comparison. → the mean of the residuals is always zero, no matter the model.
c. Use the sum of each set of residuals as a basis for comparison. The sum of the residuals is always zero.
d. Use the standard deviation of each set of residuals as a basis for comparison. → By using the standard deviations of residual you can examine the variability of error, lower variability is best, the others don't tell you about the variability.

IV. Bill Clinton, a statistician, said that the temperature was so cold yesterday at the North Pole that it was 3.5 standard deviations BELOW normal. He said that this was a statistically significant event. Clearly demonstrating your understanding of the terms "statistically significant" and including numeric support to explain if he was correct.

Bill was correct in saying the temperature was statistically significant because it is included in the definition as being "unlikely to occur by chance alone." The likelihood [of] getting a temperature 3.5 standard deviations or more below normal is normalcdf(-1000000, -3.5, 0,1) = 0.000233 or about 0.023%, which is not likely to occur just by chance [very often].

V. The figure shown left is a plot of the 2001 profits versus sales (each in ten of thousands of dollars) of 12 large companies in the XXX country, the results of a least squares regression performed, and some other summary data. Note that some of the data with lower Sales values overlap on the graph. ___y = ax + b
___a = 0.1238, b = 345.8827
___r² = 0.8732, r = +0.9344
1. Demonstrating your knowledge of the definition of r², explain what the value of r² means in the context of this problem.
2. The teacher who supplied this data set suggested that even though r² is close to one there is reason to doubt some of the interpolative predictive value of this model. He came to this conclusion with no further computation or residual analysis. Explain his reasoning.
VI. In assessing the weather prior to leaving our residences on a spring morning, we make an informal test of the hypothesis "The weather will be fair today. "The best" information available to us, we complete the test and dress accordingly. Would be the consequences of a Type I and Type II error?
From the choices below select and clearly explain your choice of the correct answer.
1. Type I error: inconvenience in carrying needless rain equipment
Type II error: clothes get soaked

Type 1 Error: Rejecting Ho when Ho is true. So the weather will be fair but you "reject" that an bring an umbrella.
Type 2: Rejecting Ha when Ha is true. So it will rain but you "reject" that it will rain and get soaked.

2. Type I error: clothes get soaked
Type II error: inconvenience in carrying needless rain equipment
3. Type I error: clothes get soaked
Type II error: no consequence since Type II error cannot be made
4. Type I error: no consequence since Type I error cannot be made
Type II error: inconvenience in carrying needless rain equipment