Statistics

(Upload on August 15 2025) [ 日本語 | English ]

Statistics (統計学)

Mount Usu / Sarobetsu post-mined peatland
From left: Crater basin in 1986 and 2006. Cottongrass / Daylily

HOME > Lecture catalog / Research summary > Glossary > Mathematics > Statistics

Too many people use statistics as a drunken man uses a lamppost, for support but not for illumination. (Finney 1997)
Facts are stubborn, but statistics are more pliable. (Mark Twain)

Statistics: statsu (Gr) = state (en.), 国家 (jp.)

Raw data, ≈ primary data (生データ)

data collected from a source
the data have not been subject to any manipulation, such as outlier removals, corrections and calibrations

raw data are the evidnece to deomonstrate the correct information →
filing the data apporpriately until no use (ethics)

McNamara fallacy or quantitative fallacy (マクナマラの誤謬)
McNamara, Robert Strange (1916-2009), US government official

the logical error of excessively relying on quantitative data to make decisions while ignoring important qualitative factors that are hard to measure

Ex. Body counts in Vietnam War as a measure of success
this approach ignored critical qualitative factors:

guerrilla warfare - difficulties in body counts
public sentiment (morale and support) of the Vietnamese people
territorial control: not reflecting territory control or strategic progress

Statistical ecology (統計生態学)

= ecological statistics (生態統計学)
referring to the application of statistical methods to the description and monitoring of ecological phenomena
≈ overlapping mostly with quantitative ecology (定量生態学, s.l.)
adding momentum for quantification after 1960's - transcending descriptive ecology (記載生態学)

Ex. 1969 International Symposium on Statistical Ecology (New Haven, Conn.)

[ statistical test | determinant ]

Fundamentals of statistics (統計基礎)

Def. Trial (試行)

Experiment: conducting a trial or experiment to obtain some statistical information

Def. Event (事象): a set of outcomes of an experiment (a subset of the sample space) to which a probability is assigned

Expression: A, B, C, …

Def. Complementary event (余事象): A^C, B^C, C^C, …

Combination and permutation (組み合わせと順列)

Combination (組み合わせ)

Def. A selection of items from a collection, such that (unlike permutations) the order of selection does not matter

Ex. "My fruit salad is a combination of apples, grapes and bananas." → combination

_nC_r =

= _nP_r/r! = n!/{r!(n – r)!}
_nC₀ = _nC_n = n!/n! = 1
Eq. _nC_k = n!/{k!(n – k)!} = n!/[(n – k)!{(n – (n – k)}!] = _nC_n–k
Th. _nC_k + _nC_k+1 = _n+1C_k+1
Pr. _nC_k + _nC_k+1 = n!/{k!(n – k)!} + n!/[(k + 1)!{(n – (k + 1)!}]

= n!/{k!(n – k – 1)!}·{1/(n – k) + 1/(k + 1)}
= n!/{k!(n – k – 1)!}·(k + 1 + n – k)/{(n – k)(k + 1)}
= n!/{k!(n – k – 1)!}·(n + 1)/{(n – k)(k + 1)}
= (n + 1)!/{(k + 1)!(n – k)!} = (n + 1)!/[(k + 1)!{n + 1 – (k + 1)}!]
= _n+1C_k+1___//

Th. (1) k_nC_k = n_n-1C_k-1, (2) _n-1C_k + _n-1C_k-1 = _nC_k
Pr. (1) k_nC_k = k·(n!/(k!(n - k)!) = (n·(n - 1)!)/((k - 1)!(n - k)!)

= n·((n - 1)!)/((n - 1) - (k - 1))! = n_n-1C_k-1

___(2) _n-1C_k + _n-1C_k-1 = ((n - 1)!)/(k!(n - 1 - k)!) + (n - 1)!)/((k - 1)!(n - k)!)

= ((n - k)·(n - 1)!)/(k!(n - k)!) + (k·(n - 1)!)/(k!(n - k)!)
= (n·(n - 1)!)/(k!(n - k)!) = (n!)/(k!(n - k)!) = _nC_k___//

Permutation (順列)

= an ordered combination
Def. The act of arranging the members of a set into a sequence or order, or, if the set is already ordered, rearranging (reordering) its elements - a process called permuting

Repeated permutation (重複順列): allowed repetition: Ex. "333" on a permutation lock
select r from n ⇒ _nH_r = _n+r-1C_r
Non-repeated permutation: not allowed repetition: Ex. a winner can not be the loser

Scales and attributes

qualitative merkmal
- category (nominal)
- ordinal = hybrid: rank order exists between the values or classes
quantitative merkmal
- interval (distance)
- ratio = quantitative

Data presentation (データ表現)

Graph

Histograms (ヒストグラム)

Bar graph (棒グラフ): bars with heights or lengths proportional to the values that they represent

Vertical bar graph (column chart)
Horizontal bar graph
Stereogram = 3-dimentional

Stacked bar graph (積み上げ棒グラフ)
Box-and-whisker plot (箱髭図): a standardized way of displaying the distribution of data based on a five number summary

graph
1. median, Q₂ or 50th percentile
2. first quartile,Q₁ or 25th percentile: the middle number between the smallest number (not the "minimum") and the median
3. third quartile, Q₃ or 75th percentile: the middle value between the median and the highest value (not the "maximum")
interquartile range (IQR): between the 25th and 75th percentile

graph
Fig. Waterfall chart

4. "maximum": Q₃ + 1.5×IQR
5. "minimum": Q₁ -1.5×IQR
whiskers: in blue
outliers: green circles

Waterfall chart (滝グラフ, flying bricks chart or Mario chart)

Pie chart (円グラフ)
Multi-pie chart
Multi-level pie chart
Sunburst chart (サンバーストチャート)
Radar chart (レーダーチャート)
Contour plot (等高線図)
Spherical contour graph
Venn diagram (ベン図)
Spider chart (クモの巣グラフ)
Mosaic or mekko chart
Line graph (線グラフ)
Multi-line graph
Scatter-line combo
Control chart (管理図)
Paleto chart (パレート図)
Scatter plot (scattergram, 散布図)
= scatterplot, scatter graph, scatter chart, scattergram and scatter diagram

using Cartesian coordinates to display values for typically two variables for a set of data

Area graph (area chart, 面積グラフ)
Stacked area chart
Trellis plot (Trellis chart and Trellis graph)
Trellis line graph
Trellis bar graph
Function plot
Binary decision diagram (BDD, cluster, 二分決定グラフ)
Hierarchy diagram (階層図)
Circuit diagram (回路図)
Flowchart (フローチャート, 流れ図): a type of diagram that represents a workflow or process

⇒ Ex. TWINSPAN

Pictograph (統計図表): using pictures instead of numbers
3D graph

Mean, average (平均)

Mean vs average: Mean, median and mode are types of averages

Arithmetic mean (算術平均), m or x^-

= (x₁ + x₂ + x₃ + … + x_n)/n = 1/n·Σ_k=1ⁿx_k

affected by the outlier(s). may loose the representativeness when the data are censored
Ex. 3, 4, 4, 4, 6, 6, 8, 13, 15 (n = 9)
⇔ M₁D₁ = 9 = (3 + 15)/2, Mode = 4, Median = 6, Mean = 7

Geometric mean (幾何平均, x_g)

= (x₁·x₂ … x_n)^1/n

∴ logx_g = 1/n·(logx₁ + logx₂ + … + logx_n)

= 1/n·Σ_k=1ⁿlogx_k (^∀x_i > 0)

used for change rate

Harmonic mean (調和平均), m_h

Def. Mean square (2乗平均), MS(x) = 1/nΣ_i=1ⁿx_i²
Eq. σ² = (MS(x))² - m² (m: mean. σ²: variance)
Pr. σ² = 1/nΣ_i=1ⁿ(x_i - m)² = 1/nΣ_i=1ⁿx_i² - 1/nΣ_i=1ⁿ2mx_i + 1/nΣ_i=1ⁿm²

∴ σ² = MS(x)² - 1/nΣ_i=1ⁿ2mx_i + 1/nΣ_i=1ⁿm²

- 1/nΣ_i=1ⁿ2mx_i = - 2m·1/nΣ_i=1ⁿx_i = -2m²

∴ σ² = m² - 2m² + 1/nΣ_i=1ⁿ2m², here, Σ_i=1ⁿ1 = n
∴ σ² = MS(x)² - 2m² + m² = MS(x)² - m² //

Def. Root mean square, RMS (2乗平均平方根) ≡ √(1/nΣ_i=1ⁿx_i²)

Circular statistics (角度統計学)

≈ directional statistics and spherical statistics

Graph

Histogram
Circular raw data plot (円周プロット)
(Nightingale) rose diagram (coxcomb chart, kite diagram, circular graph, polar area diagram 鶏頭図)

Each category or interval in the data is divided into equal segments on the radial chart. Each ring from the center is used as a scale to plot the segment size

Q. Mean of 1° and 359°
A. Correct: 0°, incorrect: (1 + 359)/2 = 180°
Q. Obtain mean of (80°, 170°, 175°, 200°, 265°, 345°)
A. × (80° + 170° + 175° + 200° + 265° + 345°)/6 = 206°
Def. mean (Θ) of vectors, (RcosΘ, RsinΘ)

= 1/N(Σ_icosθ_i, Σ_isinθ_i),__R: angle. θ: length
= (〈cosθ_i〉, 〈sinθ_i〉),__mean: 〈•〉

Θ = 191°

Def. (circular) variance (円周分散), V ≡ 1 - R (0 ≤ V ≤ 1)
Def. (circular) standard deviation (円周標準偏差), S ≡ √(-2·logR)__(0 ≤ S ≤ ∞)
Def. mean angular deviation, v ≈ S = √(2V) when V is sufficiently small

= √{2(1 - R)}

Circular uniform distribution (円周一様分布)

p.d.f. f(θ) = 1/(2π)__(0 ≤ θ ≤ 2π)
c.p.d.f. F(θ) = θ/(2π)

von Mises distribution (フォン・ミーゼス分布)

P(θ) = exp(κcos(θ - μ)/(2πI₀(κ)) ∝ exp(κcos(θ - μ))

parameters = (μ, κ), μ = mean, R = I_i(κ)/I₀(κ)

called normal distribution on the circumference of circle

Statistical tests of circular data

When von Mises distribution is assumed, the test is parametric
Rayleigh test (Rayleigh z test)

a test for periodicity in irregularly sampled data

Kuiper test, H₀: ƒ(θ) ~ P(θ)

a test if the sample distribution follows von Mises distribution

Mardia-Watson-Wheeler test, H₀: Θ₁ = Θ₂

a test if the two samples are extracted from the same population

Probability theory (確率論)

Def. statistical phenomenon (統計的現象) = probabilistic event/stochastic event (確率的現象): satisfied the two conditions shown below

1) non-deterministic (非決定論的)
2) statistical regularity (collective regularity, 集団的規則性)

Probability distribution (確率分布)

Poisson distribution

Law. Law of small numbers (ポアソンの小数の法則)
B_i(n, p), np = λ (= constant), n → ∞, p → 0 ⇒ limn→∞(_nC_k)p^kq^n–k = e^-λ·(λ^k/k!)
Pr. np = λ = constant

P(X = k) = _nC_kp^kq^n-k

= [{n(n – 1) … (n – k + 1)}/k!]·(_λC_n)k·(1 – λ/n)^n–k
= (λ^k/k!)(1 – 1/n)(1 – 2/n) … (1 – (k – 1)/n)(1 – λ/n)ⁿ(1 – λ/n)^–k

-λ/n = x → (n → ∞ → x → 0)
limx→0(1 + x)^1/x = e, (1 – λ/n)n = [(1 + x)^1/x]^xn → e^-λ
1 – 1/n, 1 – 2/n, … , 1 – (k – 1)/n, (1 – l/n)^-k, respectively → 1
∴ (_nC_k)p^kq^n–k → (λ^k/k!)e^–λ ~ Poisson distribution //

Th. Reproducing property of Poission distribution (ポアソン分布の再生性)

X₁⫫X₂, X₁~P(λ₁), X₂~P(λ₂) ⇒ Y = (X₁ + X₂)~P(λ = λ₁ + λ₂)

Pr. P(X + Y = n) = Σ_k=0ⁿP(X = n - k)P(k)

= Σ_k=0ⁿe^-λ₁·λ₁^n-k/(n - k)!·e^-λ₂·(λ₁^k/k!)
= e^{-(λ₁+λ₂)}Σ_k=0ⁿλ₁^n-k/(n - k)!·(λ₂k/k!)
= e^{-(λ₁+λ₂)}·1/n!·Σ_k=0ⁿ_nC_kλ₁^n-kλ₂^k ☛ binomial theorem
= e^{-(λ₁+λ₂)}·1/n!·(λ₁ + λ₂)ⁿ__//

Characteristic values on statistical variables (確率変数特性値)

Def. Mean (expectation, 平均/期待値)

E(X) = m_x = 1/nΣ_i=1ⁿx_if_i = Σ_n=1^hx_iX(f_i/n)
i) discrete rvX: E(X) = Σ_i=1nx_iP(X = x_i)
ii) continuous rvX: E(X) = ∫_-∞^∞x_if(x)dx

Def. variance (分散), V(X) = E{(X – E(X))²}

cp. s² = 1/nΣ_i=1ⁿ(x_i – m_x)² [√V(X): standard deviation]
i) discrete rvX: V(X) = Σ_i=1ⁿ(x_i – μ)²·P(x = x_i)
ii) continuous rvX: V(X) = ∫_-∞^∞(x_i – μ)²·f(x)dx, μ = E(X)

Th. Characteristics of mean (expectation) (a, b, constant)
1. E(aX + b) = aE(X) + b
2. E(X₁ + X₂ + … +X_n) = E(X₁) + E(X₂) + … + E(X_n)
3. XΠY → E(X·Y) = E(X)·E(Y)
Pr. (Case, discrete)
1. E(aX + b) = Σ_k(ax_k + b)P{X = x_k} = aΣ_kx_kP{X

= x_k} + bΣ_kP{X = x_k} = aE(X) + b

2. E(X + Y) = Σ_k(x_kP(X = x_k) + y_kP(Y = y_k))

= Σ_kx_kP(X = x_k) + Σ_ky_kP(Y = y_k) = E(X) + E(Y) → extension

3. E(X·Y) = Σ_k(x_kP(X = x_k)·y_kP(Y = y_k)

= Σ_k(x_kP(X = x_k)·Σ_ky_kP(Y = y_k) = E(X)E(Y) [条件より]

Th. Characteristics of variance (分散の性質)
1. V(aX + b) = a²V(X) Ex. V(X + b) = V(X), V(2X) = 4V(X)
2. V(X) = E(X²) – E²(X)
3. V(X₁ + X₂ + … + X_n) = V(X₁) + V(X₂) + … + V(X_n) + 2Σ_i<j(X_i, Y_j)
4. XΠY → V(X₁ + X₂ + … + X_n) = V(X₁) + V(X₂) + … + V(X_n)
Pr.
1. V(aX + b) = E{aX + b - E(aX + b)}² = E{a(X - E(X))}² = a²V(X)
2. V(X) = E{X² - 2E(X)X + E²(X)} = E(X²) – 2E(X)E(X) + E²(X)

= E(X²) – E²(X)

3. V(X₁ + X₂ + … + X_n) = E(X₁ + X₂ + … + X_n) – E²(X₁ + X₂ + … + X_n)

= E²{(X₁ - E(X₁)) + (X₂ - E(X₂)) + … + (X_n - E(X_n))} //

4. (demonstrate cov(X_i, X_j) = 0)

Law of great numbers and central limit theorem, CLT (大数の法則と中心極限定理)

Th. 0. Chebyshev's inequality (チェビシェフの不等式)
= Bienaymé–Chebyshev inequality
For any probability distributions, no more than 1/λ² of the distribution values can be λ or more standard deviations (SDs) away from the mean (or equivalently, over 1 − 1/λ² of the distribution values are less than λ SDs away from the mean)
Form 1) Σ_{|x_i–x|≥λs} ≤ n/λ², or Σ_{|x_i–x| < λs} ≥ n(1 – n/λ²)
Pr. ns² = Σ_i=1ⁿ(x_i – m)² = Σ_{|x_i – m_x| ≥ λs}(x_i – m)² + Σ_{|x_i – m_x| < λs}(x_i – m)²

≥ Σ_{|x_i – x| ≥ λs}(x_i – m)²
|x_i – m| ≥ λs ⇒ (x_i – mx)² ≥ λ²s²,
ns² ≥ Σ_{|x_i – x| ≥ λs}λ²s², n ≥ Σ_{|x_i – x| ≥ λs}λ²
∴ n/λ² ≥ Σ_{|x_i – x|} ≥ λs //

Form 2) P{|X – E(X)| ≥ λ√V(X)} ≤ 1/λ², or P{|X – E(X)| ≤ λ√V(X)} ≥ 1 - 1/λ²
Pr. λ = ε/√V(X) ∴ ε = λ√V(X)
Th. Chebyshev's theorem チェビシェフの定理
{X_n}, X_iΠX_j (i ≠ j), V(X_k) ≤ c (c = constant, k = 1, 2, … , n), ^∃ε > 0

→ limn→∞{|1/n·Σ_k=1ⁿX_k – 1/n·Σ_k=1ⁿE(X_k)| < ε} = 1

Pr. V(1/n·Σ_k=1ⁿX_k) = 1/n2·1/n·Σ_k=1ⁿV(X_k) ≥ c/n

Chebyshev's inequality →
P{|1/nΣ_k=1ⁿX_k – 1/nΣ_k=1ⁿE(X_k)| < ε}

≥ 1 – 1/ε²·V(1/nΣ_k=1ⁿX_k) ≥ 1 – c/nε²

limn→∞P(|1/nΣ_k=1ⁿX_k – 1/nΣ_k=1ⁿE(X_k)| < ε) ≥ 1, P ≤ 1
→ limn→∞P(|1/nΣ_k=1ⁿX_k – 1/nΣ_k=1ⁿE(X_k)| < ε) = 1 //

Th. Law of large numbers (ベルヌーイの大数の法則), S
X₁, X₂, … X_n: independence, E(x_i) = μ, V(x_i) = σ²

→ limn→∞{|(X₁ + X₂ + … + X_n)/n – μ| ≥ ε} = 0
[m = (X₁ + X₂ + … + X_n)/n is convergent to μ in probability]

Pr._E(m) = E((X₁ + X₂ + … + X_n)/n) = 1/n·E(X₁ + X₂ + … + X_n)

= 1/n·{E(X₁) + E(X₂) + … + E(X_n)} = 1/n·nμ

V(m) = V((X₁ + X₂ + … + X_n)/n) = 1/n²·E(X₁ + X₂ + … + X_n)/n)

= 1/n²·{(E(X₁) + E(X₂) + … + E(X_n)) = 1/n²·nσ² = σ²/n

P{|(X₁ + X₂ + … + X_n)/n - μ| > λ·σ/√n} ≤ 1/λ², λ·σ/√n ≡ ε, λ = √(n)/σ·ε
0 ≤ P{|(X₁ + X₂ + … + X_n)/n - μ| > ε} ≤ σ²/(nε²) → 0

Sampling theory (標本理論)

____________ ↗ Sample (x₁, x₂, … x_n)
____________ ↗ Sample (x'₁, x'₂, … x'_n)
[ population ] → Sample (x''₁, x''₂, … x''_n)
_____└─────────┘ Statistical theory
Def. population (母集団): the total set of observations that can be made

Ex. the set of GPAs of all the students at Harvard

Def. sample (標本): a set of individuals collected from a population by a defined procedure

Census and sampling

Census = complete census or survey (全数調査)

Ex. national population census (国勢調査)

Sampling survey = sample inquiry, sampling investigation (標本調査)

正規標本論 (normal sample theory)

The most popular and the central model is the normal sample theory in mathematical statistics

Outlier (外れ値)

= abnormal, discordant or unusual value (異常値)
Def. a data point that differs significantly from other observations

affecting greatly the mean ⇔ affecting leass or none the median and mode

Detection of outliers

Firstly, need to construct the assumption of distribution (that is usually normal distribution)
Tietjen-Moore Test (Tietjen-Moore 1972): detecting multiple outliers in a univariate data set that follows a normal distribution

= a generalized or extended Grubbs' test
The test can be used to answer if the data set contain k outliers

Generalized extreme studentized deviate test

Standardization (標準化)

(Zar 1996)

Data transformation (データ変換)

Estimates (推定論)

Multivariate correlation (多変量相関論)

Regression analysis (回帰分析)

Def. correlation coefficient (相関係数), r
= (Pearson) product-moment correlation coefficient (積率相関係数)

Terminology

Matrix (行列)

Correlation matrix (相関行列): a table showing correlation coefficients between variables

Positive definite (正値/定符号): displaying the coefficients of a positive definite quadratic form

Covariate (共変量): a statistical variable that changes in a predictable way and can be used to predict the outcome of a study

Covariance matrix (共分散行列) = auto-covariance matrix, dispersion matrix, variance matrix or variance–covariance matrix (分散共分散行列)
a square matrix giving the covariance between each pair of elements of a given random vector

Unbiased estimator (不偏推定量): an estimator of a given parameter is said to be unbiased if its expected value is equal to the true value of the parameter - an estimator is unbiased if it produces parameter estimates that are on average correct

Variance inflation factor, VIF (分散インフレ係数): the quotient of the variance in a model with multiple terms by the variance of a model with one term alone

Bivariate/two-dimensional (二変量)

Correlation (相関): f(x, y) → y

x: independent variable, explanatory variable, or predictor
y: dependent variable → Causal relation: response variable

Link function (リンク関数), g(μ_i): provides the relationship between the linear predictor and the mean of the distribution function in GLM
The function relates the expected value of the response to the linear predictors in the model. A link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with linear regression.

g(μ_i) = X_i'β

Table. Link functions. The exponential family functions available in R are:

binomial(link = "logit"): = ln(μ/(1 - μ)) (logistic or logit)
gaussian(link = "identity(線形予測子)"): = μ
Gamma(link = "inverse"): = ln(μ) (inverse Gaussian, 逆ガウス)
inverse.gaussian(link = 1/μ²)
poisson(link = "log"): = ln(μ) (logarithmic)

Log-normal: log = lon(μ)
Exponential: inverse = 1/(1 - μ)

Others

probit
cauchit: Cauchy
cloglog: complementary log-log
sqrt: square-root

Robust linear regression (ロバスト線形回帰)

Ordinary least-square estimators for a linear model are sensitive to outliers in the design space or outliers among y values

Smoothing (平滑化)

smoothing (to smooth a data set) is to create an approximating function that attempts to capture patterns in the data (related to curve fitting)
Ex.
Moving average (移動平均)
Smoothing spline
Cubic or Hermite spline

Catmull-Rom spline
Kochanek-Bartles spline

Additive smoothing
Kernel smoother
Local regression (loess or lowess)

Local regression (局所回帰)

Extended linear regression model (拡張線形回帰分析)

Generalized linear model, GLM (一般化線形モデル)

consisting of three components:

A random component, specifying the conditional distribution of the response variable, Y_i (for the ith of n independently sampled observations), given the values of the explanatory variables in the model. In the initial formulation of GLMs, the distribution of Y_i was a member of an exponential family, such as the Gaussian, binomial, Poisson, gamma, or inverse-Gaussian families of distributions
A linear predictor—that is a linear function of regressors,
η_i = α + β₁X_i1 + β₂X_i2 + … + β_kX_ik
A smooth and invertible linearizing link function g(·), which transforms the expectation of the response variable, μ_i = E(Y_i), to the linear predictor:
g(μ_i) = η_i = α + β₁X_i1 + β₂X_i2 + … +β_kX_ik

manyglm

= in R package mvabund, fitting generalized linear models for multivariate abundance data

Exam (試験)

March 1999

Answer the following questions in English or Japanese on the answer sheet(s).
I. Indicate the correct answer by the alphabets:
1. If a correlation coefficient is 0.80, then:

a. The explanatory variable is usually less than the response variable.
b. The explanatory variable is usually more than the response variable.
c. Below average values of the explanatory variable are more often associated with below average values of the response variable.
d. Below average values of the explanatory variable are more often associated with above average values of the response variable.
e. None of the above.

2. On observational and experimental studies,

a. An observational study can show a causal relationship.
b. An experimental study can show a causal relationship.
c. The closer the value of r2 is to 1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
d. Both a and b are true.
e. Both b and c are true.

3. On correlation coefficient (r) and determinant coefficient (r²),

a. The closer a correlation coefficient is to 1 or 1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
b. The closer a correlation coefficient is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
c. The closer the value of r² is to 1 or -1, the more evidence there is of a causal relationship between the explanatory variable and the response variable.
d. The closer the value of r² is to 0, the more evidence there is of a causal relationship between the explanatory variable and the response variable. e. None of the above.

4. The design of an experiment is biased if:

a. A sample has large variability.
b. The center of a sample is not close to the population center.
c. All samples have large variability.
d. The centers of all samples are on the same side of the population center.
e. Both c and d are true.

II. The average number of books in the homes of all Hokkaido University students is 1000. You have selected 25 homes and the first two you look at have 900 books and 950 books respectively. What do you expect the mean number of books to be for the entire sample (numerical answer).

[900 + 950 + (23 × 1000)]/25 = 994

III. One of the following statements is better than the others. Indicate that statement. VERY BRIEFLY explain why you did not choose each of the other statements:

When comparing the size the residuals from two different models for the same data:
a. Use the range of each set of residuals as a basis for comparison. → the range is only the max minus the minimum residual. It tells you nothing about what is in between.

b. Use the mean of each set of residuals as a basis for comparison. → the mean of the residuals is always zero, no matter the model.
c. Use the sum of each set of residuals as a basis for comparison. The sum of the residuals is always zero.
d. Use the standard deviation of each set of residuals as a basis for comparison. → By using the standard deviations of residual you can examine the variability of error, lower variability is best, the others don't tell you about the variability.

IV. Bill Clinton, a statistician, said that the temperature was so cold yesterday at the North Pole that it was 3.5 standard deviations BELOW normal. He said that this was a statistically significant event. Clearly demonstrating your understanding of the terms "statistically significant" and including numeric support to explain if he was correct.

Bill was correct in saying the temperature was statistically significant because it is included in the definition as being "unlikely to occur by chance alone." The likelihood [of] getting a temperature 3.5 standard deviations or more below normal is normalcdf(-1000000, -3.5, 0,1) = 0.000233 or about 0.023%, which is not likely to occur just by chance [very often].

V. The figure shown left is a plot of the 2001 profits versus sales (each in ten of thousands of dollars) of 12 large companies in the XXX country, the results of a least squares regression performed, and some other summary data. Note that some of the data with lower Sales values overlap on the graph.

___y = ax + b
___a = 0.1238, b = 345.8827
___r² = 0.8732, r = +0.9344
1. Demonstrating your knowledge of the definition of r², explain what the value of r² means in the context of this problem.
2. The teacher who supplied this data set suggested that even though r² is close to one there is reason to doubt some of the interpolative predictive value of this model. He came to this conclusion with no further computation or residual analysis. Explain his reasoning.
VI. In assessing the weather prior to leaving our residences on a spring morning, we make an informal test of the hypothesis "The weather will be fair today. "The best" information available to us, we complete the test and dress accordingly. Would be the consequences of a Type I and Type II error?
From the choices below select and clearly explain your choice of the correct answer.

Type I error: inconvenience in carrying needless rain equipment
Type II error: clothes get soaked
Type 1 Error: Rejecting Ho when Ho is true. So the weather will be fair but you "reject" that an bring an umbrella.
Type 2: Rejecting Ha when Ha is true. So it will rain but you "reject" that it will rain and get soaked.
Type I error: clothes get soaked
Type II error: inconvenience in carrying needless rain equipment
Type I error: clothes get soaked
Type II error: no consequence since Type II error cannot be made
Type I error: no consequence since Type I error cannot be made
Type II error: inconvenience in carrying needless rain equipment

Statistics (統計学)

Raw data, ≈ primary data (生データ)

Statistical ecology (統計生態学)

Fundamentals of statistics (統計基礎)

Combination and permutation (組み合わせと順列)

Combination (組み合わせ)

Permutation (順列)

Scales and attributes

Data presentation (データ表現)

Graph

Histograms (ヒストグラム)

Mean, average (平均)

Arithmetic mean (算術平均), m or x-

Geometric mean (幾何平均, xg)

Harmonic mean (調和平均), mh

Circular statistics (角度統計学)

Graph

Statistical tests of circular data

Probability theory (確率論)

Probability distribution (確率分布)

Poisson distribution

Characteristic values on statistical variables (確率変数特性値)

Law of great numbers and central limit theorem, CLT (大数の法則と中心極限定理)

Sampling theory (標本理論)

Census and sampling

正規標本論 (normal sample theory)

Outlier (外れ値)

Detection of outliers

Standardization (標準化)

Data transformation (データ変換)

Estimates (推定論)

Multivariate correlation (多変量相関論)

Regression analysis (回帰分析)

Terminology

Matrix (行列)

Bivariate/two-dimensional (二変量)

Robust linear regression (ロバスト線形回帰)

Smoothing (平滑化)

Local regression (局所回帰)

Extended linear regression model (拡張線形回帰分析)

Generalized linear model, GLM (一般化線形モデル)

manyglm

Exam (試験)

March 1999

Arithmetic mean (算術平均), m or x^-

Geometric mean (幾何平均, x_g)

Harmonic mean (調和平均), m_h