Introduction

Introduction

Background Concepts

Random variables
Probability
- Bayes' theorem
- Chain rule
- Independent
- Conditional independent
Probability distribution
- Gaussian
- Bernoulli
- Binomial
Descriptive statistics
- Expectation
- Variance
- Covariance
Entropy
- Definition $H [x] = - \sum_{x} p (x) l o g_{x} p (x)$
- Important in coding theory, statistical physics, machine learning
- Conditional entropy:
  - Definition: $H [Y | X] = - \sum_{x} \sum_{y} p (x, y) l o g_{2} (p (y | x))$ , or $H [Y | X] = - \int_{x} \int_{y} p (x, y) l o g_{2} (p (y | x)) d x d y$
  - Property: $H [X, Y] = H [x] + H [Y | X] = H [Y] + H [X | Y]$
- Mutual information
  - Definition: $I [X, Y] = - \sum_{x} \sum_{y} p (x, y) l o g_{2} (\frac{p (x) p (y)}{p (x, y)}))$ , or $U [X, Y] = - \int_{x} \int_{y} p (x, y) l o g_{2} (\frac{p (x) p (y)}{p (x, y)})) d x d y$
  - Property: $I [X, Y] = H [X] - H [X | Y] = H [Y] - H [Y | X]$
Kullback-Leibler Divergence
- Definition: $K L (p (x) | | q (x)) = \int_{x} p (x) l o g (\frac{p (x)}{q (x)}) = - \int_{x} p (x) l o g (\frac{q (x)}{p (x)})$
- Property: $K L (p (x) | | q (x)) \neq K L (q (x) | | p (x))$
- Property: $I [X, Y] = K L (p (x, y) | | p (x) p (y))$
Convex and concave functions
Vector, vector operations, matrix, matrix multiplication
Matrix calculus
- $\frac{\partial (a \vec{x} + b)}{\partial \vec{x}} = a$
- $\frac{\partial ({\vec{a}}^{T} \vec{x})}{\partial \vec{x}} = {\vec{a}}^{T}$
- $\frac{\partial ({\vec{x}}^{T} \vec{a})}{\partial \vec{x}} = {\vec{a}}^{T}$
- $\frac{\partial ({\vec{x}}^{T} \vec{x})}{\partial \vec{x}} = 2 {\vec{x}}^{T}$
- $\frac{\partial A \vec{x}}{\vec{x}} = A$
- $\frac{\partial {\vec{x}}^{T} A}{\vec{x}} = A^{T}$
Function approximation

$y = f (x; \vec{w}) = \sum_{i = 0}^{M} w_{i} x^{i} = \vec{w} x$
Sum of square error(SSE) = $E_{s s} (\vec{w}) = \frac{1}{2} \sum_{n = 1}^{N} [f (x_{n}; \vec{w}) - y_{n}]^{2}$
Root-mean-square error(RMSE) = $\sqrt{\frac{\sum_{n = 1}^{N} [f (x_{n}; \vec{w}) - y_{n}]^{2}}{N}}$

Model selection

Select a model that fits data well and is as simple as possible

Linear Regression and Logistic Regression

Regression models
- Answer what is the relationship between dependent variable and the independent variables
- Given a dataset ${< \vec{x_{n}, y_{n}} >}$ , where $\vec{x_{n}} \in R^{n}$ , and $y_{n} \in R$ , build a function $f (\vec{x}; \vec{w})$ to predict the output of a new sample $\vec{x}$
Linear model
- Estimate parameters
  - $f (\vec{x}, \vec{w}) = {\vec{w}}^{T} \vec{x}$
  - Err = $\sum_{n = 1}^{N} (y - f ({\vec{x}}_{n}, \vec{w}))^{2} = (\vec{y} - X \vec{w})^{T} (\vec{y} - X \vec{w})$
  - Solution: if X is full rank, then $\vec{w} = (X^{T} X)^{- 1} X^{T} \vec{y}$
  - If $(X^{T} X)$ is singular, then there are features that are linear combinations of other features, so we need to reduce the amount of features to make $(X^{T} X)$ full rank
Linear basis function model
- Replace $f (\vec{x}; \vec{w}) = w_{0} + . . . + w_{d} x_{d}$ with $ $f (\vec{x}; \vec{w}) = w_{0} + . . . + w_{m} h_{m} (\vec{x})$
- ${h_{i} (\vec{x})}$ is a set of non-linear basis functions/kernel functions
Logistic regression & classification
- $f (\vec{x}; \vec{w}) = \frac{1}{1 + e^{{\vec{w}}^{T} \vec{x}}}$
- Let $P (1 | \vec{x}; \vec{w}) = f (\vec{x}, \vec{w})$ , $P (0 | \vec{x}; \vec{w}) = 1 - f (\vec{x}, \vec{w})$ ,
- Classification rule: assign $\vec{x}$ to class 1 if $P (1 | \vec{x}; \vec{w}) > P (0 | \vec{x}; \vec{w})$ , otherwise assign $\vec{x}$ to class 0
- $l o g i t (\vec{x}; \vec{w}) = l o g (\frac{P (1 | \vec{x}; \vec{w})}{P (0 | \vec{x}; \vec{w})}) = {\vec{w}}^{T} \vec{x}$

Principal Component Analysis

Linear Discriminant Analysis

Bayesian Decision Theory

Formula
- Given state $S \in {s_{1}, . . ., s_{M}}$ .
- Let decision $A \in a_{1}, . . ., a_{N}$ .
- Cost function $C (a_{i} | s_{k})$ is the loss incurred for taking $A = a_{i}$ when $S = s_{k}$ .
- $P (s_{k})$ is the prior probability of $S = s_{k}$ .
- Let observation be represented as a feature vector $\vec{x}$ .
- $P (\vec{x} | s_{k})$ is the probability for $\vec{x}$ conditioned on $s_{k}$ being the true state
- Expected loss before observing $\vec{x}$ : $R (a_{i}) = \sum_{k = 1}^{M} C (a_{i} | s_{k}) P (s_{k})$ .
- Expected loss after observing $\vec{x}$ : $R (a_{i} | \vec{x}) = \sum_{k = 1}^{M} C (a_{i} | s_{k}) P (s_{k} | \vec{x})$ .
Bayesian risk
- A decision function $D (x)$ maps from an observation $\vec{x}$ to a decision/action.
- The total risk of a decision function is given by
  - Discrete: $E_{P (\vec{x})} = [R (D (\vec{x}) | \vec{x})] = E_{\vec{x}} [R (D (\vec{x}) | \vec{x})] P (\vec{x})$
  - Discrete: $E_{P (\vec{x})} = [R (D (\vec{x}) | \vec{x})] = \int_{\vec{x}} [R (D (\vec{x}) | \vec{x})] P (\vec{x})$
- A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk.
Two-class Bayes Classifier
- Let $M = 2$ , and compare $R (a_{1} | \vec{x})$ with $R (a_{2} | \vec{x})$
- We can derive $R (a_{1} | \vec{x}) < R (a_{2} | \vec{x}) ⟺ \frac{P (\vec{x} | s_{1})}{P (\vec{x} | s_{2})} > \frac{P (s_{2}) [C (a_{1} | s_{2}) - C (a_{2} | s_{2})]}{P (s_{1}) [C (a_{2} | s_{1}) - C (a_{1} | s_{1})]}$ , which is likelihood ration > constant threshold
- Also we notice that $\frac{P (\vec{x} | s_{1})}{P (\vec{x} | s_{2})} \propto \frac{P (s_{1} | \vec{x})}{P (s_{2} | \vec{x})}$ , which is likelihood ratio is proportional to posterior ratio
Minimum-error-rate classification
- Consider the following cost function: $D_{i t} = {\begin{cases} 0, & i = j \\ 1 & i \neq j \end{cases}$
- To minimize the average probability of error, we should select $s_{i}$ with the maximal posterior probability $P (s_{i} | \vec{x})$
- Classify $\vec{x}$ as $s_{i}$ if $P (s_{i} | \vec{x}) > P (s_{j} | \vec{x})$ for $\forall j \neq i$
Classifiers and discriminant functions
- A pattern classifier can be represented by a set of discriminant functions ${g_{i} (\vec{x})}$
- The classifier assigns a sample $\vec{x}$ to $s_{j}$ if $g_{i} (\vec{x}) < g_{j} (\vec{x})$ for $\forall j \neq i$
- If we replace every $g_{i} (\vec{x})$ by $f (g_{i} (\vec{x}))$ , where f is a monotonically increasing function, the resulting classification is unchanged
- A classifier that uses linear discriminant functions is called a linear classifier

Maximum Likelihood Estimation (MLE) and the Expectation and Maximization (EM) Algorithm

My Blog

COSI 123A Statistical Machine Learning

Introduction

Background Concepts

Linear Regression and Logistic Regression

Principal Component Analysis

Linear Discriminant Analysis

Bayesian Decision Theory

Maximum Likelihood Estimation (MLE) and the Expectation and Maximization (EM) Algorithm