Introduction
- Introduction
Background Concepts
Random variables
Probability
- Bayes' theorem
- Chain rule
- Independent
- Conditional independent
Probability distribution
- Gaussian
- Bernoulli
- Binomial
Descriptive statistics
- Expectation
- Variance
- Covariance
Entropy
- Definition \(H[x] = -\sum\limits_{x}p(x)log_{x}p(x)\)
- Important in coding theory, statistical physics, machine learning
- Conditional entropy:
- Definition: \(H[Y|X] = -\sum\limits_{x}\sum\limits_{y}p(x,y)log_{2}(p(y|x))\), or \(H[Y|X] = -\int\limits_{x}\int\limits_{y} p(x,y)log_{2}(p(y|x))dxdy\)
- Property: \(H[X, Y] = H[x] + H[Y|X] = H[Y] + H[X|Y]\)
- Mutual information
- Definition: \(I[X,Y] = -\sum\limits_{x}\sum\limits_{y}p(x,y)log_{2}(\frac{p(x)p(y)}{p(x,y)}))\), or \(U[X,Y] = -\int\limits_{x}\int\limits_{y}p(x,y)log_{2}(\frac{p(x)p(y)}{p(x,y)}))dxdy\)
- Property: \(I[X,Y] = H[X] - H[X|Y] = H[Y] - H[Y|X]\)
Kullback-Leibler Divergence
- Definition: \(KL(p(x)||q(x)) = \int\limits_{x} p(x)log(\frac{p(x)}{q(x)}) = -\int\limits_{x} p(x)log(\frac{q(x)}{p(x)})\)
- Property: \(KL(p(x)||q(x)) \ne KL(q(x)||p(x))\)
- Property: \(I[X,Y] = KL(p(x,y)||p(x)p(y))\)
Convex and concave functions
Vector, vector operations, matrix, matrix multiplication
Matrix calculus
- \(\frac{\partial (a \vec{x} + b)}{\partial \vec{x}} = a\)
- \(\frac{\partial (\vec{a}^{T} \vec{x})}{\partial \vec{x}} = \vec{a}^{T}\)
- \(\frac{\partial (\vec{x}^{T} \vec{a} )}{\partial \vec{x}} = \vec{a}^{T}\)
- \(\frac{\partial (\vec{x}^{T} \vec{x})}{\partial \vec{x}} = 2\vec{x}^{T}\)
- \(\frac{\partial A\vec{x}}{\vec{x}} = A\)
- \(\frac{\partial \vec{x}^{T}A}{\vec{x}} = A^{T}\)
Function approximation
- \(y = f(x;\vec{w}) = \sum\limits_{i=0}^{M}w_{i}x^{i} = \vec{w}x\)
- Sum of square error(SSE) = \(E_{ss}(\vec{w}) = \frac{1}{2}\sum\limits_{n=1}^{N}[f(x_{n};\vec{w}) - y_{n}]^{2}\)
- Root-mean-square error(RMSE) = \(\sqrt{\frac{\sum\limits_{n=1}^{N}[f(x_{n};\vec{w}) - y_{n}]^2}{N}}\)
- Model selection
- Select a model that fits data well and is as simple as possible
Linear Regression and Logistic Regression
- Regression models
- Answer what is the relationship between dependent variable and the independent variables
- Given a dataset \(\{<\vec{x_{n}, y_n}>\}\), where \(\vec{x_n} \in \mathbb{R}^{n}\), and \(y_n \in \mathbb{R}\), build a function \(f(\vec{x};\vec{w})\) to predict the output of a new sample \(\vec{x}\)
- Linear model
- Estimate parameters
- \(f(\vec{x},\vec{w}) = \vec{w}^{T}\vec{x}\)
- Err = \(\sum\limits_{n=1}^{N}(y-f(\vec{x}_n, \vec{w}))^2 = (\vec{y}- X\vec{w})^{T}(\vec{y}- X\vec{w})\)
- Solution: if X is full rank, then \(\vec{w} = (X^{T}X)^{-1}X^{T}\vec{y}\)
- If \((X^{T}X)\) is singular, then there are features that are linear combinations of other features, so we need to reduce the amount of features to make \((X^{T}X)\) full rank
- Estimate parameters
- Linear basis function model
- Replace \(f(\vec{x};\vec{w}) = w_0 + ... +w_{d}x_{d}\) with $\(f(\vec{x};\vec{w}) = w_0 + ... +w_{m}h_{m}(\vec{x})\)
- \(\{h_{i}(\vec{x})\}\) is a set of non-linear basis functions/kernel functions
- Logistic regression & classification
- \(f(\vec{x};\vec{w}) = \frac{1}{1 + e^{\vec{w}^{T}\vec{x}}}\)
- Let \(P(1|\vec{x};\vec{w}) = f(\vec{x},\vec{w})\), \(P(0|\vec{x};\vec{w}) = 1 - f(\vec{x},\vec{w})\),
- Classification rule: assign \(\vec{x}\) to class 1 if \(P(1|\vec{x};\vec{w}) > P(0|\vec{x};\vec{w})\), otherwise assign \(\vec{x}\) to class 0
- \(logit(\vec{x};\vec{w}) = log(\frac{P(1|\vec{x};\vec{w})}{P(0|\vec{x};\vec{w})}) = \vec{w}^{T}\vec{x}\)
Principal Component Analysis
Linear Discriminant Analysis
Bayesian Decision Theory
- Formula
- Given state \(S \in \{s_1, ..., s_M\}\).
- Let decision \(A \in {a_1, ..., a_N}\).
- Cost function \(C(a_i|s_k)\) is the loss incurred for taking \(A = a_i\) when \(S = s_k\).
- \(P(s_k)\) is the prior probability of \(S = s_k\).
- Let observation be represented as a feature vector \(\vec{x}\).
- \(P(\vec{x}|s_k)\) is the probability for \(\vec{x}\) conditioned on \(s_k\) being the true state
- Expected loss before observing \(\vec{x}\): \(R(a_i) = \sum\limits_{k = 1}^{M}C(a_i|s_k)P(s_k)\).
- Expected loss after observing \(\vec{x}\): \(R(a_i|\vec{x}) = \sum\limits_{k = 1}^{M}C(a_i|s_k)P(s_k|\vec{x})\).
- Bayesian risk
- A decision function \(D(x)\) maps from an observation \(\vec{x}\) to a decision/action.
- The total risk of a decision function is given by
- Discrete: \(E_{P(\vec{x})} = [R(D(\vec{x})|\vec{x})] = E_{\vec{x}}[R(D(\vec{x})|\vec{x})]P(\vec{x})\)
- Discrete: \(E_{P(\vec{x})} = [R(D(\vec{x})|\vec{x})] = \int_{\vec{x}}[R(D(\vec{x})|\vec{x})]P(\vec{x})\)
- A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk.
- Two-class Bayes Classifier
- Let \(M = 2\), and compare \(R(a_1|\vec{x})\) with \(R(a_2|\vec{x})\)
- We can derive \(R(a_1|\vec{x}) < R(a_2|\vec{x}) \iff \frac{P(\vec{x}|s_1)}{P(\vec{x}|s_2)} > \frac{P(s_2)[C(a_1|s_2) - C(a_2|s_2)]}{P(s_1)[C(a_2|s_1) - C(a_1|s_1)]}\), which is likelihood ration > constant threshold
- Also we notice that \(\frac{P(\vec{x}|s_1)}{P(\vec{x}|s_2)} \propto \frac{P(s_1|\vec{x})}{P(s_2|\vec{x})}\), which is likelihood ratio is proportional to posterior ratio
- Minimum-error-rate classification
- Consider the following cost function: \(D_{it} = \begin{cases} 0, & i = j\\ 1 & i \ne j \end{cases}\)
- To minimize the average probability of error, we should select \(s_i\) with the maximal posterior probability \(P(s_i|\vec{x})\)
- Classify \(\vec{x}\) as \(s_i\) if \(P(s_i|\vec{x}) > P(s_j|\vec{x})\) for \(\forall j \ne i\)
- Classifiers and discriminant functions
- A pattern classifier can be represented by a set of discriminant functions \(\{g_{i}(\vec{x})\}\)
- The classifier assigns a sample \(\vec{x}\) to \(s_j\) if \(g_{i}(\vec{x}) < g_{j}(\vec{x})\) for \(\forall j \ne i\)
- If we replace every \(g_{i}(\vec{x})\) by \(f(g_{i}(\vec{x}))\), where f is a monotonically increasing function, the resulting classification is unchanged
- A classifier that uses linear discriminant functions is called a linear classifier