0%

COSI 123A Statistical Machine Learning

Introduction

  1. Introduction

Background Concepts

  1. Random variables

  2. Probability

    • Bayes' theorem
    • Chain rule
    • Independent
    • Conditional independent
  3. Probability distribution

    • Gaussian
    • Bernoulli
    • Binomial
  4. Descriptive statistics

    • Expectation
    • Variance
    • Covariance
  5. Entropy

    • Definition \(H[x] = -\sum\limits_{x}p(x)log_{x}p(x)\)
    • Important in coding theory, statistical physics, machine learning
    • Conditional entropy:
      • Definition: \(H[Y|X] = -\sum\limits_{x}\sum\limits_{y}p(x,y)log_{2}(p(y|x))\), or \(H[Y|X] = -\int\limits_{x}\int\limits_{y} p(x,y)log_{2}(p(y|x))dxdy\)
      • Property: \(H[X, Y] = H[x] + H[Y|X] = H[Y] + H[X|Y]\)
    • Mutual information
      • Definition: \(I[X,Y] = -\sum\limits_{x}\sum\limits_{y}p(x,y)log_{2}(\frac{p(x)p(y)}{p(x,y)}))\), or \(U[X,Y] = -\int\limits_{x}\int\limits_{y}p(x,y)log_{2}(\frac{p(x)p(y)}{p(x,y)}))dxdy\)
      • Property: \(I[X,Y] = H[X] - H[X|Y] = H[Y] - H[Y|X]\)
  6. Kullback-Leibler Divergence

    • Definition: \(KL(p(x)||q(x)) = \int\limits_{x} p(x)log(\frac{p(x)}{q(x)}) = -\int\limits_{x} p(x)log(\frac{q(x)}{p(x)})\)
    • Property: \(KL(p(x)||q(x)) \ne KL(q(x)||p(x))\)
    • Property: \(I[X,Y] = KL(p(x,y)||p(x)p(y))\)
  7. Convex and concave functions

  8. Vector, vector operations, matrix, matrix multiplication

  9. Matrix calculus

    • \(\frac{\partial (a \vec{x} + b)}{\partial \vec{x}} = a\)
    • \(\frac{\partial (\vec{a}^{T} \vec{x})}{\partial \vec{x}} = \vec{a}^{T}\)
    • \(\frac{\partial (\vec{x}^{T} \vec{a} )}{\partial \vec{x}} = \vec{a}^{T}\)
    • \(\frac{\partial (\vec{x}^{T} \vec{x})}{\partial \vec{x}} = 2\vec{x}^{T}\)
    • \(\frac{\partial A\vec{x}}{\vec{x}} = A\)
    • \(\frac{\partial \vec{x}^{T}A}{\vec{x}} = A^{T}\)
  10. Function approximation

  • \(y = f(x;\vec{w}) = \sum\limits_{i=0}^{M}w_{i}x^{i} = \vec{w}x\)
  • Sum of square error(SSE) = \(E_{ss}(\vec{w}) = \frac{1}{2}\sum\limits_{n=1}^{N}[f(x_{n};\vec{w}) - y_{n}]^{2}\)
  • Root-mean-square error(RMSE) = \(\sqrt{\frac{\sum\limits_{n=1}^{N}[f(x_{n};\vec{w}) - y_{n}]^2}{N}}\)
  1. Model selection
  • Select a model that fits data well and is as simple as possible

Linear Regression and Logistic Regression

  1. Regression models
    • Answer what is the relationship between dependent variable and the independent variables
    • Given a dataset \(\{<\vec{x_{n}, y_n}>\}\), where \(\vec{x_n} \in \mathbb{R}^{n}\), and \(y_n \in \mathbb{R}\), build a function \(f(\vec{x};\vec{w})\) to predict the output of a new sample \(\vec{x}\)
  2. Linear model
    • Estimate parameters
      • \(f(\vec{x},\vec{w}) = \vec{w}^{T}\vec{x}\)
      • Err = \(\sum\limits_{n=1}^{N}(y-f(\vec{x}_n, \vec{w}))^2 = (\vec{y}- X\vec{w})^{T}(\vec{y}- X\vec{w})\)
      • Solution: if X is full rank, then \(\vec{w} = (X^{T}X)^{-1}X^{T}\vec{y}\)
      • If \((X^{T}X)\) is singular, then there are features that are linear combinations of other features, so we need to reduce the amount of features to make \((X^{T}X)\) full rank
  3. Linear basis function model
    • Replace \(f(\vec{x};\vec{w}) = w_0 + ... +w_{d}x_{d}\) with $\(f(\vec{x};\vec{w}) = w_0 + ... +w_{m}h_{m}(\vec{x})\)
    • \(\{h_{i}(\vec{x})\}\) is a set of non-linear basis functions/kernel functions
  4. Logistic regression & classification
    • \(f(\vec{x};\vec{w}) = \frac{1}{1 + e^{\vec{w}^{T}\vec{x}}}\)
    • Let \(P(1|\vec{x};\vec{w}) = f(\vec{x},\vec{w})\), \(P(0|\vec{x};\vec{w}) = 1 - f(\vec{x},\vec{w})\),
    • Classification rule: assign \(\vec{x}\) to class 1 if \(P(1|\vec{x};\vec{w}) > P(0|\vec{x};\vec{w})\), otherwise assign \(\vec{x}\) to class 0
    • \(logit(\vec{x};\vec{w}) = log(\frac{P(1|\vec{x};\vec{w})}{P(0|\vec{x};\vec{w})}) = \vec{w}^{T}\vec{x}\)

Principal Component Analysis

Linear Discriminant Analysis

Bayesian Decision Theory

  1. Formula
    • Given state \(S \in \{s_1, ..., s_M\}\).
    • Let decision \(A \in {a_1, ..., a_N}\).
    • Cost function \(C(a_i|s_k)\) is the loss incurred for taking \(A = a_i\) when \(S = s_k\).
    • \(P(s_k)\) is the prior probability of \(S = s_k\).
    • Let observation be represented as a feature vector \(\vec{x}\).
    • \(P(\vec{x}|s_k)\) is the probability for \(\vec{x}\) conditioned on \(s_k\) being the true state
    • Expected loss before observing \(\vec{x}\): \(R(a_i) = \sum\limits_{k = 1}^{M}C(a_i|s_k)P(s_k)\).
    • Expected loss after observing \(\vec{x}\): \(R(a_i|\vec{x}) = \sum\limits_{k = 1}^{M}C(a_i|s_k)P(s_k|\vec{x})\).
  2. Bayesian risk
    • A decision function \(D(x)\) maps from an observation \(\vec{x}\) to a decision/action.
    • The total risk of a decision function is given by
      • Discrete: \(E_{P(\vec{x})} = [R(D(\vec{x})|\vec{x})] = E_{\vec{x}}[R(D(\vec{x})|\vec{x})]P(\vec{x})\)
      • Discrete: \(E_{P(\vec{x})} = [R(D(\vec{x})|\vec{x})] = \int_{\vec{x}}[R(D(\vec{x})|\vec{x})]P(\vec{x})\)
    • A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk.
  3. Two-class Bayes Classifier
    • Let \(M = 2\), and compare \(R(a_1|\vec{x})\) with \(R(a_2|\vec{x})\)
    • We can derive \(R(a_1|\vec{x}) < R(a_2|\vec{x}) \iff \frac{P(\vec{x}|s_1)}{P(\vec{x}|s_2)} > \frac{P(s_2)[C(a_1|s_2) - C(a_2|s_2)]}{P(s_1)[C(a_2|s_1) - C(a_1|s_1)]}\), which is likelihood ration > constant threshold
    • Also we notice that \(\frac{P(\vec{x}|s_1)}{P(\vec{x}|s_2)} \propto \frac{P(s_1|\vec{x})}{P(s_2|\vec{x})}\), which is likelihood ratio is proportional to posterior ratio
  4. Minimum-error-rate classification
    • Consider the following cost function: \(D_{it} = \begin{cases} 0, & i = j\\ 1 & i \ne j \end{cases}\)
    • To minimize the average probability of error, we should select \(s_i\) with the maximal posterior probability \(P(s_i|\vec{x})\)
    • Classify \(\vec{x}\) as \(s_i\) if \(P(s_i|\vec{x}) > P(s_j|\vec{x})\) for \(\forall j \ne i\)
  5. Classifiers and discriminant functions
    • A pattern classifier can be represented by a set of discriminant functions \(\{g_{i}(\vec{x})\}\)
    • The classifier assigns a sample \(\vec{x}\) to \(s_j\) if \(g_{i}(\vec{x}) < g_{j}(\vec{x})\) for \(\forall j \ne i\)
    • If we replace every \(g_{i}(\vec{x})\) by \(f(g_{i}(\vec{x}))\), where f is a monotonically increasing function, the resulting classification is unchanged
    • A classifier that uses linear discriminant functions is called a linear classifier

Maximum Likelihood Estimation (MLE) and the Expectation and Maximization (EM) Algorithm