0%

COSI 123A Statistical Machine Learning

Introduction

  1. Introduction

Background Concepts

  1. Random variables

  2. Probability

    • Bayes' theorem
    • Chain rule
    • Independent
    • Conditional independent
  3. Probability distribution

    • Gaussian
    • Bernoulli
    • Binomial
  4. Descriptive statistics

    • Expectation
    • Variance
    • Covariance
  5. Entropy

    • Definition H[x]=xp(x)logxp(x)
    • Important in coding theory, statistical physics, machine learning
    • Conditional entropy:
      • Definition: H[Y|X]=xyp(x,y)log2(p(y|x)), or H[Y|X]=xyp(x,y)log2(p(y|x))dxdy
      • Property: H[X,Y]=H[x]+H[Y|X]=H[Y]+H[X|Y]
    • Mutual information
      • Definition: I[X,Y]=xyp(x,y)log2(p(x)p(y)p(x,y))), or U[X,Y]=xyp(x,y)log2(p(x)p(y)p(x,y)))dxdy
      • Property: I[X,Y]=H[X]H[X|Y]=H[Y]H[Y|X]
  6. Kullback-Leibler Divergence

    • Definition: KL(p(x)||q(x))=xp(x)log(p(x)q(x))=xp(x)log(q(x)p(x))
    • Property: KL(p(x)||q(x))KL(q(x)||p(x))
    • Property: I[X,Y]=KL(p(x,y)||p(x)p(y))
  7. Convex and concave functions

  8. Vector, vector operations, matrix, matrix multiplication

  9. Matrix calculus

    • (ax+b)x=a
    • (aTx)x=aT
    • (xTa)x=aT
    • (xTx)x=2xT
    • Axx=A
    • xTAx=AT
  10. Function approximation

  • y=f(x;w)=i=0Mwixi=wx
  • Sum of square error(SSE) = Ess(w)=12n=1N[f(xn;w)yn]2
  • Root-mean-square error(RMSE) = n=1N[f(xn;w)yn]2N
  1. Model selection
  • Select a model that fits data well and is as simple as possible

Linear Regression and Logistic Regression

  1. Regression models
    • Answer what is the relationship between dependent variable and the independent variables
    • Given a dataset {<xn,yn>}, where xnRn, and ynR, build a function f(x;w) to predict the output of a new sample x
  2. Linear model
    • Estimate parameters
      • f(x,w)=wTx
      • Err = n=1N(yf(xn,w))2=(yXw)T(yXw)
      • Solution: if X is full rank, then w=(XTX)1XTy
      • If (XTX) is singular, then there are features that are linear combinations of other features, so we need to reduce the amount of features to make (XTX) full rank
  3. Linear basis function model
    • Replace f(x;w)=w0+...+wdxd with $f(x;w)=w0+...+wmhm(x)
    • {hi(x)} is a set of non-linear basis functions/kernel functions
  4. Logistic regression & classification
    • f(x;w)=11+ewTx
    • Let P(1|x;w)=f(x,w), P(0|x;w)=1f(x,w),
    • Classification rule: assign x to class 1 if P(1|x;w)>P(0|x;w), otherwise assign x to class 0
    • logit(x;w)=log(P(1|x;w)P(0|x;w))=wTx

Principal Component Analysis

Linear Discriminant Analysis

Bayesian Decision Theory

  1. Formula
    • Given state S{s1,...,sM}.
    • Let decision Aa1,...,aN.
    • Cost function C(ai|sk) is the loss incurred for taking A=ai when S=sk.
    • P(sk) is the prior probability of S=sk.
    • Let observation be represented as a feature vector x.
    • P(x|sk) is the probability for x conditioned on sk being the true state
    • Expected loss before observing x: R(ai)=k=1MC(ai|sk)P(sk).
    • Expected loss after observing x: R(ai|x)=k=1MC(ai|sk)P(sk|x).
  2. Bayesian risk
    • A decision function D(x) maps from an observation x to a decision/action.
    • The total risk of a decision function is given by
      • Discrete: EP(x)=[R(D(x)|x)]=Ex[R(D(x)|x)]P(x)
      • Discrete: EP(x)=[R(D(x)|x)]=x[R(D(x)|x)]P(x)
    • A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk.
  3. Two-class Bayes Classifier
    • Let M=2, and compare R(a1|x) with R(a2|x)
    • We can derive R(a1|x)<R(a2|x)P(x|s1)P(x|s2)>P(s2)[C(a1|s2)C(a2|s2)]P(s1)[C(a2|s1)C(a1|s1)], which is likelihood ration > constant threshold
    • Also we notice that P(x|s1)P(x|s2)P(s1|x)P(s2|x), which is likelihood ratio is proportional to posterior ratio
  4. Minimum-error-rate classification
    • Consider the following cost function: Dit={0,i=j1ij
    • To minimize the average probability of error, we should select si with the maximal posterior probability P(si|x)
    • Classify x as si if P(si|x)>P(sj|x) for ji
  5. Classifiers and discriminant functions
    • A pattern classifier can be represented by a set of discriminant functions {gi(x)}
    • The classifier assigns a sample x to sj if gi(x)<gj(x) for ji
    • If we replace every gi(x) by f(gi(x)), where f is a monotonically increasing function, the resulting classification is unchanged
    • A classifier that uses linear discriminant functions is called a linear classifier

Maximum Likelihood Estimation (MLE) and the Expectation and Maximization (EM) Algorithm