Introduction
- Introduction
Background Concepts
Random variables
Probability
- Bayes' theorem
- Chain rule
- Independent
- Conditional independent
Probability distribution
- Gaussian
- Bernoulli
- Binomial
Descriptive statistics
- Expectation
- Variance
- Covariance
Entropy
- Definition
- Important in coding theory, statistical physics, machine learning
- Conditional entropy:
- Definition:
, or - Property:
- Definition:
- Mutual information
- Definition:
, or - Property:
- Definition:
- Definition
Kullback-Leibler Divergence
- Definition:
- Property:
- Property:
- Definition:
Convex and concave functions
Vector, vector operations, matrix, matrix multiplication
Matrix calculus
Function approximation
- Sum of square error(SSE) =
- Root-mean-square error(RMSE) =
- Model selection
- Select a model that fits data well and is as simple as possible
Linear Regression and Logistic Regression
- Regression models
- Answer what is the relationship between dependent variable and the independent variables
- Given a dataset
, where , and , build a function to predict the output of a new sample
- Linear model
- Estimate parameters
- Err =
- Solution: if X is full rank, then
- If
is singular, then there are features that are linear combinations of other features, so we need to reduce the amount of features to make full rank
- Estimate parameters
- Linear basis function model
- Replace
with $ is a set of non-linear basis functions/kernel functions
- Replace
- Logistic regression & classification
- Let
, , - Classification rule: assign
to class 1 if , otherwise assign to class 0
Principal Component Analysis
Linear Discriminant Analysis
Bayesian Decision Theory
- Formula
- Given state
. - Let decision
. - Cost function
is the loss incurred for taking when . is the prior probability of .- Let observation be represented as a feature vector
. is the probability for conditioned on being the true state- Expected loss before observing
: . - Expected loss after observing
: .
- Given state
- Bayesian risk
- A decision function
maps from an observation to a decision/action. - The total risk of a decision function is given by
- Discrete:
- Discrete:
- Discrete:
- A decision function is optimal if it minimizes the total risk. This optimal total risk is called Bayes risk.
- A decision function
- Two-class Bayes Classifier
- Let
, and compare with - We can derive
, which is likelihood ration > constant threshold - Also we notice that
, which is likelihood ratio is proportional to posterior ratio
- Let
- Minimum-error-rate classification
- Consider the following cost function:
- To minimize the average probability of error, we should select
with the maximal posterior probability - Classify
as if for
- Consider the following cost function:
- Classifiers and discriminant functions
- A pattern classifier can be represented by a set of discriminant
functions
- The classifier assigns a sample
to if for - If we replace every
by , where f is a monotonically increasing function, the resulting classification is unchanged - A classifier that uses linear discriminant functions is called a linear classifier
- A pattern classifier can be represented by a set of discriminant
functions