COSI 127B Database Management Systems

Lecture 01 2022/01/18 Introduction

Introduction
- A database management system (DBMS) is a collection of interrelated data and a set of programs to access those data
- Database: a collection of data, contains information relevant to an enterprise
- The primary goal of a DBMS is to provide a way to store and retrieve database information that is both convenient and efficient
- DBMS
  - Manage data: define structures for storage, provide mechanism for information manipulation
  - Ensure data safety
Database System Applications
- Representative applications
  - Enterprise information: sales, accounting, human resources
  - Manufacturing
  - Banking and finance: banking, credit card transactions, finance
  - Universities
  - Airlines
  - Telecommunication
  - Web-based services: social media, online retailers, online advertisements
  - Document databases
  - Navigation systems
- Two models where database are used
  - Support online transaction processing
  - Support data analytics
Purpose of Database Systems
- Reduce data redundancy and inconsistency
- Reduce difficulty in accessing data
- Data isolation
- Reduce integrity problems
- Atomicity problems (failure consistency)
- Concurrent-access anomalies
- Reduce security problems
View of Data
- Data Model: a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints
- Data model categories
  - Relational Model: use a collection of tables/relations to represent both data and the relationships among those data, the relational model is an example of a record-based model (attribute and record)
  - Entity-Relational (ER) Model: use a collection of basic objects (called entities) and relationships among these objects
  - Semi-structured Data Model: permit the specification of data where individual data items of the same type may have different sets of attributes. Examples include JSON and XML
  - Object-based Data Model: databases allow procedures to be stored in the database and executed by the database system, this can be an extension to the relational model with notions of encapsulation, methods, and object identity
- Data abstraction
  - Hide concrete and complex details of databases from users
  - Level of abstraction
    - Physical level: lowest abstraction level describes how data are actually stored in the database
    - Logical level: describes what data are stored in the database and what relationships exist among those data. There are simple structures at this level, but they do not to involve the complexity of teh physical level, this feature is called the physical data independency
    - View level: highest level of abstraction, describes only part of the database, a database can have many view
- Instances and schemas
  - Instance: the collection of information stored in teh database at a particular moment is called an instance of the database
  - Schema: the overall design of the database is called the database schema. There can bse physical schema, logical schema, and sub-schemas (view level schema)
Database Language
- A database provides a data-definition language (DDL) to specify the database schema and a data-manipulation language (DML) to express database queries and updates
- Consistency constraints in DDL
  - Domain constraints
  - Referential integrity
  - Authorization: CRUD authorization
- DML
  - Types:
    - Procedural DML: what data are needed and how to get them
    - Declarative DML: what data are needed, do not need to specify how to get those data
  - Sometimes query language and DML are used synonymously
Database Design
- Mainly involves the design of database schema
- Conceptual design -> logical design -> physical design
Database Engine
- Functional components: storage manager, query processor, transaction management component
- Storage manager
  - Database typically requires a large amount of storage space, the data are usually stored on SSDs
  - Components
    - Authorization and integrity manager
    - Transaction manager
    - File manager
    - Buffer manager
  - Implement the following physical components
    - Data files: database itself
    - Data dictionary: metadata
    - Indices: enable fast data access
- Query processor
  - It helps database to simplify and facilitate access to data
  - Components
    - DDL interpreter
    - DML compiler: compiles DML to lower level instructions, also conducts query optimization
    - Query evaluation engine
- Transaction manager: allows application developer to treat database in a higher level of abstraction
  - Transaction ACID: atomicity, consistency, isolation, durability
  - The transaction manager consists of the concurrency-control manager and the recovery manager
Database and Application Architecture
- Architecture

Lecture 02 2022/01/21 Relational Model

Structure of Relational Databases
- Notation
  - Table/relation: a tiny difference is that there can be identical rows in table, but there cannot be duplicates in a relation
  - Attribute/column, each attribute has a domain, the domain should be atomic (indivisible)
  - Row/tuple/record
  - Relation instance: a specific instance of a relation, containing a specific set of rows
  - Null value: the value is unknown or does not exist, null is a member of every domain
    - null AND true = null
    - null OR true = true
    - null AND false = false
    - null OR false = null
Database schema
- Relational schema (logical thing): $R = (A_{1}, A_{2}, . . ., A_{n})$ , $A_{i}$ is a column/attribute
- Relational instance (snapshot of data at a time): $r (R) = (a_{1}, a_{2}, . . ., a_{n})$ , $a_{i}$ is a value of the attribute $A_{i}$
Keys
- A tuple should be uniquely identified by its attribute values, no two tuples in a relation are allowed to have exactly the same value for all attributes
- Superkey: a set of one ore more attributes that allow user to uniquely identify a tuple in the relation
- Candidate keys: the minimal subset of superkey that the attributes inside can still uniquely identify a tuple
- Primary key
  - A candidate key that is chose by the database designer as the primary key
  - Primary keys are also referred to as primary key constraints
  - The attribute values of a primary should never, or rarely, change
- Foreign key
  - A foreign key constraint states a one-to-one relationship between two relations, the attribute in foreign key constraint is called the foreign key, the two relations are called referencing relation and referenced relation
  - The referenced attribute should be the primary key of the referenced relation
Relational Query Languages
- Query language: a language in which a user requests information from the database
- Categories
  - Imperative query language: user instructs the system to perform a specific sequence of operations on the database to compute the desired result
  - Functional query language: the computation is expressed as the evaluation of functions that may operate on data in teh database or on the results of other functions
  - Declarative query language: the user describes the desired information without giving a specific sequence of steps or function calls for obtaining that information
- Pure query language categories
  - Relational algebra
  - Tuple relational calculus
  - Domain relational calculus
The Relational Algebra
- The relational algebra consists of a set of operations that take one or two relations as input and produce a new relation as their result
- Categories
  - Unary operations: select, project, rename, ...
  - Binary operations: union, product, difference, join, ...
- Select operation
  - Select tuples that satisfy a given predicate
  - Syntax: $σ_{p r e d i c a t e} (r e l a t i o n)$
  - Example: $σ_{s a l a r y > 1000 \land d e p t_n a m e =^{'} p h y s i c s^{'}} (s t a f f)$
    1
    2
    3
    SELECT *
    FROM staff
    WHERE salary > 1000 AND dept_name = "physics";
- Project operation
  - Return its argument relation, with certain attributes left out, duplicate rows are removed
  - Syntax: $Π_{a t t r i b u t e s} (r e l a t i o n)$
  - Example: $Π_{n a m e, s a l a r y / 12} (s t a f f)$
    1
    2
    3
    4
    -- distinct name, salary/12 means two tuples are identical and
    -- will be removed only when both name and monthly salary are the same
    SELECT distinct name, salary/12
    FROM staff;
  - A nested example: $Π_{n a m e, s a l a r y / 12} (σ_{s a l a r y > 1000 \land d e p t_n a m e =^{'} p h y s i c s^{'}} (s t a f f))$
    1
    2
    3
    SELECT distinct name, salary/12
    FROM staff
    WHERE salary > 1000 AND dept_name = "physics";
- The cartesian product
  - Combine information from two relations, the result is not the same as the cartesian product in mathematical definition
  - Syntax: $A \times B$
  - Example: $s t a f f \times c o u r s e$
    1
    2
    3
    4
    5
    SELECT *
    FROM staff, course;
    -- use cross join
    SELECT *
    FROM staff CROSS JOIN course;
- Join operation
  - Combine a select operation and a Cartesian product operation into a single operation
  - Syntax: $r ⋈_{θ} s = σ_{θ} (r \times s)$
- Set Operation
  - Union: $r \cup s$
  - Intersection: $r \cap s$
  - Difference: $r - s$
- Assignment Operation
  - Assign a relational-algebra expression to a temporary variable
  - $r \leftarrow E$
- Rename Operation
  - Give names to result of relational-algebra expressions
  - $ρ_{n e w} (o l d)$

Lecture 03 2022/01/25 Basic SQL

DDL

Domain types
- char(n): fixed length character string
- varchar(n): variable length character string n, maximum length is n
- int: integer
- smallint: small integer
- numeric(p, d): fixed point number, with precision of p digits and d digits to the right of decimal point
- real, double precision: floating point and double-precision floating point numbers with machine-dependent precision
- float(n): floating point number , with precision at least n digits

Create a table

Syntax

create table table_name (
  attribute_name data_type,
  constraint_name (attribute_name),
);

create table student (
  id int,
  name varchar(20) not null,
  department varchar(20),
  primary key (id),
  foreign key(department) references department
);

Delete a table

Syntax

1 2	drop table table_name; drop table student;

Add/remove attributes in a table

Syntax

/* Add one attribute into a table*/
  alter table table_name add attribute_name type;
  alter table student add gpa float(3);

  /* Delete one attribute from a table*/
  alter table table_name drop attribute_name;
  alter table table_name drop gpa;

DML

Queries on a single relation

Basic query

/* Select all tuples with only two attributes*/
select attribute1, attribute2 from table_name;
select id, name from student;
/* Select all tuples with all attributes*/
select * from table_name
select * from student;

Duplicates

/* Select unique tuples with respect to the attribute*/
select distinct attribute_name from table_name;
/* Select all tuples with respect to the attributes*/
select all attribute_name from table_name;

Query with predicates

/*Select tuples that satisfy a given predicate*/
select attribute_name from table_name where condition;
select id, name from student where department = 'computer science';

/* Use like to match pattern*/
where name like 'a%' /* Where name starts with a*/
/* Numerical range, both bound are inclusive*/
where gpa between 3.5 and 4.0; /*3.5 <= gpa <= 4.0*/

Alias

select a.id, a.name 
from student a, student b
where a.department = 'computer science' 
and b.department = 'computer science' 
and a.gpa > b.gpa;

Queries on multiple relations

select a.id, a.name 
from student a, student b
where a.department = 'computer science' 
and b.department = 'computer science' 
and a.gpa > b.gpa;

Additional basic operations
- String operations
  - %: matches any substring
  - _: matches any character
  - like: matches a pattern

Lecture 07 2022/02/08

Common mistakes in E-R model
- Use of the primary key of an entity set as an attribute of another entity set, instead of using a relationship (if there is a relationship, then remove the foreign key)
- Designate the primary key attributes of the related entity sets as attributes of the relationship set (they are already implicit in the relationship set)
- Use a relationship with a single-value attribute in a situation that requires a multi-valued attribute

Lecture 08 2022/02/11

Good relational designs Ch 7.1
- Good schemas should contain no redundancy and no missing information
- To remove the data repetition problem, we can decompose one schema into multiple smaller ones
- If information is missing after decomposition, we call it a lossy decomposition, otherwise it's a lossless decomposition
- Let R, $R_{1}$ , $R_{2}$ the set of attributes, if $R = R_{1} \cup R_{2}$ , then the decomposition is lossless. Equivalently, if $ r = {R_1} (r) {R_2}(r)$, then the decomposition is lossless, if $r \subset Φ_{R_{1}} (r) ⋈ Φ_{R_{2}} (r)$ , the decomposition is lossy
- Normalization
  - The method for designing a relational database is to use a process commonly known as normalization
  - The goal is to generate a set of relation schemas that store information without unnecessary redundancy, yet allows to retrieve information easily
  - Procedure
    - Decide whether a schema is a good one
    - If a schema is not in good form, decompose it into smaller ones, each of which is in an appropriate normal form. The decomposition should be lossless
Decomposition using functional dependencies Ch 7.2
- Legal instance
  - There are some real-world constraints that real-world data should follow, since database models real-world data, schemas should also follow these constraints
  - An instance of a relation that satisfies all such real-world constraints is called a legal instance of the relation; a legal instance of a database is one where all the relation instances are legal instances
- Notations: $α$ for attribute, r for relation, R for schema, K for superkey
- Functional dependency: denote relational schema r(R), $α \subseteq R$ , and $β \subseteq R$ ,
  - Given an instance of r(R), we say that the instance satisfies the functional dependency $α \to β$ if for all pairs of tuples $t_{1}$ and $t_{2}$ in the instance such that $t_{1} [α] = t_{2} [α]$ , we also have $t_{1} [β] = t_{2} [β]$
  - We say that the functional dependency $α \to β$ holds on schema r(R) if every legal instance of r(R) satisfies the functional dependency
  - K is a superkey for r(R) is $K \to R$ holds on r(R)
  - A functional dependency is trivial if it is satisfied by all relations
  - Closure: Given a set F set of functional dependencies, there are certain other functional dependencies that are logically implied by F. The set of all functional dependencies logically implied by F is the closure of F, denoted $F^{+}$
- If $R_{1}$ and $R_{2}$ is a lossless decomposition of R, then at least on of the following functional dependencies hold: $R_{1} \cap R_{2} \to R_{1}$ , $R_{1} \cap R_{2} \to R_{2}$ , which means $R_{1} \cap R_{2}$ is the superkey for either $R_{1}$ or $R_{2}$
Normal Form Ch 7.3
- Boyce-Codd normal form (BCNF)
  - The most desirable normal form, it eliminates all redundancy that can be discovered based on functional dependencies
  - Definition: a relational schema R is in BCNF with respect to a set F of functional dependencies if, for all functional dependencies in $F^{+}$ of the form $α \to β$ , at least one of the following holds:
    - $α \to β$ is trivial ( $β \subseteq α$ )
    - $α$ is a superkey for schema R
  - A database design is in BCNF is each member of the set of relation schemas that constitutes the design is in BCNF
  - Decomposition of non BCNF schema into BCNF schema: suppose for schema R, $α - \to β$ is not BCNF, then we can decompose R into $α \cup β$ and $R - (β - α)$
  - It's not always possible to achieve both BCNF and dependency preserving
- Third normal form (3NF)
  - Definition: A relation schema R is in third normal form with respect to a set F of functional dependencies if, for all functional dependencies in $F^{+}$ of the form $α \to β$ , at least one of the following hold:
    - $α \to β$ is trivial ( $β \subseteq α$ )
    - $α$ is a superkey for schema R
    - Each attribute A in $β - α$ is contained in a candidate key for R
  - Every schema that has 3NF is dependency preserving
  - Advantages: it's always possible to obtain a 3NF design without sacrificing loss or dependency preservation
  - Disadvantages: there is still problem of repetition of information, and we have to use null values to represent some of the possible meaningful relationships among data items

MySQL Crash Course

Understanding SQL
- A database is a container to store organized data
- Do not misuse database with DBMS
- A table is a structured list of data of a specific type
  - Tables in the same database cannot have the same name, tables in different databases can have the same name
- Schema: information about database and table layout and properties
- Column: a single field in a table, all tables are made up of one ore more columns
  - Each column has its datatype
- Row: a record in a table
- Primary key: a column (or set of columns) whose values uniquely identify every row in a table
  - Always define primary keys, even though primary keys are not required
  - Primary key best practices
    - Do not update values in primary key columns
    - Do not reuse values in primary key columns
    - Do not use values that might change in primary key columns
- SQL is a language designed specifically for communicating with databases
Introducing MySQL
- Some reasons
  - MySQL is open-source and free to use
  - MySQL is fast
  - MySQL is trusted
  - MySQL is easy to use
- DBMS categories
  - Shared file-based: Microsoft Access, FileMaker
  - Client-server based: MySQL, Oracle, Microsoft SQL Server
- MySQL tools
  - Command-line utility: mysql
  - MySQL administrator
  - MySQL query browser
Working with MySQL
- Select a database: use database_name;
- Display all databases: show databases;
- Display all tables in the current database: show tables;
- Display all columns in a table: show columns from table_name;, you can also use the short form: describe table_name;
- Display server status: show status;
- Display server error and warning: show errors;, show warnings;
- Display database information: show create database database_name;
- Display table information: show create table table_name;
- Display security rights granted to users: show grants;
Retrieving data
- Retrieve one column: select col_name from tbl_name;
- Retrieve multiple columns: select col1_name, col2_name from tbl_name;
- Retrieve all columns: select * from tbl_name;
- Retrieve distinct rows: select distinct col_name from tbl_name;
- Limiting results: select col_name from tbl_name limit n;
- Limiting results with a starting row: select col_name from tbl_name limit n, m;, this select m rows starting from row n + 1, there is an alternative approach in which you can explicitly specify the offset: select col_name from tbl_name limit m offset n;
- Use fully qualified table names: select tbl_name.col_name from db_name.tbl_name;
Storing retrieved data
- Sorting data: select col_name from tbl_name order by col_name;, it's not required that the sorting column is included in the result set
- Sorting by multiple columns: select col_name from tbl_name order by col1_name, col2_name;, this means we first sort by col1, if the values are equal, we then sort by col2
- Specify sorting direction: select col_name from tbl_name order by col_name desc, the default direction is ascending, otherwise you need to explicitly specify the direction is descending
- Specify sorting direction for each column: select col_name from tbl_name order by col1_name, col2_name desc;
- Top k question: select col_name from tbl_name order by col_name desc limit k;
Filtering data
- Use where condition to filter data: select col_name from tbl_name where condition;
- Some operators: =, <>, !=, <, >, <=, >, >=, between and
- Check for no value: where vol_name is null;
Advanced data filtering
- Combine conditions: and, or
- Match one of the condition: where col_name in (val1, ..., valn);
- Negate a condition: where col_name not in (val1, ..., valn);, you can use not together with in, between, and exists
Using wildcard