COSI 103A Software Engineering

The Linux Command Line

Chapter 1: What is the shell
- The shell is a program that takes keyboard commands and passes them to the operating system to carry out
- bash is a shell program from the GNU project, is an acronym for Bourne Again Shell
- We need another program called a terminal emulator to interact with the shell
- Some simple commands
  - date: display the current time and date
  - cal: display a calendar of the current month
  - df: display the current amount of free space on the disk
  - free: display the amount of free memory
  - exit: exit a terminal session, you can also use Ctrl + d
Chapter 2: Navigation
- Some commands:
  - pwd: print name of current working directory
  - ls: list directory contents
  - cd: change directory
- A Unix-like OS organizes files a hierarchical directory structure. The first directory is the root directory.
- Windows has a separate file system tree for each storage device, Unix-like systems always have a single file system tree
- Absolute pathnames start with /, relative pathnames start from the working directory, ./ represents the current directory, ../ represents the parent directory of the current directory
- Filename:
  - File names with period character is hidden, you need to use ls -a to display the hidden files or directories
  - File names are case sensitive
  - Linux files do not have file extensions
  - Do not use space in the file name
Chapter 3: Exploring the System
- Some commands:
  - ls: list directory contents
  - file: determine file type
  - less: view file contents
- ls
  - ls -a: display all files
  - ls -A: display almost all files, do not list . and ..
  - ls -i: display the inode of the files
  - ls -l: display in long listing format
  - ls -r: reverse the order of the listing
  - ls -t: sort contents by modification time, newest first
  - ls -S: sort contents by file size, largest first
  - ls -R: recursive listing
  - ls -d: list the directory itself, not its contents
- Permissons
  - -rwxr--r--:the first character is -, it indicates this is a file, if this is a directory, the first character would be d, if this is a link, the first character would be l; the next three character are the permissions for the owner, the next three characters are the permission for the group, the last three characters are the permission for everyone else; r: open and read, w: write or truncate, the file itself cannot be renamed or deleted, x: treat as a program and execute
- Some directories in Linux system
  - /: root directory
  - /bin: contain programs that must present for the system to boot and run
  - /boot: contain the Linux kernel and the boot loader
  - /dev: contain a list of devices
  - /etc: contain all of the system-wide configuration files
  - /home: in general, each user is given a directory in /home
  - /lib: contain shared library files used by the core system programs
  - /mnt: contain mount points for removable devices
  - /opt: used to install optional software
  - /tmp: intended for the storage of temporary, transient files
  - /usr: contain all programs and support files by users
- Links
  - Soft links/symlink/ symbolic link: a file contains a reference to another file or directory in the form of pathname
  - Hard link: saves a copy of a file or directory
Chapter 4 Manipulating Files and Directories
- Some commands:
  - cp: copy files and directories
  - mv: move/rename files and directories
  - mkdir: create directories
  - rm: remove fils and directories
  - ln: create hard and symbolic links
- Wildcards
  - *: match any characters
  - ?: match any single character
  - [characters]: match any character that is a member of the set characters
  - [!characters]: match any character that is not a member of the set characters
  - [[:class:]]: match any character that is a member of the specified class
  - [:alnum:]: match any alphanumeric character
  - [:alpha:]: match any alphabetic character
  - [:digit:]: match any numeral
  - [:lower:]: match any lowercase letter
  - [:upper:]: match any uppercase letter
- mkdir dir1 ... dirn: make n directories
- cp item1 ... itemn dir: copy item1 to itemn to dir
  - cp -a: copy the files and directories and all of their attributes
  - cp -i: before overwriting an existing file, prompt the user for confirmation
  - cp -r: recursively copy directories and their contents
  - cp -u: only copy files that either don't exist or are newer than the existing corresponding files in the destination directory
  - cp x y:
    - If both x and y are files, overwrite y with x, if y does not exist, create y with x
    - If x is a file and y is a directory, copy x into y
    - If x and y are both directory, you should use copy -r x y, otherwise there will be an error
- mv item 1 item2: move/ rename item 1 to item2, mv item0 ... itemn directory: move item0 to itemn into directory
  - mv -i: prompt the user for confirmation before overwriting an existing file
  - mv -u: only move files that either don't exist or are newer than the existing corresponding files in the destination directory
  - mv -v: display informative message as the move is performed
  - mv x y
    - If both x and y are files, overwrite y with x, if y does not exist, it is created with x
    - If x is a file and y is a directory, move x into y, y must exist
    - If x and y are both directories, if y does not exist, create y and move contents of x into y, then delete the empty x; if y does exist, move x and its contents into y
- rm item1 ... itemn
  - rm -i: prompt the user for confirmation before removing an existing file, otherwise the file will be deleted silently
  - rm -r: recursively delete directories and their contents, no matter whether the directory, this parameter should always be provided
  - rm -f: delete the targets and ignore nonexistent files and do not prompt
  - rm -v: display informative message as the deletion is performed
- ln file link: create a hard link to a file, ln -s item link: create a symbolic link to a file or directory
  - Hard links
    - A hard cannot reference a file outside its own file system, this means a link cannot reference a file that is not on the same disk partition as the link itself
    - A hard link may not reference a directory
    - A hard link and the original file share the same inode, deleting the hard link does not affect the original file, the hard link still works if you delete the original file
  - Symbolic link
    - Symbolic links are created to overcome the limitations of hard links
    - It works by creating a special type of file that contains a text pointer to the referenced file or directory (similar to the shortcut in Windows)
    - A symbolic link and the original file have different inode, the soft link will not work in the shellTilde expansion: echo ~: display the home directoryArithmetic expansion: echo $((expression)): display the result of the arithmetic expressionNotice the division here is integer divisionExpressions can be nested, but you need the $(()) for each arithmetic expression if the original file is moved or deleted, but deleting the soft link does not affect the original file
Chapter 5 Working with Commands
- Some commands:
  - type: indicate how a command name is interpreted
  - which: display which executable program will be executed
  - help: get help from shell built-ins
  - man: display a commands' manual page
  - apropos: display a list of appropriate commands
  - info: display a command's info entry
  - whatis: display one-line manual page descriptions
  - alias: create an alias for a command
- Command types
  - An executable program: programs compiled into binaries
  - A command built into the shell itself: shell builtins
  - A shell function: shell scripts incorporated into the environment
  - An alias: user-defined commands, built from other commands
- type command: display the type of commands
- which command: only works for executable programs, not builtins nor aliases
- help command: help is available for each of the shell builtins, it shows the documentation of a command
- command --help: display usage information
- man command: display manual page
- apropos command: display appropriate commands
- whatis command: display one-line manual page descriptions
- info command: display a program's info entry
- alias command:
  - Use ; to separate commands in one line
  - Example: alias foo='cd /mnt; ls -lrt;'
  - Unalias a commdn: unalias foo
  - Show all aliases defined in the environment: alias
Chapter 6 Redirection
- Some commands:
  - cat: concatenate files
  - sort: sort lines of text
  - uniq: report or omit repeated lines
  - grep: print lines matching a pattern
  - wc: print newline, word and byte counts for each file
  - head: output the first part of a file
  - tail: output the last part of a file
  - tee: read from standard input and write to standard output and files
- Standard input(stdin, file descriptor 0), standard output(stdout, file descriptor 1), standard error(stderr, file descriptor 2), by default, stdout and stderr are linked to the screen and the stdin is attached to the keyboard
  - Redirection output
    - Truncate/ create a new file: ls -l > file.txt
    - Append to an existing file: ls -l >> file.text
    - Discard both standard error and standard output:
      - Older version: command >/dev/null 2 >&1
      - Newer version command &> /dev/null
- cat [..file]: read one or more files and copy them to the standard output
  - In most cases, cat command can be thought of as analogous to the type command
  - If cat is provided with no arguments, it will read input from the standard input, which is usually the keyboard, type Ctrl + D to indicate EOF
  - Read input from the keyboard and save to input to a file: cat > file.txt, use Ctrl + D to indicate input EOF
  - Read input from a file: cat < file.txt
- | pipelines
  - Pipeline feature of shell: with the pipe operator | (vertical bar), the standard output of one command can be piped into the standard input of another
  - Syntax: command 1 | command 2
  - Filters:
    - Pipelines are often used to perform complex operation on data, frequently, the commands used this way are referred to as filters
    - Filters take input, change it somehow, and then output it
    - sort: write sorted concatenation of all files to standard output
      - sort -f: ignore case, sort -r: reverse order, sort -R: random sort, sort -u: only output unique results
    - uniq: accepts a sorted list of data and removes any duplicates from the list
      - uniq is often used with sort
      - uniq -d: only print duplicate lines, one for each group, uniq -c: prefix lines by the number of occurrences
      - sort -u xxx, sort xxx | uniq
    - wc: display the number of lines, words, and bytes contained in files
      - wc -l: display the number of lines, wc -w: display the number of words, wc -c: display the number of bytes, wc -m: display the number of characters
    - grep pattern [file...]: used to find text patterns
      - When grep encounters a pattern in the file, it prints out the lines containing it
      - Regular expressions are allowed in grep
      - grep -i: ignore case during search, grep -v: print only those lines that do not match the pattern
    - head/ tail
      - By default, print 10 lines
      - head -n m: print first m lines, tail -n m: print last m lines
      - Monitor file changes: tail -f file.txt
    - tee: read standard input and copies it to both standard output and to one or more files
      - This is often used as an intermediate step in a pipeline: it saves the output file to a file
Chapter 7 Seeing the World as the Shell Sees it
- Some commands:
  - echo: display a line of text
- Path expansion
  - It is the mechanism by which wildcards work in the shell
  - Tilde expansion: echo ~: display the home directory
  - Arithmetic expansion: echo $((expression)): display the result of the arithmetic expression
    - Notice the division here is integer division
    - Expressions can be nested, but you need the $(()) for each arithmetic expression
  - Brace expansion
    - echo {a, b}-{1,2}: display a-1 a-2 b-1 b-2
    - echo {a..c}: display a b c
    - echo {001..6}: display 001 002 003 004 005 006
    - echo {{a,b},{1,2}}: display a b 1 2
  - Parameter expansion:
    - x=1; echo $x: display 1
  - Quoting
    - ls -l "hello world"
    - echo this is \$100.0

Chapter 9 Permissions

Some commands:
- id: display user identity
- chmod: change a file's mode
- umask: set the default file permissions
- su: run a shell as another user
- sudo: execute a command as another user
- chown: change a file's owner
- chgrp: change a file's group ownership
- passwd: change a user's password
id
- In the Unix security model, a user has a user id uid, a group id gid, and may belong to additional groups groups
- User accounts are defined in the /etc/passwd file and groups are defined in the /etc/group file, /etc/shadow contains information about the user's password

File attributes

File types:
- -: a regular file
- d: a directory
- l: a symbolic link, for a symbolic link, the remaining attributes are always rwxrwwxrwx, but they are only dummy values, the real file attributes are those of the file the symbolic link points to
- c: a character special file, this file type refers to a device that handles data as a stream of bytes
- b: a block special file, this file type refers to a device that handles data in blocks

Permission attributes

Attribute	Files	Directories
`r`	Allows a file to be opened and read	Allows a directory's contents to be listed if the execute attribute is also set
`w`	Allows a file to be written or truncated, but does not allow files to be renamed or deleted	Allows files within a directory to be created, deleted, and renamed if the execute attribute is also set
`x`	Allows a file to be created as a program and executed	Allows a directory to be entered

chmod
- Only the file's owner or superuser can change the mode of a file or directory
- The mode can use two representation: the octal representation and teh symbolic representation
  
  Octal Representation Binary Symbolic Representation
  
  0 000
  
  1 001
  
  2 010
  
  3 011
  
  4 100
  
  5 101
  
  6 110
  
  7 111
- Syntax
  - chmod 664 file: set file attribute to -rw-rw-r--
  - chmod u=rw,go=rw,o=r file: the same as the above command
umask
- View current mask setting: umask
- Set mask value: umask 0022
- Mask value interpretation: 0xyz, 0 is a preset value, x is for user, y is for group, z is for other
- Octal mask value and permission
  
  Value Permission
  
  0 rwx
  
  1 rw-
  
  2 r-w
  
  3 r--
  
  4 -wx
  
  5 -w-
  
  6 --x
  
  7 ---
- Common values:
  - 0022: 755 for directories, 644 for files
  - 0002: 775 for directories, 664 for files
Change identities
- Methods
  - Log out and log back in as the alternate user
  - Use the su command
  - Use the sudo command
- su: run a shell with substitute user and group ids
  - su -l user: abbreviation su - user, if user is not provided, the substitute user is the superuser(root)
  - Use exit to return to the original shell
  - Execute a command as another user: su -c command
- sudo: execute a command as another user
  - Allows an ordinary user to execute commands as a different user (usually the superuser) in a controlled way
  - This does not require the password of the superuser
  - List the allowed commands for the invoking user on the current host
chown
- chown user:group file: if you want to change the owner and the group, use user:group, if you want to change the user, use user:, if you wan to change the group, use :group
- Superuser privilege is required for this command, so use sudo chown user:group file
chgrp
- In order version of Unix, chown cannot change group ownership, and chgrp is used instead when you want to do so
passwd [user]
- Enter passwd to change the password of the current user

Chapter 10 Processes
- Some commands
  - ps: report a snapshot of current processes
  - top: display tasks
  - jobs: list active jobs
  - bg: place a job in the background
  - fg: place a job in the foreground
  - kill: send a signal to a process
  - killall: kill processes by name
  - shutdown: shutdown or reboot the system
- How processes work
  - When a system starts, the kernel launches a program called init, init then runs a series shell scripts called init scripts, which start all the system services. Many services are implemented as daemon programs, so they run in background
  - The program that can launch other programs is expressed in the process scheme as a parent process producing a child process
  - The kernel maintains information about each process to help keep things organized， which includes the process ID (PID), the memory assigned to each process, the processes's readiness to resume execution
- View Processes: px
  - Shows PID, TTY(teletype, refers to the controlling terminal for the process), TIME ( the amount of CPU time consumed by the process), CMD (the command executed by the process)
  - OPtions:
    - ps x: show all processes regardless of what terminal they are controlled by
    - STAT (state, reveals the current status of the process)
      - R: running or ready to run
      - S: sleeping, it is waiting for an event, such as keystroke or network packet
      - D: uninterruptible sleep, it is waiting for I/O such as a disk drive
      - T: stopped
      - Z: zombie, a child process that has terminated by not cleaned up by its parent
      - <: a high-priority process
      - N: a low-priority process
  - px aux: gives mor information
    - Information: USER, %CPU, %MEM, VSZ (virtual memory size), RSS (resident set size), START
- View Processes Dynamically with top
  - The result is continuously updating (by default, every 3 seconds)
- Control Processes
  - Interrupt a process: Ctrl+C
  - Put a process in the background: command &
  - Return a process to the foreground:
    - Find the PID of the process: jobs, say, the result PID is n
    - Bring the process to the foreground: fg %n
  - Stop a process: Ctrl+Z
    - Ctrl+Z is used for suspending/stopping a process, it cannot be interrupted by the process. Ctrl+C is used to kill a process and can be interrupted by a program so it can clean itself up before exiting, or not exit at all
    - For a stopped process, you can bring it to foreground or send it to background
  - Signals
    - For Ctrl+C, a signal called TNT is sent
    - For Ctrl+Z, a signal called TSTP is sent
    - Kill a process: kill -number PID
    - Common signals
      - kill -1: HUP signal, send the process a hangup signal
      - kill -2: INT signal, send the process an interrupt signal
      - kill -9: KILL signal, send the process a kill signal
      - kill -15: TERM signal, send the process a terminate signal
      - kill -18: CONT signal, send the process a continue signal
      - kill -19: STOP signal, send the process a stop signal
      - kill -20: TSTP signal, send the process a terminal stop signal
      - kill -3: QUIT signal, send the process a quit signal
      - kill -11: SEGV signal, send the process a segmentation violation signal
      - kill -28: WINCH signal, send the process a window change signal
  - Shut down the system
    - Function: orderly terminate all the processes on the system, then power off the system
    - Commands halt, poweroff, reboot, shutdown
Chapter 11 The Environment
- Some commands:
  - printenv: print part of all of the environment
  - set: set shell options
  - export: export environment to subsequently executed programs
  - alias: create an alias for a command
- What is stored in the environment
  - Two types of variables:
    - Shell variables: bits of data set by bash
    - Environment variables: other variables
  - Programmatic data: aliases and shell functions
- Examine the environment
  - printenv: the result is in key=value format
  - printenv key: print the value of the key
  - echo $key: print the value of the key
- How is the environment established
  - Bash program starts, and reads a series of configuration scripts called startup files, which define the default environment shared by all users, then followed by startup files related to personal environment
  - A login shell session reads
    - /etc/profile: global configuration
    - ~/.bash_profile: user startup file
    - ~/.bash_login: if the above one is not found
    - ~/.profile: if the above two are not found
  - An non-login shell session reads
    - /etc/bash/bashrc: global configuration
    - ~/.bashrc: user startup file
- Modify the environment
  - Which file to modify
    - To add directories to your PATH variable, put those in the .bash_profile/.profile file
    - For everything else, put the changes into .bashrc file5,
  - To edit other files, we use a text editor
    - Graphical editors:
      - gedit from GNOME
      - kedit, kwrite, kate from KDE
    - Text-based editors: nano, vi (in most linux system this is replaced by vim, which is short for "vim improved"), emacs
    - Some vim commands
      - :set number: display line numbers
      - :wq, :x, ZZ: write save and quit
      - :q!: quit without saving
      - gg: go to the first line
      - G: go to the last line
      - ngg, nG: go tho the nth line
      - 0: jump to the start of the line
      - $: jump to the end of the lin
      - ^: jump to the first non-blank character in the line
      - g_: jimp to the last non-blank character in the line
      - i: insert before the cursor
      - a: insert after the cursor
      - I: insert at the beginning of the line
      - A: insert at the end of the line
      - o: append a new line below the current line
      - O: append a new line above the current line
      - r: replace the current character
      - R: replace characters until ESC is clicked
Chapter 14 Package Management
- Introduction
  - The most important determinant of linux distribution quality is the packaging system and the vitality of the distributions's support community
  - Package management is a method of installing and maintaining software on the system
    - Nowadays we can install packages from the linux distributor
    - Back to early days, people need to download and compile source code to install software
- Packaging Systems
  - Different distributions use different packaging systems, and generally a packaging system designed for one distribution are not compatible with another distribution
  - Two main packaging technologies
    - .deb camp from Debian: Debian, Ubuntu, Linux Mint, Raspbian
    - .rpm camp from Red Hat: Fedora, CentOS, Red Hat Enterprise Linux, OpenSUSE
  - How a Package Systems Works
    - Virtually all software for a Linux system will be found on the Internet, most will be provided by the distribution vendor in the form of package files, and the rest will be available in source code form that can be installed manually
  - Package files
    - A package file is a compressed collection of files that comprise the software package
    - A package may contain programs and data files, metadata files, pre- and post-installation scripts that perform configuration tasks
    - Package files are created by the package maintainer
  - Repositories: packages are often hosted in a central repository
  - Dependencies: dependencies are shared libraries that are indispensible for a software to run properly
  - Package management systems tools
    - Low-level tools: install and remove package files
    - High-level tools: search metadata and resolve dependencies
      
      Distributions Low-Level Tools High-Level Tools
      
      Debian style dpkg apt, apt-get, aptitude
      
      Fedora, Red Hat Enterprise Linux rpm yum, dnf
- Common Package Management Tasks
  - Find a package in a repository
    - Debian style: apt-get update; apt-cache search search_string
    - Red Hat style: yum search search_string
  - Install a package from a package file
    - Debian style: dpkg -i package_file
    - Red Hat style: rmp -i package_file
  - List installed packages
    - Debian style: dpkg -l
    - Red Hat style: rpm -qa
  - Determine whether a package is installed
    - Debian style: dpkg -s package_name
    - Red Hat style: rpm -q package_name
  - Display information about a package
    - Debian style: apt-cache show package_name
    - Red Hat style: yum info package_name
  - Find which package installed a file
    - Debian style: dpkg -S file_name
    - Red Hat style: rpm -qf file_name
Chapter 16 Networking
- Some commands
  - ping: send an ICMP ECHO_REQUEST to network hosts
  - traceroute: print the route packets trace to a network host
  - ip: show/manipulate routing, devices, policy routing and tunnels
  - netstat: print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships
  - ftp: internet file transfer program
  - wget: non-interactive network downloader
  - ssh: OpenSSH SSH client (remote login program)
- Examine and monitor a network
  - ping
    - Sends a special network packet to a specified host, most devices receiving this packet will reply to it, allowing the network connection to be verified
    - Once start, ping continues to send packets at a specified interval (default is 1 second) until it is interrupted (say, Ctrl+C)
    - After interrupted, ping prints performance statistics
  - traceroute
    - The traceroute lists all the routers network traffic takes to get from the local system to a specified host
  - ip
    - ip a: list all information
  - netstat
    - netstat -i: display a table of all network interfaces
    - netstat -e: display additional information
    - netstat -r: display the kernel routing tables
    - netstat -n: show numerical addresses instead of trying to determine symbolic host, port or user names
- Transport files over a network
  - ftp
    - File Transfer Protocol (FTP) was once the most widely used method of downloading files over the Internet
    - ftp is used to communicate with FTP servers, machines that contain files that can be uploaded and downloaded over a network
    - FTP is not secure because it sends account names and passwords in cleartext, almost all FTP done over the Internet is done by anonymous FTP servers
    - ftp servername: login to a FTP server
  - lftp
    - It works much like the traditional ftp program but has many additional convenience features including multiple-protocol support, automatic retry on failed downloads, background processes, tab completion of path names, and many more
  - wget
    - It is useful for downloading content from both web and FTP sites
    - Single files, multiple files, and even entire sites can be downloaded
    - wget allows recursive download, download files in the background, and complete the download of a partially downloaded file
- Secure communication with remote hosts
  - Before ssh, there are commands like rlogin and telnet, but they transmit all the communication through clear-text, which is inappropriate for the use in the Internet age
  - Advantages:
    - It authenticates that the remote host is who it says it is, thus preventing so-called man-in-the-middle attacks
    - It encrypts all of the communications between the local and remote hosts
  - SSH consists of two parts
    - A SSH server runs on the remote host, listening for incoming connections, by default, on port 22
    - An SSH client is used on the local system to communicate with the rmote server
  - Most Linux distributions has an implementation of SSH called OpenSSH from the OpenBSD project
  - Syntax: ssh user@hostname
  - Use SSH-encrypted tunnel to copy files across the network
    - scp (secure copy), scp from to
    - sftp (secure file transfer), sftp hostname
Chapter 19 Regular Expressions
- Introduction
  - Regular expressions are symbolic notations used to identify patterns in text
  - We only consider regular expressions described in the POSIX standard
- grep
  - Actually grep is short for "global regular expression print", it essentially searches test files for the occurrence text matching a specified regular expression and outputs any line containing a match to standard output
  - Syntax: grep [options] regex [file...]
  - grep options
    - grep -t: ignore case
    - grep -v: invert match, prints lines that do not match
    - grep -c: print the number of matches instead of the lines themselves
    - grep -l: print the name of each file that contains a match instead of the lines themselves
    - grep -L: print the name of each file that does not contain any matched lines
    - grep -n: prefix each matching line with the line number
    - grep -h: for multi-file searches, suppress the output of filenames
- Metacharacters and literals
  - We can use literals in regular expressions
  - We can also use metacharacters in regular expressions
    - Metacharacters: ^, $, [], {}, -, ?, *, +, (), |, \
- The any character
  - . matches any character in a character position
- Anchors
  - The caret ^ and the dollar sign $ are treated as anchors in regular expressions
  - ^ matches the beginning of a line
  - $ matches the end of the line
- Bracket expressions and character classes
  - Bracket expression matches a single character from a specified set of characters
  - Metacharacters does not work inside bracket expresions, except ^ and -
    - ^: if it appears at the beginning inside a bracket expression, then the following set of characters must not be present at the given character position
    - A-Z: matches all uppercase letters
  - POSIX character classes
    - [:alnum:]: [A-Za-z0-9]
    - [:word:]: [A-Za-z0-9_]
    - [:alpha:]: [A-Za-z]
    - [:blank:]: Space and tab
    - [:cntrl:]: ASCII control codes, include ASCII characters 0 through 31 and 127
    - [:digit:]: [0-9]
    - [:graph:]: The visible characters, include ASCII characters 33 through 126
    - [:lower:]: [a-z]
    - [:upper:]: [A-Z]
    - [:punct:]: The punctuation characters, in ASCII, equivalent to [-!"#$%&'()*+,./:;<=>?@[\\\]_{|}~]
    - [:print:]: The printable characters, all characters in [:graph:] plus the space character
    - [:space:]: The whitespace characters, in ASCII, equivalent to [ \t\r\n\v\f]
    - [:xdigit:]: Hexadecimal numbers, in ASCII, equivalent to [0-9A-Fa-f]
  - POSIX basic v.s. extended regular expressions
    - BRE (basic regular expressions)
    - ERE (extended regular expressions)
  - Alternation
    - grep -E 'AAA|CCC|BBB' file: matches either AAA, or BBB, or CCC in the file
    - Combine alternation with other regular expression, use parentheses on alternation:
      - grep -E '^(aa|bb|cc) file': matches either aa, or bb, or cc at the beginning of a line in the file
  - Quantifiers
    - ?: matches an element zero or one time
    - *: matches an element zero or more times
    - +: matches an element one or more times
    - {n}: matches an element exactly n times
    - {n, m}: matches an element at least n times, and no more thant m times
    - {n,}: matches an element at least n times
    - {,m}: matches an element no more than m times
  - Some applications
    - find: search files in a directory
      - grep tests whether a line contains a pattern, find tests whether a line exactly matches a pattern
    - locate: find files by name
      - locate --regexp pattern: use BRE
      - locate --regex patern: use ERE

Octal Representation	Binary	Symbolic Representation
0	000
1		001
2	010
3		011
4		100
5		101
6		110
7		111

Value	Permission
0	rwx
1	rw-
2	r-w
3	r--
4	-wx
5	-w-
6	--x
7	---

Distributions	Low-Level Tools	High-Level Tools
Debian style	`dpkg`	`apt`, `apt-get`, `aptitude`
Fedora, Red Hat Enterprise Linux	`rpm`	`yum`, `dnf`

Pro Git

Chapter 1
- Version control
  - Def: a system that records changes to a file or set of files over time so that you can recall specific version later
  - Local version control system
  - Centralized version control system
  - Distributed version control system
- Git
  - Most other CVS stores information as a list of file-based changes
  - Git thinks of its data like a series of snapshots of a miniature filesystem
- Three states of files in git
  - Modified: the file is changed but not committed to the database yet
  - Staged: the file is marked as modified and it will go into the next commit snapshot
  - Committed: the files has already been committed to the database

Python for Data Analysis (this is not a required book for this course)

Chapter 4 NumPy Basics

Introduction
- NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python
- Data analysis applications using NumPy
  - Fast vectorized array operations
  - Common array algorithms like sorting, unique, and set operations
  - Efficient descriptive statistics and aggregating/summarizing data
  - Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
  - Expressing conditional logic as array expressions instead of loops with if branches
  - Group-wise data manipulation

The NumPy ndarray Ch 4.1

Example 1

import numpy as np

data = np.random.randn(2,3) # generate a 2*3 array of random numbers
data = data * 3 # multiply each element by 3
data.shape # (2,3)
data.dtype # dtype('float64')

Example 2

import numpy as np

data = [[1,2], [3,4]]
arr = np.array(data)
arr.ndim # 2
arr.shape # (2, 2)
arr.dtype # dtype('int64')

Example 3

import numpy as np

arr1 = np.zeros((2,3)) # create  a 2*3 array of 0
arr2 = np.ones((1,2)) # create a 1*2 array of 1

There are different type of data types

     import numpy as np

     a1 = np.zeros((2,2))
     a1.dtype # dtype('float64')
     a2 = a1.astype(np.int32)
     a2.dtype # dtype('int32')
     a3 = np.ones((2,2), dtype=np.int32)
     a3.dtype # dtype('int32')
   ``` 
* Any arithmetic operation between equal size NumPy arrays applies the operation element-wise
* Indexing and slicing
   ```python
     import numpy as np

     x = np.ones(5)
     x[2:] = 2  # x is now [1, 1, 2, 2, 2]
     y = x[:2]
     y[0] = 3  # y is [3, 1], and x is [3, 1, 2, 2, 2]

     z = np.array([[1,2],[3,4]])
     z[:, :1]  # [[1], [3]]

Boolean indexing

index = np.arange(3)
data = np.arange(1, 10).reshape((3, 3))
data[index == 2] # [7, 8, 9]
cond = index != 2
data[~cond] # [7, 8, 9]
data[:, index == 2] # [3, 6, 9]
data[data< 5] = 5 # data is now [[5, 5, 5], [5, 5, 6], [7, 8, 9]]

Transposing index and swapping axes

x.T: transpose of an array
np.dot(x, x.T): matrix multiplication

transpose function

import numpy as np

x = np.arange(1, 6).reshape((2, 3))
x.shape # (2, 3)
x = np.transpose(x, (1, 0)) # the second argument should be a tuple of range(n)
x.shape # (3, 2)

Universal functions Ch 4.2
- A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays
- Functions: np.abs(arr), np.sqrt(arr), np.square(arr), np.exp(arr), np.log(arr), np.log2(arr), np.sign(arr), np.ceil(arr), np.floor(arr), np.modf(arr) (returns two arrays, one is the integral part, the other is the fractional part), np.isnan(arr), np.isfinite(arr), np.isinf(arr), np.add/subtract/multiply/divide(a1, a2), np.power(a1, a2), np.maximum/minimum(a1, a2), np.mod(a1, a2), np.greater/less/greater_equal/less_equal/equal/not_equal(a1, a2), np.logical_and/logical_or/logical_xor(a1, a2)

Array-oriented programming Ch 4.3

Example 1

import numpy as np
import matplotlib.pyplot as plt

points = np.arange(-5, 5, 0.01) # 0.01 is the step size
xs, ys = np.meshgrid(points, points)
z = np.sqrt(xs ** 2 + ys ** 2)
plt.imshow(zs, cmap=plt.cm.gray)
plt.colorbar()

Convert conditional logic into array expressions

Example 1

import numpy as np

xs = np.arange(1, 6)
ys = np.arange(6, 11)
rands = np.random.randn(5)
cond = rands > 0
# The following are equivalent, but the second one is more efficient
res1 = [(x if c else y) for x, y, c in zip(xs, ys, cond)]
res2 = np.where(cond, xs, ys)

Example 2

import numpy as np

ma = np.arange(1, 17).reshape((4,4))
rands = np.random.randn(16).reshape((4, 4))
cond = rands > 0
res = np.where(cond, 2, ma) # replace all elements in ma with 2 if cond is True

Mathematical and statistical operations

import numpy as np

arr = np.random.randn(16).reshape((4, 4))
arr.mean() # mean of all elements, equivalent to np.mean(arr)
arr.mean(axis=0) # mean of each column, equivalent to np.mean(arr, axis=0)
arr.mean(axis=1) # mean of each row, equivalent to np.mean(arr, axis=1)
arr.cumsum(axis=0) # cumulative sum of each column
arr.cumprod(axis=1) # cumulative product of each row

Other methods: sum, mean, std, var, min, max, argmin, argmax, cumsum, cumprod
Boolean array methods
- Calcualte amount of positive values: (arr > 0).sum()
- If there exists true value, return true: arr.any()
- If all values are ture, return true: arr.all()
Sorting
- Ondimensional array sorting: arr.sort()
- Sorting by row for a 2darray: arr.sort(1)
- Sorting by column for a 2darray: arr.sort(0)
Unique and set logic
- np.unique(array)

Linear algebra Ch 4.5

x.dot(y) is equivalent to np.dot(x, y), also equivalent to x @ y

Inverse, and QR decomposition

          from numpy.linalg import inv, qv

          x = np.random.randn(5, 5)
          mat = x.dot(x)
          inv(mat) # inverse of mat
          q, r = qr(mat)
        ``` 
   * Pseudorandom number generation Ch 4.6
     * Set seed `np.random.seed(num)`
     * `np.random.randn()`: normal distribution
     * `np.random.rand()`: uniform distribution
     * `np.random.randint()`: uniform distribution
     * `np.random.binomial()`: binomial distribution
     * `np.random.beta()` beta distribution
     * `np.random.chisquare()`: chi-square distribution
     * `np.random.uniform()`: uniform distribution
     * `np.random.gamma()`: gamm distribution

2. Chapter 5 Getting started with pandas
   * Introduction to pandas data structrues Ch 5.1
     * Series is a one-dimensional array-like object containing a sequence of values
        ```python
          import pandas as pd

          ls = list(range(10))
          obj = pd.Series(ls, index=list(range(1, 11)))
          obj.values # list from 0 to 9
          obj.index # list from 1 to 10
          10 in obj # True, check if index is in series
          9 in obj # False

          dt = {0: 1, 1: 2, 2: 3}
          obj2 = pd.Series(dt)
          obj2.index # list from 0 to 2
          obj2.name = "my_series" # give name to a series
          obj2.index.name = "my_index" # give name to index series

DataFrame

A dataframe representes a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type
A dataframe can have both a row and column index

The most comman way to construct a dataframe is from a dict of equal-length lists or NumPy arrays

import pandas as pd

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df1 = pd.DataFrame(data)
index = [str(x) for x in range_(6)]
df2 = pd.DataFrame(data, columns=['pop', 'year', 'state'], index=index) # alter the order of columns
# If a data or row is not inside the original data, the value will be NaN

df.columns # list of column names
df.index # list of row names

df.head(n) # get the first n rows
df['pop'] # get the data in column whose name is pop
df.loc['1'] # get the data in row whose name is 1
df.loc[['1', '3']] # get the data in rows whose name is 1 or 3
df.loc['1', 'pop'] # get the data entry whose column is pop and row is 1

# add a column
df['eastern'] = df.state == 'Ohio'
# delete a column
del df['eastern']
# dataframe transpose
df.T
# Add name to dataframe indicies
df.index.name = 'my_index'
df.columns.name = 'my_columns'

Index objects
- You can add index in Series and DataFrame, pandas index are immutable
- You can also create index maually
  1
  2
  3
  import pandas as pd
  
  index = pd.Index(np.arange(3))
- There can be duplicates in pandas index
- Index operations: append, difference, intersection, union, isin, delete, drop, insert, is_monotonic, is_unique, unique

Essential Functionality Ch 5.2

Reindexing
- The reindex method creates a new data index, the data will also be rearranged according to the new index
- If a index does not exist before, missing values NaN will be filled accordingly
  1
  2
  3
  4
  import pandas as pd
  
  obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
  obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) # NaN is associated with index 'e'
- Other filling related arguments
  - method=ffill: fill the missing values with previous valid observations
  - method=backfill: fill the missing values with next valid observations
  - method=nearest: fill the missing values with the nearest valid observations
  - fill_values=n: fill the missing values with n
- For a dataframe, if not specified, reindex will apply on row indicies. If you want to reindex column indicies, use obj.reindex(columns=list)

Drop items

For series, it's easy

import pandas as pd
import numpy as np

obj = pd.Series(np.arange(6), index=np.arange(6))
obj.drop([1, 2])

For dataframe, you can drop either rows or columns

import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
        index=['Ohio', 'Colorado', 'Utah', 'New York'],
        columns=['one', 'two', 'three', 'four'])
rows = ['Ohio', 'Utah']
cols =['one', 'four']
data.drop(rows) # Equivalent to data.drop(rows, axis=0), or data.drop(rows, axis='index'), since the default value for axis is 0
data.drop(cols, axis=1) # Equivalent to data.drop(cols, axis='columns')

Indexing, selection, and filtering

For series, things are trivial

import pandas as pd
import numpy as np

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['a':'c'] # Both endpoints are inclusive, result is [0.0, 1.0, 2.0] 
obj[0:3] # [0.0, 1.0, 2.0], 3 is exelusive

For series, you can apply on both row and column indicies
- df[:2]: the first two rows
- df['x'] : column x
- df[['x', 'y']]: column x and column y
- df[df['x'] > 2]: rows that has value in column x greater than 2
loc and iloc
- loc uses axis labels, iloc use integer indicies
- df.loc['r1', 'c1':'cn']: row is r1, column is from c1 to cn(both c1 and cn are included)
- df.iloc[:, :]: select all rows and all columns
Arithmetic and data alignment
- If a index only appears in one series or one dataframe, the result of arithmetic operation will be NaN for that index
- You can use fill_value parameter to provide a fill value for the missing ones
- Some operations
  - df1.add(df2) (df1 + df2), df1.radd(df2) (df2 + df1)
  - df1.sub(df2) (df1 - df2), df1.rsub(df2) (df2 - df1)
  - df1.mul(df2) (df1 * df2), df1.rmul(df2) (df2 * df1)
  - df1.div(df2) (df1 / df2), df1.rdiv(df2) (df2 / df1)
- Arithmetic operations between dataframe and series
  - By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows
  - If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods df1.add(ls, axis='index')

Function application and mapping

NumPy ufuncs (element-wise array methods) also work with pandas objects

Apply functions on columns and rows

import numpy as np
import pandas as pd

f = lambda x: x.max() - x.min()
df.apply(f) # Calculate max - min for each column
df.apply(f, axis='columns') # Caculate max - min for each row

g = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(g) # Calculate min and max for each column

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary

Apply functions for each item in the dataframe

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(16).reshape(4,4))
f = lambda x: '%2.2f' % x
df.applymap(f)
# You can use map on a series
ls = pd.Series(np.arange(16))
ls.map(f)

Sorting and ranking
- Use ascending=False to sort in descending order
- ls.sort_index(): sort by seriesindex
- df.sort_index(): sort by dataframe row index
- df.sort_index(axis=1): sort by dataframe column index
- ls.sort_values(): sort series values
- df.sort_values(by=['a', 'b'], axis=0, ascending=True): sort dataframe values by column a and column b (sort row values in those columns, so axis=0)
- ls.rank(method='average/first/max/min/dense'): rank a series by different methods
- df.rank(method='agerage', axis=0/1): rank a dataframe by rows/columns using different methods
Index with duplicates
- ls.index.is_unique(): check if series index has duplicates
- df.index.is_unique(): check if dataframe row index has duplicates
- df.columns.is_unique(): check if dataframe column index has duplicates

Summarizing and computing descriptive statistics Ch 5.3

df.sum(), df.mean(),df.cumsum(), df.sumprod(), df.count(), df.min(),df.argmin(), df.idmin(), df.median(), df.mad(), df.prod(), df.var(), df.std(), df.skew(), df.kurt(), df.diff(), df.pct_change(), df.corr(), df.cov()
axis=0/1, skipna=Ture/False

Unique values, value counts, and membership

ls.unique(): return unique values
ls.value_counts(sort=False/True): return value frequencies
pd.value_counts(ls, sort=False/True): return value frequenceis
mask = ls1.isin(ls2): return a bollean lists whether a item in ls1 is in ls2

df.apply(pd.value_counts).fillna(0): calculate value frequencies for dataframe

import numpy as np
import pandas as pd

data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                    'Qu2': [2, 3, 1, 2, 3],
                    'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
"""
    Qu1 Qu2 Qu3
  1 1.0 1.0 1.0
  2 0.0 2.0 1.0
  3 2.0 2.0 0.0
  4 2.0 0.0 2.0
  5 0.0 0.0 1.0
"""

Chapter 6 Data loading, storage, and file formats

Reading and writing data in text format Ch 6.1

Parsing functions in pandas
- read_csv, read_table, read_fwf, read_clipboard, read_excel, read_hdf, read_html, read_json, read_msgpack, read_pickle, read_sas, read_sql, read_stata, read_feather
Parameters: spe=',', header=None/int, names=[c1, ..., cn], index_col=cn
Reading text files in pieces
- pd.isnull(df)
- pd.options.display.max_rows = n, pd.options.display.max_columns = m
Writing data to text format
- df.to_csv(path)

Working with delimited formats

import csv

with open("file_path") as csvfile:
  reader = csv.reader(csvfile)
  ls = [row for row in reader]
  csvfile.close()
header, contents = ls[0], ls[1:]

with open("output", "w") as output:
  writer = csv.writer(output)
  for row in contents:
    writer.writerow(row)
  output.close()

JSON data
- Convert json string to json object: obj = json.loads(json_string)
- Convert json object to json string: json = json.dumps(obj)
- Read json file into pandas dataframe: df = pd.read_json(file)
XML and HTML
- Read html into a list of dataframes: pd.read_html(file)

Binary data formats Ch 6.2

One of the easies ways to store data efficiently in binary format is using Python's built-in pickle serilization. Pandas objects all have to_pickle method that writes data to disk in pickle format
Pickle is good as a short-trem storage format, it's hard to guarantee that the format will be stable over time
Pandas supports two more binary data formats: HDF5 and MessagePack
- Store in HDF5 format: df.HDFStore("xxx.h5")
- HDFStore supports two storage schemas, format='fixed'/'table'

Reading excel file

Create excel file and then read

         import pandas as pd

         xlsx = pd.ExcelFile("xxx.xlsx")
         df = pd.read_excel(xlsx, "Sheet1")
       ```  
    * Put it all together `df = pd.read_excel("xxx.xlsx", "Sheet1")`
* Interacting with web APIs Ch 6.3
  * `requests` package
     ```python
       import requests

       url = "xxx"
       response = requests.get(url)
       data = response.json()  # a dictionary containing json objects
       df = pd.DataFrame(data, index=, columns=)

Interacting with databases Ch 6.4

sqlite3

import sqlite3

query = """
  SELECT * FROM table
  WHERE c1 = 
  ORDER BY c2
  LIMIT 5;
"""
conn = sqlite3.connect("xxx.sqlite")
cursor = conn.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows, columns = [x[0] for x in cursor.description])

Chapter 7 Data cleaning and preparation

Handling missing data Ch 7.1
- All of the descriptive statistics on pandas objects exclude missing data by default
- For numeric data, missing data is represented using the floating-point value NaN, use data.isnull() to check for missing data
- Filtering out missing data: use data.dropna(), this is equivalent to data[data.notnull()]. arguments: axis=0/1, how='all'/'any' (if all values are NaN, or if any value is NaN)
- Fill missing values
  - Fill all missing values with on value: data.fillna(val)
  - Fill missing values with different values for each column: data.fillna({"c1": v1, "cn": vn})
  - Fill missing values in place: data.fillna(val, inplace=True)

Data transformation Ch 7.2

Removing duplicates
- data.duplicated: return a boolean series/dataframe indicating whether a value/row is duplicated
- data.drop_duplicates(): drop duplicated values/rows, use keep='first'/'last'/False to sepcify how to remove duplicated values

Transforming data using a function or mapping

import numpy as np
import pandas as pd

mapping = {0: 'a', 1: 'b', 3: 'c' }
df['char_index'] = df['index'].map(mapping) # add a new column, with integer indicies mapped to character indicies

Replacing values:
- data.replace(v1, v2): replace v1 with v2
- data.replace({v1: v2, v3: v4}): replace v1 with v2, v3 with v4
- data.replace([v1, v3], [v2, v4]): replace v1 with v2, v3 with v4

Renaming axis indicies

Rename row or column indicies

import pandas as pd

trans1 = lambda x: x.str.upper()
trans2 = lambda x: x.str.title() # Capatilize the first letter of each word
df.index.map(trans1)
df.columns.map(trans2)

Rename both index and column names: df.rename(index=trans1, columns=trans2)

Detecting and filtering outliers
- data.describe(): calculate max, min, mean, std, 4 quantiles of each column

Permutation and random sampling

You can apply permutation on values in a series or rows of a dataframe with the iloc or the take function

import numpy as np
import pandas as pd

df = pd.DataFrame(np.arange(15).reshape(5,3))
perm = np.random.permutation(len(df))
df.iloc[perm] # permute the rows
df.take(term) # permute the rows, equivalent to the above iloc function

Take a random sample of rows: df.sample(n=n, replace=True/False), default replace is False

Computing indicator/dummy variables

Convert a categorical variable into dummy/indicator matrix

import pandas as pd
import numpy as np

ls = ['a', 'a', 'b', 'b', 'c']
ser = pd.Series(ls, index=np.arange(1, len(ls) + 1))
pd.get_dummies(ser)
"""
     a  b  c
  1  1  0  0
  2  1  0  0
  3  0  1  0
  4  0  1  0
  5  0  0  1
"""

Merge a dataframe and a series/another dataframe

The series must be named

If the dataframes have columns with the same name, use lsuffix and rsuffix to specifu the suffix for the left and right dataframe

import pandas as pd

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
       'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
          'B': ['B0', 'B1', 'B2']})
df.join(other, lsuffix='_caller', rsuffix='_other')

Use np.random.seed(n) to set seed for random sampling

String manipulation Ch 7.3
- String object methods
  - str.split(sep=""), str.strip(), str.rstrip(), str.lstrip(), sep.join(ls)
  - char in str, str.index(char) (ValueError if not found), str.find(char) (return -1 if not found), str.contains(char)
  - str.count(char), str.replace(old, new)
  - str.endswith(pattern), str.startswith(pattern)
  - str.lower(), str.upper()
- Regular expression
  - re package
  - Use regex = re.compile(pattern, flags=re.IGNORECASE) to get a reusable regex object
  - regex.findall(str), regex,search(str), regex.match(str)

Data wrangling: join, combine and reshape

Hierarchical indexing

import pandas as pd

data = pd.Series(np.random.randn(9),
                index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                      [1, 2, 3, 1, 3, 1, 2, 2, 3]])
df = data.unstack() # Convert a series to a dataframe
df.index # ['a', 'b', 'c'] 
df.columns # [1, 2, 3]

Python testing with pytest

Chapter 1 Getting started with pytest

pytest *_test.py to run all test in a file
pytest -v: -v flag controls the verbosity of pytest output in various aspects: test session progress, assertion details when tests fail, fixtures details with --fixtures
pytest -r: show extra test summary info as specified by chars
Naming rules
- Test files should be named test_*.py or *_test.py
- Test methods and functions should be named test_*
- Test classes should be named Test*
Possible outcomes of a test function
- Passed(.): the test ran successfully
- Failed(F): the test did no run successfully
- Skipped(s): the test was skipped
- xfail(x): the test was not supposed to pass, ran, and failed
- XPASS(X): the test was not supposed to pass, ran, and passed
- ERROR(E): an exception happened outside of the test function
Running only one test
- Run test called test_inc in test_math.py: pytest -v test_math.py::test_inc

Using options

Check for pytest options: pytest --help

pytest -m: only run tests matching given mark expression

Example

import pytest

def inc_one(x):
  return x + 1

def dec_one(x):
  return x - 1
# give this test a mark called simple
@pytest.mark.simple
def test_inc():
  assert inc_one(1) == 2

def test_dec():
  assert dec_one(2) == 1

1 2	pytest -v // This runs both tests pytest -v -m simple // This runs only the test marked simple

collect-only: The --collect-only option shows you which tests will be run with the given options and configuration
-m markexprMarkers are one of the best ways to mark a subset of your test functions so that they can be run together. As an example, one way to run test_replace() and test_member_access(), even though they are in separate files, is to mark them

Chapter 2 Writing test functions

Using assert statements: assert expr, assertTrue(expr), assertEqual(x, y), ...

Expecting exceptions:

import pytest

def my_add(x: int, y: int) -> int:
  return x + y

def test_add():
  with pytest.raises(TypeError) as excinfo:
    my_add(1, '2') 
    exception_msg = excinfo.value.arge[0]
    assert exception_msg == 'unsupported opertand type(s) for +: 'int' and 'str''

Marking test functions

import pytest

def inc(x):
  return x + 1

def dec(x):
  return x - 1

@pytest.mark.inc
def test_inc():
  assert inc(1) == 2

@pytest.mark.dec
def test_dec():
  assert dec(2) == 1

1
2
3

pytest -v // This runs both test_inc and test_dec
pytest -v -m inc // This runs only test_inc
pytest -v -m dec // This runs only test_dec

Skipping tests

We can skip or skip with condition tests that we do not want to run now

Skip a test

import pytest

@pytest.mark.skip(reason ='no implemented yet')
def test_func():
  assert True

Skip a test with a boolean condition

     import pytest

     @pytest.mark.skipif(condition, reason ='no supported until condition is met')
     def test_func():
       assert True
   ```  
* Marking tests as excepting to fail
   ```python
     import pytest

     @pytest.mark.xfail(condition, reason ='not supported unless condition is met')
     def test_func():
       assert True

Running a subset of tests

pytest -v .   # This runs all tests in the current directory
pytest -v test_raise.py # This runs all tests in a file
pytest -v test_raise.py::test_raise # This runs a single test function in a file
pytest -v test_raise.py::TestRaise # This runs all tests in a class inside a file
pytest -v -k '_raise and not delete' # This runs all tests in the current directory whose name contains `_raise` but not `delete`

Chapter 3 Pytest fixtures

Fixtures are functions that are run by pytest before the test functions
Fixtures functions can do whatever you want: get data, set up a database connection, share data among multiple test functions, etc.

Example

import pytest

@pytest.fixture
def db_connect():
  conn = sqlite3.connect('test.db')
  cursor = conn.cursor()
  cursor.execute("create table if not exists test_table(name text);")
  cursor.execute("insert into test_table(name) values('test_name');")
  result = cursor.execute("select * from test_table;")
  res = []
  for row in result:
      res.append(row)
  conn.close()
  return res

def test_connection(db_connect):
  assert len(db_connecet) > 0

Sharing fixtures among multiple tests

Use conftest.py file to define fixtures that need to be shared among different places
Or you can add a scope parameter to the fixture function @pytest.fixture(scope='module'), scope can be function, class, module, package, or session

Example

# content of conftest.py
import pytest
import smtplib


@pytest.fixture(scope="module")
def smtp_connection():
    return smtplib.SMTP("smtp.gmail.com", 587, timeout=5)

# content of test_module.py
def test_ehlo(smtp_connection):
    response, msg = smtp_connection.ehlo()
    assert response == 250
    assert b"smtp.gmail.com" in msg
    assert 0  # for demo purposes


def test_noop(smtp_connection):
    response, msg = smtp_connection.noop()
    assert response == 250
    assert 0  # for demo purposes

Lecture 01 2022/01/19

Introduction, based on the TLCL book

Lecture 02 2022/01/20

Navigation and Redirection, based on the TLCL book

Lecture 03 2022/01/24

Argument Expansion, based on the TLCL book

Lecture 04 2022/01/26

Permissions and Processes, based on the TLCL book

Lecture 05 2022/01/27

Grep and Regular Expressions, based on the TLCL books

Lecture 06 2022/01/31

Lecture 07 2022/02/02

Lecture 08 2022/02/03

Lecture 09 2022/02/07

Lecture 10 2022/02/08

Change the default browser by which jupyter notebook is opened
- Run jupyter notebook --generate-config in command line prompt, this command generates a configure file at ~/.jupyter/jupyter_notebook_config.py
- Set this property in the above configure file c.NotebookApp.browser = u'C:/Home/AppData/Local/Google/Chrome/Application/chrome.exe %s'. If you use windows, you should use path separator \\, and also quote the path of your browser inside "", and then inside '', so you configuration should be something like this: c.NotebookApp.browser = '"C:\\Home\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe" %s'

Lecture 15 2022/02/28

Lecture 21 2022/03/14

pytest
- Install pytest:
  - Install: pip install pytest
  - Check pytest version: pytest --version
  - Simplest way to run tests: pytest, this command will will run all files of the form test_*.py or *_test.py in the current directory and its subdirectories
- Simple test example
  1
  2
  3
  4
  5
  6
  # inside test_simple.py
  def inc_one(x):
  return x + 1
  
  def test_inc_one():
  assert inc_one(1) == 2

My Blog

COSI 103A Software Engineering

The Linux Command Line

Pro Git

Python for Data Analysis (this is not a required book for this course)

Python testing with pytest

Lecture 01 2022/01/19

Lecture 02 2022/01/20

Lecture 03 2022/01/24

Lecture 04 2022/01/26

Lecture 05 2022/01/27

Lecture 06 2022/01/31

Lecture 07 2022/02/02

Lecture 08 2022/02/03

Lecture 09 2022/02/07

Lecture 10 2022/02/08

Lecture 15 2022/02/28

Lecture 21 2022/03/14

Lecture 22 2022/03/16