0%

COSI 103A Software Engineering

The Linux Command Line

  1. Chapter 1: What is the shell
    • The shell is a program that takes keyboard commands and passes them to the operating system to carry out
    • bash is a shell program from the GNU project, is an acronym for Bourne Again Shell
    • We need another program called a terminal emulator to interact with the shell
    • Some simple commands
      • date: display the current time and date
      • cal: display a calendar of the current month
      • df: display the current amount of free space on the disk
      • free: display the amount of free memory
      • exit: exit a terminal session, you can also use Ctrl + d
  2. Chapter 2: Navigation
    • Some commands:
      • pwd: print name of current working directory
      • ls: list directory contents
      • cd: change directory
    • A Unix-like OS organizes files a hierarchical directory structure. The first directory is the root directory.
    • Windows has a separate file system tree for each storage device, Unix-like systems always have a single file system tree
    • Absolute pathnames start with /, relative pathnames start from the working directory, ./ represents the current directory, ../ represents the parent directory of the current directory
    • Filename:
      • File names with period character is hidden, you need to use ls -a to display the hidden files or directories
      • File names are case sensitive
      • Linux files do not have file extensions
      • Do not use space in the file name
  3. Chapter 3: Exploring the System
    • Some commands:
      • ls: list directory contents
      • file: determine file type
      • less: view file contents
    • ls
      • ls -a: display all files
      • ls -A: display almost all files, do not list . and ..
      • ls -i: display the inode of the files
      • ls -l: display in long listing format
      • ls -r: reverse the order of the listing
      • ls -t: sort contents by modification time, newest first
      • ls -S: sort contents by file size, largest first
      • ls -R: recursive listing
      • ls -d: list the directory itself, not its contents
    • Permissons
      • -rwxr--r--:the first character is -, it indicates this is a file, if this is a directory, the first character would be d, if this is a link, the first character would be l; the next three character are the permissions for the owner, the next three characters are the permission for the group, the last three characters are the permission for everyone else; r: open and read, w: write or truncate, the file itself cannot be renamed or deleted, x: treat as a program and execute
    • Some directories in Linux system
      • /: root directory
      • /bin: contain programs that must present for the system to boot and run
      • /boot: contain the Linux kernel and the boot loader
      • /dev: contain a list of devices
      • /etc: contain all of the system-wide configuration files
      • /home: in general, each user is given a directory in /home
      • /lib: contain shared library files used by the core system programs
      • /mnt: contain mount points for removable devices
      • /opt: used to install optional software
      • /tmp: intended for the storage of temporary, transient files
      • /usr: contain all programs and support files by users
    • Links
      • Soft links/symlink/ symbolic link: a file contains a reference to another file or directory in the form of pathname
      • Hard link: saves a copy of a file or directory
  4. Chapter 4 Manipulating Files and Directories
    • Some commands:
      • cp: copy files and directories
      • mv: move/rename files and directories
      • mkdir: create directories
      • rm: remove fils and directories
      • ln: create hard and symbolic links
    • Wildcards
      • *: match any characters
      • ?: match any single character
      • [characters]: match any character that is a member of the set characters
      • [!characters]: match any character that is not a member of the set characters
      • [[:class:]]: match any character that is a member of the specified class
      • [:alnum:]: match any alphanumeric character
      • [:alpha:]: match any alphabetic character
      • [:digit:]: match any numeral
      • [:lower:]: match any lowercase letter
      • [:upper:]: match any uppercase letter
    • mkdir dir1 ... dirn: make n directories
    • cp item1 ... itemn dir: copy item1 to itemn to dir
      • cp -a: copy the files and directories and all of their attributes
      • cp -i: before overwriting an existing file, prompt the user for confirmation
      • cp -r: recursively copy directories and their contents
      • cp -u: only copy files that either don't exist or are newer than the existing corresponding files in the destination directory
      • cp x y:
        • If both x and y are files, overwrite y with x, if y does not exist, create y with x
        • If x is a file and y is a directory, copy x into y
        • If x and y are both directory, you should use copy -r x y, otherwise there will be an error
    • mv item 1 item2: move/ rename item 1 to item2, mv item0 ... itemn directory: move item0 to itemn into directory
      • mv -i: prompt the user for confirmation before overwriting an existing file
      • mv -u: only move files that either don't exist or are newer than the existing corresponding files in the destination directory
      • mv -v: display informative message as the move is performed
      • mv x y
        • If both x and y are files, overwrite y with x, if y does not exist, it is created with x
        • If x is a file and y is a directory, move x into y, y must exist
        • If x and y are both directories, if y does not exist, create y and move contents of x into y, then delete the empty x; if y does exist, move x and its contents into y
    • rm item1 ... itemn
      • rm -i: prompt the user for confirmation before removing an existing file, otherwise the file will be deleted silently
      • rm -r: recursively delete directories and their contents, no matter whether the directory, this parameter should always be provided
      • rm -f: delete the targets and ignore nonexistent files and do not prompt
      • rm -v: display informative message as the deletion is performed
    • ln file link: create a hard link to a file, ln -s item link: create a symbolic link to a file or directory
      • Hard links
        • A hard cannot reference a file outside its own file system, this means a link cannot reference a file that is not on the same disk partition as the link itself
        • A hard link may not reference a directory
        • A hard link and the original file share the same inode, deleting the hard link does not affect the original file, the hard link still works if you delete the original file
      • Symbolic link
        • Symbolic links are created to overcome the limitations of hard links
        • It works by creating a special type of file that contains a text pointer to the referenced file or directory (similar to the shortcut in Windows)
        • A symbolic link and the original file have different inode, the soft link will not work in the shellTilde expansion: echo ~: display the home directoryArithmetic expansion: echo $((expression)): display the result of the arithmetic expressionNotice the division here is integer divisionExpressions can be nested, but you need the $(()) for each arithmetic expression if the original file is moved or deleted, but deleting the soft link does not affect the original file
  5. Chapter 5 Working with Commands
    • Some commands:
      • type: indicate how a command name is interpreted
      • which: display which executable program will be executed
      • help: get help from shell built-ins
      • man: display a commands' manual page
      • apropos: display a list of appropriate commands
      • info: display a command's info entry
      • whatis: display one-line manual page descriptions
      • alias: create an alias for a command
    • Command types
      • An executable program: programs compiled into binaries
      • A command built into the shell itself: shell builtins
      • A shell function: shell scripts incorporated into the environment
      • An alias: user-defined commands, built from other commands
    • type command: display the type of commands
    • which command: only works for executable programs, not builtins nor aliases
    • help command: help is available for each of the shell builtins, it shows the documentation of a command
    • command --help: display usage information
    • man command: display manual page
    • apropos command: display appropriate commands
    • whatis command: display one-line manual page descriptions
    • info command: display a program's info entry
    • alias command:
      • Use ; to separate commands in one line
      • Example: alias foo='cd /mnt; ls -lrt;'
      • Unalias a commdn: unalias foo
      • Show all aliases defined in the environment: alias
  6. Chapter 6 Redirection
    • Some commands:
      • cat: concatenate files
      • sort: sort lines of text
      • uniq: report or omit repeated lines
      • grep: print lines matching a pattern
      • wc: print newline, word and byte counts for each file
      • head: output the first part of a file
      • tail: output the last part of a file
      • tee: read from standard input and write to standard output and files
    • Standard input(stdin, file descriptor 0), standard output(stdout, file descriptor 1), standard error(stderr, file descriptor 2), by default, stdout and stderr are linked to the screen and the stdin is attached to the keyboard
      • Redirection output
        • Truncate/ create a new file: ls -l > file.txt
        • Append to an existing file: ls -l >> file.text
        • Discard both standard error and standard output:
          • Older version: command >/dev/null 2 >&1
          • Newer version command &> /dev/null
    • cat [..file]: read one or more files and copy them to the standard output
      • In most cases, cat command can be thought of as analogous to the type command
      • If cat is provided with no arguments, it will read input from the standard input, which is usually the keyboard, type Ctrl + D to indicate EOF
      • Read input from the keyboard and save to input to a file: cat > file.txt, use Ctrl + D to indicate input EOF
      • Read input from a file: cat < file.txt
    • | pipelines
      • Pipeline feature of shell: with the pipe operator | (vertical bar), the standard output of one command can be piped into the standard input of another
      • Syntax: command 1 | command 2
      • Filters:
        • Pipelines are often used to perform complex operation on data, frequently, the commands used this way are referred to as filters
        • Filters take input, change it somehow, and then output it
        • sort: write sorted concatenation of all files to standard output
          • sort -f: ignore case, sort -r: reverse order, sort -R: random sort, sort -u: only output unique results
        • uniq: accepts a sorted list of data and removes any duplicates from the list
          • uniq is often used with sort
          • uniq -d: only print duplicate lines, one for each group, uniq -c: prefix lines by the number of occurrences
          • sort -u xxx, sort xxx | uniq
        • wc: display the number of lines, words, and bytes contained in files
          • wc -l: display the number of lines, wc -w: display the number of words, wc -c: display the number of bytes, wc -m: display the number of characters
        • grep pattern [file...]: used to find text patterns
          • When grep encounters a pattern in the file, it prints out the lines containing it
          • Regular expressions are allowed in grep
          • grep -i: ignore case during search, grep -v: print only those lines that do not match the pattern
        • head/ tail
          • By default, print 10 lines
          • head -n m: print first m lines, tail -n m: print last m lines
          • Monitor file changes: tail -f file.txt
        • tee: read standard input and copies it to both standard output and to one or more files
          • This is often used as an intermediate step in a pipeline: it saves the output file to a file
  7. Chapter 7 Seeing the World as the Shell Sees it
    • Some commands:
      • echo: display a line of text
    • Path expansion
      • It is the mechanism by which wildcards work in the shell
      • Tilde expansion: echo ~: display the home directory
      • Arithmetic expansion: echo $((expression)): display the result of the arithmetic expression
        • Notice the division here is integer division
        • Expressions can be nested, but you need the $(()) for each arithmetic expression
      • Brace expansion
        • echo {a, b}-{1,2}: display a-1 a-2 b-1 b-2
        • echo {a..c}: display a b c
        • echo {001..6}: display 001 002 003 004 005 006
        • echo {{a,b},{1,2}}: display a b 1 2
      • Parameter expansion:
        • x=1; echo $x: display 1
      • Quoting
        • ls -l "hello world"
        • echo this is \$100.0
  8. Chapter 9 Permissions
    • Some commands:
      • id: display user identity
      • chmod: change a file's mode
      • umask: set the default file permissions
      • su: run a shell as another user
      • sudo: execute a command as another user
      • chown: change a file's owner
      • chgrp: change a file's group ownership
      • passwd: change a user's password
    • id
      • In the Unix security model, a user has a user id uid, a group id gid, and may belong to additional groups groups
      • User accounts are defined in the /etc/passwd file and groups are defined in the /etc/group file, /etc/shadow contains information about the user's password
    • File attributes
      • File types:

        • -: a regular file
        • d: a directory
        • l: a symbolic link, for a symbolic link, the remaining attributes are always rwxrwwxrwx, but they are only dummy values, the real file attributes are those of the file the symbolic link points to
        • c: a character special file, this file type refers to a device that handles data as a stream of bytes
        • b: a block special file, this file type refers to a device that handles data in blocks
      • Permission attributes

        Attribute Files Directories
        r Allows a file to be opened and read Allows a directory's contents to be listed if the execute attribute is also set
        w Allows a file to be written or truncated, but does not allow files to be renamed or deleted Allows files within a directory to be created, deleted, and renamed if the execute attribute is also set
        x Allows a file to be created as a program and executed Allows a directory to be entered
    • chmod
      • Only the file's owner or superuser can change the mode of a file or directory

      • The mode can use two representation: the octal representation and teh symbolic representation

        Octal Representation Binary Symbolic Representation
        0 000
        1 001
        2 010
        3 011
        4 100
        5 101
        6 110
        7 111
      • Syntax

        • chmod 664 file: set file attribute to -rw-rw-r--
        • chmod u=rw,go=rw,o=r file: the same as the above command
    • umask
      • View current mask setting: umask

      • Set mask value: umask 0022

      • Mask value interpretation: 0xyz, 0 is a preset value, x is for user, y is for group, z is for other

      • Octal mask value and permission

        Value Permission
        0 rwx
        1 rw-
        2 r-w
        3 r--
        4 -wx
        5 -w-
        6 --x
        7 ---
      • Common values:

        • 0022: 755 for directories, 644 for files
        • 0002: 775 for directories, 664 for files
    • Change identities
      • Methods
        • Log out and log back in as the alternate user
        • Use the su command
        • Use the sudo command
      • su: run a shell with substitute user and group ids
        • su -l user: abbreviation su - user, if user is not provided, the substitute user is the superuser(root)
        • Use exit to return to the original shell
        • Execute a command as another user: su -c command
      • sudo: execute a command as another user
        • Allows an ordinary user to execute commands as a different user (usually the superuser) in a controlled way
        • This does not require the password of the superuser
        • List the allowed commands for the invoking user on the current host
    • chown
      • chown user:group file: if you want to change the owner and the group, use user:group, if you want to change the user, use user:, if you wan to change the group, use :group
      • Superuser privilege is required for this command, so use sudo chown user:group file
    • chgrp
      • In order version of Unix, chown cannot change group ownership, and chgrp is used instead when you want to do so
    • passwd [user]
      • Enter passwd to change the password of the current user
  9. Chapter 10 Processes
    • Some commands
      • ps: report a snapshot of current processes
      • top: display tasks
      • jobs: list active jobs
      • bg: place a job in the background
      • fg: place a job in the foreground
      • kill: send a signal to a process
      • killall: kill processes by name
      • shutdown: shutdown or reboot the system
    • How processes work
      • When a system starts, the kernel launches a program called init, init then runs a series shell scripts called init scripts, which start all the system services. Many services are implemented as daemon programs, so they run in background
      • The program that can launch other programs is expressed in the process scheme as a parent process producing a child process
      • The kernel maintains information about each process to help keep things organized, which includes the process ID (PID), the memory assigned to each process, the processes's readiness to resume execution
    • View Processes: px
      • Shows PID, TTY(teletype, refers to the controlling terminal for the process), TIME ( the amount of CPU time consumed by the process), CMD (the command executed by the process)
      • OPtions:
        • ps x: show all processes regardless of what terminal they are controlled by
        • STAT (state, reveals the current status of the process)
          • R: running or ready to run
          • S: sleeping, it is waiting for an event, such as keystroke or network packet
          • D: uninterruptible sleep, it is waiting for I/O such as a disk drive
          • T: stopped
          • Z: zombie, a child process that has terminated by not cleaned up by its parent
          • <: a high-priority process
          • N: a low-priority process
      • px aux: gives mor information
        • Information: USER, %CPU, %MEM, VSZ (virtual memory size), RSS (resident set size), START
    • View Processes Dynamically with top
      • The result is continuously updating (by default, every 3 seconds)
    • Control Processes
      • Interrupt a process: Ctrl+C
      • Put a process in the background: command &
      • Return a process to the foreground:
        • Find the PID of the process: jobs, say, the result PID is n
        • Bring the process to the foreground: fg %n
      • Stop a process: Ctrl+Z
        • Ctrl+Z is used for suspending/stopping a process, it cannot be interrupted by the process. Ctrl+C is used to kill a process and can be interrupted by a program so it can clean itself up before exiting, or not exit at all
        • For a stopped process, you can bring it to foreground or send it to background
      • Signals
        • For Ctrl+C, a signal called TNT is sent
        • For Ctrl+Z, a signal called TSTP is sent
        • Kill a process: kill -number PID
        • Common signals
          • kill -1: HUP signal, send the process a hangup signal
          • kill -2: INT signal, send the process an interrupt signal
          • kill -9: KILL signal, send the process a kill signal
          • kill -15: TERM signal, send the process a terminate signal
          • kill -18: CONT signal, send the process a continue signal
          • kill -19: STOP signal, send the process a stop signal
          • kill -20: TSTP signal, send the process a terminal stop signal
          • kill -3: QUIT signal, send the process a quit signal
          • kill -11: SEGV signal, send the process a segmentation violation signal
          • kill -28: WINCH signal, send the process a window change signal
      • Shut down the system
        • Function: orderly terminate all the processes on the system, then power off the system
        • Commands halt, poweroff, reboot, shutdown
  10. Chapter 11 The Environment
    • Some commands:
      • printenv: print part of all of the environment
      • set: set shell options
      • export: export environment to subsequently executed programs
      • alias: create an alias for a command
    • What is stored in the environment
      • Two types of variables:
        • Shell variables: bits of data set by bash
        • Environment variables: other variables
      • Programmatic data: aliases and shell functions
    • Examine the environment
      • printenv: the result is in key=value format
      • printenv key: print the value of the key
      • echo $key: print the value of the key
    • How is the environment established
      • Bash program starts, and reads a series of configuration scripts called startup files, which define the default environment shared by all users, then followed by startup files related to personal environment
      • A login shell session reads
        • /etc/profile: global configuration
        • ~/.bash_profile: user startup file
        • ~/.bash_login: if the above one is not found
        • ~/.profile: if the above two are not found
      • An non-login shell session reads
        • /etc/bash/bashrc: global configuration
        • ~/.bashrc: user startup file
    • Modify the environment
      • Which file to modify
        • To add directories to your PATH variable, put those in the .bash_profile/.profile file
        • For everything else, put the changes into .bashrc file5,
      • To edit other files, we use a text editor
        • Graphical editors:
          • gedit from GNOME
          • kedit, kwrite, kate from KDE
        • Text-based editors: nano, vi (in most linux system this is replaced by vim, which is short for "vim improved"), emacs
        • Some vim commands
          • :set number: display line numbers
          • :wq, :x, ZZ: write save and quit
          • :q!: quit without saving
          • gg: go to the first line
          • G: go to the last line
          • ngg, nG: go tho the nth line
          • 0: jump to the start of the line
          • $: jump to the end of the lin
          • ^: jump to the first non-blank character in the line
          • g_: jimp to the last non-blank character in the line
          • i: insert before the cursor
          • a: insert after the cursor
          • I: insert at the beginning of the line
          • A: insert at the end of the line
          • o: append a new line below the current line
          • O: append a new line above the current line
          • r: replace the current character
          • R: replace characters until ESC is clicked
  11. Chapter 14 Package Management
    • Introduction
      • The most important determinant of linux distribution quality is the packaging system and the vitality of the distributions's support community
      • Package management is a method of installing and maintaining software on the system
        • Nowadays we can install packages from the linux distributor
        • Back to early days, people need to download and compile source code to install software
    • Packaging Systems
      • Different distributions use different packaging systems, and generally a packaging system designed for one distribution are not compatible with another distribution
      • Two main packaging technologies
        • .deb camp from Debian: Debian, Ubuntu, Linux Mint, Raspbian
        • .rpm camp from Red Hat: Fedora, CentOS, Red Hat Enterprise Linux, OpenSUSE
      • How a Package Systems Works
        • Virtually all software for a Linux system will be found on the Internet, most will be provided by the distribution vendor in the form of package files, and the rest will be available in source code form that can be installed manually
      • Package files
        • A package file is a compressed collection of files that comprise the software package
        • A package may contain programs and data files, metadata files, pre- and post-installation scripts that perform configuration tasks
        • Package files are created by the package maintainer
      • Repositories: packages are often hosted in a central repository
      • Dependencies: dependencies are shared libraries that are indispensible for a software to run properly
      • Package management systems tools
        • Low-level tools: install and remove package files

        • High-level tools: search metadata and resolve dependencies

          Distributions Low-Level Tools High-Level Tools
          Debian style dpkg apt, apt-get, aptitude
          Fedora, Red Hat Enterprise Linux rpm yum, dnf
    • Common Package Management Tasks
      • Find a package in a repository
        • Debian style: apt-get update; apt-cache search search_string
        • Red Hat style: yum search search_string
      • Install a package from a package file
        • Debian style: dpkg -i package_file
        • Red Hat style: rmp -i package_file
      • List installed packages
        • Debian style: dpkg -l
        • Red Hat style: rpm -qa
      • Determine whether a package is installed
        • Debian style: dpkg -s package_name
        • Red Hat style: rpm -q package_name
      • Display information about a package
        • Debian style: apt-cache show package_name
        • Red Hat style: yum info package_name
      • Find which package installed a file
        • Debian style: dpkg -S file_name
        • Red Hat style: rpm -qf file_name
  12. Chapter 16 Networking
    • Some commands
      • ping: send an ICMP ECHO_REQUEST to network hosts
      • traceroute: print the route packets trace to a network host
      • ip: show/manipulate routing, devices, policy routing and tunnels
      • netstat: print network connections, routing tables, interface statistics, masquerade connections, and multicast memberships
      • ftp: internet file transfer program
      • wget: non-interactive network downloader
      • ssh: OpenSSH SSH client (remote login program)
    • Examine and monitor a network
      • ping
        • Sends a special network packet to a specified host, most devices receiving this packet will reply to it, allowing the network connection to be verified
        • Once start, ping continues to send packets at a specified interval (default is 1 second) until it is interrupted (say, Ctrl+C)
        • After interrupted, ping prints performance statistics
      • traceroute
        • The traceroute lists all the routers network traffic takes to get from the local system to a specified host
      • ip
        • ip a: list all information
      • netstat
        • netstat -i: display a table of all network interfaces
        • netstat -e: display additional information
        • netstat -r: display the kernel routing tables
        • netstat -n: show numerical addresses instead of trying to determine symbolic host, port or user names
    • Transport files over a network
      • ftp
        • File Transfer Protocol (FTP) was once the most widely used method of downloading files over the Internet
        • ftp is used to communicate with FTP servers, machines that contain files that can be uploaded and downloaded over a network
        • FTP is not secure because it sends account names and passwords in cleartext, almost all FTP done over the Internet is done by anonymous FTP servers
        • ftp servername: login to a FTP server
      • lftp
        • It works much like the traditional ftp program but has many additional convenience features including multiple-protocol support, automatic retry on failed downloads, background processes, tab completion of path names, and many more
      • wget
        • It is useful for downloading content from both web and FTP sites
        • Single files, multiple files, and even entire sites can be downloaded
        • wget allows recursive download, download files in the background, and complete the download of a partially downloaded file
    • Secure communication with remote hosts
      • Before ssh, there are commands like rlogin and telnet, but they transmit all the communication through clear-text, which is inappropriate for the use in the Internet age
      • Advantages:
        • It authenticates that the remote host is who it says it is, thus preventing so-called man-in-the-middle attacks
        • It encrypts all of the communications between the local and remote hosts
      • SSH consists of two parts
        • A SSH server runs on the remote host, listening for incoming connections, by default, on port 22
        • An SSH client is used on the local system to communicate with the rmote server
      • Most Linux distributions has an implementation of SSH called OpenSSH from the OpenBSD project
      • Syntax: ssh user@hostname
      • Use SSH-encrypted tunnel to copy files across the network
        • scp (secure copy), scp from to
        • sftp (secure file transfer), sftp hostname
  13. Chapter 19 Regular Expressions
    • Introduction
      • Regular expressions are symbolic notations used to identify patterns in text
      • We only consider regular expressions described in the POSIX standard
    • grep
      • Actually grep is short for "global regular expression print", it essentially searches test files for the occurrence text matching a specified regular expression and outputs any line containing a match to standard output
      • Syntax: grep [options] regex [file...]
      • grep options
        • grep -t: ignore case
        • grep -v: invert match, prints lines that do not match
        • grep -c: print the number of matches instead of the lines themselves
        • grep -l: print the name of each file that contains a match instead of the lines themselves
        • grep -L: print the name of each file that does not contain any matched lines
        • grep -n: prefix each matching line with the line number
        • grep -h: for multi-file searches, suppress the output of filenames
    • Metacharacters and literals
      • We can use literals in regular expressions
      • We can also use metacharacters in regular expressions
        • Metacharacters: ^, $, [], {}, -, ?, *, +, (), |, \
    • The any character
      • . matches any character in a character position
    • Anchors
      • The caret ^ and the dollar sign $ are treated as anchors in regular expressions
      • ^ matches the beginning of a line
      • $ matches the end of the line
    • Bracket expressions and character classes
      • Bracket expression matches a single character from a specified set of characters
      • Metacharacters does not work inside bracket expresions, except ^ and -
        • ^: if it appears at the beginning inside a bracket expression, then the following set of characters must not be present at the given character position
        • A-Z: matches all uppercase letters
      • POSIX character classes
        • [:alnum:]: [A-Za-z0-9]
        • [:word:]: [A-Za-z0-9_]
        • [:alpha:]: [A-Za-z]
        • [:blank:]: Space and tab
        • [:cntrl:]: ASCII control codes, include ASCII characters 0 through 31 and 127
        • [:digit:]: [0-9]
        • [:graph:]: The visible characters, include ASCII characters 33 through 126
        • [:lower:]: [a-z]
        • [:upper:]: [A-Z]
        • [:punct:]: The punctuation characters, in ASCII, equivalent to [-!"#$%&'()*+,./:;<=>?@[\\\]_{|}~]
        • [:print:]: The printable characters, all characters in [:graph:] plus the space character
        • [:space:]: The whitespace characters, in ASCII, equivalent to [ \t\r\n\v\f]
        • [:xdigit:]: Hexadecimal numbers, in ASCII, equivalent to [0-9A-Fa-f]
      • POSIX basic v.s. extended regular expressions
        • BRE (basic regular expressions)
        • ERE (extended regular expressions)
      • Alternation
        • grep -E 'AAA|CCC|BBB' file: matches either AAA, or BBB, or CCC in the file
        • Combine alternation with other regular expression, use parentheses on alternation:
          • grep -E '^(aa|bb|cc) file': matches either aa, or bb, or cc at the beginning of a line in the file
      • Quantifiers
        • ?: matches an element zero or one time
        • *: matches an element zero or more times
        • +: matches an element one or more times
        • {n}: matches an element exactly n times
        • {n, m}: matches an element at least n times, and no more thant m times
        • {n,}: matches an element at least n times
        • {,m}: matches an element no more than m times
      • Some applications
        • find: search files in a directory
          • grep tests whether a line contains a pattern, find tests whether a line exactly matches a pattern
        • locate: find files by name
          • locate --regexp pattern: use BRE
          • locate --regex patern: use ERE

Pro Git

  1. Chapter 1
    • Version control
      • Def: a system that records changes to a file or set of files over time so that you can recall specific version later
      • Local version control system
      • Centralized version control system
      • Distributed version control system
    • Git
      • Most other CVS stores information as a list of file-based changes
      • Git thinks of its data like a series of snapshots of a miniature filesystem
    • Three states of files in git
      • Modified: the file is changed but not committed to the database yet
      • Staged: the file is marked as modified and it will go into the next commit snapshot
      • Committed: the files has already been committed to the database

Python for Data Analysis (this is not a required book for this course)

  1. Chapter 4 NumPy Basics
    • Introduction
      • NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python
      • Data analysis applications using NumPy
        • Fast vectorized array operations
        • Common array algorithms like sorting, unique, and set operations
        • Efficient descriptive statistics and aggregating/summarizing data
        • Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
        • Expressing conditional logic as array expressions instead of loops with if branches
        • Group-wise data manipulation
    • The NumPy ndarray Ch 4.1
      • Example 1
        1
        2
        3
        4
        5
        6
        import numpy as np

        data = np.random.randn(2,3) # generate a 2*3 array of random numbers
        data = data * 3 # multiply each element by 3
        data.shape # (2,3)
        data.dtype # dtype('float64')
      • Example 2
        1
        2
        3
        4
        5
        6
        7
        import numpy as np

        data = [[1,2], [3,4]]
        arr = np.array(data)
        arr.ndim # 2
        arr.shape # (2, 2)
        arr.dtype # dtype('int64')
      • Example 3
        1
        2
        3
        4
        import numpy as np

        arr1 = np.zeros((2,3)) # create a 2*3 array of 0
        arr2 = np.ones((1,2)) # create a 1*2 array of 1
      • There are different type of data types
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
             import numpy as np

        a1 = np.zeros((2,2))
        a1.dtype # dtype('float64')
        a2 = a1.astype(np.int32)
        a2.dtype # dtype('int32')
        a3 = np.ones((2,2), dtype=np.int32)
        a3.dtype # dtype('int32')
        ```
        * Any arithmetic operation between equal size NumPy arrays applies the operation element-wise
        * Indexing and slicing
        ```python
        import numpy as np

        x = np.ones(5)
        x[2:] = 2 # x is now [1, 1, 2, 2, 2]
        y = x[:2]
        y[0] = 3 # y is [3, 1], and x is [3, 1, 2, 2, 2]

        z = np.array([[1,2],[3,4]])
        z[:, :1] # [[1], [3]]
      • Boolean indexing
        1
        2
        3
        4
        5
        6
        7
        index = np.arange(3)
        data = np.arange(1, 10).reshape((3, 3))
        data[index == 2] # [7, 8, 9]
        cond = index != 2
        data[~cond] # [7, 8, 9]
        data[:, index == 2] # [3, 6, 9]
        data[data< 5] = 5 # data is now [[5, 5, 5], [5, 5, 6], [7, 8, 9]]
      • Transposing index and swapping axes
        • x.T: transpose of an array
        • np.dot(x, x.T): matrix multiplication
        • transpose function
          1
          2
          3
          4
          5
          6
          import numpy as np

          x = np.arange(1, 6).reshape((2, 3))
          x.shape # (2, 3)
          x = np.transpose(x, (1, 0)) # the second argument should be a tuple of range(n)
          x.shape # (3, 2)
    • Universal functions Ch 4.2
      • A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays
      • Functions: np.abs(arr), np.sqrt(arr), np.square(arr), np.exp(arr), np.log(arr), np.log2(arr), np.sign(arr), np.ceil(arr), np.floor(arr), np.modf(arr) (returns two arrays, one is the integral part, the other is the fractional part), np.isnan(arr), np.isfinite(arr), np.isinf(arr), np.add/subtract/multiply/divide(a1, a2), np.power(a1, a2), np.maximum/minimum(a1, a2), np.mod(a1, a2), np.greater/less/greater_equal/less_equal/equal/not_equal(a1, a2), np.logical_and/logical_or/logical_xor(a1, a2)
    • Array-oriented programming Ch 4.3
      • Example 1
        1
        2
        3
        4
        5
        6
        7
        8
        import numpy as np
        import matplotlib.pyplot as plt

        points = np.arange(-5, 5, 0.01) # 0.01 is the step size
        xs, ys = np.meshgrid(points, points)
        z = np.sqrt(xs ** 2 + ys ** 2)
        plt.imshow(zs, cmap=plt.cm.gray)
        plt.colorbar()
      • Convert conditional logic into array expressions
        • Example 1
          1
          2
          3
          4
          5
          6
          7
          8
          9
          import numpy as np

          xs = np.arange(1, 6)
          ys = np.arange(6, 11)
          rands = np.random.randn(5)
          cond = rands > 0
          # The following are equivalent, but the second one is more efficient
          res1 = [(x if c else y) for x, y, c in zip(xs, ys, cond)]
          res2 = np.where(cond, xs, ys)
        • Example 2
          1
          2
          3
          4
          5
          6
          import numpy as np

          ma = np.arange(1, 17).reshape((4,4))
          rands = np.random.randn(16).reshape((4, 4))
          cond = rands > 0
          res = np.where(cond, 2, ma) # replace all elements in ma with 2 if cond is True
      • Mathematical and statistical operations
        1
        2
        3
        4
        5
        6
        7
        8
        import numpy as np

        arr = np.random.randn(16).reshape((4, 4))
        arr.mean() # mean of all elements, equivalent to np.mean(arr)
        arr.mean(axis=0) # mean of each column, equivalent to np.mean(arr, axis=0)
        arr.mean(axis=1) # mean of each row, equivalent to np.mean(arr, axis=1)
        arr.cumsum(axis=0) # cumulative sum of each column
        arr.cumprod(axis=1) # cumulative product of each row
        • Other methods: sum, mean, std, var, min, max, argmin, argmax, cumsum, cumprod
        • Boolean array methods
          • Calcualte amount of positive values: (arr > 0).sum()
          • If there exists true value, return true: arr.any()
          • If all values are ture, return true: arr.all()
        • Sorting
          • Ondimensional array sorting: arr.sort()
          • Sorting by row for a 2darray: arr.sort(1)
          • Sorting by column for a 2darray: arr.sort(0)
        • Unique and set logic
          • np.unique(array)
    • Linear algebra Ch 4.5
      • x.dot(y) is equivalent to np.dot(x, y), also equivalent to x @ y
      • Inverse, and QR decomposition
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
                  from numpy.linalg import inv, qv

        x = np.random.randn(5, 5)
        mat = x.dot(x)
        inv(mat) # inverse of mat
        q, r = qr(mat)
        ```
        * Pseudorandom number generation Ch 4.6
        * Set seed `np.random.seed(num)`
        * `np.random.randn()`: normal distribution
        * `np.random.rand()`: uniform distribution
        * `np.random.randint()`: uniform distribution
        * `np.random.binomial()`: binomial distribution
        * `np.random.beta()` beta distribution
        * `np.random.chisquare()`: chi-square distribution
        * `np.random.uniform()`: uniform distribution
        * `np.random.gamma()`: gamm distribution

        2. Chapter 5 Getting started with pandas
        * Introduction to pandas data structrues Ch 5.1
        * Series is a one-dimensional array-like object containing a sequence of values
        ```python
        import pandas as pd

        ls = list(range(10))
        obj = pd.Series(ls, index=list(range(1, 11)))
        obj.values # list from 0 to 9
        obj.index # list from 1 to 10
        10 in obj # True, check if index is in series
        9 in obj # False

        dt = {0: 1, 1: 2, 2: 3}
        obj2 = pd.Series(dt)
        obj2.index # list from 0 to 2
        obj2.name = "my_series" # give name to a series
        obj2.index.name = "my_index" # give name to index series
      • DataFrame
        • A dataframe representes a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type
        • A dataframe can have both a row and column index
        • The most comman way to construct a dataframe is from a dict of equal-length lists or NumPy arrays
          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          15
          16
          17
          18
          19
          20
          21
          22
          23
          24
          25
          26
          27
          28
          import pandas as pd

          data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
          'year': [2000, 2001, 2002, 2001, 2002, 2003],
          'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
          df1 = pd.DataFrame(data)
          index = [str(x) for x in range_(6)]
          df2 = pd.DataFrame(data, columns=['pop', 'year', 'state'], index=index) # alter the order of columns
          # If a data or row is not inside the original data, the value will be NaN

          df.columns # list of column names
          df.index # list of row names

          df.head(n) # get the first n rows
          df['pop'] # get the data in column whose name is pop
          df.loc['1'] # get the data in row whose name is 1
          df.loc[['1', '3']] # get the data in rows whose name is 1 or 3
          df.loc['1', 'pop'] # get the data entry whose column is pop and row is 1

          # add a column
          df['eastern'] = df.state == 'Ohio'
          # delete a column
          del df['eastern']
          # dataframe transpose
          df.T
          # Add name to dataframe indicies
          df.index.name = 'my_index'
          df.columns.name = 'my_columns'
      • Index objects
        • You can add index in Series and DataFrame, pandas index are immutable
        • You can also create index maually
          1
          2
          3
          import pandas as pd

          index = pd.Index(np.arange(3))
        • There can be duplicates in pandas index
        • Index operations: append, difference, intersection, union, isin, delete, drop, insert, is_monotonic, is_unique, unique
    • Essential Functionality Ch 5.2
      • Reindexing
        • The reindex method creates a new data index, the data will also be rearranged according to the new index
        • If a index does not exist before, missing values NaN will be filled accordingly
          1
          2
          3
          4
          import pandas as pd

          obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
          obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) # NaN is associated with index 'e'
        • Other filling related arguments
          • method=ffill: fill the missing values with previous valid observations
          • method=backfill: fill the missing values with next valid observations
          • method=nearest: fill the missing values with the nearest valid observations
          • fill_values=n: fill the missing values with n
        • For a dataframe, if not specified, reindex will apply on row indicies. If you want to reindex column indicies, use obj.reindex(columns=list)
      • Drop items
        • For series, it's easy
          1
          2
          3
          4
          5
          import pandas as pd
          import numpy as np

          obj = pd.Series(np.arange(6), index=np.arange(6))
          obj.drop([1, 2])
        • For dataframe, you can drop either rows or columns
          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          import pandas as pd
          import numpy as np

          data = pd.DataFrame(np.arange(16).reshape((4, 4)),
          index=['Ohio', 'Colorado', 'Utah', 'New York'],
          columns=['one', 'two', 'three', 'four'])
          rows = ['Ohio', 'Utah']
          cols =['one', 'four']
          data.drop(rows) # Equivalent to data.drop(rows, axis=0), or data.drop(rows, axis='index'), since the default value for axis is 0
          data.drop(cols, axis=1) # Equivalent to data.drop(cols, axis='columns')
      • Indexing, selection, and filtering
        • For series, things are trivial
          1
          2
          3
          4
          5
          6
          import pandas as pd
          import numpy as np

          obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
          obj['a':'c'] # Both endpoints are inclusive, result is [0.0, 1.0, 2.0]
          obj[0:3] # [0.0, 1.0, 2.0], 3 is exelusive
        • For series, you can apply on both row and column indicies
          • df[:2]: the first two rows
          • df['x'] : column x
          • df[['x', 'y']]: column x and column y
          • df[df['x'] > 2]: rows that has value in column x greater than 2
        • loc and iloc
          • loc uses axis labels, iloc use integer indicies
          • df.loc['r1', 'c1':'cn']: row is r1, column is from c1 to cn(both c1 and cn are included)
          • df.iloc[:, :]: select all rows and all columns
        • Arithmetic and data alignment
          • If a index only appears in one series or one dataframe, the result of arithmetic operation will be NaN for that index
          • You can use fill_value parameter to provide a fill value for the missing ones
          • Some operations
            • df1.add(df2) (df1 + df2), df1.radd(df2) (df2 + df1)
            • df1.sub(df2) (df1 - df2), df1.rsub(df2) (df2 - df1)
            • df1.mul(df2) (df1 * df2), df1.rmul(df2) (df2 * df1)
            • df1.div(df2) (df1 / df2), df1.rdiv(df2) (df2 / df1)
          • Arithmetic operations between dataframe and series
            • By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows
            • If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods df1.add(ls, axis='index')
        • Function application and mapping
          • NumPy ufuncs (element-wise array methods) also work with pandas objects
          • Apply functions on columns and rows
            1
            2
            3
            4
            5
            6
            7
            8
            9
            import numpy as np
            import pandas as pd

            f = lambda x: x.max() - x.min()
            df.apply(f) # Calculate max - min for each column
            df.apply(f, axis='columns') # Caculate max - min for each row

            g = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])
            df.apply(g) # Calculate min and max for each column
          • Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary
          • Apply functions for each item in the dataframe
            1
            2
            3
            4
            5
            6
            7
            8
            9
            import numpy as np
            import pandas as pd

            df = pd.DataFrame(np.arange(16).reshape(4,4))
            f = lambda x: '%2.2f' % x
            df.applymap(f)
            # You can use map on a series
            ls = pd.Series(np.arange(16))
            ls.map(f)
        • Sorting and ranking
          • Use ascending=False to sort in descending order
          • ls.sort_index(): sort by seriesindex
          • df.sort_index(): sort by dataframe row index
          • df.sort_index(axis=1): sort by dataframe column index
          • ls.sort_values(): sort series values
          • df.sort_values(by=['a', 'b'], axis=0, ascending=True): sort dataframe values by column a and column b (sort row values in those columns, so axis=0)
          • ls.rank(method='average/first/max/min/dense'): rank a series by different methods
          • df.rank(method='agerage', axis=0/1): rank a dataframe by rows/columns using different methods
        • Index with duplicates
          • ls.index.is_unique(): check if series index has duplicates
          • df.index.is_unique(): check if dataframe row index has duplicates
          • df.columns.is_unique(): check if dataframe column index has duplicates
      • Summarizing and computing descriptive statistics Ch 5.3
        • df.sum(), df.mean(),df.cumsum(), df.sumprod(), df.count(), df.min(),df.argmin(), df.idmin(), df.median(), df.mad(), df.prod(), df.var(), df.std(), df.skew(), df.kurt(), df.diff(), df.pct_change(), df.corr(), df.cov()
        • axis=0/1, skipna=Ture/False
        • Unique values, value counts, and membership
          • ls.unique(): return unique values
          • ls.value_counts(sort=False/True): return value frequencies
          • pd.value_counts(ls, sort=False/True): return value frequenceis
          • mask = ls1.isin(ls2): return a bollean lists whether a item in ls1 is in ls2
          • df.apply(pd.value_counts).fillna(0): calculate value frequencies for dataframe
            1
            2
            3
            4
            5
            6
            7
            8
            9
            10
            11
            12
            13
            14
            15
            import numpy as np
            import pandas as pd

            data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
            'Qu2': [2, 3, 1, 2, 3],
            'Qu3': [1, 5, 2, 4, 4]})
            result = data.apply(pd.value_counts).fillna(0)
            """
            Qu1 Qu2 Qu3
            1 1.0 1.0 1.0
            2 0.0 2.0 1.0
            3 2.0 2.0 0.0
            4 2.0 0.0 2.0
            5 0.0 0.0 1.0
            """
  2. Chapter 6 Data loading, storage, and file formats
    • Reading and writing data in text format Ch 6.1
      • Parsing functions in pandas
        • read_csv, read_table, read_fwf, read_clipboard, read_excel, read_hdf, read_html, read_json, read_msgpack, read_pickle, read_sas, read_sql, read_stata, read_feather
      • Parameters: spe=',', header=None/int, names=[c1, ..., cn], index_col=cn
      • Reading text files in pieces
        • pd.isnull(df)
        • pd.options.display.max_rows = n, pd.options.display.max_columns = m
      • Writing data to text format
        • df.to_csv(path)
      • Working with delimited formats
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        import csv

        with open("file_path") as csvfile:
        reader = csv.reader(csvfile)
        ls = [row for row in reader]
        csvfile.close()
        header, contents = ls[0], ls[1:]

        with open("output", "w") as output:
        writer = csv.writer(output)
        for row in contents:
        writer.writerow(row)
        output.close()
      • JSON data
        • Convert json string to json object: obj = json.loads(json_string)
        • Convert json object to json string: json = json.dumps(obj)
        • Read json file into pandas dataframe: df = pd.read_json(file)
      • XML and HTML
        • Read html into a list of dataframes: pd.read_html(file)
    • Binary data formats Ch 6.2
      • One of the easies ways to store data efficiently in binary format is using Python's built-in pickle serilization. Pandas objects all have to_pickle method that writes data to disk in pickle format
      • Pickle is good as a short-trem storage format, it's hard to guarantee that the format will be stable over time
      • Pandas supports two more binary data formats: HDF5 and MessagePack
        • Store in HDF5 format: df.HDFStore("xxx.h5")
        • HDFStore supports two storage schemas, format='fixed'/'table'
      • Reading excel file
        • Create excel file and then read
          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          15
                   import pandas as pd

          xlsx = pd.ExcelFile("xxx.xlsx")
          df = pd.read_excel(xlsx, "Sheet1")
          ```
          * Put it all together `df = pd.read_excel("xxx.xlsx", "Sheet1")`
          * Interacting with web APIs Ch 6.3
          * `requests` package
          ```python
          import requests

          url = "xxx"
          response = requests.get(url)
          data = response.json() # a dictionary containing json objects
          df = pd.DataFrame(data, index=, columns=)
    • Interacting with databases Ch 6.4
      • sqlite3
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        import sqlite3

        query = """
        SELECT * FROM table
        WHERE c1 =
        ORDER BY c2
        LIMIT 5;
        """
        conn = sqlite3.connect("xxx.sqlite")
        cursor = conn.execute(query)
        rows = cursor.fetchall()
        df = pd.DataFrame(rows, columns = [x[0] for x in cursor.description])
  3. Chapter 7 Data cleaning and preparation
    • Handling missing data Ch 7.1
      • All of the descriptive statistics on pandas objects exclude missing data by default
      • For numeric data, missing data is represented using the floating-point value NaN, use data.isnull() to check for missing data
      • Filtering out missing data: use data.dropna(), this is equivalent to data[data.notnull()]. arguments: axis=0/1, how='all'/'any' (if all values are NaN, or if any value is NaN)
      • Fill missing values
        • Fill all missing values with on value: data.fillna(val)
        • Fill missing values with different values for each column: data.fillna({"c1": v1, "cn": vn})
        • Fill missing values in place: data.fillna(val, inplace=True)
    • Data transformation Ch 7.2
      • Removing duplicates
        • data.duplicated: return a boolean series/dataframe indicating whether a value/row is duplicated
        • data.drop_duplicates(): drop duplicated values/rows, use keep='first'/'last'/False to sepcify how to remove duplicated values
      • Transforming data using a function or mapping
        1
        2
        3
        4
        5
        import numpy as np
        import pandas as pd

        mapping = {0: 'a', 1: 'b', 3: 'c' }
        df['char_index'] = df['index'].map(mapping) # add a new column, with integer indicies mapped to character indicies
      • Replacing values:
        • data.replace(v1, v2): replace v1 with v2
        • data.replace({v1: v2, v3: v4}): replace v1 with v2, v3 with v4
        • data.replace([v1, v3], [v2, v4]): replace v1 with v2, v3 with v4
      • Renaming axis indicies
        • Rename row or column indicies
          1
          2
          3
          4
          5
          6
          import pandas as pd

          trans1 = lambda x: x.str.upper()
          trans2 = lambda x: x.str.title() # Capatilize the first letter of each word
          df.index.map(trans1)
          df.columns.map(trans2)
        • Rename both index and column names: df.rename(index=trans1, columns=trans2)
      • Detecting and filtering outliers
        • data.describe(): calculate max, min, mean, std, 4 quantiles of each column
      • Permutation and random sampling
        • You can apply permutation on values in a series or rows of a dataframe with the iloc or the take function
          1
          2
          3
          4
          5
          6
          7
          import numpy as np
          import pandas as pd

          df = pd.DataFrame(np.arange(15).reshape(5,3))
          perm = np.random.permutation(len(df))
          df.iloc[perm] # permute the rows
          df.take(term) # permute the rows, equivalent to the above iloc function
        • Take a random sample of rows: df.sample(n=n, replace=True/False), default replace is False
      • Computing indicator/dummy variables
        • Convert a categorical variable into dummy/indicator matrix
          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          import pandas as pd
          import numpy as np

          ls = ['a', 'a', 'b', 'b', 'c']
          ser = pd.Series(ls, index=np.arange(1, len(ls) + 1))
          pd.get_dummies(ser)
          """
          a b c
          1 1 0 0
          2 1 0 0
          3 0 1 0
          4 0 1 0
          5 0 0 1
          """
        • Merge a dataframe and a series/another dataframe
          • The series must be named
          • If the dataframes have columns with the same name, use lsuffix and rsuffix to specifu the suffix for the left and right dataframe
            1
            2
            3
            4
            5
            6
            7
            import pandas as pd

            df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
            'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
            other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
            'B': ['B0', 'B1', 'B2']})
            df.join(other, lsuffix='_caller', rsuffix='_other')
        • Use np.random.seed(n) to set seed for random sampling
    • String manipulation Ch 7.3
      • String object methods
        • str.split(sep=""), str.strip(), str.rstrip(), str.lstrip(), sep.join(ls)
        • char in str, str.index(char) (ValueError if not found), str.find(char) (return -1 if not found), str.contains(char)
        • str.count(char), str.replace(old, new)
        • str.endswith(pattern), str.startswith(pattern)
        • str.lower(), str.upper()
      • Regular expression
        • re package
        • Use regex = re.compile(pattern, flags=re.IGNORECASE) to get a reusable regex object
        • regex.findall(str), regex,search(str), regex.match(str)
  4. Data wrangling: join, combine and reshape
    • Hierarchical indexing
      1
      2
      3
      4
      5
      6
      7
      8
      import pandas as pd

      data = pd.Series(np.random.randn(9),
      index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
      [1, 2, 3, 1, 3, 1, 2, 2, 3]])
      df = data.unstack() # Convert a series to a dataframe
      df.index # ['a', 'b', 'c']
      df.columns # [1, 2, 3]

Python testing with pytest

  1. Chapter 1 Getting started with pytest
    • pytest *_test.py to run all test in a file
    • pytest -v: -v flag controls the verbosity of pytest output in various aspects: test session progress, assertion details when tests fail, fixtures details with --fixtures
    • pytest -r: show extra test summary info as specified by chars
    • Naming rules
      • Test files should be named test_*.py or *_test.py
      • Test methods and functions should be named test_*
      • Test classes should be named Test*
    • Possible outcomes of a test function
      • Passed(.): the test ran successfully
      • Failed(F): the test did no run successfully
      • Skipped(s): the test was skipped
      • xfail(x): the test was not supposed to pass, ran, and failed
      • XPASS(X): the test was not supposed to pass, ran, and passed
      • ERROR(E): an exception happened outside of the test function
    • Running only one test
      • Run test called test_inc in test_math.py: pytest -v test_math.py::test_inc
    • Using options
      • Check for pytest options: pytest --help
      • pytest -m: only run tests matching given mark expression
        • Example
          1
          2
          3
          4
          5
          6
          7
          8
          9
          10
          11
          12
          13
          14
          import pytest

          def inc_one(x):
          return x + 1

          def dec_one(x):
          return x - 1
          # give this test a mark called simple
          @pytest.mark.simple
          def test_inc():
          assert inc_one(1) == 2

          def test_dec():
          assert dec_one(2) == 1
          1
          2
          pytest -v // This runs both tests
          pytest -v -m simple // This runs only the test marked simple
      • collect-only: The --collect-only option shows you which tests will be run with the given options and configuration
      • -m markexprMarkers are one of the best ways to mark a subset of your test functions so that they can be run together. As an example, one way to run test_replace() and test_member_access(), even though they are in separate files, is to mark them
  2. Chapter 2 Writing test functions
    • Using assert statements: assert expr, assertTrue(expr), assertEqual(x, y), ...
    • Expecting exceptions:
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      import pytest

      def my_add(x: int, y: int) -> int:
      return x + y

      def test_add():
      with pytest.raises(TypeError) as excinfo:
      my_add(1, '2')
      exception_msg = excinfo.value.arge[0]
      assert exception_msg == 'unsupported opertand type(s) for +: 'int' and 'str''
    • Marking test functions
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      import pytest

      def inc(x):
      return x + 1

      def dec(x):
      return x - 1

      @pytest.mark.inc
      def test_inc():
      assert inc(1) == 2

      @pytest.mark.dec
      def test_dec():
      assert dec(2) == 1
      1
      2
      3
      pytest -v // This runs both test_inc and test_dec
      pytest -v -m inc // This runs only test_inc
      pytest -v -m dec // This runs only test_dec
    • Skipping tests
      • We can skip or skip with condition tests that we do not want to run now
      • Skip a test
        1
        2
        3
        4
        5
        import pytest

        @pytest.mark.skip(reason ='no implemented yet')
        def test_func():
        assert True
      • Skip a test with a boolean condition
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
             import pytest

        @pytest.mark.skipif(condition, reason ='no supported until condition is met')
        def test_func():
        assert True
        ```
        * Marking tests as excepting to fail
        ```python
        import pytest

        @pytest.mark.xfail(condition, reason ='not supported unless condition is met')
        def test_func():
        assert True
      • Running a subset of tests
        1
        2
        3
        4
        5
        pytest -v .   # This runs all tests in the current directory
        pytest -v test_raise.py # This runs all tests in a file
        pytest -v test_raise.py::test_raise # This runs a single test function in a file
        pytest -v test_raise.py::TestRaise # This runs all tests in a class inside a file
        pytest -v -k '_raise and not delete' # This runs all tests in the current directory whose name contains `_raise` but not `delete`
  3. Chapter 3 Pytest fixtures
    • Fixtures are functions that are run by pytest before the test functions
    • Fixtures functions can do whatever you want: get data, set up a database connection, share data among multiple test functions, etc.
    • Example
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      import pytest

      @pytest.fixture
      def db_connect():
      conn = sqlite3.connect('test.db')
      cursor = conn.cursor()
      cursor.execute("create table if not exists test_table(name text);")
      cursor.execute("insert into test_table(name) values('test_name');")
      result = cursor.execute("select * from test_table;")
      res = []
      for row in result:
      res.append(row)
      conn.close()
      return res

      def test_connection(db_connect):
      assert len(db_connecet) > 0
    • Sharing fixtures among multiple tests
      • Use conftest.py file to define fixtures that need to be shared among different places
      • Or you can add a scope parameter to the fixture function @pytest.fixture(scope='module'), scope can be function, class, module, package, or session
      • Example
        1
        2
        3
        4
        5
        6
        7
        8
        # content of conftest.py
        import pytest
        import smtplib


        @pytest.fixture(scope="module")
        def smtp_connection():
        return smtplib.SMTP("smtp.gmail.com", 587, timeout=5)
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        # content of test_module.py
        def test_ehlo(smtp_connection):
        response, msg = smtp_connection.ehlo()
        assert response == 250
        assert b"smtp.gmail.com" in msg
        assert 0 # for demo purposes


        def test_noop(smtp_connection):
        response, msg = smtp_connection.noop()
        assert response == 250
        assert 0 # for demo purposes

Lecture 01 2022/01/19

Introduction, based on the TLCL book

Lecture 02 2022/01/20

Navigation and Redirection, based on the TLCL book

Lecture 03 2022/01/24

Argument Expansion, based on the TLCL book

Lecture 04 2022/01/26

Permissions and Processes, based on the TLCL book

Lecture 05 2022/01/27

Grep and Regular Expressions, based on the TLCL books

Lecture 06 2022/01/31

Lecture 07 2022/02/02

Lecture 08 2022/02/03

Lecture 09 2022/02/07

Lecture 10 2022/02/08

  1. Change the default browser by which jupyter notebook is opened
    • Run jupyter notebook --generate-config in command line prompt, this command generates a configure file at ~/.jupyter/jupyter_notebook_config.py
    • Set this property in the above configure file c.NotebookApp.browser = u'C:/Home/AppData/Local/Google/Chrome/Application/chrome.exe %s'. If you use windows, you should use path separator \\, and also quote the path of your browser inside "", and then inside '', so you configuration should be something like this: c.NotebookApp.browser = '"C:\\Home\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe" %s'

Lecture 15 2022/02/28

Lecture 21 2022/03/14

  1. pytest
    • Install pytest:
      • Install: pip install pytest
      • Check pytest version: pytest --version
      • Simplest way to run tests: pytest, this command will will run all files of the form test_*.py or *_test.py in the current directory and its subdirectories
    • Simple test example
      1
      2
      3
      4
      5
      6
      # inside test_simple.py
      def inc_one(x):
      return x + 1

      def test_inc_one():
      assert inc_one(1) == 2

Lecture 22 2022/03/16