The Linux Command Line
- Chapter 1: What is the shell
- The shell is a program that takes keyboard commands and passes them to the operating system to carry out
- bash is a shell program from the GNU project, is an acronym for Bourne Again Shell
- We need another program called a terminal emulator to interact with the shell
- Some simple commands
date
: display the current time and datecal
: display a calendar of the current monthdf
: display the current amount of free space on the diskfree
: display the amount of free memoryexit
: exit a terminal session, you can also useCtrl + d
- Chapter 2: Navigation
- Some commands:
pwd
: print name of current working directoryls
: list directory contentscd
: change directory
- A Unix-like OS organizes files a hierarchical directory structure. The first directory is the root directory.
- Windows has a separate file system tree for each storage device, Unix-like systems always have a single file system tree
- Absolute pathnames start with
/
, relative pathnames start from the working directory,./
represents the current directory,../
represents the parent directory of the current directory - Filename:
- File names with period character is hidden, you need to use
ls -a
to display the hidden files or directories - File names are case sensitive
- Linux files do not have file extensions
- Do not use space in the file name
- File names with period character is hidden, you need to use
- Some commands:
- Chapter 3: Exploring the System
- Some commands:
ls
: list directory contentsfile
: determine file typeless
: view file contents
ls
ls -a
: display all filesls -A
: display almost all files, do not list.
and..
ls -i
: display the inode of the filesls -l
: display in long listing formatls -r
: reverse the order of the listingls -t
: sort contents by modification time, newest firstls -S
: sort contents by file size, largest firstls -R
: recursive listingls -d
: list the directory itself, not its contents
- Permissons
-rwxr--r--
:the first character is-
, it indicates this is a file, if this is a directory, the first character would bed
, if this is a link, the first character would bel
; the next three character are the permissions for the owner, the next three characters are the permission for the group, the last three characters are the permission for everyone else;r
: open and read,w
: write or truncate, the file itself cannot be renamed or deleted,x
: treat as a program and execute
- Some directories in Linux system
/
: root directory/bin
: contain programs that must present for the system to boot and run/boot
: contain the Linux kernel and the boot loader/dev
: contain a list of devices/etc
: contain all of the system-wide configuration files/home
: in general, each user is given a directory in/home
/lib
: contain shared library files used by the core system programs/mnt
: contain mount points for removable devices/opt
: used to install optional software/tmp
: intended for the storage of temporary, transient files/usr
: contain all programs and support files by users
- Links
- Soft links/symlink/ symbolic link: a file contains a reference to another file or directory in the form of pathname
- Hard link: saves a copy of a file or directory
- Some commands:
- Chapter 4 Manipulating Files and Directories
- Some commands:
cp
: copy files and directoriesmv
: move/rename files and directoriesmkdir
: create directoriesrm
: remove fils and directoriesln
: create hard and symbolic links
- Wildcards
*
: match any characters?
: match any single character[characters]
: match any character that is a member of the set characters[!characters]
: match any character that is not a member of the set characters[[:class:]]
: match any character that is a member of the specified class[:alnum:]
: match any alphanumeric character[:alpha:]
: match any alphabetic character[:digit:]
: match any numeral[:lower:]
: match any lowercase letter[:upper:]
: match any uppercase letter
mkdir dir1 ... dirn
: make n directoriescp item1 ... itemn dir
: copy item1 to itemn to dircp -a
: copy the files and directories and all of their attributescp -i
: before overwriting an existing file, prompt the user for confirmationcp -r
: recursively copy directories and their contentscp -u
: only copy files that either don't exist or are newer than the existing corresponding files in the destination directorycp x y
:- If both x and y are files, overwrite y with x, if y does not exist, create y with x
- If x is a file and y is a directory, copy x into y
- If x and y are both directory, you should use
copy -r x y
, otherwise there will be an error
mv item 1 item2
: move/ rename item 1 to item2,mv item0 ... itemn directory
: move item0 to itemn into directorymv -i
: prompt the user for confirmation before overwriting an existing filemv -u
: only move files that either don't exist or are newer than the existing corresponding files in the destination directorymv -v
: display informative message as the move is performedmv x y
- If both x and y are files, overwrite y with x, if y does not exist, it is created with x
- If x is a file and y is a directory, move x into y, y must exist
- If x and y are both directories, if y does not exist, create y and move contents of x into y, then delete the empty x; if y does exist, move x and its contents into y
rm item1 ... itemn
rm -i
: prompt the user for confirmation before removing an existing file, otherwise the file will be deleted silentlyrm -r
: recursively delete directories and their contents, no matter whether the directory, this parameter should always be providedrm -f
: delete the targets and ignore nonexistent files and do not promptrm -v
: display informative message as the deletion is performed
ln file link
: create a hard link to a file,ln -s item link
: create a symbolic link to a file or directory- Hard links
- A hard cannot reference a file outside its own file system, this means a link cannot reference a file that is not on the same disk partition as the link itself
- A hard link may not reference a directory
- A hard link and the original file share the same inode, deleting the hard link does not affect the original file, the hard link still works if you delete the original file
- Symbolic link
- Symbolic links are created to overcome the limitations of hard links
- It works by creating a special type of file that contains a text pointer to the referenced file or directory (similar to the shortcut in Windows)
- A symbolic link and the original file have different inode, the soft
link will not work in the shellTilde expansion:
echo ~
: display the home directoryArithmetic expansion:echo $((expression))
: display the result of the arithmetic expressionNotice the division here is integer divisionExpressions can be nested, but you need the $(()) for each arithmetic expression if the original file is moved or deleted, but deleting the soft link does not affect the original file
- Hard links
- Some commands:
- Chapter 5 Working with Commands
- Some commands:
type
: indicate how a command name is interpretedwhich
: display which executable program will be executedhelp
: get help from shell built-insman
: display a commands' manual pageapropos
: display a list of appropriate commandsinfo
: display a command's info entrywhatis
: display one-line manual page descriptionsalias
: create an alias for a command
- Command types
- An executable program: programs compiled into binaries
- A command built into the shell itself: shell builtins
- A shell function: shell scripts incorporated into the environment
- An alias: user-defined commands, built from other commands
type command
: display the type of commandswhich command
: only works for executable programs, not builtins nor aliaseshelp command
: help is available for each of the shell builtins, it shows the documentation of a commandcommand --help
: display usage informationman command
: display manual pageapropos command
: display appropriate commandswhatis command
: display one-line manual page descriptionsinfo command
: display a program's info entryalias command
:- Use ; to separate commands in one line
- Example:
alias foo='cd /mnt; ls -lrt;'
- Unalias a commdn:
unalias foo
- Show all aliases defined in the environment:
alias
- Some commands:
- Chapter 6 Redirection
- Some commands:
cat
: concatenate filessort
: sort lines of textuniq
: report or omit repeated linesgrep
: print lines matching a patternwc
: print newline, word and byte counts for each filehead
: output the first part of a filetail
: output the last part of a filetee
: read from standard input and write to standard output and files
- Standard input(
stdin
, file descriptor 0), standard output(stdout
, file descriptor 1), standard error(stderr
, file descriptor 2), by default, stdout and stderr are linked to the screen and the stdin is attached to the keyboard- Redirection output
- Truncate/ create a new file:
ls -l > file.txt
- Append to an existing file:
ls -l >> file.text
- Discard both standard error and standard output:
- Older version:
command >/dev/null 2 >&1
- Newer version
command &> /dev/null
- Older version:
- Truncate/ create a new file:
- Redirection output
cat [..file]
: read one or more files and copy them to the standard output- In most cases, cat command can be thought of as analogous to the type command
- If cat is provided with no arguments, it will read input from the
standard input, which is usually the keyboard, type
Ctrl + D
to indicate EOF - Read input from the keyboard and save to input to a file:
cat > file.txt
, useCtrl + D
to indicate input EOF - Read input from a file:
cat < file.txt
|
pipelines- Pipeline feature of shell: with the pipe operator | (vertical bar), the standard output of one command can be piped into the standard input of another
- Syntax:
command 1 | command 2
- Filters:
- Pipelines are often used to perform complex operation on data, frequently, the commands used this way are referred to as filters
- Filters take input, change it somehow, and then output it
sort
: write sorted concatenation of all files to standard outputsort -f
: ignore case,sort -r
: reverse order,sort -R
: random sort,sort -u
: only output unique results
uniq
: accepts a sorted list of data and removes any duplicates from the list- uniq is often used with sort
uniq -d
: only print duplicate lines, one for each group,uniq -c
: prefix lines by the number of occurrencessort -u xxx
,sort xxx | uniq
wc
: display the number of lines, words, and bytes contained in fileswc -l
: display the number of lines,wc -w
: display the number of words,wc -c
: display the number of bytes,wc -m
: display the number of characters
grep pattern [file...]
: used to find text patterns- When grep encounters a pattern in the file, it prints out the lines containing it
- Regular expressions are allowed in grep
grep -i
: ignore case during search,grep -v
: print only those lines that do not match the pattern
head
/tail
- By default, print 10 lines
head -n m
: print first m lines,tail -n m
: print last m lines- Monitor file changes:
tail -f file.txt
tee
: read standard input and copies it to both standard output and to one or more files- This is often used as an intermediate step in a pipeline: it saves the output file to a file
- Some commands:
- Chapter 7 Seeing the World as the Shell Sees it
- Some commands:
echo
: display a line of text
- Path expansion
- It is the mechanism by which wildcards work in the shell
- Tilde expansion:
echo ~
: display the home directory - Arithmetic expansion:
echo $((expression))
: display the result of the arithmetic expression- Notice the division here is integer division
- Expressions can be nested, but you need the $(()) for each arithmetic expression
- Brace expansion
echo {a, b}-{1,2}
: display a-1 a-2 b-1 b-2echo {a..c}
: display a b cecho {001..6}
: display 001 002 003 004 005 006echo {{a,b},{1,2}}
: display a b 1 2
- Parameter expansion:
x=1; echo $x
: display 1
- Quoting
ls -l "hello world"
echo this is \$100.0
- Some commands:
- Chapter 9 Permissions
- Some commands:
id
: display user identitychmod
: change a file's modeumask
: set the default file permissionssu
: run a shell as another usersudo
: execute a command as another userchown
: change a file's ownerchgrp
: change a file's group ownershippasswd
: change a user's password
id
- In the Unix security model, a user has a user id
uid
, a group idgid
, and may belong to additional groupsgroups
- User accounts are defined in the
/etc/passwd
file and groups are defined in the/etc/group
file,/etc/shadow
contains information about the user's password
- In the Unix security model, a user has a user id
- File attributes
File types:
-
: a regular filed
: a directoryl
: a symbolic link, for a symbolic link, the remaining attributes are alwaysrwxrwwxrwx
, but they are only dummy values, the real file attributes are those of the file the symbolic link points toc
: a character special file, this file type refers to a device that handles data as a stream of bytesb
: a block special file, this file type refers to a device that handles data in blocks
Permission attributes
Attribute Files Directories r
Allows a file to be opened and read Allows a directory's contents to be listed if the execute attribute is also set w
Allows a file to be written or truncated, but does not allow files to be renamed or deleted Allows files within a directory to be created, deleted, and renamed if the execute attribute is also set x
Allows a file to be created as a program and executed Allows a directory to be entered
chmod
Only the file's owner or superuser can change the mode of a file or directory
The mode can use two representation: the octal representation and teh symbolic representation
Octal Representation Binary Symbolic Representation 0 000 1 001 2 010 3 011 4 100 5 101 6 110 7 111 Syntax
chmod 664 file
: set file attribute to-rw-rw-r--
chmod u=rw,go=rw,o=r file
: the same as the above command
umask
View current mask setting:
umask
Set mask value:
umask 0022
Mask value interpretation:
0xyz
, 0 is a preset value, x is for user, y is for group, z is for otherOctal mask value and permission
Value Permission 0 rwx 1 rw- 2 r-w 3 r-- 4 -wx 5 -w- 6 --x 7 --- Common values:
0022
: 755 for directories, 644 for files0002
: 775 for directories, 664 for files
- Change identities
- Methods
- Log out and log back in as the alternate user
- Use the
su
command - Use the
sudo
command
su
: run a shell with substitute user and group idssu -l user
: abbreviationsu - user
, if user is not provided, the substitute user is the superuser(root)- Use
exit
to return to the original shell - Execute a command as another user:
su -c command
sudo
: execute a command as another user- Allows an ordinary user to execute commands as a different user
(usually the superuser) in a controlled way
- This does not require the password of the superuser
- List the allowed commands for the invoking user on the current host
- Allows an ordinary user to execute commands as a different user
(usually the superuser) in a controlled way
- Methods
chown
chown user:group file
: if you want to change the owner and the group, useuser:group
, if you want to change the user, useuser:
, if you wan to change the group, use:group
- Superuser privilege is required for this command, so use
sudo chown user:group file
chgrp
- In order version of Unix, chown cannot change group ownership, and chgrp is used instead when you want to do so
passwd [user]
- Enter
passwd
to change the password of the current user
- Enter
- Some commands:
- Chapter 10 Processes
- Some commands
ps
: report a snapshot of current processestop
: display tasksjobs
: list active jobsbg
: place a job in the backgroundfg
: place a job in the foregroundkill
: send a signal to a processkillall
: kill processes by nameshutdown
: shutdown or reboot the system
- How processes work
- When a system starts, the kernel launches a program called init, init then runs a series shell scripts called init scripts, which start all the system services. Many services are implemented as daemon programs, so they run in background
- The program that can launch other programs is expressed in the process scheme as a parent process producing a child process
- The kernel maintains information about each process to help keep things organized, which includes the process ID (PID), the memory assigned to each process, the processes's readiness to resume execution
- View Processes:
px
- Shows PID, TTY(teletype, refers to the controlling terminal for the process), TIME ( the amount of CPU time consumed by the process), CMD (the command executed by the process)
- OPtions:
ps x
: show all processes regardless of what terminal they are controlled by- STAT (state, reveals the current status of the process)
R
: running or ready to runS
: sleeping, it is waiting for an event, such as keystroke or network packetD
: uninterruptible sleep, it is waiting for I/O such as a disk driveT
: stoppedZ
: zombie, a child process that has terminated by not cleaned up by its parent<
: a high-priority processN
: a low-priority process
px aux
: gives mor information- Information:
USER
,%CPU
,%MEM
,VSZ
(virtual memory size),RSS
(resident set size),START
- Information:
- View Processes Dynamically with
top
- The result is continuously updating (by default, every 3 seconds)
- Control Processes
- Interrupt a process:
Ctrl+C
- Put a process in the background:
command &
- Return a process to the foreground:
- Find the PID of the process:
jobs
, say, the result PID isn
- Bring the process to the foreground:
fg %n
- Find the PID of the process:
- Stop a process:
Ctrl+Z
- Ctrl+Z is used for suspending/stopping a process, it cannot be interrupted by the process. Ctrl+C is used to kill a process and can be interrupted by a program so it can clean itself up before exiting, or not exit at all
- For a stopped process, you can bring it to foreground or send it to background
- Signals
- For Ctrl+C, a signal called TNT is sent
- For Ctrl+Z, a signal called TSTP is sent
- Kill a process:
kill -number PID
- Common signals
kill -1
: HUP signal, send the process a hangup signalkill -2
: INT signal, send the process an interrupt signalkill -9
: KILL signal, send the process a kill signalkill -15
: TERM signal, send the process a terminate signalkill -18
: CONT signal, send the process a continue signalkill -19
: STOP signal, send the process a stop signalkill -20
: TSTP signal, send the process a terminal stop signalkill -3
: QUIT signal, send the process a quit signalkill -11
: SEGV signal, send the process a segmentation violation signalkill -28
: WINCH signal, send the process a window change signal
- Shut down the system
- Function: orderly terminate all the processes on the system, then power off the system
- Commands
halt
,poweroff
,reboot
,shutdown
- Interrupt a process:
- Some commands
- Chapter 11 The Environment
- Some commands:
printenv
: print part of all of the environmentset
: set shell optionsexport
: export environment to subsequently executed programsalias
: create an alias for a command
- What is stored in the environment
- Two types of variables:
- Shell variables: bits of data set by bash
- Environment variables: other variables
- Programmatic data: aliases and shell functions
- Two types of variables:
- Examine the environment
printenv
: the result is in key=value formatprintenv key
: print the value of the keyecho $key
: print the value of the key
- How is the environment established
- Bash program starts, and reads a series of configuration scripts called startup files, which define the default environment shared by all users, then followed by startup files related to personal environment
- A login shell session reads
/etc/profile
: global configuration~/.bash_profile
: user startup file~/.bash_login
: if the above one is not found~/.profile
: if the above two are not found
- An non-login shell session reads
/etc/bash/bashrc
: global configuration~/.bashrc
: user startup file
- Modify the environment
- Which file to modify
- To add directories to your PATH variable, put those in the
.bash_profile
/.profile
file - For everything else, put the changes into
.bashrc
file5,
- To add directories to your PATH variable, put those in the
- To edit other files, we use a text editor
- Graphical editors:
gedit
from GNOMEkedit
,kwrite
,kate
from KDE
- Text-based editors:
nano
,vi
(in most linux system this is replaced by vim, which is short for "vim improved"),emacs
- Some vim commands
:set number
: display line numbers:wq
,:x
,ZZ
: write save and quit:q!
: quit without savinggg
: go to the first lineG
: go to the last linengg
,nG
: go tho the nth line0
: jump to the start of the line$
: jump to the end of the lin^
: jump to the first non-blank character in the lineg_
: jimp to the last non-blank character in the linei
: insert before the cursora
: insert after the cursorI
: insert at the beginning of the lineA
: insert at the end of the lineo
: append a new line below the current lineO
: append a new line above the current liner
: replace the current characterR
: replace characters untilESC
is clicked
- Graphical editors:
- Which file to modify
- Some commands:
- Chapter 14 Package Management
- Introduction
- The most important determinant of linux distribution quality is the packaging system and the vitality of the distributions's support community
- Package management is a method of installing and maintaining
software on the system
- Nowadays we can install packages from the linux distributor
- Back to early days, people need to download and compile source code to install software
- Packaging Systems
- Different distributions use different packaging systems, and generally a packaging system designed for one distribution are not compatible with another distribution
- Two main packaging technologies
.deb
camp from Debian: Debian, Ubuntu, Linux Mint, Raspbian.rpm
camp from Red Hat: Fedora, CentOS, Red Hat Enterprise Linux, OpenSUSE
- How a Package Systems Works
- Virtually all software for a Linux system will be found on the Internet, most will be provided by the distribution vendor in the form of package files, and the rest will be available in source code form that can be installed manually
- Package files
- A package file is a compressed collection of files that comprise the software package
- A package may contain programs and data files, metadata files, pre- and post-installation scripts that perform configuration tasks
- Package files are created by the package maintainer
- Repositories: packages are often hosted in a central repository
- Dependencies: dependencies are shared libraries that are indispensible for a software to run properly
- Package management systems tools
Low-level tools: install and remove package files
High-level tools: search metadata and resolve dependencies
Distributions Low-Level Tools High-Level Tools Debian style dpkg
apt
,apt-get
,aptitude
Fedora, Red Hat Enterprise Linux rpm
yum
,dnf
- Common Package Management Tasks
- Find a package in a repository
- Debian style:
apt-get update; apt-cache search search_string
- Red Hat style:
yum search search_string
- Debian style:
- Install a package from a package file
- Debian style:
dpkg -i package_file
- Red Hat style:
rmp -i package_file
- Debian style:
- List installed packages
- Debian style:
dpkg -l
- Red Hat style:
rpm -qa
- Debian style:
- Determine whether a package is installed
- Debian style:
dpkg -s package_name
- Red Hat style:
rpm -q package_name
- Debian style:
- Display information about a package
- Debian style:
apt-cache show package_name
- Red Hat style:
yum info package_name
- Debian style:
- Find which package installed a file
- Debian style:
dpkg -S file_name
- Red Hat style:
rpm -qf file_name
- Debian style:
- Find a package in a repository
- Introduction
- Chapter 16 Networking
- Some commands
ping
: send an ICMP ECHO_REQUEST to network hoststraceroute
: print the route packets trace to a network hostip
: show/manipulate routing, devices, policy routing and tunnelsnetstat
: print network connections, routing tables, interface statistics, masquerade connections, and multicast membershipsftp
: internet file transfer programwget
: non-interactive network downloaderssh
: OpenSSH SSH client (remote login program)
- Examine and monitor a network
ping
- Sends a special network packet to a specified host, most devices receiving this packet will reply to it, allowing the network connection to be verified
- Once start, ping continues to send packets at a specified interval
(default is 1 second) until it is interrupted (say,
Ctrl+C
) - After interrupted, ping prints performance statistics
traceroute
- The traceroute lists all the routers network traffic takes to get from the local system to a specified host
ip
ip a
: list all information
netstat
netstat -i
: display a table of all network interfacesnetstat -e
: display additional informationnetstat -r
: display the kernel routing tablesnetstat -n
: show numerical addresses instead of trying to determine symbolic host, port or user names
- Transport files over a network
ftp
- File Transfer Protocol (FTP) was once the most widely used method of downloading files over the Internet
ftp
is used to communicate with FTP servers, machines that contain files that can be uploaded and downloaded over a network- FTP is not secure because it sends account names and passwords in cleartext, almost all FTP done over the Internet is done by anonymous FTP servers
ftp servername
: login to a FTP server
lftp
- It works much like the traditional ftp program but has many additional convenience features including multiple-protocol support, automatic retry on failed downloads, background processes, tab completion of path names, and many more
wget
- It is useful for downloading content from both web and FTP sites
- Single files, multiple files, and even entire sites can be downloaded
- wget allows recursive download, download files in the background, and complete the download of a partially downloaded file
- Secure communication with remote hosts
- Before
ssh
, there are commands likerlogin
andtelnet
, but they transmit all the communication through clear-text, which is inappropriate for the use in the Internet age - Advantages:
- It authenticates that the remote host is who it says it is, thus preventing so-called man-in-the-middle attacks
- It encrypts all of the communications between the local and remote hosts
- SSH consists of two parts
- A SSH server runs on the remote host, listening for incoming connections, by default, on port 22
- An SSH client is used on the local system to communicate with the rmote server
- Most Linux distributions has an implementation of SSH called OpenSSH from the OpenBSD project
- Syntax:
ssh user@hostname
- Use SSH-encrypted tunnel to copy files across the network
scp
(secure copy),scp from to
sftp
(secure file transfer),sftp hostname
- Before
- Some commands
- Chapter 19 Regular Expressions
- Introduction
- Regular expressions are symbolic notations used to identify patterns in text
- We only consider regular expressions described in the POSIX standard
grep
- Actually grep is short for "global regular expression print", it essentially searches test files for the occurrence text matching a specified regular expression and outputs any line containing a match to standard output
- Syntax:
grep [options] regex [file...]
- grep options
grep -t
: ignore casegrep -v
: invert match, prints lines that do not matchgrep -c
: print the number of matches instead of the lines themselvesgrep -l
: print the name of each file that contains a match instead of the lines themselvesgrep -L
: print the name of each file that does not contain any matched linesgrep -n
: prefix each matching line with the line numbergrep -h
: for multi-file searches, suppress the output of filenames
- Metacharacters and literals
- We can use literals in regular expressions
- We can also use metacharacters in regular expressions
- Metacharacters:
^
,$
,[]
,{}
,-
,?
,*
,+
,()
,|
,\
- Metacharacters:
- The any character
.
matches any character in a character position
- Anchors
- The caret
^
and the dollar sign$
are treated as anchors in regular expressions ^
matches the beginning of a line$
matches the end of the line
- The caret
- Bracket expressions and character classes
- Bracket expression matches a single character from a specified set of characters
- Metacharacters does not work inside bracket expresions, except
^
and-
^
: if it appears at the beginning inside a bracket expression, then the following set of characters must not be present at the given character positionA-Z
: matches all uppercase letters
- POSIX character classes
[:alnum:]
:[A-Za-z0-9]
[:word:]
:[A-Za-z0-9_]
[:alpha:]
:[A-Za-z]
[:blank:]
: Space and tab[:cntrl:]
: ASCII control codes, include ASCII characters 0 through 31 and 127[:digit:]
:[0-9]
[:graph:]
: The visible characters, include ASCII characters 33 through 126[:lower:]
:[a-z]
[:upper:]
:[A-Z]
[:punct:]
: The punctuation characters, in ASCII, equivalent to[-!"#$%&'()*+,./:;<=>?@[\\\]_{|}~]
[:print:]
: The printable characters, all characters in[:graph:]
plus the space character[:space:]
: The whitespace characters, in ASCII, equivalent to[ \t\r\n\v\f]
[:xdigit:]
: Hexadecimal numbers, in ASCII, equivalent to[0-9A-Fa-f]
- POSIX basic v.s. extended regular expressions
- BRE (basic regular expressions)
- ERE (extended regular expressions)
- Alternation
grep -E 'AAA|CCC|BBB' file
: matches either AAA, or BBB, or CCC in the file- Combine alternation with other regular expression, use parentheses
on alternation:
grep -E '^(aa|bb|cc) file'
: matches either aa, or bb, or cc at the beginning of a line in the file
- Quantifiers
?
: matches an element zero or one time*
: matches an element zero or more times+
: matches an element one or more times{n}
: matches an element exactly n times{n, m}
: matches an element at least n times, and no more thant m times{n,}
: matches an element at least n times{,m}
: matches an element no more than m times
- Some applications
find
: search files in a directory- grep tests whether a line contains a pattern, find tests whether a line exactly matches a pattern
locate
: find files by namelocate --regexp pattern
: use BRElocate --regex patern
: use ERE
- Introduction
Pro Git
- Chapter 1
- Version control
- Def: a system that records changes to a file or set of files over time so that you can recall specific version later
- Local version control system
- Centralized version control system
- Distributed version control system
- Git
- Most other CVS stores information as a list of file-based changes
- Git thinks of its data like a series of snapshots of a miniature filesystem
- Three states of files in git
- Modified: the file is changed but not committed to the database yet
- Staged: the file is marked as modified and it will go into the next commit snapshot
- Committed: the files has already been committed to the database
- Version control
Python for Data Analysis (this is not a required book for this course)
- Chapter 4 NumPy Basics
- Introduction
- NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python
- Data analysis applications using NumPy
- Fast vectorized array operations
- Common array algorithms like sorting, unique, and set operations
- Efficient descriptive statistics and aggregating/summarizing data
- Data alignment and relational data manipulations for merging and joining together heterogeneous datasets
- Expressing conditional logic as array expressions instead of loops with if branches
- Group-wise data manipulation
- The NumPy ndarray Ch 4.1
- Example 1
1
2
3
4
5
6import numpy as np
data = np.random.randn(2,3) # generate a 2*3 array of random numbers
data = data * 3 # multiply each element by 3
data.shape # (2,3)
data.dtype # dtype('float64') - Example 2
1
2
3
4
5
6
7import numpy as np
data = [[1,2], [3,4]]
arr = np.array(data)
arr.ndim # 2
arr.shape # (2, 2)
arr.dtype # dtype('int64') - Example 3
1
2
3
4import numpy as np
arr1 = np.zeros((2,3)) # create a 2*3 array of 0
arr2 = np.ones((1,2)) # create a 1*2 array of 1 - There are different type of data types
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21import numpy as np
a1 = np.zeros((2,2))
a1.dtype # dtype('float64')
a2 = a1.astype(np.int32)
a2.dtype # dtype('int32')
a3 = np.ones((2,2), dtype=np.int32)
a3.dtype # dtype('int32')
```
* Any arithmetic operation between equal size NumPy arrays applies the operation element-wise
* Indexing and slicing
```python
import numpy as np
x = np.ones(5)
x[2:] = 2 # x is now [1, 1, 2, 2, 2]
y = x[:2]
y[0] = 3 # y is [3, 1], and x is [3, 1, 2, 2, 2]
z = np.array([[1,2],[3,4]])
z[:, :1] # [[1], [3]] - Boolean indexing
1
2
3
4
5
6
7index = np.arange(3)
data = np.arange(1, 10).reshape((3, 3))
data[index == 2] # [7, 8, 9]
cond = index != 2
data[~cond] # [7, 8, 9]
data[:, index == 2] # [3, 6, 9]
data[data< 5] = 5 # data is now [[5, 5, 5], [5, 5, 6], [7, 8, 9]] - Transposing index and swapping axes
x.T
: transpose of an arraynp.dot(x, x.T)
: matrix multiplication- transpose function
1
2
3
4
5
6import numpy as np
x = np.arange(1, 6).reshape((2, 3))
x.shape # (2, 3)
x = np.transpose(x, (1, 0)) # the second argument should be a tuple of range(n)
x.shape # (3, 2)
- Example 1
- Universal functions Ch 4.2
- A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays
- Functions:
np.abs(arr)
,np.sqrt(arr)
,np.square(arr)
,np.exp(arr)
,np.log(arr)
,np.log2(arr)
,np.sign(arr)
,np.ceil(arr)
,np.floor(arr)
,np.modf(arr)
(returns two arrays, one is the integral part, the other is the fractional part),np.isnan(arr)
,np.isfinite(arr)
,np.isinf(arr)
,np.add/subtract/multiply/divide(a1, a2)
,np.power(a1, a2)
,np.maximum/minimum(a1, a2)
,np.mod(a1, a2)
,np.greater/less/greater_equal/less_equal/equal/not_equal(a1, a2)
,np.logical_and/logical_or/logical_xor(a1, a2)
- Array-oriented programming Ch 4.3
- Example 1
1
2
3
4
5
6
7
8import numpy as np
import matplotlib.pyplot as plt
points = np.arange(-5, 5, 0.01) # 0.01 is the step size
xs, ys = np.meshgrid(points, points)
z = np.sqrt(xs ** 2 + ys ** 2)
plt.imshow(zs, cmap=plt.cm.gray)
plt.colorbar() - Convert conditional logic into array expressions
- Example 1
1
2
3
4
5
6
7
8
9import numpy as np
xs = np.arange(1, 6)
ys = np.arange(6, 11)
rands = np.random.randn(5)
cond = rands > 0
# The following are equivalent, but the second one is more efficient
res1 = [(x if c else y) for x, y, c in zip(xs, ys, cond)]
res2 = np.where(cond, xs, ys) - Example 2
1
2
3
4
5
6import numpy as np
ma = np.arange(1, 17).reshape((4,4))
rands = np.random.randn(16).reshape((4, 4))
cond = rands > 0
res = np.where(cond, 2, ma) # replace all elements in ma with 2 if cond is True
- Example 1
- Mathematical and statistical operations
1
2
3
4
5
6
7
8import numpy as np
arr = np.random.randn(16).reshape((4, 4))
arr.mean() # mean of all elements, equivalent to np.mean(arr)
arr.mean(axis=0) # mean of each column, equivalent to np.mean(arr, axis=0)
arr.mean(axis=1) # mean of each row, equivalent to np.mean(arr, axis=1)
arr.cumsum(axis=0) # cumulative sum of each column
arr.cumprod(axis=1) # cumulative product of each row- Other methods:
sum
,mean
,std
,var
,min
,max
,argmin
,argmax
,cumsum
,cumprod
- Boolean array methods
- Calcualte amount of positive values:
(arr > 0).sum()
- If there exists true value, return true:
arr.any()
- If all values are ture, return true:
arr.all()
- Calcualte amount of positive values:
- Sorting
- Ondimensional array sorting:
arr.sort()
- Sorting by row for a 2darray:
arr.sort(1)
- Sorting by column for a 2darray:
arr.sort(0)
- Ondimensional array sorting:
- Unique and set logic
np.unique(array)
- Other methods:
- Example 1
- Linear algebra Ch 4.5
x.dot(y)
is equivalent tonp.dot(x, y)
, also equivalent tox @ y
- Inverse, and QR decomposition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36from numpy.linalg import inv, qv
x = np.random.randn(5, 5)
mat = x.dot(x)
inv(mat) # inverse of mat
q, r = qr(mat)
```
* Pseudorandom number generation Ch 4.6
* Set seed `np.random.seed(num)`
* `np.random.randn()`: normal distribution
* `np.random.rand()`: uniform distribution
* `np.random.randint()`: uniform distribution
* `np.random.binomial()`: binomial distribution
* `np.random.beta()` beta distribution
* `np.random.chisquare()`: chi-square distribution
* `np.random.uniform()`: uniform distribution
* `np.random.gamma()`: gamm distribution
2. Chapter 5 Getting started with pandas
* Introduction to pandas data structrues Ch 5.1
* Series is a one-dimensional array-like object containing a sequence of values
```python
import pandas as pd
ls = list(range(10))
obj = pd.Series(ls, index=list(range(1, 11)))
obj.values # list from 0 to 9
obj.index # list from 1 to 10
10 in obj # True, check if index is in series
9 in obj # False
dt = {0: 1, 1: 2, 2: 3}
obj2 = pd.Series(dt)
obj2.index # list from 0 to 2
obj2.name = "my_series" # give name to a series
obj2.index.name = "my_index" # give name to index series - DataFrame
- A dataframe representes a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type
- A dataframe can have both a row and column index
- The most comman way to construct a dataframe is from a dict of
equal-length lists or NumPy arrays
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df1 = pd.DataFrame(data)
index = [str(x) for x in range_(6)]
df2 = pd.DataFrame(data, columns=['pop', 'year', 'state'], index=index) # alter the order of columns
# If a data or row is not inside the original data, the value will be NaN
df.columns # list of column names
df.index # list of row names
df.head(n) # get the first n rows
df['pop'] # get the data in column whose name is pop
df.loc['1'] # get the data in row whose name is 1
df.loc[['1', '3']] # get the data in rows whose name is 1 or 3
df.loc['1', 'pop'] # get the data entry whose column is pop and row is 1
# add a column
df['eastern'] = df.state == 'Ohio'
# delete a column
del df['eastern']
# dataframe transpose
df.T
# Add name to dataframe indicies
df.index.name = 'my_index'
df.columns.name = 'my_columns'
- Index objects
- You can add index in Series and DataFrame, pandas index are immutable
- You can also create index maually
1
2
3import pandas as pd
index = pd.Index(np.arange(3)) - There can be duplicates in pandas index
- Index operations:
append
,difference
,intersection
,union
,isin
,delete
,drop
,insert
,is_monotonic
,is_unique
,unique
- Essential Functionality Ch 5.2
- Reindexing
- The
reindex
method creates a new data index, the data will also be rearranged according to the new index - If a index does not exist before, missing values NaN will be filled
accordingly
1
2
3
4import pandas as pd
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e']) # NaN is associated with index 'e' - Other filling related arguments
method=ffill
: fill the missing values with previous valid observationsmethod=backfill
: fill the missing values with next valid observationsmethod=nearest
: fill the missing values with the nearest valid observationsfill_values=n
: fill the missing values with n
- For a dataframe, if not specified, reindex will apply on row
indicies. If you want to reindex column indicies, use
obj.reindex(columns=list)
- The
- Drop items
- For series, it's easy
1
2
3
4
5import pandas as pd
import numpy as np
obj = pd.Series(np.arange(6), index=np.arange(6))
obj.drop([1, 2]) - For dataframe, you can drop either rows or columns
1
2
3
4
5
6
7
8
9
10import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
rows = ['Ohio', 'Utah']
cols =['one', 'four']
data.drop(rows) # Equivalent to data.drop(rows, axis=0), or data.drop(rows, axis='index'), since the default value for axis is 0
data.drop(cols, axis=1) # Equivalent to data.drop(cols, axis='columns')
- For series, it's easy
- Indexing, selection, and filtering
- For series, things are trivial
1
2
3
4
5
6import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['a':'c'] # Both endpoints are inclusive, result is [0.0, 1.0, 2.0]
obj[0:3] # [0.0, 1.0, 2.0], 3 is exelusive - For series, you can apply on both row and column indicies
df[:2]
: the first two rowsdf['x']
: column xdf[['x', 'y']]
: column x and column ydf[df['x'] > 2]
: rows that has value in column x greater than 2
loc
andiloc
loc
uses axis labels,iloc
use integer indiciesdf.loc['r1', 'c1':'cn']
: row is r1, column is from c1 to cn(both c1 and cn are included)df.iloc[:, :]
: select all rows and all columns
- Arithmetic and data alignment
- If a index only appears in one series or one dataframe, the result of arithmetic operation will be NaN for that index
- You can use
fill_value
parameter to provide a fill value for the missing ones - Some operations
df1.add(df2)
(df1 + df2),df1.radd(df2)
(df2 + df1)df1.sub(df2)
(df1 - df2),df1.rsub(df2)
(df2 - df1)df1.mul(df2)
(df1 * df2),df1.rmul(df2)
(df2 * df1)df1.div(df2)
(df1 / df2),df1.rdiv(df2)
(df2 / df1)
- Arithmetic operations between dataframe and series
- By default, arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows
- If you want to instead broadcast over the columns, matching on the
rows, you have to use one of the arithmetic methods
df1.add(ls, axis='index')
- Function application and mapping
- NumPy ufuncs (element-wise array methods) also work with pandas objects
- Apply functions on columns and rows
1
2
3
4
5
6
7
8
9import numpy as np
import pandas as pd
f = lambda x: x.max() - x.min()
df.apply(f) # Calculate max - min for each column
df.apply(f, axis='columns') # Caculate max - min for each row
g = lambda x: pd.Series([x.min(), x.max()], index=['min', 'max'])
df.apply(g) # Calculate min and max for each column - Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary
- Apply functions for each item in the dataframe
1
2
3
4
5
6
7
8
9import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(16).reshape(4,4))
f = lambda x: '%2.2f' % x
df.applymap(f)
# You can use map on a series
ls = pd.Series(np.arange(16))
ls.map(f)
- Sorting and ranking
- Use
ascending=False
to sort in descending order ls.sort_index()
: sort by seriesindexdf.sort_index()
: sort by dataframe row indexdf.sort_index(axis=1)
: sort by dataframe column indexls.sort_values()
: sort series valuesdf.sort_values(by=['a', 'b'], axis=0, ascending=True)
: sort dataframe values by column a and column b (sort row values in those columns, so axis=0)ls.rank(method='average/first/max/min/dense')
: rank a series by different methodsdf.rank(method='agerage', axis=0/1)
: rank a dataframe by rows/columns using different methods
- Use
- Index with duplicates
ls.index.is_unique()
: check if series index has duplicatesdf.index.is_unique()
: check if dataframe row index has duplicatesdf.columns.is_unique()
: check if dataframe column index has duplicates
- For series, things are trivial
- Summarizing and computing descriptive statistics Ch 5.3
df.sum()
,df.mean()
,df.cumsum()
,df.sumprod()
,df.count()
,df.min()
,df.argmin()
,df.idmin()
,df.median()
,df.mad()
,df.prod()
,df.var()
,df.std()
,df.skew()
,df.kurt()
,df.diff()
,df.pct_change()
,df.corr()
,df.cov()
axis=0/1
,skipna=Ture/False
- Unique values, value counts, and membership
ls.unique()
: return unique valuesls.value_counts(sort=False/True)
: return value frequenciespd.value_counts(ls, sort=False/True)
: return value frequenceismask = ls1.isin(ls2)
: return a bollean lists whether a item in ls1 is in ls2df.apply(pd.value_counts).fillna(0)
: calculate value frequencies for dataframe1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import numpy as np
import pandas as pd
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
'Qu2': [2, 3, 1, 2, 3],
'Qu3': [1, 5, 2, 4, 4]})
result = data.apply(pd.value_counts).fillna(0)
"""
Qu1 Qu2 Qu3
1 1.0 1.0 1.0
2 0.0 2.0 1.0
3 2.0 2.0 0.0
4 2.0 0.0 2.0
5 0.0 0.0 1.0
"""
- Reindexing
- Introduction
- Chapter 6 Data loading, storage, and file formats
- Reading and writing data in text format Ch 6.1
- Parsing functions in pandas
read_csv
,read_table
,read_fwf
,read_clipboard
,read_excel
,read_hdf
,read_html
,read_json
,read_msgpack
,read_pickle
,read_sas
,read_sql
,read_stata
,read_feather
- Parameters:
spe=','
,header=None/int
,names=[c1, ..., cn]
,index_col=cn
- Reading text files in pieces
pd.isnull(df)
pd.options.display.max_rows = n
,pd.options.display.max_columns = m
- Writing data to text format
df.to_csv(path)
- Working with delimited formats
1
2
3
4
5
6
7
8
9
10
11
12
13import csv
with open("file_path") as csvfile:
reader = csv.reader(csvfile)
ls = [row for row in reader]
csvfile.close()
header, contents = ls[0], ls[1:]
with open("output", "w") as output:
writer = csv.writer(output)
for row in contents:
writer.writerow(row)
output.close() - JSON data
- Convert json string to json object:
obj = json.loads(json_string)
- Convert json object to json string:
json = json.dumps(obj)
- Read json file into pandas dataframe:
df = pd.read_json(file)
- Convert json string to json object:
- XML and HTML
- Read html into a list of dataframes:
pd.read_html(file)
- Read html into a list of dataframes:
- Parsing functions in pandas
- Binary data formats Ch 6.2
- One of the easies ways to store data efficiently in binary format is
using Python's built-in
pickle
serilization. Pandas objects all haveto_pickle
method that writes data to disk in pickle format - Pickle is good as a short-trem storage format, it's hard to guarantee that the format will be stable over time
- Pandas supports two more binary data formats:
HDF5
andMessagePack
- Store in HDF5 format:
df.HDFStore("xxx.h5")
- HDFStore supports two storage schemas,
format='fixed'/'table'
- Store in HDF5 format:
- Reading excel file
- Create excel file and then read
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import pandas as pd
xlsx = pd.ExcelFile("xxx.xlsx")
df = pd.read_excel(xlsx, "Sheet1")
```
* Put it all together `df = pd.read_excel("xxx.xlsx", "Sheet1")`
* Interacting with web APIs Ch 6.3
* `requests` package
```python
import requests
url = "xxx"
response = requests.get(url)
data = response.json() # a dictionary containing json objects
df = pd.DataFrame(data, index=, columns=)
- Create excel file and then read
- One of the easies ways to store data efficiently in binary format is
using Python's built-in
- Interacting with databases Ch 6.4
- sqlite3
1
2
3
4
5
6
7
8
9
10
11
12import sqlite3
query = """
SELECT * FROM table
WHERE c1 =
ORDER BY c2
LIMIT 5;
"""
conn = sqlite3.connect("xxx.sqlite")
cursor = conn.execute(query)
rows = cursor.fetchall()
df = pd.DataFrame(rows, columns = [x[0] for x in cursor.description])
- sqlite3
- Reading and writing data in text format Ch 6.1
- Chapter 7 Data cleaning and preparation
- Handling missing data Ch 7.1
- All of the descriptive statistics on pandas objects exclude missing data by default
- For numeric data, missing data is represented using the
floating-point value NaN, use
data.isnull()
to check for missing data - Filtering out missing data: use
data.dropna()
, this is equivalent todata[data.notnull()]
. arguments:axis=0/1
,how='all'/'any'
(if all values are NaN, or if any value is NaN) - Fill missing values
- Fill all missing values with on value:
data.fillna(val)
- Fill missing values with different values for each column:
data.fillna({"c1": v1, "cn": vn})
- Fill missing values in place:
data.fillna(val, inplace=True)
- Fill all missing values with on value:
- Data transformation Ch 7.2
- Removing duplicates
data.duplicated
: return a boolean series/dataframe indicating whether a value/row is duplicateddata.drop_duplicates()
: drop duplicated values/rows, usekeep='first'/'last'/False
to sepcify how to remove duplicated values
- Transforming data using a function or mapping
1
2
3
4
5import numpy as np
import pandas as pd
mapping = {0: 'a', 1: 'b', 3: 'c' }
df['char_index'] = df['index'].map(mapping) # add a new column, with integer indicies mapped to character indicies - Replacing values:
data.replace(v1, v2)
: replace v1 with v2data.replace({v1: v2, v3: v4})
: replace v1 with v2, v3 with v4data.replace([v1, v3], [v2, v4])
: replace v1 with v2, v3 with v4
- Renaming axis indicies
- Rename row or column indicies
1
2
3
4
5
6import pandas as pd
trans1 = lambda x: x.str.upper()
trans2 = lambda x: x.str.title() # Capatilize the first letter of each word
df.index.map(trans1)
df.columns.map(trans2) - Rename both index and column names:
df.rename(index=trans1, columns=trans2)
- Rename row or column indicies
- Detecting and filtering outliers
data.describe()
: calculate max, min, mean, std, 4 quantiles of each column
- Permutation and random sampling
- You can apply permutation on values in a series or rows of a
dataframe with the
iloc
or thetake
function1
2
3
4
5
6
7import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5,3))
perm = np.random.permutation(len(df))
df.iloc[perm] # permute the rows
df.take(term) # permute the rows, equivalent to the above iloc function - Take a random sample of rows:
df.sample(n=n, replace=True/False)
, default replace is False
- You can apply permutation on values in a series or rows of a
dataframe with the
- Computing indicator/dummy variables
- Convert a categorical variable into dummy/indicator matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
14import pandas as pd
import numpy as np
ls = ['a', 'a', 'b', 'b', 'c']
ser = pd.Series(ls, index=np.arange(1, len(ls) + 1))
pd.get_dummies(ser)
"""
a b c
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
""" - Merge a dataframe and a series/another dataframe
- The series must be named
- If the dataframes have columns with the same name, use
lsuffix
andrsuffix
to specifu the suffix for the left and right dataframe1
2
3
4
5
6
7import pandas as pd
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'B': ['B0', 'B1', 'B2']})
df.join(other, lsuffix='_caller', rsuffix='_other')
- Use
np.random.seed(n)
to set seed for random sampling
- Convert a categorical variable into dummy/indicator matrix
- Removing duplicates
- String manipulation Ch 7.3
- String object methods
str.split(sep="")
,str.strip()
,str.rstrip()
,str.lstrip()
,sep.join(ls)
char in str
,str.index(char)
(ValueError if not found),str.find(char)
(return -1 if not found),str.contains(char)
str.count(char)
,str.replace(old, new)
str.endswith(pattern)
,str.startswith(pattern)
str.lower()
,str.upper()
- Regular expression
re
package- Use
regex = re.compile(pattern, flags=re.IGNORECASE)
to get a reusable regex object regex.findall(str)
,regex,search(str)
,regex.match(str)
- String object methods
- Handling missing data Ch 7.1
- Data wrangling: join, combine and reshape
- Hierarchical indexing
1
2
3
4
5
6
7
8import pandas as pd
data = pd.Series(np.random.randn(9),
index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
[1, 2, 3, 1, 3, 1, 2, 2, 3]])
df = data.unstack() # Convert a series to a dataframe
df.index # ['a', 'b', 'c']
df.columns # [1, 2, 3]
- Hierarchical indexing
Python testing with pytest
- Chapter 1 Getting started with pytest
pytest *_test.py
to run all test in a filepytest -v
: -v flag controls the verbosity of pytest output in various aspects: test session progress, assertion details when tests fail, fixtures details with --fixturespytest -r
: show extra test summary info as specified by chars- Naming rules
- Test files should be named
test_*.py
or*_test.py
- Test methods and functions should be named
test_*
- Test classes should be named
Test*
- Test files should be named
- Possible outcomes of a test function
- Passed(.): the test ran successfully
- Failed(F): the test did no run successfully
- Skipped(s): the test was skipped
- xfail(x): the test was not supposed to pass, ran, and failed
- XPASS(X): the test was not supposed to pass, ran, and passed
- ERROR(E): an exception happened outside of the test function
- Running only one test
- Run test called
test_inc
intest_math.py
:pytest -v test_math.py::test_inc
- Run test called
- Using options
- Check for pytest options:
pytest --help
pytest -m
: only run tests matching given mark expression- Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14import pytest
def inc_one(x):
return x + 1
def dec_one(x):
return x - 1
# give this test a mark called simple
def test_inc():
assert inc_one(1) == 2
def test_dec():
assert dec_one(2) == 11
2pytest -v // This runs both tests
pytest -v -m simple // This runs only the test marked simple
- Example
collect-only
: The --collect-only option shows you which tests will be run with the given options and configuration-m markexpr
Markers are one of the best ways to mark a subset of your test functions so that they can be run together. As an example, one way to run test_replace() and test_member_access(), even though they are in separate files, is to mark them
- Check for pytest options:
- Chapter 2 Writing test functions
- Using assert statements:
assert expr
,assertTrue(expr)
,assertEqual(x, y)
, ... - Expecting exceptions:
1
2
3
4
5
6
7
8
9
10import pytest
def my_add(x: int, y: int) -> int:
return x + y
def test_add():
with pytest.raises(TypeError) as excinfo:
my_add(1, '2')
exception_msg = excinfo.value.arge[0]
assert exception_msg == 'unsupported opertand type(s) for +: 'int' and 'str'' - Marking test functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import pytest
def inc(x):
return x + 1
def dec(x):
return x - 1
def test_inc():
assert inc(1) == 2
def test_dec():
assert dec(2) == 11
2
3pytest -v // This runs both test_inc and test_dec
pytest -v -m inc // This runs only test_inc
pytest -v -m dec // This runs only test_dec - Skipping tests
- We can skip or skip with condition tests that we do not want to run now
- Skip a test
1
2
3
4
5import pytest
def test_func():
assert True - Skip a test with a boolean condition
1
2
3
4
5
6
7
8
9
10
11
12
13import pytest
def test_func():
assert True
```
* Marking tests as excepting to fail
```python
import pytest
def test_func():
assert True - Running a subset of tests
1
2
3
4
5pytest -v . # This runs all tests in the current directory
pytest -v test_raise.py # This runs all tests in a file
pytest -v test_raise.py::test_raise # This runs a single test function in a file
pytest -v test_raise.py::TestRaise # This runs all tests in a class inside a file
pytest -v -k '_raise and not delete' # This runs all tests in the current directory whose name contains `_raise` but not `delete`
- Using assert statements:
- Chapter 3 Pytest fixtures
- Fixtures are functions that are run by pytest before the test functions
- Fixtures functions can do whatever you want: get data, set up a database connection, share data among multiple test functions, etc.
- Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import pytest
def db_connect():
conn = sqlite3.connect('test.db')
cursor = conn.cursor()
cursor.execute("create table if not exists test_table(name text);")
cursor.execute("insert into test_table(name) values('test_name');")
result = cursor.execute("select * from test_table;")
res = []
for row in result:
res.append(row)
conn.close()
return res
def test_connection(db_connect):
assert len(db_connecet) > 0 - Sharing fixtures among multiple tests
- Use
conftest.py
file to define fixtures that need to be shared among different places - Or you can add a scope parameter to the fixture function
@pytest.fixture(scope='module')
, scope can be function, class, module, package, or session - Example
1
2
3
4
5
6
7
8# content of conftest.py
import pytest
import smtplib
def smtp_connection():
return smtplib.SMTP("smtp.gmail.com", 587, timeout=5)1
2
3
4
5
6
7
8
9
10
11
12# content of test_module.py
def test_ehlo(smtp_connection):
response, msg = smtp_connection.ehlo()
assert response == 250
assert b"smtp.gmail.com" in msg
assert 0 # for demo purposes
def test_noop(smtp_connection):
response, msg = smtp_connection.noop()
assert response == 250
assert 0 # for demo purposes
- Use
Lecture 01 2022/01/19
Introduction, based on the TLCL book
Lecture 02 2022/01/20
Navigation and Redirection, based on the TLCL book
Lecture 03 2022/01/24
Argument Expansion, based on the TLCL book
Lecture 04 2022/01/26
Permissions and Processes, based on the TLCL book
Lecture 05 2022/01/27
Grep and Regular Expressions, based on the TLCL books
Lecture 06 2022/01/31
Lecture 07 2022/02/02
Lecture 08 2022/02/03
Lecture 09 2022/02/07
Lecture 10 2022/02/08
- Change the default browser by which jupyter notebook is opened
- Run
jupyter notebook --generate-config
in command line prompt, this command generates a configure file at~/.jupyter/jupyter_notebook_config.py
- Set this property in the above configure file
c.NotebookApp.browser = u'C:/Home/AppData/Local/Google/Chrome/Application/chrome.exe %s'
. If you use windows, you should use path separator\\
, and also quote the path of your browser inside "", and then inside '', so you configuration should be something like this:c.NotebookApp.browser = '"C:\\Home\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe" %s'
- Run
Lecture 15 2022/02/28
Lecture 21 2022/03/14
- pytest
- Install pytest:
- Install:
pip install pytest
- Check pytest version:
pytest --version
- Simplest way to run tests:
pytest
, this command will will run all files of the formtest_*.py
or*_test.py
in the current directory and its subdirectories
- Install:
- Simple test example
1
2
3
4
5
6# inside test_simple.py
def inc_one(x):
return x + 1
def test_inc_one():
assert inc_one(1) == 2
- Install pytest: