An index is an additional structure that is derived from the primary
data
Many database allows you to add and remove indexes without affecting
the contents of the database
Index is mainly used to speed up queries
Any kind of index usually slows down writes, because the index also
needs to be updated every time data is written
Hash Indexes
Key-value indexes are very common, and it's useful building block
for more complex indexes
If the data storage consists only of appending to a file, whenever
you append a new key-value pair for the file, you also update the hash
map to reflect the offset of the data you just wrote. When you want to
look up a value, use the hash map to find the offset in the data file,
seek to that location, and read the value
Avoid storage shortage: break the log into segments of certain size
by closing a segment file when it reaches a certain size, and making
subsequent writes to a new segment file. We can then perform compaction
on these frozen segments by throwing away duplicate keys in the log, and
keeping only the most recent update for each key. We can also merge
compacted segments into a single segment
Issues and solutions
File format: a fast and simple log format should by binary format
file
Deleting records: you need to append a special deletion record to
the data file (called tombstone), the actual deletion happens during
compaction
Crash recovery: if the hash map is totally in memory, then recovery
after crash could be painful, you can save a snapshot of each segment's
hash map on disk to speed up recovery
Partially written records: use checksum to ensure data
integrity
Concurrency control: use only one writer thread to ensure
concurrency
Hash index limitations
The hash tale must fit in memory, so the amount of keys cannot be
very large
Range queries are not efficient, you need to loop up each key in the
range individually
SSTables and LSM-Tree
In hash index, the key-value pairs appear in the order that they
were written, and values later in the log take precedence over values
for the same key earlier in the log
In SSTable (sorted string table), we require that the sequence of
key-value pairs is sorted by key
Advantages of SSTable
Merging segments is simple and efficient, the approach is like the
merging used in the merge sort algorithm
In order to find a particular key in the file, you no longer need to
keep an index of all the keys in the memory
Since read requests need to scan over several key-value pairs in the
requested range anyway, it is possible to group those record into a
block and compress it before writing it to disk
Constructing and maintaining SSTables
When a write comes int, add it to an in-memory balanced tree data
structure called memtable
When the memtable gets bigger than some threshold, write it out to
disk as an SSTable file
In order to serve a read request, first try to find the key in the
memtable, then in the most recent on-disk segment, then in the
next-older segment etc
From time to time, run a merging and compaction process in the
background to combine segment files and to discard overwritten or
deleted values
If the database crashes, the most recent writes (in the memtable)
are lost. In order to avoid this problem, we can keep a separate log on
disk to which every write is immediately appended. Every time the
memtable is written out to an SSTable, the corresponding log can be
discarded
Making an LSM-Tree out of SSTables
LSM-Tree (Log-Structured Merge Tree), maintain key-value pairs. LSM
trees maintain data in two or more separate structures, each of which is
optimized for its respective underlying storage medium; data is
synchronized between the two structures efficiently, in batches
LSM uses memtable in memory and SSTable on disk, the point is that
sequential access is way faster than random access on disk
Performance optimization
The LSM-tree algorithm can be slow when looking up keys that do not
exist in the database, you can use Bloom filters to speed up these
queries
There are also different strategies to determine the order and
timing of how SSTables are compacted and merged
B-Trees
Introduction
B-tree is the standard index implementation in almost all relational
databases, and many nonrelational databases use them too
B-trees break the database down into fixed-size blocks or pages,
traditionally 4Kb in size, and read or write one page at a time. Each
page can be identified using an address or location, which allows one
page to refer to another. We can use these page references to construct
a tree of pages
A B-tree with n keys always has a depth of
Branching factor: the number of references to child pages in one
page of the B-tree
A four-level tree of 4KB pages with a branching factor of 500 can
store up to 256TB
Making B-trees reliable
The basic underlying write operation of a B-tree is to overwrite a
page on disk with new data, the overwrite does not change the location
of the page
Some operations require several different pages to be overwritten
(split/merge is involved), this can be dangerous, because if the
database crashes after only some of the pages have been written, you end
up with a corrupted index
In order to make the database resilient to crashed, it is common for
B-tree implementations to include an additional data structure on disk:
a write-ahead log(WAL, also known as a redo log). The WAL is an
append-only file to which every B-tree modification must be written
before it can be applied to the pages of the tree itself. When the
database comes back up after crash, this log is used to restore the
B-tree back to a consistent state
B-tree optimizations
Instead of overwriting pages and maintaining a WAL for crash
recovery, some databases use a copy-on-write scheme
We can save space in pages by not storing the entire key, but
abbreviating it. Packing more keys into a page allows the tree to have a
higher branching factor, and thus fewer levels (this variant is known as
B+ tree)
In general, pages can be positioned anywhere on disk, there is
nothing requiring pages with nearby key ranges to be nearby on disk.
Many B-tree implementation try to lay out the tree so that leaf pages
appear in sequential order on dis, maintaining this property is
difficult as the tree grows
Comparing B-Trees and LSM-Trees
B-tree implementations are generally more mature than LSM-tree
implementations
As a rule of thumb, LSM-trees are typically faster for writes, but
B-trees are thought to be faster for reads
Advantages of LSM-trees
A B-tree index must write every piece of data at least twice, once
to the WAL and once to the tree page itself (and perhaps again as pages
are split)
Write amplification: one write to the database resulting in multiple
writes to the disk over the course of the database's lifetime
Advantage 1: LSM-trees have higher write throughput than B-trees,
partly because they have lower write amplification, and partly because
they sequentially write compact SSTable files rather than having to
overwrite several pages in the tree
Advantage 2: LSM-trees can be compressed better, and thus often
produce smaller files on disk (B-trees storage engines have
fragmentation, LSM-trees periodically use compression to remove
fragmentation)
Downsides of LSM-trees
Downside 1: the compaction process can sometimes interfere with the
performance of ongoing reads and writes. The impact on throughput and
average response time is usually small, but at higher percentiles the
response time of queries to LSM-trees based engines can be quite
high
Downside 2: the disk's finite write bandwidth needs to be shared
between the initial write and the compaction threads running in the
background, this problem is significant when the write throughput is
high
Downside 3: log-structured storage engine may have multiple copies
of the same key in different segments, then it's harder to offer strong
transactional semantics
Other Indexing Structures
Introduction
The key-value indexes are like a primary key index in the relational
model. A primary key uniquely identified one row in a relational table,
or one document in a document database, or one vertex in a graph
database
Secondary indexes are also very common. You can create several
indexes on the same table in relational databases using the CREATE INDEx
command, and they are often crucial for performing joins
efficiently
A secondary index does not provide unique key-value pairs, i.e.,
there might be many rows with the same key
Both B-trees and log-structured indexes can be used as secondary
indexes
Storing values within the index
The key in an index is the thing that queries search for, but the
value can be one of the two things: it could be the actual row in
question, or it could be a reference to the row stored elsewhere (the
actual place where rows are stored is called a heap file)
Cluster index: the values in index are the actual rows
Covering index/index with included columns: a compromise between a
cluster index and a non-clustered index, it stores some of a table's
columns with the index
Multi-column indexes
The most common type of multi-column index is called a concatenated
index, which simply combines several fields into one key by appending
one column to another
Query with multiple columns cannot be answered using the standard
B-tree or LSM-tree index efficiently
Full-text search and fuzzy indexes
Fuzzy queries: search for similar keys, misspelled words
Full-text search engines commonly allow a search for one word to be
expanded to include synonyms of the word, to ignore grammatical
variations of words, and to search for occurrences of words near other
in the same document, and support various other features that depend on
linguistic analysis of the text
Keeping everything in memory
Some in-memory key-value stores are intended for caching use only,
where it's acceptable for data to be lost if a machine is restarted
The performance advantage of in-memory databases is not due to the
fact that they don't need to read from disk, they can be faster because
they can avoid the overheads of encoding in-memory data structures in a
form that can be written to disk
In-memory databases also provide data models that are difficult to
implement with disk-based indexes ## Transaction Processing or
Analytics