Chapter 1 可靠，可扩展与可维护的应用系统

前言

系统分类
- 数据密集型(data-intensive): 数据量，数据的复杂程度何数据的快速多变性是限制
- 计算密集型(compute-intensive): CPU处理能力是限制
常见系统构成:
- 数据库: 存储数据
- 高速缓存: 缓存复杂或者代价昂贵的操作，加速访问
- 索引: 关键字搜索和过滤
- 流式处理: 持续发送消息到另一个进程，异步处理
- 批处理: 定期处理大量数据

认识数据库系统

可以把数据库，队列，高速缓存等系统归为一大类数据系统(data systems)得原因:
- 随着技术的发展，一些针对不同应用场景进行优化得工具出现了，他们不能被归为传统类型
- 系统需求变得更加广泛，单个组件往往无法完成所有数据处理和存储得需求。需要将任务分解，每个组件负责高效得完成其中一部分，多个组件依靠应用层代码驱动有机衔接起来
系统设计需要注意的问题
- Reliability: 在出现意外得时候，系统应该可以正常运转，虽然性能可能有所降低
- Scalability: 随着规模得增长，系统应该以合理得方式来匹配这种增长
- Maintainability: 随着时间推移，在对现有场景或者新场景进行适配的时候，系统应该高速运转

可靠性

系统的期望
- 应用可以执行用户希望的功能
- 应用可以容忍用户操作错误或者不正确的使用方法
- 系统可以应对典型场景，合理负载压力和数据量
- 系统可以防止未经授权的访问和滥用
可容错(fault-tolerant/resilient)系统: 即使发生了错误，系统应该可以继续正常工作
Fault != Failure:
- Fault: 组件偏离正常规格
- Failure: 整个系统停止
- 不太可能组件故障概率为0，我们应该设计容错机制避免故障引起失效
- 一般来说倾向于容忍故障而不是预防故障，但是对于安全故障来说应该提前预防
硬件故障
- 磁盘平均无故障时间: 10~50年
- 一般应对硬件故障采用硬件冗余: 磁盘配置RAID, 服务器配备双电源, 热拔插CPU, 数据中心备用电源, 发电机
- 随着数据量和计算需求增加，一些云平台可能在硬件没问题的情况下软件崩溃，这时应该注重整体灵活性和弹性而非单机可靠性，因而需要软件容错
软件错误
- 一般来说硬件故障是独立的，但是软件错误可能跨节点导致系统故障
- 一般软件错误需要进行测试，进程隔离，允许进程崩溃重启，监控生产环境的行为表现等操作才能解决
人为失误
- 防止人为失误的多种方法:
- 以最小出错的方式来设计系统: 更多更详细的规定v.s.人们可能存在的逆反
- 想办法分离最容易出错的地方，容易发生故障的接口: 提供一个尽可能完备的砂箱环境来模拟生产环境
- 充分的测试: 单元测试，全系统集成测试，手动测试，自动化测试注意边界条件
- 出现人为失误时提供快速恢复机制: 快速回滚，滚动发布代码，数据校验工具等
- 设置详细而清晰的监控子系统，包括性能指标和错误率
- 推行管理流程并且加以培训
可靠性的重要性
- 无论是强制指标还是非强制指标，都应该对可靠性有着高要求
- 有时为了成本考虑，可以适度降低非关键系统的可靠性要求

可扩展性

随着负载增加，可靠系统有可能变成不可靠的
描述负载
- 负载可以用称为负载参数的若干数字来描述
- 具体负载参数取决于系统的体系结构: 比如Web服务器的每秒请求处理次数，数据库中写入的比例，聊天室的活动用户书量，缓存命中率等
- 有时候平均值很重要，有时候系统瓶颈来自于少数的峰值
描述性能
- 批处理系统如Hadoop中可以用吞吐量(thoughput)，线性系统可以用服务响应时间(response time)
- 延迟(latency): 花费在处理请求上的时间
- 响应时间(response time): 客户端看到的，除了处理请求时间，还包括来回网络延迟和各种排队延迟。响应时间不应该被视作一个常数，而应该视为一种分布
- 随机因素影响: 上下文切换，进程调度，网络数据包丢失和TCP重传，垃圾回收暂停，缺页中断和瓷盘I/O
- 描述响应时间:
  - 平均给响应时间
  - 中位数响应时间(p50)
  - 阈值响应时间(p95, p99, p999)
  - 要考虑长尾效应(tail latencies): 响应比较慢可能是因为数据量大，比如购物价值多
  - 要注意成本效益分析: 优化分位数响应阈值的成本是否小于优化带来的收益
应对负载增加的方法
- 针对负载增加而设计的构架一般不太可能应对超出设计目标10倍的实际负载
- 垂直扩展(升级更强大的机器)，水平扩展(将负载分布到集群中)
- 有一些弹性特征的系统可以自动检测负载增加，然后自动添加更多计算资源。一些手动扩展的系统需要人工分析性能表现和添加资源。如果负载度不可测，自动弹性系统会更高效，手动方式可以减少执行期间的意外情况
- 无状态服务进行分布式扩展比有状态服务更容易
- 超大规模的系统往往针对特定应用而高度定制，很难有一种通用的结构
- 一般来说，可扩展架构通常是从通用模块逐步构建而来，往往有规可循

可维护性

软件系统总体成本里开发成本只占一小部分，绝大部分成本是整个生命周期内持续的投入
软件系统的三个设计原则:
- 可运维性，方便运营团队保持系统平稳运行:
  - 监视系统健康，出现问题及时恢复服务，追踪问题原因
  - 保持软件和平台更新
  - 预测和防范未来的问题
  - 制定流程规范和规范的工具包
- 简单性，使得工程师能够轻松理解系统:
  - 复杂性的表现: 状态空间膨胀，模块紧耦合，相互依赖关系，命名和术语不一致
  - 简化系统设计并不是减少系统功能，主要是消除意外方面的复杂性(坏的软件实现带来的问题)
  - 一个主要手段: 抽象，从而隐藏细节，提供干净易懂的对外接口。好的抽象设计是很有挑战的
- 可演化性，后续工程师可以轻松改进
  - 可以使用敏捷开发模式: TDD(Test driven design)和重构

Chapter 2 数据模型和查询语言

前言

大多数应用程序可以表示为数据模型的一层层叠加，下面是一个例子(可能还有更多的中间层，但是每层都通过一个简介的数据模型隐藏下一层的复杂性):
- 观测现实世界，抽象出对象或者数据结构，以及操作对象/数据结构的API来进行建模
- 在需要对对象/数据结构进行存储的时候，可以采用通过数据模型(JSON/XML文档/表/图)来表示
- 用何种内存/磁盘/网络的字节格式来表示上述存储的对象/数据结构，数据表示需要支持CURD
- 底层上考虑用电流/脉冲波/磁场等表示字节

Chapter 3 Storage and Retrieval

Data Structures That Power Your Database

Overview
- An index is an additional structure that is derived from the primary data
- Many database allows you to add and remove indexes without affecting the contents of the database
- Index is mainly used to speed up queries
- Any kind of index usually slows down writes, because the index also needs to be updated every time data is written
Hash Indexes
- Key-value indexes are very common, and it's useful building block for more complex indexes
- If the data storage consists only of appending to a file, whenever you append a new key-value pair for the file, you also update the hash map to reflect the offset of the data you just wrote. When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value
- Avoid storage shortage: break the log into segments of certain size by closing a segment file when it reaches a certain size, and making subsequent writes to a new segment file. We can then perform compaction on these frozen segments by throwing away duplicate keys in the log, and keeping only the most recent update for each key. We can also merge compacted segments into a single segment
- Issues and solutions
  - File format: a fast and simple log format should by binary format file
  - Deleting records: you need to append a special deletion record to the data file (called tombstone), the actual deletion happens during compaction
  - Crash recovery: if the hash map is totally in memory, then recovery after crash could be painful, you can save a snapshot of each segment's hash map on disk to speed up recovery
  - Partially written records: use checksum to ensure data integrity
  - Concurrency control: use only one writer thread to ensure concurrency
- Hash index limitations
  - The hash tale must fit in memory, so the amount of keys cannot be very large
  - Range queries are not efficient, you need to loop up each key in the range individually
SSTables and LSM-Tree
- In hash index, the key-value pairs appear in the order that they were written, and values later in the log take precedence over values for the same key earlier in the log
- In SSTable (sorted string table), we require that the sequence of key-value pairs is sorted by key
- Advantages of SSTable
  - Merging segments is simple and efficient, the approach is like the merging used in the merge sort algorithm
  - In order to find a particular key in the file, you no longer need to keep an index of all the keys in the memory
  - Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those record into a block and compress it before writing it to disk
- Constructing and maintaining SSTables
  - When a write comes int, add it to an in-memory balanced tree data structure called memtable
  - When the memtable gets bigger than some threshold, write it out to disk as an SSTable file
  - In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment etc
  - From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values
  - If the database crashes, the most recent writes (in the memtable) are lost. In order to avoid this problem, we can keep a separate log on disk to which every write is immediately appended. Every time the memtable is written out to an SSTable, the corresponding log can be discarded
- Making an LSM-Tree out of SSTables
  - LSM-Tree (Log-Structured Merge Tree), maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches
  - LSM uses memtable in memory and SSTable on disk, the point is that sequential access is way faster than random access on disk
- Performance optimization
  - The LSM-tree algorithm can be slow when looking up keys that do not exist in the database, you can use Bloom filters to speed up these queries
  - There are also different strategies to determine the order and timing of how SSTables are compacted and merged
B-Trees
- Introduction
  - B-tree is the standard index implementation in almost all relational databases, and many nonrelational databases use them too
  - B-trees break the database down into fixed-size blocks or pages, traditionally 4Kb in size, and read or write one page at a time. Each page can be identified using an address or location, which allows one page to refer to another. We can use these page references to construct a tree of pages
  - A B-tree with n keys always has a depth of
  - Branching factor: the number of references to child pages in one page of the B-tree
  - A four-level tree of 4KB pages with a branching factor of 500 can store up to 256TB
- Making B-trees reliable
  - The basic underlying write operation of a B-tree is to overwrite a page on disk with new data, the overwrite does not change the location of the page
  - Some operations require several different pages to be overwritten (split/merge is involved), this can be dangerous, because if the database crashes after only some of the pages have been written, you end up with a corrupted index
  - In order to make the database resilient to crashed, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log(WAL, also known as a redo log). The WAL is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. When the database comes back up after crash, this log is used to restore the B-tree back to a consistent state
- B-tree optimizations
  - Instead of overwriting pages and maintaining a WAL for crash recovery, some databases use a copy-on-write scheme
  - We can save space in pages by not storing the entire key, but abbreviating it. Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels (this variant is known as B+ tree)
  - In general, pages can be positioned anywhere on disk, there is nothing requiring pages with nearby key ranges to be nearby on disk. Many B-tree implementation try to lay out the tree so that leaf pages appear in sequential order on dis, maintaining this property is difficult as the tree grows
Comparing B-Trees and LSM-Trees
- B-tree implementations are generally more mature than LSM-tree implementations
- As a rule of thumb, LSM-trees are typically faster for writes, but B-trees are thought to be faster for reads
- Advantages of LSM-trees
  - A B-tree index must write every piece of data at least twice, once to the WAL and once to the tree page itself (and perhaps again as pages are split)
  - Write amplification: one write to the database resulting in multiple writes to the disk over the course of the database's lifetime
  - Advantage 1: LSM-trees have higher write throughput than B-trees, partly because they have lower write amplification, and partly because they sequentially write compact SSTable files rather than having to overwrite several pages in the tree
  - Advantage 2: LSM-trees can be compressed better, and thus often produce smaller files on disk (B-trees storage engines have fragmentation, LSM-trees periodically use compression to remove fragmentation)
- Downsides of LSM-trees
  - Downside 1: the compaction process can sometimes interfere with the performance of ongoing reads and writes. The impact on throughput and average response time is usually small, but at higher percentiles the response time of queries to LSM-trees based engines can be quite high
  - Downside 2: the disk's finite write bandwidth needs to be shared between the initial write and the compaction threads running in the background, this problem is significant when the write throughput is high
  - Downside 3: log-structured storage engine may have multiple copies of the same key in different segments, then it's harder to offer strong transactional semantics
Other Indexing Structures
- Introduction
  - The key-value indexes are like a primary key index in the relational model. A primary key uniquely identified one row in a relational table, or one document in a document database, or one vertex in a graph database
  - Secondary indexes are also very common. You can create several indexes on the same table in relational databases using the CREATE INDEx command, and they are often crucial for performing joins efficiently
  - A secondary index does not provide unique key-value pairs, i.e., there might be many rows with the same key
  - Both B-trees and log-structured indexes can be used as secondary indexes
- Storing values within the index
  - The key in an index is the thing that queries search for, but the value can be one of the two things: it could be the actual row in question, or it could be a reference to the row stored elsewhere (the actual place where rows are stored is called a heap file)
  - Cluster index: the values in index are the actual rows
  - Covering index/index with included columns: a compromise between a cluster index and a non-clustered index, it stores some of a table's columns with the index
- Multi-column indexes
  - The most common type of multi-column index is called a concatenated index, which simply combines several fields into one key by appending one column to another
  - Query with multiple columns cannot be answered using the standard B-tree or LSM-tree index efficiently
- Full-text search and fuzzy indexes
  - Fuzzy queries: search for similar keys, misspelled words
  - Full-text search engines commonly allow a search for one word to be expanded to include synonyms of the word, to ignore grammatical variations of words, and to search for occurrences of words near other in the same document, and support various other features that depend on linguistic analysis of the text
- Keeping everything in memory
  - Some in-memory key-value stores are intended for caching use only, where it's acceptable for data to be lost if a machine is restarted
  - The performance advantage of in-memory databases is not due to the fact that they don't need to read from disk, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk
  - In-memory databases also provide data models that are difficult to implement with disk-based indexes ## Transaction Processing or Analytics

My Blog

DDIA Notes

Chapter 1 可靠，可扩展与可维护的应用系统

前言

认识数据库系统

可靠性

可扩展性

可维护性

Chapter 2 数据模型和查询语言

前言

Chapter 3 Storage and Retrieval

Data Structures That Power Your Database