9. Data Storage Structures¶

约 2853 个字 6 行代码预计阅读时间 14 分钟

Abstract

File Organization
- Fixed-Length Records
- Variable-Length Records
Organization of Records in Files
- Heap File Organization
- Sequential File Organization
- Multitable Clustering File Organization
- Table Partition
Data-Dictionary Storage
Storage Access & Buffer manager
- LRU strategy
- Clock algorithm
Columnar Representation

9.1 File Organization¶

The database is stored as a collection of files.
Each file is a sequence of records.
A record is a sequence of fields.

数据库存储为文件的集合，每个文件都是一个记录序列。

One approach:

assume record size is fixed
each file has records of one particular type only
different files are used for different relations

9.1.1 Fixed-Length Records¶

定长记录：

Store record i starting from byte \(n \times (i – 1)\), where n is the size of each record.

Record access is simple but records may cross blocks

Modification: do not allow records to cross block boundaries

第 i 条记录是从第\(n \times (i - 1)\)个字节开始的，其中 n 是每一条记录的定长

如果一个块有4096个字节，一个记录需要100个字节来存放，那么最多只能放下40条记录，而舍去最后的96个字节。否则会造成一条记录存在于两个块中的麻烦情况。

Deletion of record i: alternatives: 删除记录i：

move records \(i + 1,\ldots , n\) to \(i, \ldots. , n – 1\) 依次上移
move record n to i 将第n条记录移动到第i条记录位置（删除位置）
do not move records, but link all free records on a free list

要删除的条打上标记，在一块的头部留下一点空间，用于放置空链表指针，将所有被删除的位置连接在一起，形成一个空记录的链表。

以后如果要往这个块里插入，直接通过指针找到空记录插入即可，随后更新指针。

9.1.2 Variable-Length Records¶

可变长度记录

Variable-length records arise in database systems in several ways:

Storage of multiple record types in a file. 在一个文件中存储多种记录类型。
Record types that allow variable lengths for one or more fields such as strings (varchar)

允许一个或多个字段使用可变长度的记录类型，例如字符串 ( varchar )
Record types that allow repeating fields (used in some older data models).

Variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes

变长属性用固定大小（偏移量、长度）前缀表示的，实际数据存储在所有定长属性之后

Null values represented by null-value bitmap（空位图）stored in 1 byte

空位图：表示这些属性是否为null（空）。如空位图为0000，表示四个属性都不是空。

Example

定长的 (offset, length) +(定长属性）保存在前面，不定长的保存在后面.

最开始放置不定长的offset和length，再放置定长，最后放置不定长

表示中，前面用两个整数分别表示位置与长度(定长)，“65000”是定长字符串（Bytes 12到Bytes 20）.，0000是空位图（需要 1 byte），表达属性是否为空，这些都是定长的。后面的不定长属性按照实际的长度，依次存放

位置 20 放的 0000，对应的是空位图null-value bitmap, 表示前面四个属性均是非空的, 1 表示空。（放在前面也可以，只要在一个固定位置能找到即可）

前提：每一个记录都是被放在一起的。（有按行存储的方式）

9.1.2.1 Slotted Page Structure¶

Slotted page（分槽页） header contains:

number of record entries
end of free space in the block

一个指针指向 free space 末尾，用来分配内存（新的record位置）
location and size of each record

两头压缩，新插入的写在Free Space的开头和末尾

当删除的时候，一种方法是把后面的记录挪过去，让自由空间更紧凑，这样需要修改 entries, free space 的指针, 偏移量也要调整。也可以暂时不调整，等后面如果需要分配内存但不够用时，再一次性重整之前的空间。

9.1.3 Organization of Records in Files¶

插入到哪个文件的哪个位置？

Heap – record can be placed anywhere in the file where there is space

有位置我就插进去
Sequential – store records in sequential order, based on the value of the search key of each record

根据每条记录的搜索键值按顺序存储记录
In a multitable clustering file organization records of several different relations can be stored in the same file

Motivation: store related records on the same block to minimize I/O

将相关记录存储在同一块上以最小化I/O
B+tree file organization - Ordered storage even with inserts/deletes
Hashing – a hash function computed on search key; the result specifies in which block of the file the record should be placed

9.1.3.1 Heap File Organization¶

Array with 1 entry per block. Each entry is a few bits to a byte, and records fraction of block that is free.对于文件中每一个块，（例如用3个或者4个bit）来表示这个块中的空闲程度。

Free-space map

维护一个空闲块的地图，记录这个块的空闲程度。

Example

3 bits per block, value divided by 8 indicates

如 4 表示 4/8 空闲。

顺序访问比较低效，我们可以有第二层(second level)来表示其中的最大空闲块。

9.1.3.2 Sequential File Organization¶

Suitable for applications that require sequential processing of the entire file

The records in the file are ordered by a search-key

Deletion – use pointer chains
Insertion – locate the position where the record is to be inserted

逐个迁移效率低，我们把插入的放在末尾，通过指针维护秩序。

Need to reorganize the file from time to time to restore sequential order.

为了维持物理意义上的顺序结构（不依赖指针），需要不时地重新组织文件以恢复顺序。

9.1.3.3 Multitable Clustering File Organization¶

Store several relations in one file using a multitable clustering file organization.

将不同关系的表放在一起

Example

对于老师和部分经常一起访问的情况，我们可以把这两个信息放在一起。（如果两个表经常连接，这样比较高效）

但这样对于单独查找某个信息就不太方便。

用于涉及单个部门及其教师的查询
不利于仅涉及部门的查询,导致可变大小的记录
可以添加指针链以链接特定关系的记录

9.1.3.4 Table Partitioning¶

Table partitioning: Records in a relation can be partitioned into smaller relations that are stored separately

一个表太大，对于并行访问可能引发冲突。于是我们可以把表分开，如对于所有老师的表，我们可以把计算机系的老师分出来，数学系的老师分出来。（水平分割）也可以按列存储。

水平partition 和垂直 partition

eg：水平整行记录分离，如将全体学生按专业分类

垂直分隔，按列属性分离，所有的学号一张表，所有的电话一张表

9.2 Data Dictionary Storage¶

The Data dictionary (also called system catalog) stores metadata; that is, data about data, such as

数据字典（也称为系统目录）存储元数据

定义的数据也是数据 (metadata) 我们也需要把它们存储下来。

Information about relations
- names of relations
- names, types and lengths of attributes of each relation
- names and definitions of views
- integrity constraints
User and accounting information, including passwords
Statistical and descriptive data
Physical file organization information
Information about indices

9.3 Storage Access & Buffer manager¶

Blocks are units of both storage allocation and data transfer.

Buffer – portion of main memory available to store copies of disk blocks.

缓冲区——主内存的一部分，可用于存储磁盘块的副本

Buffer manager – subsystem responsible for allocating buffer space in main memory.

缓冲区管理器——负责在主内存中分配缓冲区空间的子系统

如我们要找某块，先在 buffer 中找，如果没找到就从磁盘中读出来放到 buffer 中。当 buffer 满了就需要考虑如何替换，替换哪一块（最近最少原则）。

"LRU（least recent used 最近最少） Example"

看作堆栈的模式，最近使用的放在top，最近最少使用的放在bottom

对于修改过后的数据弹出时，需要先写回磁盘，或者确保修改日志已经写入到磁盘（持久化）

Programs call on the buffer manager when they need a block from disk.

当程序需要磁盘上的块时，程序会调用缓冲区管理器

If the block is already in the buffer, buffer manager returns the address of the block in main memory

如果该块已经在缓冲区中，则缓冲区管理器返回该块在主内存中的地址
If the block is not in the buffer, the buffer manager

如果该块不在缓冲区中
- Allocates space in the buffer for the block
  
  在缓冲区中为块分配空间
  
  在 buffer 里替换空间，如果有空位可以直接分配，否则需要替换。
  - Replacing (throwing out) some other block, if required, to make space for the new block.
    
    如果需要，替换（扔掉）其他一些块，为新块腾出空间。
  - Replaced block written back to disk only if it was modified since the most recent time that it was written to/fetched from the disk.
    
    先让日志持久化，将日志写入到磁盘
- Reads the block from the disk to the buffer, and returns the address of the block in main memory to requester.
  
  将块从磁盘读取到缓冲区，并将块在主存中的地址返回给请求者。

Pinned block

memory block that is not allowed to be written back to disk

如果上面的程序正在访问缓冲区中的某一个块，那么要进行Pin的操作，钉住这个块，不让它离开缓冲区。

Pin done before reading/writing data from a block
Unpin done when read /write is complete
Multiple concurrent pin/unpin operations possible

Keep a pin count, buffer block can be evicted only if pin count = 0
当上层的程序访问完毕，即Pin count = 0（Pin count表示上层有多少个程序正在访问缓冲区的这个块）时，进行Unpin的操作，解除钉住这个块，那么这个块将变成候选的可以被替换出去的块。

Shared and exclusive locks on buffer

当一个块进入到缓冲区以后，上层的某些程序可能在读它，有些程序可能在修改它，则会涉及共享区域的冲突访问问题。

解决方案：加锁。读与读之间不冲突，但读与写、写与写之间都是冲突的。

Readers get shared lock, updates to a block require exclusive lock

Locking rules:

Only one process can get exclusive lock at a time
Shared lock cannot be concurrently with exclusive lock
Multiple processes may be given shared lock concurrently

9.3.1 Buffer-Replacement Policies¶

LRU strategy - replace the block least recently used.
- Idea behind LRU – use past pattern of block references as a predictor of future references
  
  用过去的访问模式推断讲来的访问模式
Warning "LRU can be a bad strategy"
Toss-immediate strategy – frees the space occupied by a block as soon as the final tuple of that block has been processed
Most recently used (MRU) strategy – system must pin the block currently being processed. After the final tuple of that block has been processed, the block is unpinned, and it becomes the most recently used block.
Buffer managers also support forced output of blocks for the purpose of recovery

最好的策略是基于预测的，但是预测本身是很难的，需要利用人工智能的方法。

9.3.2 Clock: An approximation of LRU¶

Arrange block into a cycle, store one reference_bit per block

When pin_count reduces to 0, set reference_bit =1

轮转一次，如果pin_count访问次数为0(表示在上一次循环中，没有外部程序访问这一块)，就将reference_bit置为1，表示可以替换出去（但不是现在）.

对于环中的每一块，如果它的Reference bit为1，那么将其置为0；

现在，新进来一个块，放到刚刚被替换出去的块的位置，它的Reference bit被置为1。或者，假如在循环过程中，哪一块被访问到，就将对应的reference bit置为1。

所以一旦在某一轮遇到reference bit = 0（且pin_count = 0），就可以将其替换出去。因为它历经了1到0,0到0的过程，且每一次pin_count都为0，相当于两轮没有访问过，可以弹出.

reference_bit as the 2nd chance bit

do for each block in cycle {
    if (reference_bit ==1)
        set reference_bit=0;
    else if (reference_bit ==0)
        choose this block for replacement;
} until a page is chosen;