【转】深入理解 ext4 等 Linux 文件系统( Understanding Linux filesystems : ext4 and beyond )

Learn the history of ext4, including what’s different from ext3 and the other filesystems that came before it.
了解 ext4 的历史,包括其与 ext3 和之前其它文件系统之间的区别。

The majority of modern Linux distributions default to the ext4 filesystem, just as previous Linux distributions defaulted to ext3, ext2, and — if you go back far enough — ext.

If you’re new to Linux — or to filesystems — you might wonder what ext4 brings to the table that ext3 didn’t. You might also wonder whether ext4 is still in active development at all, given the flurries of news coverage of alternate filesystems such as btrfs, xfs, and zfs.

目前大部分 Linux 发行版都默认采用 ext4 文件系统,正如以前的 Linux 发行版默认使用 ext3 、ext2 以及更久之前的 ext 一样。

对于不熟悉 Linux 或文件系统的朋友而言,你可能不清楚 ext4 相对于上一版本 ext3 带来了什么变化。你可能还想知道一连串关于可选的文件系统例如 Btrfs 、XFS 和 ZFS 在不断推陈出新的情况下,ext4 是否仍然能得到进一步的发展。

We can’t cover everything about filesystems in a single article, but we’ll try to bring you up to speed on the history of Linux’s default filesystem, where it stands, and what to look forward to.

I drew heavily on Wikipedia’s various ext filesystem articles, kernel.org’s wiki entries on ext4, and my own experiences while preparing this overview.

我们不可能在一篇文章中讲述文件系统的所有事情,但我们尽力让您尽快了解 Linux 默认文件系统的发展历史,包括它的诞生以及未来的发展。

我仔细研究了维基百科里各种关于 ext 文件系统的文章、kernel.org 的 wiki 中关于 ext4 的条目以及结合自己的经验写下了这篇文章。

A brief history of ext

MINIX filesystem

Before there was ext, there was the MINIX filesystem. If you’re not up on your Linux history, MINIX was a very small Unix-like operating system for IBM PC/AT microcomputers. Andrew Tannenbaum developed it for teaching purposes and released its source code (in print form!) in 1987.

ext 简史

MINIX 文件系统

在有 ext 之前,使用的是 MINIX 文件系统。如果你不熟悉 Linux 的历史,那么可以理解为 MINIX 是用于 IBM PC / AT 微型计算机的一个非常小的类 Unix 操作系统。Andrew Tannenbaum 为了教学的目的而开发了它,并于 1987 年发布了源代码(以印刷版的格式!)。

BM 1980 中期的 PC / AT,MBlairMartin,CC BY-SA 4.0
BM 1980 中期的 PC / AT,MBlairMartinCC BY-SA 4.0

Although you could peruse MINIX’s source, it was not actually free and open source software (FOSS). The publishers of Tannebaum’s book required a $69 license fee to operate MINIX, which was included in the cost of the book. Still, this was incredibly inexpensive for the time, and MINIX adoption took off rapidly, soon exceeding Tannenbaum’s original intent of using it simply to teach the coding of operating systems. By and throughout the 1990s, you could find MINIX installations thriving in universities worldwide — and a young Linus Torvalds used MINIX to develop the original Linux kernel, first announced in 1991, and released under the GPL in December 1992.

虽然你可以细读 MINIX 的源代码,但实际上它并不是自由开源的软件( FOSS )。出版 Tannebaum 著作的出版商要求你花 69 美元的许可费来运行 MINIX ,而这笔费用包含在书籍的费用中。尽管如此,在那时来说非常便宜,并且 MINIX 的使用得到迅速发展,很快超过了 Tannebaum 当初使用它来教授操作系统编码的意图。在整个 20 世纪 90 年代,你可以发现 MINIX 的安装在世界各个大学里面非常流行。而此时,年轻的 Linus Torvalds 使用 MINIX 来开发原始 Linux 内核,并于 1991 年首次公布,而后在 1992 年 12 月在 GPL 开源协议下发布。

But wait, this is a filesystem article, right? Yes, and MINIX had its own filesystem, which early versions of Linux also relied on. Like MINIX, it could uncharitably be described as a “toy” example of its kind — the MINIX filesystem could handle filenames only up to 14 characters and address only 64MB of storage. In 1991, the typical hard drive was already 40-140MB in size. Linux clearly needed a better filesystem!

但是等等,这是一篇以 文件系统 为主题的文章不是吗?是的,MINIX 有自己的文件系统,早期的 Linux 版本依赖于它。跟 MINIX 一样,Linux 的文件系统也如同玩具那般简单 —— MINIX 文件系统最多能处理 14 个字符的文件名,并且只能处理 64MB 的存储空间。到了 1991 年,一般的硬盘尺寸已经达到了 40 – 140 MB 。很显然,Linux 需要一个更好的文件系统。

ext

While Linus hacked away on the fledgling Linux kernel, Rémy Card worked on the first ext filesystem. First implemented in 1992 — only a year after the initial announcement of Linux itself! — ext solved the worst of the MINIX filesystem’s problems.

1992’s ext used the new virtual filesystem (VFS) abstraction layer in the Linux kernel. Unlike the MINIX filesystem before it, ext could address up to 2GB of storage and handle 255-character filenames.

But ext didn’t have a long reign, largely due to its primitive timestamping (only one timestamp per file, rather than the three separate stamps for inode creation, file access, and file modification we’re familiar with today). A mere year later, ext2 ate its lunch.

ext

当 Linus 开发出刚起步的 Linux 内核时,Rémy Card 从事第一代的 ext 文件系统的开发工作。ext 文件系统在 1992 年首次实现并发布 —— 仅在 Linux 首次发布后的一年!—— ext 解决了 MINIX 文件系统中最糟糕的问题。

1992 年的 ext 使用在 Linux 内核中的新虚拟文件系统( VFS )抽象层。与之前的 MINIX 文件系统不同的是,ext 可以处理高达 2 GB 存储空间并处理 255 个字符的文件名。

但 ext 并没有长时间占统治地位,主要是由于它原始的时间戳(每个文件仅有一个时间戳,而不是今天我们所熟悉的有 inode 创建时间、最近文件访问时间和最新文件修改时间的时间戳。)仅仅一年后,ext2 就替代了它。

ext2

Rémy clearly realized ext’s limitations pretty quickly, since he designed ext2 as its replacement a year later. While ext still had its roots in “toy” operating systems, ext2 was designed from the start as a commercial-grade filesystem, along the same principles as BSD’s Berkeley Fast File System.

Ext2 offered maximum filesizes in the gigabytes and filesystem sizes in the terabytes, placing it firmly in the big leagues for the 1990s. It was quickly and widely adopted, both in the Linux kernel and eventually in MINIX, as well as by third-party modules making it available for MacOS and Windows.

ext2

Rémy 很快就意识到 ext 的局限性,所以一年后他设计出 ext2 替代它。当 ext 仍然根植于 “ 玩具 ” 操作系统时,ext2 从一开始就被设计为一个商业级文件系统,沿用 BSD 的 Berkeley 快速文件系统的设计原理。

ext2 提供了 GB 级别的最大文件大小和 TB 级别的文件系统大小,使其在 20 世纪 90 年代的地位牢牢巩固在文件系统大联盟中。很快它被广泛地使用,无论是在 Linux 内核中还是最终在 MINIX 中,且利用第三方模块可以使其应用于 MacOS 和 Windows 。

There were still problems to solve, though: ext2 filesystems, like most filesystems of the 1990s, were prone to catastrophic corruption if the system crashed or lost power while data was being written to disk. They also suffered from significant performance losses due to fragmentation (the storage of a single file in multiple places, physically scattered around a rotating disk) as time went on.

Despite these problems, ext2 is still used in some isolated cases today—most commonly, as a format for portable USB thumb drives.

但这里仍然有一些问题需要解决:ext2 文件系统与 20 世纪 90 年代的大多数文件系统一样,如果在将数据写入到磁盘的时候,系统发生崩溃或断电,则容易发生灾难性的数据损坏。随着时间的推移,由于碎片(单个文件存储在多个位置,物理上其分散在旋转的磁盘上),它们也遭受了严重的性能损失。

尽管存在这些问题,但今天 ext2 还是用在某些特殊的情况下 —— 最常见的是,作为便携式 USB 驱动器的文件系统格式。

ext3

In 1998, six years after ext2’s adoption, Stephen Tweedie announced he was working on significantly improving it. This became ext3, which was adopted into mainline Linux with kernel version 2.4.15, in November 2001.

ext3

1998 年,在 ext2 被采用的 6 年后,Stephen Tweedie 宣布他正在致力于改进 ext2 。这成了 ext3 ,并于 2001 年 11 月被采用到 Linux 主线内核版本 2.4.15 中。

20 世纪 90 年代中期的 Packard Bell 计算机,Spacekid ,CC0
20 世纪 90 年代中期的 Packard Bell 计算机,SpacekidCC0

Ext2 had done very well by Linux distributions for the most part, but — like FAT, FAT32, HFS, and other filesystems of the time — it was prone to catastrophic corruption during power loss. If you lose power while writing data to the filesystem, it can be left in what’s called an inconsistent state — one in which things have been left half-done and half-undone. This can result in loss or corruption of vast swaths of files unrelated to the one being saved or even unmountability of the entire filesystem.

在大部分情况下,ext2 在 Linux 发行版中工作得很好,但像 FAT 、FAT32 、HFS 和当时的其它文件系统一样 —— 在断电时容易发生灾难性的破坏。如果在将数据写入文件系统的时候发生断电,则可能会将其留在所谓 不一致 的状态 —— 事情只完成一半而另一半未完成。这可能导致大量与保存文件无关的文件丢失或损坏,甚至导致整个文件系统不可卸载。

Ext3, and other filesystems of the late 1990s, such as Microsoft’s NTFS, uses journaling to solve this problem. The journal is a special allocation on disk where writes are stored in transactions; if the transaction finishes writing to disk, its data in the journal is committed to the filesystem itself. If the system crashes before that operation is committed, the newly rebooted system recognizes it as an incomplete transaction and rolls it back as though it had never taken place. This means that the file being worked on may still be lost, but the filesystem itself remains consistent, and all other data is safe. Three levels of journaling are available in the Linux kernel implementation of ext3: journalordered, and writeback.

ext3 和 20 世纪 90 年代后期的其它文件系统,如微软的 NTFS ,使用 日志 来解决这个问题。日志是磁盘上的一种特殊的分配区域,其写入被存储在事务中;如果该事务完成磁盘写入,则日志中的数据将提交给文件系统自身。如果系统在该操作提交前崩溃,则重新启动的系统识别其为未完成的事务而将其进行回滚,就像从未发生过一样。这意味着正在处理的文件可能依然会丢失,但文件系统 本身 保持一致,且其它所有数据都是安全的。在使用 ext3 文件系统的 Linux 内核中实现了三个级别的日志记录方式:日记journal顺序ordered 和 回写writeback

  • Journal is the lowest risk mode, writing both data and metadata to the journal before committing it to the filesystem. This ensures consistency of the file being written to, as well as the filesystem as a whole, but can significantly decrease performance.
  • 日记 是最低风险模式,在将数据和元数据提交给文件系统之前将其写入日志。这可以保证正在写入的文件与整个文件系统的一致性,但其显著降低了性能。
  • Ordered is the default mode in most Linux distributions; ordered mode writes metadata to the journal but commits data directly to the filesystem. As the name implies, the order of operations here is rigid: First, metadata is committed to the journal; second, data is written to the filesystem, and only then is the associated metadata in the journal flushed to the filesystem itself. This ensures that, in the event of a crash, the metadata associated with incomplete writes is still in the journal, and the filesystem can sanitize those incomplete writes while rolling back the journal. In ordered mode, a crash may result in corruption of the file or files being actively written to during the crash, but the filesystem itself — and files not actively being written to — are guaranteed safe.
  • 顺序 是大多数 Linux 发行版的默认模式;顺序模式将元数据写入日志而直接将数据提交到文件系统。顾名思义,这里的操作顺序是固定的:首先,元数据提交到日志;其次,数据写入文件系统,然后才将日志中关联的元数据更新到文件系统。这确保了在发生崩溃时,那些与未完整写入相关联的元数据仍在日志中,且文件系统可以在回滚日志时清理那些不完整的写入事务。在顺序模式下,系统崩溃可能导致在崩溃期间单个或多个损坏的文件被主动写入,但能确保文件系统本身 —— 以及未被主动写入的文件 —— 都是安全的。
  • Writeback is the third — and least safe — journaling mode. In writeback mode, like ordered mode, metadata is journaled, but data is not. Unlike ordered mode, metadata and data alike may be written in whatever order makes sense for best performance. This can offer significant increases in performance, but it’s much less safe. Although writeback mode still offers a guarantee of safety to the filesystem itself, files that were written to during or before the crash are vulnerable to loss or corruption.
  • 回写 是第三种模式,也是最不安全的日志模式。在回写模式下,像顺序模式一样,元数据会被记录到日志,但数据不会。与顺序模式不同的是,元数据和数据都可以以任何有利于获得最佳性能的顺序写入。这可以显著提高性能,但安全性低很多。尽管回写模式仍然保证文件系统本身的安全性,但在崩溃或崩溃之前写入的文件很容易丢失或损坏。

Like ext2 before it, ext3 uses 16-bit internal addressing. This means that with a blocksize of 4K, the largest filesize it can handle is 2 TiB in a maximum filesystem size of 16 TiB.

跟之前的 ext2 类似,ext3 使用 16 位内部寻址。这意味着对于有着 4 K 块大小的 ext3 在最大规格为 16 TiB 的文件系统中可以处理的最大文件大小为 2 TiB 。

ext4

Theodore Ts’o (who by then was ext3’s principal developer) announced ext4 in 2006, and it was added to mainline Linux two years later, in kernel version 2.6.28. Ts’o describes ext4 as a stopgap technology which significantly extends ext3 but is still reliant on old technology. He expects it to be supplanted eventually by a true next-generation filesystem.

ext4

Theodore Ts’o(是当时 ext3 主要开发人员)在 2006 年发表的 ext4 ,于两年后在 2.6.28 内核版本中被加入到了 Linux 主线。Ts’o 将 ext4 描述为一个显著扩展 ext3 但仍然依赖于旧技术的临时技术。他预计 ext4 终将会被真正的下一代文件系统所取代。

Dell Precision 380 工作站,Lance Fisher ,CC BY-SA 2.0
Dell Precision 380 工作站,Lance FisherCC BY-SA 2.0

Ext4 is functionally very similar to ext3, but brings large filesystem support, improved resistance to fragmentation, higher performance, and improved timestamps.

ext4 在功能上与 ext3 在功能上非常相似,但支持大文件系统,提高了对碎片的抵抗力,有更高的性能以及更好的时间戳。

Ext4 vs ext3

Ext3 and ext4 have some very specific differences, which I’ll focus on here.

ext4 vs ext3

ext3 和 ext4 有一些非常明确的差别,在这里集中讨论下。

Backwards compatibility

Ext4 was specifically designed to be as backward-compatible as possible with ext3. This not only allows ext3 filesystems to be upgraded in place to ext4; it also permits the ext4 driver to automatically mount ext3 filesystems in ext3 mode, making it unnecessary to maintain the two codebases separately.

向后兼容性

ext4 特地设计为尽可能地向后兼容 ext3 。这不仅允许 ext3 文件系统原地升级到 ext4 ;也允许 ext4 驱动程序以 ext3 模式自动挂载 ext3 文件系统,因此使它无需单独维护两个代码库。

Large filesystems

Ext3 filesystems used 32-bit addressing, limiting them to 2 TiB files and 16 TiB filesystems (assuming a 4 KiB blocksize; some ext3 filesystems use smaller blocksizes and are thus limited even further).

大文件系统

ext3 文件系统使用 32 位寻址,这限制它仅支持 2 TiB 文件大小和 16 TiB 文件系统大小(这是假设在块大小为 4 KiB 的情况下,一些 ext3 文件系统使用更小的块大小,因此限制会更进一步)。

Ext4 uses 48-bit internal addressing, making it theoretically possible to allocate files up to 16 TiB on filesystems up to 1,000,000 TiB (1 EiB). Early implementations of ext4 were still limited to 16 TiB filesystems by some userland utilities, but as of 2011, e2fsprogs has directly supported the creation of >16TiB ext4 filesystems. As one example, Red Hat Enterprise Linux contractually supports ext4 filesystems only up to 50 TiB and recommends ext4 volumes no larger than 100 TiB.

ext4 使用 48 位的内部寻址,理论上可以在文件系统上分配高达 16 TiB 大小的文件,其中文件系统大小最高可达 1000000 TiB( 1 EiB )。在早期 ext4 的实现中有些用户空间的程序仍然将其限制为最大大小为 16 TiB 的文件系统,但截至 2011 年,e2fsprogs 已经直接支持大于 16 TiB 大小的 ext4 文件系统。例如,红帽企业级 Linux 规定仅支持最高 50 TiB 的 ext4 文件系统,并建议 ext4 卷不超过 100 TiB 。

Allocation improvements

Ext4 introduces a lot of improvements in the ways storage blocks are allocated before writing them to disk, which can significantly increase both read and write performance.

分配方式改进

ext4 在将存储块写入磁盘之前对存储块的分配方式进行了大量改进,这可以显著提高读写性能。

Extents

An extent is a range of contiguous physical blocks (up to 128 MiB, assuming a 4 KiB block size) that can be reserved and addressed at once. Utilizing extents decreases the number of inodes required by a given file and significantly decreases fragmentation and increases performance when writing large files.

( 1 )区段

区段extent是一系列连续的物理块 (最多达 128 MiB ,假设块大小为 4 KiB ),可以一次性保存和寻址。使用区段可以减少给定文件所需的 inode 数量,显著减少碎片并提高写入大文件时的性能。

Multiblock allocation

Ext3 called its block allocator once for each new block allocated. This could easily result in heavy fragmentation when multiple writers are open concurrently. However, ext4 uses delayed allocation, which allows it to coalesce writes and make better decisions about how to allocate blocks for the writes it has not yet committed.

( 2 )多块分配

ext3 为每一个新分配的块调用一次块分配器。当多个写入程序同时打开分配器时,很容易导致严重的碎片化。然而,ext4 使用延迟分配,这允许它合并写入并更好地决定如何为尚未提交的写入分配块。

Persistent pre-allocation

When pre-allocating disk space for a file, most file systems must write zeroes to the blocks for that file on creation. Ext4 allows the use of fallocate() instead, which guarantees the availability of the space (and attempts to find contiguous space for it) without first needing to write to it. This significantly increases performance in both writes and future reads of the written data for streaming and database applications.

( 3 )持久的预分配

在为文件预分配磁盘空间时,大部分文件系统必须在创建时将零写入该文件的块中。ext4 允许使用 fallocate() 代替,它保证了空间的可用性(并试图为它找到连续的空间),而不需要先写入它。这显著提高了写入和将来读取流和数据库这些应用程序写入的数据的性能。

Delayed allocation

This is a chewy — and contentious — feature. Delayed allocation allows ext4 to wait to allocate the actual blocks it will write data to until it’s ready to commit that data to disk. (By contrast, ext3 would allocate blocks immediately, even while the data was still flowing into a write cache.)

( 4 )延迟分配

这是一个耐人寻味而有争议性的功能。延迟分配允许 ext4 先不要为将写入的数据分配实际的块(或空间),直到它准备好将数据提交到磁盘。(相比之下,即使数据仍然在往写入缓存中写入,ext3 也会立即分配块。)

Delaying allocation of blocks as data accumulates in cache allows the filesystem to make saner choices about how to allocate those blocks, reducing fragmentation (write and, later, read) and increasing performance significantly. Unfortunately, it increases the potential for data loss in programs that have not been specifically written to call fsync() when the programmer wants to ensure data has been flushed entirely to disk.

当缓存中的数据累积时,延迟块的分配允许文件系统对如何分配块做出更好的选择,降低碎片(写入,以及稍后的读)并显著提升性能。然而不幸的是,当程序员想确保数据完全刷新到磁盘上时,如果程序没有专门调用 fsync() 方法来保存数据,那么将会 增加 这些数据丢失的可能性。

Let’s say a program rewrites a file entirely:

假设一个程序完全重写了一个文件:

fd=open("file", O_TRUNC); write(fd, data); close(fd);

With legacy filesystems, close(fd); is sufficient to guarantee that the contents of file will be flushed to disk. Even though the write is not, strictly speaking, transactional, there’s very little risk of losing the data if a crash occurs after the file is closed.

使用旧的文件系统,close(fd); 足以保证 file 中的内容会刷新到磁盘。严格来说即使写不是事务性的,如果文件关闭后发生崩溃,那么丢失数据的风险也很小。

If the write does not succeed (due to errors in the program, errors on the disk, power loss, etc.), both the original version and the newer version of the file may be lost or corrupted. If other processes access the file as it is being written, they will see a corrupted version. And if other processes have the file open and do not expect its contents to change — e.g., a shared library mapped into multiple running programs — they may crash.

如果写入不成功(由于程序上的错误、磁盘上的错误、断电等),文件的原始版本和较新版本都可能丢失数据或损坏。如果其它进程在写入文件时访问文件,则会看到损坏的版本。如果其它进程打开文件并且不希望其内容发生更改 —— 例如,映射到多个正在运行的程序的共享库 —— 这些进程可能会崩溃。

To avoid these issues, some programmers avoid using O_TRUNC at all. Instead, they might write to a new file, close it, then rename it over the old one:

为了避免这些问题,一些程序员完全避免使用 O_TRUNC 。相反,他们可能会写入一个新文件,关闭它,然后将其重命名为旧文件名:

fd=open("newfile"); write(fd, data); close(fd); rename("newfile", "file");

Under filesystems without delayed allocation, this is sufficient to avoid the potential corruption and crash problems outlined above: Since rename() is an atomic operation, it won’t be interrupted by a crash; and running programs will continue to reference the old, now unlinked version of file for as long as they have an open filehandle to it. But because ext4’s delayed allocation can cause writes to be delayed and re-ordered, the rename(“newfile”,”file”) may be carried out before the contents of newfile are actually written to disk, which opens the problem of parallel processes getting bad versions of file all over again.

在 没有 延迟分配的文件系统下,这足以避免上面列出的潜在的损坏和崩溃问题:因为 rename() 是原子操作,所以它不会被崩溃中断;并且运行的程序将继续引用旧的文件,现在不会链接到(新的)文件的版本,只要它们还有一个打开的(旧的)文件句柄。但是因为 ext4 的延迟分配会导致写入被延迟和重新排序,rename(“newfile”, “file”) 可以在 newfile 的内容实际写入磁盘之前执行,这就出现了并行进程再次获得错误版本文件的问题。

To mitigate this, the Linux kernel (since version 2.6.30) attempts to detect these common code cases and force the files in question to be allocated immediately. This reduces, but does not prevent, the potential for data loss — and it doesn’t help at all with new files. If you’re a developer, please take note: The only way to guarantee data is written to disk immediately is to call fsync() appropriately.

为了缓解这种情况,Linux 内核(从版本 2.6.30 开始)尝试检测这些常见代码情况并强制立即分配。这会减少但不能防止数据丢失的可能性 —— 并且它对新文件没有任何帮助。如果你是一位开发人员,请注意:保证数据立即写入磁盘的唯一方法是正确调用 fsync() 。

Unlimited subdirectories

Ext3 was limited to a total of 32,000 subdirectories; ext4 allows an unlimited number. Beginning with kernel 2.6.23, ext4 uses HTree indices to mitigate performance loss with huge numbers of subdirectories.

无限制的子目录

ext3 只能拥有 32000 个子目录;ext4 允许无限数量的子目录。从 2.6.23 内核版本开始,ext4 使用 HTree 索引来减少因大量子目录造成的性能损失。

Journal checksumming

Ext3 did not checksum its journals, which presented problems for disk or controller devices with caches of their own, outside the kernel’s direct control. If a controller or a disk with its own cache did writes out of order, it could break ext3’s journaling transaction order, potentially corrupting files being written to during (or for some time preceding) a crash.

日志校验

ext3 没有对日志进行校验,这给自带缓存的磁盘或控制器设备带来了问题,因为它们处于内核直接控制之外。如果自带缓存的控制器或磁盘没有根据一定的顺序来写入(数据),则可能会破坏 ext3 的日志事务顺序,从而可能破坏在崩溃期间(或之前一段时间)写入的文件。

In theory, this problem is resolved by the use of write barriers — when mounting the filesystem, you set barrier=1 in the mount options, and the device will then honor fsync() calls all the way down to the metal. In practice, it’s been discovered that storage devices and controllers frequently do not honor write barriers — improving performance (and benchmarks, where they’re compared to their competitors) but opening up the possibility of data corruption that should have been prevented.

理论上,这个问题可以使用写入障碍barrier来解决 —— 在挂载文件系统时,你在挂载选项设置 barrier=1 ,然后设备就会忠实地执行 fsync 一直向下到底层硬件。通过实践,可以发现存储设备和控制器经常不遵守写入障碍 —— 提高性能(和基准,为了跟他们的竞争对手进行比较),但增加了本应该防止数据损坏的可能性。

Checksumming the journal allows the filesystem to realize that some of its entries are invalid or out-of-order on the first mount after a crash. This thereby avoids the mistake of rolling back partial or out-of-order journal entries and further damaging the filesystem — even if the storage devices lie and don’t honor barriers.

对日志进行校验是为了允许文件系统在发生崩溃后第一次挂载时意识到其某些条目是无效或无序的。因此,这避免了回滚部分条目或无序日志条目并进一步损坏文件系统的错误 —— 即使部分存储设备假做写入障碍和不遵守写入障碍。

Fast filesystem checks

Under ext3, the entire filesystem — including deleted and empty files — required checking when fsck is invoked. By contrast, ext4 marks unallocated blocks and sections of the inode table as such, allowing fsck to skip them entirely. This greatly reduces the time to run fsck on most filesystems and has been implemented since kernel 2.6.24.

快速文件系统检查

在 ext3 下,在 fsck 被调用时会检查整个文件系统 —— 包括已删除的文件和空的文件。相比之下,ext4 标记了 inode 表未分配的块和扇区,从而允许 fsck 完全跳过它们。这大大减少了在大多数文件系统上运行 fsck 的时间,并且从内核 2.6.24 开始就已经实现了。

Improved timestamps

Ext3 offered timestamps granular to one second. While sufficient for most uses, mission-critical applications are frequently looking for much, much tighter time control. Ext4 makes itself available to those enterprise, scientific, and mission-critical applications by offering timestamps in the nanoseconds.

改进的时间戳

ext3 提供粒度为一秒的时间戳。虽然对于大多数应用程序来说已经足够了,但任务关键型应用程序常常需要更严格的时间控制。ext4 通过提供纳秒级的时间戳,使其可用于那些企业、科学以及任务关键型的应用程序。

Ext3 filesystems also did not provide sufficient bits to store dates beyond January 18, 2038. Ext4 adds an additional two bits here, extending the Unix epoch another 408 years. If you’re reading this in 2446 AD, you have hopefully already moved onto a better filesystem — but it’ll make me posthumously very, very happy if you’re still measuring the time since UTC 00:00, January 1, 1970.

ext3 文件系统也没有提供足够的位来存储 2038 年 1 月 18 日以后的日期。ext4 在这里增加了两个位,将 Unix 纪元扩展了 408 年。如果你在公元 2446 年读到这篇文章,你很有可能已经转移到一个更好的文件系统 —— 如果你还在测量自 1970 年 1 月 1 日 00:00( UTC )以来的时间,这会让我死后得以安眠。

Online defragmentation

Neither ext2 nor ext3 directly supported online defragmentation — that is, defragging the filesystem while mounted. Ext2 had an included utility, e2defrag, that did what the name implies — but it needed to be run offline while the filesystem was not mounted. (This is, obviously, especially problematic for a root filesystem.) The situation was even worse in ext3 — although ext3 was much less likely to suffer from severe fragmentation than ext2 was, running e2defrag against an ext3 filesystem could result in catastrophic corruption and data loss.

在线碎片整理

ext2 和 ext3 都不直接支持在线碎片整理 —— 即在挂载时会对文件系统进行碎片整理。ext2 有一个包含的实用程序 e2defrag ,它的名字暗示 —— 它需要在文件系统未挂载时脱机运行。(显然,这对于根文件系统来说非常有问题。)在 ext3 中的情况甚至更糟糕 —— 虽然 ext3 比 ext2 更不容易受到严重碎片的影响,但 ext3 文件系统运行 e2defrag 可能会导致灾难性损坏和数据丢失。

Although ext3 was originally deemed “unaffected by fragmentation,” processes that employ massively parallel write processes to the same file (e.g., BitTorrent) made it clear that this wasn’t entirely the case. Several userspace hacks and workarounds, such as Shake, addressed this in one way or another — but they were slower and in various ways less satisfactory than a true, filesystem-aware, kernel-level defrag process.

尽管 ext3 最初被认为 “ 不受碎片影响 ” ,但对同一文件(例如 BitTorrent )采用大规模并行进程写入的过程清楚地表明情况并非完全如此。一些对用户空间的劫持和解决方法,例如 Shake ,以某种方式解决了这个问题 —— 但它们比真正的、文件系统感知的、内核级碎片整理过程更慢,并且在各方面都不太令人满意。

Ext4 addresses this problem head on with e4defrag, an online, kernel-mode, filesystem-aware, block-and-extent-level defragmentation utility.

ext4 通过 e4defrag 解决了这个问题,且是一个在线、内核模式、文件系统感知、块和区段级别的碎片整理实用程序。

Ongoing ext4 development

Ext4 is, as the Monty Python plague victim once said, “not quite dead yet!” Although its principal developer regards it as a mere stopgap along the way to a truly next-generation filesystem, none of the likely candidates will be ready (due to either technical or licensing problems) for deployment as a root filesystem for some time yet.

正在进行的 ext4 开发

ext4 ,正如 Monty Python 瘟疫感染者曾经说过的那样,“ 我还没死呢!” 虽然它的主要开发人员认为它只是一个真正的下一代文件系统的权宜之计,但是在一段时间内,所有可能的候选者都还没有准备好(由于技术或许可问题)作为根文件系统进行部署。

There are still a few key features being developed into future versions of ext4, including metadata checksumming, first-class quota support, and large allocation blocks.

在未来的 ext4 版本中仍然有一些关键功能要开发,包括元数据校验和、一流的配额支持和大分配块。

Metadata checksumming

Since ext4 has redundant superblocks, checksumming the metadata within them offers the filesystem a way to figure out for itself whether the primary superblock is corrupt and needs to use an alternate. It is possible to recover from a corrupt superblock without checksumming — but the user would first need to realize that it was corrupt, and then try manually mounting the filesystem using an alternate. Since mounting a filesystem read-write with a corrupt primary superblock can, in some cases, cause further damage, this isn’t a sufficient solution, even with a sufficiently experienced user!

元数据校验和

由于 ext4 具有冗余超级块,因此为文件系统校验其中的元数据提供了一种方法,可以自行确定主超级块是否已损坏并需要使用备用块。可以在没有校验和的情况下,恢复损坏的超级块 —— 但是用户首先需要意识到它已损坏,然后尝试使用备用块来手动挂载文件系统。由于在某些情况下,使用损坏的主超级块挂载文件系统进行读写可能会造成进一步的损坏,即使是经验丰富的用户也无法避免,这也不是一个完美的解决方案!

Compared to the extremely robust per-block checksumming offered by next-gen filesystems such as btrfs or zfs, ext4’s metadata checksumming is a pretty weak feature. But it’s much better than nothing.

与 Btrfs 或 ZFS 等下一代文件系统提供的极其强大的逐块校验和相比,ext4 的元数据校验和的功能非常弱。但它总比没有好。

Although it sounds like a no-brainer — yes, checksum ALL THE THINGS! — there are some significant challenges to bolting checksums into a filesystem after the fact; see the design document for the gritty details.

虽然校验 所有的事情 都听起来很简单!—— 事实上,将校验和与文件系统连接到一起存在一些重大的挑战;请参阅设计文档了解详细信息。

First-class quota support

Wait, quotas?! We’ve had those since the ext2 days! Yes, but they’ve always been an afterthought, and they’ve always kinda sucked. It’s probably not worth going into the hairy details here, but the design document lays out the ways quotas will be moved from userspace into the kernel and more correctly and performantly enforced.

一流的配额支持

等等,配额?!从 ext2 出现的那天开始我们就有了这些!是的,但它们一直都是事后的想法(或事后诸葛亮),而且它们总是犯傻。这里可能不值得详细介绍,但设计文档列出了配额将从用户空间移动到内核中的方式,并且能够更加正确和高效地执行。

Large allocation blocks

As time goes by, those pesky storage systems keep getting bigger and bigger. With some solid-state drives already using 8K hardware blocksizes, ext4’s current limitation to 4K blocks gets more and more limiting. Larger storage blocks can decrease fragmentation and increase performance significantly, at the cost of increased “slack” space (the space left over when you only need part of a block to store a file or the last piece of a file).

大分配块

随着时间的推移,那些讨厌的存储系统不断变得越来越大。由于一些固态硬盘已经使用 8 K 硬件块大小,因此 ext4 现在仅允许块大小最大为 4 K 的限制越来越受到制约。较大的存储块可以显著减少碎片并提高性能,代价是增加 “ 松弛 ” 空间(当你只需要块的一部分来存储文件或文件的最后一块时留下的空间)。

You can view the hairy details in the design document.

你可以在设计文档中查看详细说明。

Practical limitations of ext4

Ext4 is a robust, stable filesystem, and it’s what most people should probably be using as a root filesystem in 2018. But it can’t handle everything. Let’s talk briefly about some of the things you shouldn’t expect from ext4 — now or probably in the future.

ext4 的实际限制

ext4 是一个健壮、稳定的文件系统。在 2018 年大多数人都应该在用它作为根文件系统。但它无法处理一切。让我们简单地谈谈你不应该在 ext4 上期待的一些事情 —— 现在或可能在未来:

Although ext4 can address up to 1 EiB — equivalent to 1,000,000 TiB — of data, you really, really shouldn’t try to do so. There are problems of scale above and beyond merely being able to remember the addresses of a lot more blocks, and ext4 does not now (and likely will not ever) scale very well beyond 50-100 TiB of data.

虽然 ext4 可以处理高达 1 EiB 大小(相当于 1,000,000 TiB )的数据,但你 真的 不应该尝试这样做。除了能够记住更多块的地址之外,还存在规模上的问题,并且现在 ext4 不会处理(并且可能永远不会)超过 50 – 100 TiB 的数据。

Ext4 also doesn’t do enough to guarantee the integrity of your data. As big an advancement as journaling was back in the ext3 days, it does not cover a lot of the common causes of data corruption. If data is corrupted while already on disk — by faulty hardware, impact of cosmic rays (yes, really), or simple degradation of data over time — ext4 has no way of either detecting or repairing such corruption.

ext4 也不足以保证数据的完整性。随着日志功能的重大进展仿佛又回到了 ext3 的那个时候,它并未涵盖数据损坏的许多常见原因。如果数据已经在磁盘上被破坏 —— 由于故障硬件,宇宙射线的影响(是的,真的),或者只是数据随时间衰减 —— ext4 无法检测或修复这种损坏。

Building on the last two items, ext4 is only a pure filesystem, and not a storage volume manager. This means that even if you’ve got multiple disks — and therefore parity or redundancy, which you could theoretically recover corrupt data from — ext4 has no way of knowing that or using it to your benefit. While it’s theoretically possible to separate a filesystem and storage volume management system in discrete layers without losing automatic corruption detection and repair features, that isn’t how current storage systems are designed, and it would present significant challenges to new designs.

基于以上两点,ext4 只是一个纯 文件系统,而不是存储卷管理器。这意味着即使你有多个磁盘(也就是奇偶校验或冗余,理论上你可以从中恢复损坏的数据),ext4 也无法知道这一点或使用它来对你有利。虽然理论上可以在不同的层中分离文件系统和存储卷管理系统而不会丢失自动损坏检测和修复功能,但这不是当前存储系统的设计方式,并且它将给新设计带来重大挑战。

Alternate filesystems

Before we get started, a word of warning: Be very careful with any alternate filesystem which isn’t built into and directly supported as a part of your distribution’s mainline kernel!

可选文件系统

在我们开始之前,提醒一句:对于那些没有作为发行版主线内核的一部分而内置和直接支持的任何可选文件系统而言,要非常小心!

Even if a filesystem is safe, using it as the root filesystem can be absolutely terrifying if something hiccups during a kernel upgrade. If you aren’t extremely comfortable with the idea of booting from alternate media and poking manually and patiently at kernel modules, grub configs, and DKMS from a chroot… don’t go off the reservation with the root filesystem on a system that matters to you.

即使一个文件系统是 安全的,如果在内核升级期间出现问题,使用它作为根文件系统也是非常可怕的。如果你没有充分的理由使用替代介质引导,并耐心地手工操作内核模块、配置 grub 和使用 chroot 操作 DKMS 等 …… 就不要在一个很重要的系统上放弃原有的根文件系统。

There may well be good reasons to use a filesystem your distro doesn’t directly support — but if you do, I strongly recommend you mount it after the system is up and usable. (For example, you might have an ext4 root filesystem, but store most of your data on a zfs or btrfs pool.)

使用发行版不直接支持的文件系统可能有很好的理由 —— 但如果你这样做,我强烈建议你在系统启动并可用后再挂载它。(例如,你可能有一个 ext4 根文件系统,但是将大部分数据存储在 ZFS 或 Btrfs 池中。)

XFS

XFS is about as mainline as a non-ext filesystem gets under Linux. It’s a 64-bit, journaling filesystem that has been built into the Linux kernel since 2001 and offers high performance for large filesystems and high degrees of concurrency (i.e., a really large number of processes all writing to the filesystem at once).

XFS

XFS 与非 ext 文件系统在 Linux 主线中的地位一样。它是一个 64 位的日志文件系统,自 2001 年以来内置于 Linux 内核中,为大型文件系统和高度并发性提供了高性能(也就是说,大量的进程都会同时对文件系统执行写入操作)。

XFS became the default filesystem for Red Hat Enterprise Linux, as of RHEL 7. It still has a few disadvantages for home or small business users — most notably, it’s a real pain to resize an existing XFS filesystem, to the point it usually makes more sense to create another one and copy your data over.

从 RHEL 7 开始,XFS 成为 Red Hat Enterprise Linux 的默认文件系统。对于家庭或小型企业用户来说,它仍然有一些缺点 —— 最值得注意的是,重新调整现有 XFS 文件系统是一件非常痛苦的事情,不如创建另一个并复制数据更有意义。

While XFS is stable and performant, there’s not enough of a concrete end-use difference between it and ext4 to recommend its use anywhere that it isn’t the default (e.g., RHEL7) unless it addresses a specific problem you’re having with ext4, such as >50 TiB capacity filesystems.

虽然 XFS 是稳定的且是高性能的,但它和 ext4 之间没有足够具体的最终用途差异,以值得推荐在非默认(如 RHEL7 )的任何地方使用它,除非它解决了您在 ext4 中遇到的一个特定问题,例如大于 50 TiB 容量的文件系统。

XFS is not in any way a “next-generation” filesystem in the ways that ZFS, btrfs, or even WAFL (a proprietary SAN filesystem) are. Like ext4, it should most likely be considered a stopgap along the way towards something better.

XFS 在任何方面都不是像 ZFS 、Btrfs 甚至 WAFL(一个专有的 SAN 文件系统)这样的 “ 下一代 ” 文件系统。就像 ext4 一样,它应该被视为一种更好的权宜之计。

ZFS

ZFS was developed by Sun Microsystems and named after the zettabyte — equivalent to 1 trillion gigabytes — as it could theoretically address storage systems that large.

ZFS

ZFS 由 Sun Microsystems 开发,以 zettabyte 命名 —— 相当于 1 万亿 GB —— 因为理论上它可以处理这么大的存储系统。

A true next-generation filesystem, ZFS offers volume management (the ability to address multiple individual storage devices in a single filesystem), block-level cryptographic checksumming (allowing detection of data corruption with an extremely high accuracy rate), automatic corruption repair (where redundant or parity storage is available), rapid asynchronous incremental replication, inline compression, and more. A lot more.

作为真正的下一代文件系统,ZFS 提供卷管理(能够在单个文件系统中处理多个单独的存储设备),块级加密校验和(允许以极高的准确率检测数据损坏),自动损坏修复(当冗余或奇偶校验存储可用时),快速异步增量复制,内嵌压缩等,以及更多

The biggest problem with ZFS, from a Linux user’s perspective, is the licensing. ZFS was licensed CDDL, which is a semi-permissive license that conflicts with the GPL. There is a lot of controversy over the implications of using ZFS with the Linux kernel, with opinions ranging from “it’s a GPL violation” to “it’s a CDDL violation” to “it’s perfectly fine, it just hasn’t been tested in court.” Most notably, Canonical has included ZFS code inline in its default kernels since 2016 without legal challenge so far.

从 Linux 用户的角度来看,ZFS 的最大问题是许可证问题。ZFS 许可证是 CDDL 许可证,这是一种与 GPL 冲突的半许可的许可证。关于在 Linux 内核中使用 ZFS 的意义存在很多争议,其争议范围从 “ 它是 GPL 违规 ” 到 “ 它是 CDDL 违规 ” 再到 “ 它完全没问题,它只是还没有在法庭上进行过测试。” 最值得注意的是,自 2016 年以来 Canonical 已将 ZFS 代码内嵌在其默认内核中,而且目前还没有受到法律的挑战。

At this time, even as a very avid ZFS user myself, I would not recommend ZFS as a root Linux filesystem. If you want to leverage the benefits of ZFS on Linux, set up a small root filesystem on ext4, then put ZFS on your remaining storage, and put data, applications, whatever you like on it — but keep root on ext4, until your distribution explicitly supports a zfs root.

此时,即使我作为一个非常狂热于 ZFS 的用户,我也不建议将 ZFS 作为 Linux 的根文件系统。如果你想在 Linux 上利用 ZFS 的优势,那么用 ext4 设置一个较小的根文件系统,接着将 ZFS 用在你剩余的存储上,然后把数据、应用程序以及你喜欢的任何东西放在它上面 —— 但把根分区保留在 ext4 上,直到你的发行版明确支持 ZFS 根分区。

btrfs

Btrfs — short for B-Tree Filesystem, and usually pronounced “butter” — was announced by Chris Mason in 2007 during his tenure at Oracle. Btrfs aims at most of the same goals as ZFS, offering multiple device management, per-block checksumming, asynchronous replication, inline compression, and more.

Btrfs

Btrfs 是 B-Tree Filesystem 的简称,通常发音为 “ butter ” —— 由 Chris Mason 于 2007 年在 Oracle 任职期间发布。Btrfs 的目标与 ZFS 大多数目标相同,提供多种设备管理、逐块校验、异步复制、内嵌压缩等,还有更多

As of 2018, btrfs is reasonably stable and usable as a standard single-disk filesystem but should probably not be relied on as a volume manager. It suffers from significant performance problems compared to ext4, XFS, or ZFS in many common use cases, and its next-generation features — replication, multiple-disk topologies, and snapshot management — can be pretty buggy, with results ranging from catastrophically reduced performance to actual data loss.

截至 2018 年,Btrfs 相当稳定,可用作标准的单磁盘文件系统,但可能不应该依赖其作为卷管理器。与许多常见用例中的 ext4 、XFS 或 ZFS 相比,它存在严重的性能问题,其下一代功能 —— 复制、多磁盘拓扑和快照管理 —— 可能存在非常多的 bug ,其结果可能会导致灾难性地性能降低和实际数据的丢失。

The ongoing status of btrfs is controversial; SUSE Enterprise Linux adopted it as its default filesystem in 2015, whereas Red Hat announced it would no longer support btrfs beginning with RHEL 7.4 in 2017. It is probably worth noting that production, supported deployments of btrfs use it as a single-disk filesystem, not as a multiple-disk volume manager a la ZFS — even Synology, which uses btrfs on its storage appliances, but layers it atop conventional Linux kernel RAID (mdraid) to manage the disks.

Btrfs 的发展状态是存在争议的;SUSE Enterprise Linux 在 2015 年采用它作为默认文件系统,而 Red Hat 于 2017 年宣布它从 RHEL 7.4 开始不再支持 Btrfs 。可能值得注意的是,Btrfs 可部署用作单磁盘文件系统,而不是像 ZFS 中的多磁盘卷管理器,Synology 竟然在它们的存储设备上使用 Btrfs ,但也是将它分层到(或分离到)传统的 Linux 内核 RAID( mdraid )之上来管理磁盘。

 

via: https://opensource.com/article/18/4/ext4-filesystem

作者:Jim Salter 译者:HardworkFish 校对:wxypityonline

本文由 LCTT 原创编译,Linux中国 荣誉推出

 

转自:

  • 原文:https://opensource.com/article/18/4/ext4-filesystem
  • 中文翻译参考自:https://linux.cn/article-10000-1.html ,有大量的修改

Was this article helpful?

Related Articles

Leave A Comment?

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据