The Google File System

Classic papers in the field of distributed systems, read and translated with the help of ChatGPT and Claude, mainly focusing on the first five chapters view the original text here

Overview#

GFS is a scalable distributed file system for large distributed data-intensive applications.

Application Scenarios and Design Points of GFS#

The GFS system runs on many inexpensive commodity hardware, where failures are very common, including application bugs, operating system bugs, human errors, disk/memory/driver/network/power failures, etc. Therefore, there is a need for regular monitoring, error detection, fault tolerance, and automatic recovery mechanisms, treating component failures as normal rather than exceptions.
The files stored in the GFS system are usually large, often reaching sizes of 100MB or even GB. Thus, efficient management of large files is required, while small files can be supported but do not need optimization.
File modifications are more often append writes rather than overwrites, and random writes are very rare. File reads are generally large-scale streaming reads (hundreds of KB to several MB) or small-scale (several KB) random reads.
The load mainly comes from: large-scale streaming reads (hundreds of KB to several MB), small-scale (several KB) random reads, and large-scale serialized append writes.
GFS files are often used as production-consumption queues or for multi-way merging, so the system must efficiently implement parallel append semantics for multiple clients to the same file.
High sustained bandwidth is more important than low latency.

Interface#

The GFS interface provides the following operations: create, delete, open, close, read, write, snapshot, record append. Among them, snapshot is a low-cost file creation or directory copy operation, and record append is an atomic append operation that allows multiple clients to append to a file in parallel while ensuring the atomicity of each client's operation, used for multi-way merging.

Architecture#

GFS Architecture

A GFS cluster consists of one master node and multiple chunkservers. All files are divided into fixed-size chunks, and each chunk is assigned a globally unique immutable 64-bit chunk handle by the master upon creation. Chunks are stored on local disks by chunkservers and are read and written using the chunk handle and byte range. To ensure reliability, each chunk is replicated on multiple chunkservers, with a default of three replicas.

The master node stores all file metadata, including namespace permission control information, the mapping of files to chunks, and the location information of chunks. The master node also controls system-level activities such as chunk lease management, garbage collection of orphaned chunks, and migration of chunks between chunkservers. The master node periodically communicates with chunkservers via HeartBeat messages to issue commands and collect status information from chunkservers.

Clients interact with the master to obtain and modify metadata. All data-bearing communications connect directly to chunkservers. Neither clients nor chunkservers cache files (though clients do cache metadata) for the following reasons:

The benefits of client caching are minimal. Most requests will stream large files or have a large working set that cannot be cached. Not caching simplifies the client implementation and eliminates the need for the entire system to consider cache invalidation (consistency) issues.
Chunkservers also do not need an additional layer of caching because chunks are stored as regular files, and Linux's buffer cache can keep frequently accessed files in memory.

Interaction Process#

The client converts the filename and offset into a chunk index based on a fixed chunk size.
The client sends the filename and chunk index to the master.
The master returns the chunk handles and location information for multiple replicas of the corresponding chunk.
The client caches the information returned by the server using the filename plus the chunk index as the key.
The client specifies the chunk handle and byte range, sending a data request to one of the replicas (generally the nearest one) located on the chunkserver.
Subsequent read requests for the same chunk use the local cache, without needing to interact with the master, until the cache information becomes invalid or the file is reopened.
Typically, the client will include multiple chunks in a single request, and the master will also return multiple chunks that may be needed later to reduce the overhead of subsequent requests.

Chunk Size#

The chosen chunk size for GFS is 64MB, which is larger than the typical disk block size of operating systems. Each chunk replica is stored on chunkservers as regular Linux files, with space allocation being lazy to avoid space waste due to internal fragmentation. The advantages of choosing a larger chunk size include:

Fewer chunks per file, reducing the number of requests for clients to interact with the master to obtain chunkserver location information.
A client can perform multiple operations on the same chunk while maintaining a single persistent TCP connection, reducing the overhead of establishing many connections.
The total number of chunks in the system is reduced, thereby decreasing the size of metadata that the master needs to store, allowing metadata to remain in memory.

Typically, small files contain only a few or even just one chunk. If many clients access the same file, these chunkservers may become hotspots. In practice, hotspots are not a major issue because applications mostly read multiple chunks of files sequentially. However, when GFS was first applied to batch processing queue systems, hotspot issues did arise: an executable file was written to GFS as a single chunk file, and subsequently, hundreds of machines accessed it simultaneously, causing the few chunkservers storing this file to become overloaded due to hundreds of simultaneous requests. We addressed this issue in two ways: 1. Increasing the number of replicas. 2. Applications staggered the timing of chunk reads to avoid simultaneous requests. Another potentially effective solution is to allow clients to read data from other clients in such scenarios.

Metadata#

The master stores three types of metadata: the namespace of files and chunks, the mapping of files to chunks, and the location information of each chunk replica. All metadata remains in the master’s memory, and the master persists the change logs of the first two types of information to local and remote disks. Through log replay, the master’s state can be easily updated, ensuring reliability and consistency in state recovery after crashes.

The master does not persistently store the location information of chunk replicas but queries chunkservers at startup or when a new chunkserver joins the cluster.

Memory State#

Since metadata is stored in memory, operations on the master are generally fast, and it can efficiently traverse the complete state periodically in the background. Periodic scanning is used for: implementing garbage collection, recreating replicas for chunks on failed chunkservers, and executing chunk migrations to ensure load balancing between chunkservers and disk space.

The issue with having everything in memory is that the total capacity of chunks in the system is limited by the memory of the master machine. In practice, this is generally not a problem; the metadata for a 64MB chunk is typically less than 64 bytes, and most files' chunks are full, except for the last chunk, so there won't be many fragmented chunks. Additionally, the metadata for the file namespace is generally also less than 64 bytes, as filenames are stored compactly using prefix compression. Finally, if there is a need to support a larger file system, it is simply a matter of increasing the memory of the master machine.

Chunk Location#

The master does not maintain persistent records of which chunkserver has which chunk replicas; it only polls all chunkservers at startup and can ensure that this information is up-to-date because the master controls the arrangement of all chunk locations and monitors the status of all chunkservers through HeartBeat.

Querying at startup eliminates many synchronization issues between the master and chunkservers, such as chunkservers joining or leaving the cluster, renaming, rebooting, crashing, etc. In a relatively large cluster, these events occur frequently. On the other hand, many errors on chunkservers may cause chunks to disappear (e.g., disk failures rendering them unavailable) or renaming operations, and only the chunkserver itself knows whether it truly stores a valid chunk, so it makes little sense to store this information on the master.

Operation Logs#

Operation logs contain the historical records of critical metadata changes and are the only persistent information for metadata, providing a logical timeline for all parallel operations. Each chunk creation and version number change is marked with a logical timestamp. Due to the importance of operation logs, they must be stored reliably, and they are not visible to clients until persistence is complete. Operation logs are replicated on remote machines, and only after being flushed to both local and remote disks are they returned to the client.

The master recovers by replaying operation logs. To minimize the master’s startup time, the logs need to be as small as possible. When the log volume reaches a certain size, the master creates a checkpoint of the current complete state, and upon the next startup, it only needs to replay the logs from the last checkpoint onward. The checkpoint itself is a compact B-tree structure that can be directly mapped to memory for namespace queries without additional parsing.

Creating a checkpoint takes some time; the internal state of the master is structured, and the creation of a new checkpoint can occur without blocking current modification operations. During checkpoint creation, it runs on another thread, switches to a new log file, and the newly created checkpoint contains all previous changes. A system with millions of files can create a checkpoint within a minute, and the creation is only considered successful after it has been written to both local and remote disks. When the master recovers, it only needs to replay the operation logs from the last complete and valid checkpoint, and earlier checkpoints can be deleted directly. Failures during checkpoint creation do not affect correctness, as recovery will check and skip incomplete checkpoints.

Consistency Model#

GFS uses a relatively relaxed consistency model but still ensures a relatively simple and efficient implementation. GFS guarantees the following: changes to the namespace (mutations) (such as file creation) are atomic, the master node adds mutex locks on the namespace to perform operations, and the operation logs on the master node can define the global order of changes.

The state of a file region (referring to the byte range of a file on the storage device) after data changes depends on the type of data change: success/failure, and whether it is a concurrent change.

A file region refers to the byte range of a file on the storage device. If all clients, regardless of which replica, can always read the same data, then this file region is consistent. If a file's data changes while remaining consistent, and all clients can see the complete content of the change written, then this region is defined (this term is not easy to translate; it can be understood as the content of this file region being determinable by the client).

Write failure: The result is certainly inconsistent.
Non-concurrent (sequential) write success: The result is definitely defined; the content of the corresponding file region is what the client wrote.
Concurrent write success: The file region is consistent, but the content written to the corresponding file region may be a mix of multiple write segments, so it is undefined.
Sequential/concurrent append success: Since appending is atomic, the content of the file in the successfully appended region is determined, i.e., defined, but prior to this, there may have been successful or failed writes and padding data, etc.

Data changes can be writes or appends: writes require the client to specify an offset, while appends write at the position the client believes is the end of the file. Append operations will be executed atomically at least once, even in the presence of concurrent changes, but the final offset is chosen by GFS, and the successfully written offset will be returned to the client and marked as the starting point of the defined file region containing this append record (there may have been successful or failed writes before).

GFS may insert padding or duplicate append records, and this area will be considered inconsistent. After a series of changes, the final file can be guaranteed to be defined within a certain range and contain the data written in the last change.

GFS archives by applying changes sequentially to all replicas of a chunk, and then checks all outdated replicas using chunk version numbers. These outdated replicas may have lost changes due to chunkservers going offline. Outdated replicas will not participate in subsequent changes and will not be returned to clients when they request the master; they will be garbage collected as soon as possible. Since clients cache the location information of chunks, before this cache refresh, clients may read data from an outdated replica, and this time window is limited by the timeout of the cache unit and the next file open time. Since most files are only appended, outdated replicas relative to the latest data are merely earlier in the file's end position. Clients needing to read the latest data will fail, and upon retrying, will request chunk information from the master again, at which point they will immediately receive the latest chunk location information.

After a change has been successfully made for a long time, component failures can also lead to data corruption or loss. GFS checks with all chunkservers via handshake requests, verifies checksums to detect data corruption, and then marks the chunkserver as invalid. Once an error is detected, data may be quickly re-copied from other valid replicas. A chunk can only be irreversibly lost if all replicas fail before GFS reacts, which typically takes a few minutes. In such cases, the client will be informed that the data is unavailable rather than corrupted.

For applications using GFS, to adapt to this relaxed consistency, it is advisable to use append writes rather than random writes, define files based on the write position, periodically set checkpoints for completed writes, and include application-level checksums. Then only read data before the checkpoint and perform checksum checks. This approach can handle both consistency and concurrency issues well. In the event of a write error, the application can incrementally write again after its recorded checkpoint. The application can discard padding content and fragmented records based on checksums. The application can also add a unique ID to each record to identify duplicate records.

System Interaction#

Leases and Change Order#

Changes are operations that alter data or metadata, and each change is applied to all replicas of a chunk. GFS uses a lease mechanism to maintain consistency of changes among all replicas. Before executing a change, GFS selects one replica among all replicas of the chunk to grant a lease, known as the primary. The primary is responsible for determining the serial order of change operations executed on this chunk, and all other replicas follow this order when executing chunk change operations.

Thus, the global change order of the entire system can be determined by the order in which the master grants leases and the execution order of changes by each primary. The lease mechanism minimizes management overhead on the master node. The lease has an initial timeout of 60 seconds, but as long as a chunk change is completed, the primary can extend the timeout. The master will communicate this extended timeout to all chunkservers via HeartBeat messages. In certain situations, the master may revoke a lease, for example, if it wants to disable changes on a file that has already been renamed. In cases where communication between the master and primary is lost, a new lease can be reassigned through the timeout mechanism.

Data Flow and Control Flow

The process for a client to execute a change is as follows:

The client requests the chunkserver holding the lease for the specified chunk and the location information of other replicas from the master. If there is currently no lease, the master selects a replica to grant the lease.
The master returns the location information of all replicas and identifies which one is the primary. The client then caches this information for future use. The client only requests the master again if it cannot communicate with the primary or if the primary no longer holds the lease.
The client pushes data to all replicas in any order; each chunkserver stores the data in its internal LRU buffer cache until the data is used or expires. By decoupling data flow and control flow, data flow can be scheduled based on network topology to enhance performance.
Once all replicas confirm receipt of the data, the client sends a write request to the primary, identifying the data previously pushed to all replicas. The primary assigns a continuous sequence number to all received changes and then executes all changes to its local state in order of the sequence number.
The primary forwards the write request to all secondaries, which then execute all changes in the order of the sequence numbers assigned by the primary.
After all secondaries reply to the primary, it indicates that they have completed their operations.
If any secondary fails (i.e., not all secondaries succeed), the primary will return the corresponding failure error to the client, and the modified file region will now be in an inconsistent state. The client will then retry the failed change, repeating steps 3 to 7.

If a write is very large or spans multiple chunks, the GFS client code will split it into multiple write operations, each following the above process. However, if multiple clients execute concurrently, interleaved execution may lead to overwrites, resulting in the shared file's end containing segments written by multiple clients. In this case, since all replicas execute in the same order, the file region is consistent, but the state of the file is undefined, as it is unclear which came first.

Data Flow#

Decoupling control flow and data flow is intended to make network transmission more efficient. Control flow goes from the client to the primary and then to other secondaries, while data flow is pushed linearly along a carefully selected chain of chunkservers in a pipelined manner, aiming to maximize the bandwidth utilization of each machine and minimize the latency of pushing all data. The total outbound bandwidth of each machine is utilized as much as possible to transmit data quickly, rather than splitting data for multiple receivers.

To avoid network bottlenecks and high-latency links, each machine will choose the nearest machine in the network topology that has not yet received data to forward data. The internal network topology of GFS is very simple, allowing for precise distance estimation through IP addresses. GFS reduces latency by transmitting data in a pipelined manner; once a chunkserver begins receiving data, it will immediately start forwarding.

Atomic Record Append#

GFS provides an atomic append operation called record append. Unlike general writes, record append does not specify a write position. GFS guarantees that this data will be atomically appended to the end of the file at least once, and then returns the offset to the client. It is somewhat similar to the O_APPEND mode in Unix, where concurrent writes to a file do not lead to data races. Achieving this effect with ordinary writes would require distributed locks.

Record append is a type of change operation that follows the same control flow but has slight differences on the primary. The client pushes the append data to all replicas of the last chunk at the end of the file and then sends the request to the primary. The primary then checks whether appending the record to the current chunk will exceed the maximum size (64MB).

If it exceeds, the primary will first fill the current chunk to its maximum size, instruct all secondaries to perform the same operation, and then return to the client that this operation should be retried on the next chunk (the size of record append is limited to 1/4 of the chunk size to avoid extreme fragmentation).
If it does not exceed the chunk size, the primary appends the record to its local replica and sends the offset to all secondaries for data writing, ultimately returning a response to the client.

If the record append fails on any replica, the client will retry this operation. Ultimately, different replicas of the same chunk may contain different data, including all or part of the same record duplicated.

GFS does not guarantee that all replicas are byte-for-byte identical; it only guarantees that data will be written at least atomically once. For successful write operations, data must be written at the same offset on all replicas, and the lengths of all replicas will be at least the same as the length of the appended data, with subsequent appends only having higher offsets. According to the previous definition of consistency, after a successful record append operation, the file region can be considered defined and consistent.

Snapshot#

The snapshot operation can almost instantly create a copy of a file or directory tree, minimizing ongoing changes or any interruptions.

GFS uses copy-on-write technology to implement snapshots. When the master receives a snapshot request, it first revokes all unfinished leases on the chunks of the relevant files, ensuring that any subsequent writes to these chunks require the master to obtain a new lease holder. This gives the master the opportunity to create chunk copies beforehand.

After the leases are revoked/expired, the master writes the snapshot operation log to local disk, modifies the memory state, and copies the metadata of the source file/directory tree.

The newly created snapshot file still points to the chunks of the source file. After the snapshot operation is complete, when a client wants to write content to a chunk, it will request the current lease holder from the master. At this point, the master can find that the reference count for the corresponding chunk is greater than 1, and then the master will delay responding to the client, first selecting a new chunk handle and then informing all chunkservers holding the chunkC replica to create a new chunk. Each chunkserver will perform file copying locally without going through the network (the disk speed within the GFS system is three times that of the network, and this comparison should be about reading files?). Then the master will grant a new lease on the new chunk and return it to the client, allowing normal writing to the copied chunk thereafter.

Master Operations#

The master performs all namespace-related operations, manages all chunk replicas within the system, decides where to create new chunks and their replicas, coordinates various system-level activities, ensures all chunk backups are complete, balances the load of chunkservers, and reclaims unused storage, etc.

Namespace Management and Locks#

Many operations on the master are time-consuming. To avoid the impact of a single time-consuming operation on other operations, GFS allows multiple operations to be executed simultaneously, ensuring correct serialization by locking a certain region of the namespace.

Unlike traditional file systems, directories in GFS do not have a separate data structure (compared to traditional file systems that store files under directories, etc.) and do not support file/directory aliases (soft/hard links). Logically, it is equivalent to having only one table that maps full file names to metadata information, which can be efficiently represented in memory using prefix compression.

Each node in the namespace (the absolute path of a filename or directory name) has an associated read-write lock. The master needs to acquire a certain set of these locks before executing each operation. For example, if an operation is associated with /d1/d2/.../dn/leaf, it needs to acquire read locks for /d1, /d1/d2, ..., /d1/d2/.../dn and a read or write lock for /d1/d2/.../dn/leaf (depending on the needs of the corresponding operation). Here, the leaf node on the path may be a file or a directory in different operations.

Consider a scenario: during the snapshot of /home/user to /save/user, /home/user/foo is created. First, the snapshot operation will acquire read locks for /home and /save, as well as write locks for /home/user and /save/user, while the creation of /home/user/foo requires read locks for /home and /home/user and a write lock for /home/user/foo. These two operations will create lock conflicts on /home/user. The file creation operation does not need to add a write lock on /home/user because GFS's "directory" does not contain any information that needs to be modified; adding a read lock on the directory is sufficient to prevent it from being deleted/renamed or being snapshot.

This locking model supports concurrent change operations in the same directory well, such as concurrently creating multiple files in the same directory. Since there can be many nodes in the namespace, read-write lock objects are lazily created and immediately deleted once they are no longer in use.

To avoid deadlocks, all locks are requested in a consistent order: first sorted by hierarchy in the namespace, and then sorted lexicographically at the same level.

Replica Location#

A GFS cluster may have hundreds of chunkservers distributed across different racks, and these chunkservers may be accessed by hundreds of clients from the same or different racks. Communication between machines on different racks may need to pass through one or more switches. The input-output bandwidth of a rack may be smaller than the combined bandwidth of all machines in that rack.

The strategy for selecting replica locations serves two goals: maximizing data reliability and availability, and maximizing network bandwidth utilization.

Distributing replicas across all machines can only avoid disk or machine failures and maximize the utilization of each machine's network bandwidth. It is also necessary to distribute them across different racks to ensure that there are still available replicas of chunks in case an entire rack becomes unavailable (due to shared resource failures/switch/power failures, etc.) to guarantee availability, and that reading a chunk can utilize the total bandwidth of multiple racks. However, on the other hand, writing data must be transmitted between multiple racks, which requires a trade-off.

Replica Creation, Re-replication, and Balancing#

Chunk replicas are created in three situations: chunk creation, re-replication, and rebalancing.

When the master creates a chunk, it selects the location for the replicas, mainly considering several factors:

Store them on chunkservers with disk utilization below average, so that over time, the disk utilization of all chunkservers will be relatively balanced.
Limit the number of recently created replicas on each chunkserver, as generally, large write traffic will follow after file creation, which needs to avoid network congestion in such cases.
Disperse chunks as much as possible across chunkservers on different racks.

When the number of available replicas for a chunk falls below the user-specified target, for example, due to chunkservers being unavailable or disk failures causing replica damage, or if the user increases the specified number of replicas, the master will quickly initiate re-replication.

Each chunk that needs to be re-replicated will be prioritized based on the following conditions:

How far it is from the replication target; for example, a chunk that has lost two replicas will have a higher priority than one that has lost one.
Prioritize replicating currently alive files over those that are about to be deleted.
To minimize the impact of failures on running applications, the priority of chunks blocking client requests will be increased.

The master will select the highest-priority chunk and issue instructions to the selected chunkserver to copy data directly from the currently existing valid replicas. The location selection for the new replicas follows the same rules as when creating chunks.

To avoid overwhelming client traffic during replica copying, the master will limit the number of replications executed within the cluster and on each chunkserver. Additionally, each chunkserver will limit the bandwidth usage for replica copying by restricting requests sent to the source chunkserver.

The master will also periodically perform replica rebalancing. It will check the current distribution of replicas and balance the load by moving replicas to more suitable disk spaces. The selection strategy for the location of new replicas is the same as above. After creating new replicas, since the number of replicas has increased, the master needs to select currently existing replicas for deletion, generally preferring to choose chunkservers with below-average free space to balance disk space usage.

Garbage Collection#

When a file is deleted, GFS does not immediately reclaim the physical space it occupies but performs file and chunk-level reclamation during regular GC periods. This approach makes the entire system simpler and more reliable.

When an application deletes a file, the master immediately records a deletion log and renames it to a hidden file containing the deletion time. During the master’s regular scanning of the file system, it will remove hidden files that have existed for more than three days. Before removal, hidden files can still be accessed via their special filenames and can be restored by renaming them back to normal filenames.

Once a hidden file is removed from the namespace, its metadata in memory will also be erased, effectively severing its connection to the chunk. During regular scans of the chunk namespace, the master will mark orphaned chunks (i.e., chunks that cannot be accessed from any file) and erase the metadata of these chunks. In the HeartBeat messages communicating between the master and chunkservers, chunkservers will report a subset of the chunks they own, and the master will mark those chunks without metadata in the response, allowing chunkservers to delete their replicas at any time.

The advantages of this GC approach are:

In large-scale distributed systems, this implementation is simple and reliable. The creation of each chunk may fail on certain chunkservers, and messages for deleting replicas may be lost, but the master does not need to retry or remember these failed replicas.
All storage reclamation occurs in the background in batches and is done during relatively idle periods for the master, allowing it to respond more quickly to urgent client requests during normal times.
Delaying the reclamation of space can prevent irreversible deletions caused by accidents.

The main disadvantage of delayed deletion is that it can hinder users' efforts to adjust space when storage is tight (e.g., wanting to delete some files). Applications that repeatedly create and delete temporary files cannot directly reuse storage. The solution is that if a previously deleted file is deleted again, the internal system will accelerate the reclamation process for the corresponding storage, while also allowing applications to use different replicas and reclamation strategies in different namespaces. For example, users can specify that all chunks under a certain directory tree do not need to store replicas, and any file deletion will be executed immediately, resulting in irreversible removal from the file system.

Checking Outdated Replicas#

Replicas may become outdated and lose changes due to chunkservers going offline. The master maintains a version number for each chunk to distinguish the latest replicas from outdated ones. When the master grants a lease on a chunk, it increments its version number and informs all its currently latest replicas, and then the master and these replicas persist the new version number. These steps occur before the client writes to the chunk.

If a replica is currently unavailable, its chunk version will fall behind. When a chunkserver restarts, the master will detect that it has outdated replicas and notify the chunkserver of the corresponding outdated chunks and the latest version number. If the master detects a chunk version higher than the current record, it can be assumed to have arisen during a lease grant failure (since a lease failure cannot lead to writes), and it can directly modify this higher chunk version to the current version.

The master removes outdated replicas during regular GC periods, and when returning chunk-related information to clients, it will ignore outdated replicas. As another guarantee: when the master informs the client which chunkserver holds the lease or instructs a chunkserver to copy data from another chunkserver, it will include the chunk version, and then the client and chunkserver will verify the chunk version during operations to ensure they are accessing the latest data.

Fault Tolerance and Diagnosis#

One of the biggest challenges in the design of the GFS system is dealing with frequent component failures. The quality and quantity of components lead to failures being a norm rather than an exception: machines and disks cannot be fully trusted. Component failures can lead to system unavailability or even data corruption.

High Availability#

A GFS cluster contains hundreds of services, some of which may be unavailable at any time. GFS ensures high availability of the system through two simple and effective methods: rapid recovery and replication.

Rapid Recovery#

Both the master and chunkservers are designed to recover their state and start up within seconds, regardless of how they terminate. In fact, GFS does not distinguish between normal and abnormal terminations; servers are simply killed during routine shutdowns. Clients or other services will briefly pause upon timing out on their unfinished requests and then reconnect to the restarted service to retry.

Chunk Replication#

As mentioned earlier, chunk replication occurs when the master detects a chunkserver going offline or a replica being damaged (via checksums) and performs a clone operation to replicate it. Additionally, GFS uses parity or erasure codes to meet the increasing demand for read-only storage.

Master Replication#

To ensure reliability, the state of the master is also replicated. Operation logs and checkpoints are replicated across multiple machines, and each state change operation can only be considered committed after the logs have been flushed to local disks and all master replicas.

To keep things simple, only one master is responsible for all change operations and background activities such as GC. When the master process fails, it can be restarted almost instantly. If the master machine or disk fails, monitoring infrastructure outside of GFS will start a new master process using a replica of the operation logs to continue processing. Clients will only access the master using a standard name like gfs-test rather than an IP or other names that may change with machine modifications.

Additionally, master replicas, or shadow masters, can provide read-only access to the file system, even when the primary master is offline. However, shadow masters may be slightly behind the primary by a fraction of a second. For files that are not frequently modified or in situations where real-time requirements are relatively weak, this approach can enhance read availability. In fact, file content is read from chunkservers, and applications are generally unaware of whether the file content is outdated. Only directory content or permission control information, such as metadata, may expire within a very short time window.

Each shadow master reads and executes incremental operations on the operation log replica and modifies its memory state in the same order as the primary. Similar to the primary master, the shadow master polls all chunkservers at startup to locate information about all chunk replicas and subsequently monitors their status through frequent handshake message exchanges. Furthermore, updates to replica locations due to decisions made by the primary master, such as creating and deleting replicas, will rely on the primary master.

Data Integrity#

Each chunkserver uses checksums to check for data corruption. A GFS cluster may have thousands of disks distributed across hundreds of machines, making data corruption and loss during reads or writes common. In such cases, recovery can be performed using other chunk replicas. Checking for data corruption by comparing different replicas on chunkservers is impractical. Additionally, in some operations, such as atomic record appends, it cannot be guaranteed that all replicas are completely consistent, but these replicas are still valid. Therefore, each chunkserver must independently verify data integrity by maintaining checksums.

Each chunk is divided into 64KB blocks, and each block has a 32-bit checksum, which is stored in memory and persisted in logs, separated from user data. For read requests, before returning data to the client or chunkserver, the chunkserver will verify the checksums of the blocks within the read range. If a checksum fails, the chunkserver will not send the corrupted data but will return an error to the requester and report the error to the master. The requester will then request data from other replicas, and the master will perform a replica redo (when the number of valid replicas is insufficient) and notify the chunkserver with the checksum error to delete its replica.

Calculating checksums has a minimal impact on read performance because most read requests span only a few blocks, so only a small portion of additional data is needed for checksums, and the GFS client code will attempt to align reads with block sizes to reduce this overhead. Additionally, the lookup and verification of checksums incur no I/O overhead, and checksum calculations can typically overlap with I/O operations (calculating checksums while reading files).

For chunk append write operations, checksum calculations are highly optimized because this work dominates the load. We only incrementally update the last incomplete block and then calculate new checksums for the newly appended blocks. Even if the data of the last incomplete block is corrupted, it will not be detectable at that moment, but the new checksum will mismatch the data, allowing detection during the next read of that block.

In contrast, if a write operation overwrites an existing range in a chunk, we must first read and verify the checksums of the first to last blocks within the overwrite range before executing the write operation and then calculate and record the new checksum.

During idle periods, chunkservers will scan and verify the data of inactive chunks. Once damage is detected, the master will create new replicas and delete the damaged ones. This prevents inactive chunks from quietly becoming corrupted without the master’s knowledge, leading to insufficient valid replicas.

Diagnostic Tools#

Detailed and comprehensive diagnostic logs provide invaluable assistance in problem isolation, debugging, and performance analysis. GFS services generate a lot of diagnostic logs, including key events (such as chunkservers coming online or going offline) and all RPC requests and responses. These logs can be deleted at any time, but we always try to retain them for as long as storage space allows.

RPC logs contain the exact requests and responses transmitted over the communication lines. By matching request and response logs and records on different machines, the entire interaction history can be reconstructed to diagnose a problem. Logs can also be used for load testing tracking and performance analysis.

The impact of logs on performance is minimal because these logs are written sequentially and asynchronously. Most recent events are also kept in memory for continuous online monitoring.

Experience#

Business systems are always rigorous and controllable, but users are not, so more infrastructure is needed to prevent mutual interference among users.

Many of our issues are related to disks and Linux. Many disks claim to support a range of IDE protocols, but in reality, they can only reliably respond to the most recent version. Since the protocols are very similar across multiple versions, most of the time, the drivers work normally, but occasional errors can lead to inconsistencies between the driver and kernel states, causing data to be gradually corrupted. This is also our motivation for using checksums, and we have modified the kernel to address these protocol errors.

In the early days, when we were using the Linux 2.2 kernel, we discovered some issues where the time taken for fsync was proportional to the overall size of the file rather than the portion being written. For large-scale operation logs, this was a problem, especially before implementing checkpoints. We initially used synchronous writes to address this issue and later migrated to Linux 2.4.

Another issue with Linux is the single (global) read-write lock. Any thread in the address space needs to acquire this lock when reading from disk into memory (read lock) or modifying mmap-mapped memory addresses (write lock). We found that even under low system load, there would be brief timeouts, leading us to investigate resource bottlenecks and intermittent hardware failures. Ultimately, we discovered that this single (global) lock was blocking the network main thread from mapping new data into memory because the disk thread was bringing previously mapped memory into memory. Since we were primarily limited by network interface bandwidth rather than memory copy bandwidth, we resolved this issue by replacing mmap with pread.

Conclusion#

GFS has successfully supported our storage needs and is widely used within Google as a storage platform for search and business data processing.