Fossil: A Technical Overview of Fossil's Design & Implementation

1.0 Introduction

At its lowest level, a Fossil repository consists of an unordered set of immutable "artifacts". You might think of these artifacts as "files", since in many cases the artifacts are exactly that. But other "structural artifacts" are also included in the mix. These structural artifacts define the relationships between artifacts - which files go together to form a particular version of the project, who checked in that version and when, what was the check-in comment, what wiki pages are included with the project, what are the edit histories of each wiki page, what bug reports or tickets are included, who contributed to the evolution of each ticket, and so forth. This low-level file format is called the "global state" of the repository, since this is the information that is synced to peer repositories using push and pull operations. The low-level file format is also called "enduring" since it is intended to last for many years. The details of the low-level, enduring, global file format are described separately.

This article is about how Fossil is currently implemented. Instead of dealing with vague abstractions of "enduring file formats" as the other document does, this article provides some detail on how Fossil actually stores information on disk.

2.0 Three Databases

Fossil stores state information in SQLite database files. SQLite keeps an entire relational database, including multiple tables and indices, in a single disk file. The SQLite library allows the database files to be efficiently queried and updated using the industry-standard SQL language. SQLite updates are atomic, so even in the event of a system crashes or power failure the repository content is protected.

Fossil uses three separate classes of SQLite databases:

The configuration database
Repository databases
Checkout databases

The configuration database is a one-per-user database that holds global configuration information used by Fossil. There is one repository database per project. The repository database is the file that people are normally referring to when they say "a Fossil repository". The checkout database is found in the working checkout for a project and contains state information that is unique to that working checkout.

Fossil does not always use all three database files. The web interface, for example, typically only uses the repository database. And the fossil settings command only opens the configuration database when the --global option is used. But other commands use all three databases at once. For example, the fossil status command will first locate the checkout database, then use the checkout database to find the repository database, then open the configuration database. Whenever multiple databases are used at the same time, they are all opened on the same SQLite database connection using SQLite's ATTACH command.

The chart below provides a quick summary of how each of these database files are used by Fossil, with detailed discussion following.

Configuration Database "~/.fossil" or "~/.config/fossil.db"	Repository Database "project.fossil"	Checkout Database "_FOSSIL_" or ".fslckout"
Global settings List of active repositories used by the all command	Global state of the project encoded using delta-compression Local settings Web interface display preferences User credentials and permissions Metadata about the global state to facilitate rapid queries	The repository database used by this checkout The version currently checked out Other versions merged in but not yet committed Changes from the add, delete, and rename commands that have not yet been committed "mtime" values and other information used to efficiently detect local edits The "stash" Information needed to "undo" or "redo"

2.1 The Configuration Database

The configuration database holds cross-repository preferences and a list of all repositories for a single user.

The fossil settings command can be used to specify various operating parameters and preferences for Fossil repositories. Settings can apply to a single repository, or they can apply globally to all repositories for a user. If both a global and a repository value exists for a setting, then the repository-specific value takes precedence. All of the settings have reasonable defaults, and so many users will never need to change them. But if changes to settings are desired, the configuration database provides a way to change settings for all repositories with a single command, rather than having to change the setting individually on each repository.

The configuration database also maintains a list of repositories. This list is used by the fossil all command in order to run various operations such as "sync" or "rebuild" on all repositories managed by a user.

2.1.1 Location Of The Configuration Database

On Unix systems, the configuration database is named by the following algorithm:

1. if environment variable FOSSIL_HOME exists	→	$FOSSIL_HOME/.fossil
2. if file ~/.fossil exists	→	~/.fossil
3. if environment variable XDG_CONFIG_HOME exists	→	$XDG_CONFIG_HOME/fossil.db
4. if the directory ~/.config exists	→	~/.config/fossil.db
5. Otherwise	→	~/.fossil

Another way of thinking of this algorithm is the following:

Use "$FOSSIL_HOME/.fossil" if the FOSSIL_HOME variable is defined
Use the XDG-compatible name (usually ~/.config/fossil.db) on XDG systems if the ~/.fossil file does not already exist
Otherwise, use the traditional unix name of "~/.fossil"

This algorithm is complex due to the need for historical compatibility. Originally, the database was always just "~/.fossil". Then support for the FOSSIL_HOME environment variable was added. Later, support for the XDG-compatible configation filenames was added. Each of these changes needed to continue to support legacy installations.

On Windows, the configuration database is the first of the following for which the corresponding environment variables exist:

%FOSSIL_HOME%/_fossil
%LOCALAPPDATA%/_fossil
%APPDATA%/_fossil
%USERPROFILES%/_fossil
%HOMEDRIVE%%HOMEPATH%/_fossil

The second case is the one that usually determines the name. Note that the FOSSIL_HOME environment variable can always be set to determine the location of the configuration database. Note also that the configuration database file itself is called ".fossil" or "fossil.db" on unix but "_fossil" on windows.

The fossil info command will show the location of the configuration database on a line that starts with "config-db:".

2.2 Repository Databases

The repository database is the file that is commonly referred to as "the repository". This is because the repository database contains, among other things, the complete revision, ticket, and wiki history for a project. It is customary to name the repository database after the name of the project, with a ".fossil" suffix. For example, the repository database for the self-hosting Fossil repository is called "fossil.fossil" and the repository database for SQLite is called "sqlite.fossil".

2.2.1 Global Project State

The bulk of the repository database (typically 75 to 85%) consists of the artifacts that comprise the enduring, global, shared state of the project. The artifacts are stored as BLOBs, compressed using zlib compression and, where applicable, using delta compression. The combination of zlib and delta compression results in a considerable space savings. For the SQLite project (when this paragraph was last updated on 2020-02-08) the total size of all artifacts is over 7.1 GB but thanks to the combined zlib and delta compression, that content only takes less than 97 MB of space in the repository database, for a compression ratio of about 74:1. The median size of all content BLOBs after delta and zlib compression have been applied is 156 bytes. The median size of BLOBs without compression is 45,312 bytes.

Note that the zlib and delta compression is not an inherent part of the Fossil file format; it is just an optimization. The enduring file format for Fossil is the unordered set of artifacts. The compression techniques are just a detail of how the current implementation of Fossil happens to store these artifacts efficiently on disk.

All of the original uncompressed and un-delta'd artifacts can be extracted from a Fossil repository database using the fossil deconstruct command. Individual artifacts can be extracted using the fossil artifact command. When accessing the repository database using raw SQL and the fossil sql command, the extension function "content()" with a single argument which is the SHA1 or SHA3-256 hash of an artifact will return the complete uncompressed content of that artifact.

Going the other way, the fossil reconstruct command will scan a directory hierarchy and add all files found to a new repository database. The fossil import command works by reading the input git-fast-export stream and using it to construct corresponding artifacts which are then written into the repository database.

2.2.2 Project Metadata

The global project state information in the repository database is supplemented by computed metadata that makes querying the project state more efficient. Metadata includes information such as the following:

The names for all files found in any check-in.
All check-ins that modify a given file
Parents and children of each check-in.
Potential timeline rows.
The names of all symbolic tags and the check-ins they apply to.
The names of all wiki pages and the artifacts that comprise each wiki page.
Attachments and the wiki pages or tickets they apply to.
Current content of each ticket.
Cross-references between tickets, check-ins, and wiki pages.

The metadata is held in various SQL tables in the repository database. The metadata is designed to facilitate queries for the various timelines and reports that Fossil generates. As the functionality of Fossil evolves, the schema for the metadata can and does change. But schema changes do not invalidate the repository. Remember that the metadata contains no new information - only information that has been extracted from the canonical artifacts and saved in a more useful form. Hence, when the metadata schema changes, the prior metadata can be discarded and the entire metadata corpus can be recomputed from the canonical artifacts. That is what the fossil rebuild command does.

2.2.3 Display And Processing Preferences

The repository database also holds information used to help format the display of web pages and configuration settings that override the global configuration settings for the specific repository. All of this information (and the user credentials and privileges too) is local to each repository database; it is not shared between repositories by fossil sync. That is because it is entirely reasonable that two different websites for the same project might have completely different display preferences and user communities. One instance of the project might be a fork of the other, for example, which pulls from the other but never pushes and extends the project in ways that the keepers of the other website disapprove of.

Display and processing information includes the following:

The name and description of the project
The CSS file, header, and footer used by all web pages
The project logo image
Fields of tickets that are considered "significant" and which are therefore collected from artifacts and made available for display
Templates for screens to view, edit, and create tickets
Ticket report formats and display preferences
Local values for settings that override the global values defined in the per-user configuration database.

Though the display and processing preferences do not move between repository instances using fossil sync, this information can be shared between repositories using the fossil config push and fossil config pull commands. The display and processing information is also copied into new repositories when they are created using fossil clone.

2.2.4 User Credentials And Privileges

Just because two development teams are collaborating on a project and allow push and/or pull between their repositories does not mean that they trust each other enough to share passwords and access privileges. Hence the names and emails and passwords and privileges of users are considered private information that is kept locally in each repository.

Each repository database has a table holding the username, privileges, and login credentials for users authorized to interact with that particular database. In addition, there is a table named "concealed" that maps the SHA1 hash of each users email address back into their true email address. The concealed table allows just the SHA1 hash of email addresses to be stored in tickets, and thus prevents actual email addresses from falling into the hands of spammers who happen to clone the repository.

The content of the user and concealed tables can be pushed and pulled using the fossil config push and fossil config pull commands with the "user" and "email" as the AREA argument, but only if you have administrative privileges on the remote repository.

2.2.5 Shunned Artifact List

The set of canonical artifacts for a project - the global state for the project - is intended to be an append-only database. In other words, new artifacts can be added but artifacts can never be removed. But it sometimes happens that inappropriate content is mistakenly or maliciously added to a repository. The only way to get rid of the undesired content is to "shun" it. The "shun" table in the repository database records the hash values for all shunned artifacts.

The shun table can be pushed or pulled using the fossil config command with the "shun" AREA argument. The shun table is also copied during a clone.

2.3 Checkout Databases

Fossil allows a single repository to have multiple working checkouts. Each working checkout has a single database in its root directory that records the state of that checkout. The checkout database is named "_FOSSIL_" or ".fslckout". The checkout database records information such as the following:

The name of the repository database file.
The version that is currently checked out.
Files that have been added, removed, or renamed but not yet committed.
The mtime and size of files as they were originally checked out, in order to expedite checking which files have been edited.
Other check-ins that have been merged into the working checkout but not yet committed.
Copies of files prior to the most recent undoable operation - needed to implement the undo and redo commands.
The stash.
State information for the bisect command.

For Fossil commands that run from within a working checkout, the first thing that happens is that Fossil locates the checkout database. Fossil first looks in the current directory. If not found there, it looks in the parent directory. If not found there, the parent of the parent. And so forth until either the checkout database is found or the search reaches the root of the file system. (In the latter case, Fossil returns an error, of course.) Once the checkout database is located, it is used to locate the repository database.

Notice that the checkout database contains a pointer to the repository database but that the repository database has no record of the checkout databases. That means that a working checkout directory tree can be freely renamed or copied or deleted without consequence. But the repository database file, on the other hand, has to stay in the same place with the same name or else the open checkout databases will not be able to find it.

A checkout database is created by the fossil open command. A checkout database is deleted by fossil close. The fossil close command really isn't needed; one can accomplish the same thing simply by deleting the checkout database.

Note that the stash, the undo stack, and the state of the bisect command are all contained within the checkout database. That means that the fossil close command will delete all stash content, the undo stack, and the bisect state. The close command is not undoable. Use it with care.

Fossil