Fossil

Artifact [7ca5bd7b]
Login

Artifact 7ca5bd7bfd6bc8e3df067badc393129d30363562:


<title>Fossil Delta Format</title>
<nowiki>
<h2>Abstract</h2>

<p>Fossil achieves efficient storage and low-bandwidth synchronization
through the use of delta-compression.  Instead of storing
or transmitting the complete content of an artifact, fossil stores or
transmits only the changes relative to a related artifact.
</p>

<p>This document describes the delta-encoding format used by fossil.
The intended audience is developers working on either
<a href="index.wiki">fossil</a> itself, or on tools compatible with
fossil.</p>

<p>Note that the delta-encoding is not a fundamental element of the
state of a fossil repository.  A state of a fossil repository is
defined by the uncompressed and undeltaed content of all artifacts.
The fact the artifacts
are stored on disk using this delta-encoding format is merely an
optimization.  One could, in theory, create an entirely new and
compatible implementation of fossil that used a different delta-encoding
or did no delta-encoding at all.  However, experience has shown that
the delta-encoding described here is both efficient to compute and
results in very small deltas, so its continued use is recommended.</p>

<a name="structure"></a><h2>1.0 Structure</h2>
<img src="delta1.gif" align="left" hspace="10">

<p>A delta consists of three parts, a "header", a "trailer", and a
"segment-list" between them.</p>

<p>Both header and trailer provide information about the target
helping the decoder, and the segment-list describes how the target can
be constructed from the original.</p>

<a name="header"></a><h3>1.1 Header</h3>
<img src="delta6.gif" align="left" hspace="10">

<p>The header consists of a single number followed by a newline
character (ASCII 0x0a). The number is the length of the target in
bytes.</p>

<p>This means that, given a delta, the decoder can compute the size of
the target (and allocate any necessary memory based on that) by simply
reading the first line of the delta and decoding the number found
there. In other words, before it has to decode everything else.</p>

<a name="trailer"></a><h3>1.2 Trailer</h3>
<img src="delta5.gif" align="left" hspace="10">

<p>The trailer consists of a single number followed by a semicolon (ASCII
0x3b). This number is a checksum of the target and can be used by a
decoder to verify that the delta applied correctly, reconstructing the
target from the original.</p>

<p>The checksum is computed by treating the target as a series of
32-bit integer numbers (MSB first), and summing these up, modulo
2^32-1. A target whose length is not a multiple of 4 is padded with
0-bytes (ASCII 0x00) at the end.</p>

<p>By putting this information at the end of the delta a decoder has
it available immediately after the target has been reconstructed
fully.</p>

<a name="slist"></a><h3>1.3 Segment-List</h3>
<img src="delta2.gif" align="left" hspace="10">

<p>The segment-list of a delta describes how to create the target from
the original by a combination of inserting literal byte-sequences and
copying ranges of bytes from the original. This is where the
compression takes place, by encoding the large common parts of
original and target in small copy instructions.</p>

<p>The target is constructed from beginning to end, with the data
generated by each instruction appended after the data of all previous
instructions, with no gaps.</p>

<a name="insertlit"></a><h4>1.3.1 Insert Literal</h4>

<p>A literal is specified by two elements, the size of the literal in
bytes, and the bytes of the literal itself.</p>

<img src="delta4.gif" align="left" hspace="10">
<p>The length is written first, followed by a colon character (ASCII
0x3a), followed by the bytes of the literal.</p>

<a name="copyrange"></a><h4>1.3.2 Copy Range</h4>

<p>A range to copy is specified by two numbers, the offset of the
first byte in the original to copy, and the size of the range, in
bytes. The size zero is special, its usage indicates that the range
extends to the end of the original.</p>

<img src="delta3.gif" align="left" hspace="10">
<p>The length is written first, followed by an "at" character (ASCII
0x40), then the offset, followed by a comma (ASCII 0x2c).</p>

<a name="intcoding"></a><h2>2.0 Encoding of integers</h2>

<p>
The format currently handles only 32 bit integer numbers. They are
written base-64 encoded, MSB first, and without leading
"0"-characters, except if they are significant (i.e. 0 => "0").
</p>

<p>
The base-64 coding is described in
<a href="http://www.ietf.org/rfc/rfc3548.txt">RFC 3548</a>.
</p>

<a name="examples"></a><h2>3.0 Examples</h2>

<a name="examplesint"></a><h3>3.1 Integer encoding</h3>

<table border=1>
<tr>
<th>Value</th>
<th>Encoding</th>
</tr>
<tr>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>6246</td>
<td>1Xb</td>
</tr>
<tr>
<td>-1101438770</td>
<td>2zMM3E</td>
</tr>
</table>

<a name="examplesdelta"></a><h3>3.2 Delta encoding</h3>

<p>An example of a delta using the specified encoding is:</p>

<table border=1><tr><td><pre>
1Xb
4E@0,2:thFN@4C,6:scenda1B@Jd,6:scenda5x@Kt,6:pieces79@Qt,F: Example: eskil~E@Y0,2zMM3E;</pre>
</td></tr></table>

<p>This can be taken apart into the following parts:</p>

<table border=1>
<tr><th>What  </th> <th>Encoding         </th><th>Meaning </th><th>Details</th></tr>
<tr><td>Header</td> <td>1Xb              </td><td>Size    </td><td> 6246	     </td></tr>
<tr><td>S-List</td> <td>4E@0,	         </td><td>Copy    </td><td> 270 @ 0	     </td></tr>
<tr><td>&nbsp;</td> <td>2:th	         </td><td>Literal </td><td> 2 'th'	     </td></tr>
<tr><td>&nbsp;</td> <td>FN@4C,	         </td><td>Copy    </td><td> 983 @ 268	     </td></tr>
<tr><td>&nbsp;</td> <td>6:scenda         </td><td>Literal </td><td> 6 'scenda'	     </td></tr>
<tr><td>&nbsp;</td> <td>1B@Jd,	         </td><td>Copy    </td><td> 75 @ 1256	     </td></tr>
<tr><td>&nbsp;</td> <td>6:scenda         </td><td>Literal </td><td> 6 'scenda'	     </td></tr>
<tr><td>&nbsp;</td> <td>5x@Kt,	         </td><td>Copy    </td><td> 380 @ 1336	     </td></tr>
<tr><td>&nbsp;</td> <td>6:pieces	 </td><td>Literal </td><td> 6 'pieces'	     </td></tr>
<tr><td>&nbsp;</td> <td>79@Qt,	         </td><td>Copy    </td><td> 457 @ 1720     </td></tr>
<tr><td>&nbsp;</td> <td>F: Example: eskil</td><td>Literal </td><td> 15 ' Example: eskil'</td></tr>
<tr><td>&nbsp;</td> <td>~E@Y0,           </td><td>Copy    </td><td>  4046 @ 2176        </td></tr>
<tr><td>Trailer</td><td>2zMM3E           </td><td>Checksum</td><td> -1101438770         </td></tr>
</table>

<p>The unified diff behind the above delta is</p>

<table border=1><tr><td><pre>
bluepeak:(761) ~/Projects/Tcl/Fossil/Devel/devel > diff -u ../DELTA/old ../DELTA/new
--- ../DELTA/old        2007-08-23 21:14:40.000000000 -0700
+++ ../DELTA/new        2007-08-23 21:14:33.000000000 -0700
@@ -5,7 +5,7 @@

  *  If the server does not have write permission on the database
     file, or on the directory containing the database file (and
-    it is thus unable to update database because it cannot create
+    it is thus unable to update the database because it cannot create
     a rollback journal) then it currently fails silently on a push.
     It needs to return a helpful error.

@@ -27,8 +27,8 @@
  *  Additional information displayed for the "vinfo" page:

      +  All leaves of this version that are not included in the
-        descendant list.  With date, user, comment, and hyperlink.
-        Leaves in the descendant table should be marked as such.
+        descendant list.  With date, user, comment, and hyperlink.
+        Leaves in the descendant table should be marked as such.
         See the compute_leaves() function to see how to find all
         leaves.
      +  Add file diff links to the file change list.
@@ -37,7 +37,7 @@

  *  The /xfer handler (for push, pull, and clone) does not do
     delta compression.  This results in excess bandwidth usage.
-    There are some code in xfer.c that are sketches of ideas on
+    There are some pieces in xfer.c that are sketches of ideas on
     how to do delta compression, but nothing has been implemented.

  *  Enhancements to the diff and tkdiff commands in the cli.
@@ -45,7 +45,7 @@
     single file.  Allow diffs against any two arbitrary versions,
     not just diffs against the current check-out.  Allow
     configuration options to replace tkdiff with some other
-    visual differ of the users choice.
+    visual differ of the users choice. Example: eskil.

  *  Ticketing interface (expand this bullet)

</pre></td></tr></table>



<a name="notes"></a><h2>Notes</h2>

<ul>
<li>Pure text files generate a pure text delta.
</li>
<li>Binary files generate a delta that may contain some binary data.
</li>
<li>The delta encoding does not attempt to compress the content.
It was considered to be much
more sensible to do compression using a separate general-purpose
compression library, like <a href="http://www.zlib.net">zlib</a>.
</li>
</ul>