Speed up checksums

Takepatch has been enhanced over time to bring a whole sfio when it is a new file and stream of patches when it is not. The stream of patches can be injected into the weave in one pass, saving the overhead of the original stripdel and re-patching.

This moved the overhead in a big pull to the checksum verification for each delta brought over in the patch. In takepatch there is a loop calling sccs_resum(s, d) per delta brought in, sccs_resum() in turn calls sccs_get() with some flags which cause it to not get the contents of the file, but just to verify the checksum. That operation was walking the whole weave on each call.

To speed up the process, fastsum() was wired into get_reg() if the conditions were just right (basically, the conditions that sccs_resum() sets up). The first time fastsum is called, it sets up a cache of the weave where data blocks are replaced with checksums and line counts of the blocks. Subsequent calls then use the cache to compute d→sum, d→added, d→same, and d→deleted.

Handling a no newline file

The cache needs to run the same no newline state machine that get_reg() does but gets to cheat a little. When we are writing a file out, we can’t put the newline out until we are in the right place in the file. Calculating a checksum, we can add the newline in right away, then subtract \n at the end if the no newline state machine says so.

The way it works is to keep a variable lf_pend which has the serial number of the last block of data printed. That variable will get set to 0 in two conditions: if a deleted portion of that same level of data is seen, and if an ^AE is seen closing down that block.

In the latter case, when a block which set lf_pends ends, and there was an N found after the serial number (when the cache was built), the no_lf flag is set. This stays set until any other active data is seen.

If no_lf is set at the end, then we subtract \n from the sum.

You may be wondering why get_reg() uses lf_pend, but not the no_lf flag. The no_lf is needed in fastsum() because of the I compression. In get_reg(), there is only one E with the same serial as lf_pend. But with the I compression, the matching E might have been removed. We detect closing blocks by the step down to a lower serial number. However there can be many cases where a step down occurs. Only the first one is the real one, so when it is seen, lf_pend is cleared, and no_lf is set.

Packing the weave

There are a number of optimizations we can do as we are not restricted to maintaining sccs structure.

Version 1

The weave is represented in an array of 32 bit unsigned ints. Each SCCS command takes a u32 and each block of data (upto 2^15 - 1 lines long) takes a u32.

commands MSB ^AD <serial> ⇒ 1 0 0 <29 bits of serial> ^AI <serial> ⇒ 1 0 1 <29 bits of serial> ^AE <serial> ⇒ 1 1 0 <29 bits of serial> ^AE <serial>N ⇒ 1 1 1 <29 bits of serial>

data blocks MSB bla bla bla bla bla bla bla bla bla ⇒ 0 <15 bits line count> <16 checksum>

Then walking the weave uses the same changestate/printstate as get_reg() with one change that printstate can be put with the data block, since there is typically one datablock between commands. This reduces the number of printstate calls down. Example:

^AD 3 This is the old text that used to be here ^AE 3 ^AI 3 This is the new text that is now here ^AE 3

Calling printstate with every command means it would be called 4 times, and calling it with each data block means it is called twice.

Version 2

When version 1 was pulled into dev the performance took a dive. Leading the way was changestate() with 250 million calls for a big patch. The data structure for state has been changed from a handcrafted linked list to a dynamically allocate u32 array. This changed meant a big slowdown for a 1/4 billion calls (takepatch went from 12 to 16 seconds).

It’s easy to drop the number of changestate calls because not every command needs to go in. The way I-E pairs are in the weave are a pure stack — push for I, pop for E. That allows for a replacement for a sequence like I1 .. I2 .. I3 .. E3 E2 E1 with using I followed by the serial number that is now active: I1 .. I2 .. I3 .. I2 I1 I0

Storing this structure in the cache means the cache walker wouldn’t need to use changestate to recover what serial contributed the next data block. Instead, a cur variable keeps the current level.

To do the no newline processing, we need to recognize the old E command. That is done by seeing that the serial number is dropping. The cache making asserts this is true, and so it can be relied on when cache walking. The N tag now moves from E to I, which takes some jostling of the way things are stored by swapping I and E:

commands MSB ^AD <serial> ⇒ 1 0 0 <29 bits of serial> ^AE <serial> ⇒ 1 0 1 <29 bits of serial> ^AI <serial> ⇒ 1 1 0 <29 bits of serial> ^AI <serial>N ⇒ 1 1 1 <29 bits of serial>

Doing this, changestate is only used to track D<ser>-E<ser> pairs, and not all D-E pairs, but only active ones (where slist[ser] is true).

The only item that needs to be tracked is the active delete with the largest serial. If we are only storing active, then that would be the top element of state if there is one.

The tracking of the deleted counter needed to look at the next biggest delete. I could have created another interface to look at the next biggest, but instead I set it up to cache the top of the state list locally in dstate and use the state structure to store all the other deletes. This lets me use topstate to get at the second delete in the list. This change also increases performance since changestate is called even less. If there are no overlapping active deletes, changestate is never called.

Doing all of this dropped the number of calls to changestate from around 250 million to 45 million (before the deleted enhancement). This dropped the just under 12 seconds to around 6 seconds for the bugfix version, and just below 16 seconds to just above 6 seconds for the dev version. It closed a 4 second gap to just above 1/2 second.

Version 3

Compress I sequences. After the I-E → I-I transform, a weave can look like:

^I 5 a new first line ^I 6 more foo ^I 5 ^I 0 ^I 2 first line ^I 0 ^I 1 ^I 0

With compression: ^I 5 a new first line ^I 6 more foo ^I 2 first line ^I 0

As explained earlier, the no-newline machine was adjusted to work with this, by splitting out the tracker lf_pend with the final state of no_lf. This lets the stopping condition not need to match the data serial state, but just be less than it.

The building of the cache has an assert (ser < last) which verifies this assumption.

The compression works with no newline by only collapsing I<serial>[N] if neither have the N or both have the N.

Version 4

Look at the weave as a sequence of data blocks, and the conditions which make them active. Turn the conditions into a hash key, and accumulate checksum metadata about blocks with the same key.

Then make that into a list of data blocks meta data. Walk the list testing for active block, and if so, accumulate checksum data.

The details will be presented in 3 parts: the hash, the list, and the walk.

But first…​

REVIEW of the changestate() array

The array is an ordered list of serials, going low to high, and augmented with the high bit set if the serial corresponds to I (add) block.

The weave is a combination of two structures: blocks added (^AI - ^AE in the weave) and range deleted (^AD - ^AE), the latter building on the first.

Added blocks form a nested structure:

^AI <serial N>
line of text from serial N
line 2 of text from serial N
  ^AI <serial N+m>
  a new line of text from serial N+m
  ^AE <serial N+m>
an additional lines of text from serial N
^AE <serial N>

changestate() could track that would be a pure stack: push serial with ^AI and pop with ^AE. The top of stack would be the serial corresponding to any text seen.

Delete regions are laid across this structure:

^AI <serial N>
line of text from serial N
^AD <serial N+k> --------------------- start deleting if N+k active
line 2 of text from serial N
  ^AI <serial N+m>
  a new line of text from serial N+m
  ^AE <serial N+m>
an additional lines of text from serial N
^AD <serial N+k> --------------------- stop deleting
^AE <serial N>

There can be two conditions here: (k > m) || (k < m). A delete region only deletes lines from lines from lower serials.

changestate() handles that with 2 additions: it needs to be able to insert the serial into an ordered list, and it needs to distinguish I from D entries, which is done by setting the high bit on the I entries.

HERE’S THE INTERESTING PART…​

In get_reg(), when walking the weave to generate a particular version, a text block is seen as active if the newest I on the changestate array is in the active serial map and none of the larger serial Ds are active. The smaller I and D entries are ignored.

That means the only interesting part of the changestate array is from the last I until the end of the list (which will be 0 or more Ds).

And now the new stuff…​

HASH

We use the interesting segment of the changestate array as a variable length hash key.

In changestate, the I entries are marked with the hibit set to distinguish them from D entries. In the hash key, the first entry will always be an I and the zero or more rest will always be D. When making the hash key, we turn off the hibit on I entry to make them just a list of serials with known structure.

Then each datablock is checksummed, and line counted, then saved in the hash with the described key. This means data for different blocks with the same conditions for becoming active, will be accumulated in the same hash value.

Example: say the original file is at serial 2 (aka 1.1). Serial 3 adds a big block near the beginning of the file, and a smaller block at the end. Serial 4 deletes a portion in the middle of the first block of serial 3:

|--- ----- serial 1 ---------- -----------| |---------| serial 2 |-----| xxx serial 3

Nothing in serial 1 is deleted. All 4 blocks will have the same key {1}, and will only have 1 entry in the hash. Serial 2’s first block will be cut into 3: before the delete, the delete, and after the delete. They will collapse into to hash entries: {2} and {2,3}. The last block in 2 will also use key {2}. Serial 3 has no data blocks and therefore no hash entries.

Advanced topic: no line feed state machine. Each hash val tracks the seq number of the last line accumulated, and whether it is associated with a no-newline block. When we evaluate all blocks in light of a given serialmap, we save the highest sequence number of an active block, and whether has no terminating line feed. After processing all hash entries, we’ll know if the file ended in newline.

LIST

At the time the hash is being built, a list is also being built and creates an in order memory walk for the walker (discussed below) to use.

This list could be sorted to have the walker go over fewer items, but the addition of sort time vs the fewer items walked didn’t help.

WALK

We walk the list and evaluate each entry to see if it is active (I and no D) and accumulate block sums. And logic to count added/deleted/same and run the no newline state machine. See fastsum(). Small routine.

Future work: incremental sums

We can now think about doing an incremental checksum, similar to cset_resum, with same order of walking. Then each computed sum will serve as the basis for the next sum. It’s a bit of work, and requires a sort of the list to create an interval tree like structure, but this data structure at least allows it to be built. Also there’s something about the no newline state machine hitting some cases where it punts on incremental and computes the whole thing from scratch, but those are as rare as instances of no newline.