                                CVSup

CVSup is a next-generation replacement for sup.  It is a leap
forward, in terms of capability, speed, and server efficiency.
CVSup offers the following advantages:

    * Capability.  CVSup makes some things possible that weren't
      possible before.

	- A client who uses CVSup to maintain a CVS repository can
	  check in local modifications, without interference from
	  CVSup.  CVSup will bring in new revisions, tags, and
	  branches from the server, without disturbing revisions,
	  tags, and branches that were created locally.  This is
	  possible because CVSup parses and understands RCS files.
	  Of course, for this to work, the revision numbers must
	  not collide.

	- CVSup can check out or update any of FreeBSD's sup "releases"
	  (cvs, stable, and current) directly from the CVS repository
	  on the server.  The server doesn't need to store checked-out
	  sources for stable and current.

    * Speed.  CVSup is much faster than sup.  It makes much better use
      of the communication link.

	- Smart updates.  CVSup works directly from the CVS repository
	  on the server.  It can parse and edit RCS files, and it
	  understands how they typically change, and how they are
	  typically used under CVS.  That enables CVSup to perform
	  updates very efficiently.  CVSup currently can recognize
	  and perform seven different kinds of updates:

	    + RCS file updates.  This is the most common kind of
	      update in a CVS repository.  CVSup determines which
	      revisions are present on the server but not on the
	      client.  It extracts the diffs for the new revisions
	      directly from the RCS file, and sends them to the
	      client.  The client merges the new revisions into
	      its RCS file, without disturbing the rest of it.

	      There is a twist for revisions on the main branch.
	      On the main branch, RCS uses reverse diffs -- i.e.,
	      the newest revision is stored in full text form, and
	      the earlier revisions are represented as diffs from
	      that.  This is no problem for CVSup.  The server
	      extracts the reverse diffs from the RCS file, reverses
	      them again to make them forward diffs, and sends them
	      to the client.  The client applies the received diffs
	      to its out-of-date head revision, to reconstruct the
	      full text of the new head.  It then reverses the
	      received diffs once more, and attaches them to the
	      previous head revision in the RCS file.  You got
	      that?  It works!

	      CVSup also detects added symbolic tags, and merges
	      the new tags into the client version of the RCS file.
	      Likewise, it propagates changes in the so-called
	      administrative information, e.g., the default branch
	      that is recorded in the RCS file.

	      Much care was taken to make this technique robust.
	      CVSup produces updated RCS files on the client that
	      are byte-for-byte identical to what is on the server,
	      provided the client hasn't checked in any local
	      revisions, or created any local tags.

	      CVSup verifies its RCS file updates via MD5 checksums,
	      to ensure that no RCS file is ever updated incorrectly.
	      In the event of a checksum mismatch, CVSup falls back
	      on sending the entire RCS file, in compressed form.
	      It must be stressed here that checksum mismatches
	      virtually never happen.  But, especially during the
	      early development of CVSup, the checks have served
	      as a reassuring safety net.  They can be disabled by
	      the user, in order to permit local modifications.

	      The original sup sends the entire file, no matter
	      how minor the change to it.

	    + RCS file updates to the Attic.  CVSup recognizes when
	      an RCS file has been moved into the Attic.  It does
	      the usual RCS update, then moves the new version into
	      the Attic on the client side.

	      The original sup deletes the original file, and sends
	      the full text of the Attic version.

	    + RCS file updates from the Attic.  This is the mirror
	      image of the previous case.  It is handled in a
	      similar way.

	    + Appends to non-RCS files.  When dealing with a non-RCS
	      file, CVSup checks to see whether the update consists
	      of a simple append to the end of the file.  It does
	      that as follows.  The client sends the length of the
	      file, and its MD5 checksum, to the server.  The server
	      computes the MD5 checksum of the same-sized prefix
	      of its newer file.  If the checksums match, then the
	      server recognizes that the file has simply been
	      appended to.  It sends only the new portion to the
	      client.

	      This case applies to most of the CVS administrative
	      files under the CVSROOT subdirectory.  Almost all
	      the files there are log files.  Recognizing appends
	      makes a tremendous difference in the amount of time
	      it takes to update these very large files.

	      The original sup sends the entire files.

	    + Other changes to non-RCS files.  CVSup sends the
	      entire file, just like the original sup.  It's
	      compressed, though (see below).  About the only
	      non-RCS file that has to be sent like this is the
	      "modules" file.

	    + New files.  The client generates 5 random files of the
	      proper length, and sends the MD5 checksum of each back to
	      the server.  If, by chance, the client happened to stumble
	      into creating the right file, the server need not send
	      its actual contents.

	      Just kidding! :-)  New files are sent (compressed) in
	      their entirety.

	    + Deleted files.  CVSup instructs the client to delete
	      them.

	- Streaming communication protocol.  Many existing network
	  file transfer protocols, including sup and NNTP, suffer
	  greatly from the fact that they use a lock-step, request-reply
	  form of communication.  The client sends a request to
	  the server, then waits for the server to reply.  The
	  server replies, then waits for a new request.  This form
	  of communication is bad for three reasons.  First, it
	  uses the full-duplex network connection in a half-duplex
	  fashion, wasting half of its bandwidth.  Second, it is
	  very much impacted by network delays.  No matter how fast
	  the link in terms of raw speed, if there are significant
	  delays then communication slows down.  Third, both the
	  client and the server spend a lot of time idle, waiting
	  for a reply from the other side.  This idle time could
	  be put to better use.

	  CVSup completely eliminates this common problem, by means
	  of a streaming protocol that involves no back-and-forth
	  communication.  Here is a diagram that shows how it works:

	      +-----------+
	      |  Lister   |
	      +-----+-----+
	            |                           +-------------+
	            +-------------------------->| Tree Differ |
	                                        +------+------+
	      +-----------+                            |
	      | Detailer  |<---------------------------+
	      +-----+-----+
	            |                           +-------------+
	            +-------------------------->| File Differ |
	                                        +------+------+
	      +-----------+                            |
	      |  Updater  |<---------------------------+
	      +-----------+

		  CLIENT                            SERVER

	  The five boxes represent independent threads of control.
	  The three threads on the left make up the client, while
	  the two on the right make up the server.  All of the
	  threads run at the same time, in parallel.  They are not
	  processes that are forked; they're actually implemented
	  as application-level threads.

	  The lines connecting the threads represent one-way network
	  communication channels.  Each is implemented using half
	  (one direction) of a TCP connection.

	  The Lister is responsible for telling the server which
	  files are present on the client, and when each was last
	  modified.  It performs a tree walk over the client's
	  repository, and sends the name and modification time of
	  every file to the Tree Differ in the server.  It sends
	  this information intelligently, maintaining a directory
	  stack on each side of the connection, so that only the
	  basenames of the files need to be sent.

	  Meanwhile, on the server, the Tree Differ performs its
	  own tree walk over the server's repository.  It compares
	  its filenames and modification times with those received
	  from the client, and determines what has been added,
	  deleted, or (potentially) modified.

	  Both the Lister and the Tree Differ generate their filename
	  lists in a sorted order.  They do this by sorting the
	  contents of each directory, as it is read.  This is very
	  important, because it allows the Tree Differ to do its
	  work in real time.  Because each list is sorted, the Tree
	  Differ can recognize instantly when a file is missing
	  from either list, or when the modification times disagree.
	  It does it in much the same way as the Unix utility
	  COMM(1).

	  The sorting helps in another way, too.  The sort order
	  is chosen in such a way that Attic files come out close
	  to the corresponding files in the live repository.  That
	  enables the Tree Differ to recognize easily when a file
	  has moved into or out of the Attic.

	  As the Tree Differ finds files that have been added,
	  deleted, moved into or out of the Attic, or touched, it
	  immediately sends detail requests to the client's Detailer.
	  Detail requests say, "Here is the name of a file that I
	  need more information about."  Of course, detail requests
	  are generated only for the small fraction of files that
	  might have possibly changed.

	  The Detailer's job is to analyze files that have been
	  touched, and send the relevant details to the server.
	  What is sent depends on the kind of file.  For an RCS
	  file, the Detailer sends a list of all the revisions and
	  their associated check-in times.  It also sends a list
	  of symbolic tags, and other administrative information.
	  For a non-RCS file, the Detailer sends the size of the
	  file and its MD5 checksum, to enable the server to detect
	  appends.  All this information streams across a one-way
	  channel to the server's File Differ.

	  The File Differ receives file details from the Detailer,
	  performs a similar analysis on the corresponding files
	  on the server, and determines what, if anything, is
	  different between the two versions.  For an RCS file, it
	  might determine that one or more revisions or symbolic
	  tags, or perhaps even entire branches, have been added
	  on the server.  If so, then it generates edit requests
	  and sends them to the Updater.  The edit requests typically
	  contain diffs extracted from the RCS file, as described
	  above.

	  The Updater receives edit requests from the File Differ,
	  and acts upon them.  It is the component that actually
	  updates the files.  Like RCS, it emits each updated
	  version to a temporary file, then moves it atomically on
	  top of the original.

	  The structure described here makes CVSup almost immune
	  to network delays.  After the brief start-up handshake,
	  all communication is strictly one-way.  With three parallel
	  threads on the client, and two on the server, something
	  is virtually always happening on each side of the network.
	  Very little time is wasted in the client or the server.

	- Balanced communication.  Since a network link is typically
	  full-duplex, it is wasteful to send data through it in
	  only one direction.  Yet, that is what sup does, most of
	  the time.  Since file updates propagate from the server
	  to the client, it would seem that the bulk of the network
	  traffic must move in that direction.

	  A principal design goal of CVSup was to make as much use
	  as possible of the reverse channel from the client to
	  the server.  That is where the free bandwidth lies.  In
	  every case where there is a choice, CVSup sends information
	  from the client to the server.  Thus, the client sends
	  its list of files to the server, and the client sends
	  its file details to the server.  This approach has been
	  remarkably effective.  In fact, it is not unusual for
	  CVSup's reverse network traffic to exceed its forward
	  traffic.

	- Compression.  CVSup uses "zlib" to compress the data on
	  each of its four one-way channels.  Despite the savings
	  from sending diffs and so forth, the zlib compression is
	  still very effective.  On each of the three compressed
	  channels, zlib achieves a compression ratio of 65-75%,
	  calculated using the standard formula:

	    ratio = (plain text - compressed text) / plain text

	  In other words, 65-75% of the bytes don't have to be sent
	  down the wire.

    * Server efficiency.  CVSup tries to do as much work as possible
      on the client, rather than on the server.  That makes sense --
      the client is only dealing with a single server, but the server
      may be dealing with many clients.  Luckily, this design choice
      goes together well with the goal of maximizing the flow of
      information from the client to the server.

      There are two other ways in which CVSup reduces the load on the
      server:

	- Minimal forking.  The server forks only once per client
	  connection.  It doesn't execute any other programs.  It
	  is completely self-contained.

	- No diffing.  The server doesn't have to generate diffs,
	  as is necessary for CTM.  All the CVSup server does is
	  extract existing diffs from RCS files.  That can be done
	  very quickly.  Even reversing the main-branch diffs is
	  a simple, linear-time operation.

--
  John Polstra
  jdp@polstra.com

Copyright 1996-2003 John D. Polstra
$Id: Blurb,v 1.4 2003/03/04 19:26:05 jdp Exp $
