URLLIST
The idea behind URLLIST is to 'remember' all the locations to which
we have connected and use them to either find a component that we need
to populate nested_populate()
or just to check whether it is
available in some remote location urllist_find()
.
What follows is a brief description of the URLLIST API, the on-disk
file formats it uses, and how nested_populate()
and urllist_find()
make use of them.
The problem of finding which components a given URL has and on which
URLs a given component can be found is inherently a ''many-to-many''
problem (or n x m
where n
is the number of URLs and m
is the
number of components).
On-Disk File Format
For backward compatibility reasons, and also to avoid duplicating information, we keep the information in two separate files.
BitKeeper/log/urllist
-
a KV file that contains component rootkeys as keys and all the urls on which we’ve seen that component as values. There is no other information in this file. Occasionally, you might see a timestamp at the end of the stored URLs, this is because older BKs (bk-5.1) used to store it. It is deprecated.
BitKeeper/log/urlinfo
-
a KV file that contains URLs as keys and information about the URLs as data (separated by new lines). BitKeeper will read the fields it knows about and ignore extra fields. The order of the fields must be preserved.
Commands which use and/or alter state
Whenever a urllist is fetched from a remote source (clone, pull), the urls are translated knowing the url of the remote. So a remote file url may get translated to a bk:// url if the remote was accessed through a bk:// url.
-
clone - brings in urllist but not urlinfo.
-
rclone - updates local urllist and copies to remote
-
push - updates local urllist
-
pull - pulls from remote and integrates it in considering only remote
-
populate - can clean out urllist
-
unpopulate - if no -f, then will probe and clean the urlist.
Complexity in a pull
cd A; bk pull ../B[BR]
A typical component, say comp1, can be (roughly independent dimensions): * locally populated or unpopulated or non existent * remotely populated or unpopulated * have remote changes or not * have local changes or not * local repo here uses an alias that changes causing populate or unpopulate or no change.
urlist is pulled in remotely, interpreted with a local perspective, and see if can be considered a safe pull.[BR] [BR]
''NOTE: the saved urllist can be wrong.'' The probes done during safe and populate part of pull use the remote namespace, which is correct for the components getting populated, but not correct for components with local changes which aren’t involved in the pull. The havekeys will lead to components with local changes relative to the pull, but which match a known gate, to have the gate pulled from the urllist.[BR] [BR]
Now, if something has local and remote changes, the final resolve will cause the component to get flushed from the urllist.[BR] [BR]
If a remote is populated and the local is not, then the component is part of the pull, but not part of what is actually pulled, so will be part of the pull-safe test. If the whole pull succeeds, then the urlinfo will be right. If the pull fails, then the urlinfo will represent the remote level, which is a superset of the local level (otherwise the pull would have failed, right?), and that will be fine.[BR] [BR]
[rick] Has anybody done this type of state space analysis of pull and other operations? The more I understand, the more holes I see, and see some of my earlier suggestions were misdirected.
Data Structures
The data structures that support the on-disk format are slightly different and have been designed to be easier to work with than the on-disk hashes.
{{{#!cplusplus numbers=off typedef struct { // normalized url, output of remote_unparse() with any leading // file:// removed char *url;
// map comp struct pointers to "1" if remote has the needed // component tipkey or "0" if they have the component, but not // the tipkey. hash *pcomps; /* populated components found in this URL */ u32 checked:1; /* have we actually connected? */ u32 checkedGood:1; /* was URL probe successful this time? */ u32 noconnect:1; /* probeURL failed for connection problem */
// From URLINFO file, extras are ignored time_t time; /* 1 time of last successful connection */ int gate; /* 2 is it a gate?*/ char *repoID; /* 3 */ char **extra; /* extra data we don't parse */ } urlinfo; }}}
Besides this new structure urlinfo
, a couple of fields have been
added to the nested
structure.
{{{#!cplusplus numbers=off
struct nested {
/* fields elided /
// urllist
u32 list_loaded:1; // urllist file loaded
u32 list_dirty:1; // urllist data changed
char **urls; // array of urlinfo *'s
hash *urlinfo; // info about urls
/ fields elided */
}}}
API
urlinfo_load()
-
initializes the in-memory data structures from the on-disk KV files. Optionally, it can 'normalize' the URLs in the KV files with respect to some
remote
structure. What this means is that URLs of the form:file://foo/bar/baz
will be translated tobk://server//foo/bar/baz
. This is similar to what BAM does. urlinfo_buildArray()
-
this is an auxiliary function that initializes the
n->urls
array. urlinfo_urlArgs()
-
Occasionally we need to check some extra URLs than what is in the KV files. E.g. URLs that were passed by the user as arguments to some command (-@<url>). This helper function allows us to add then to the
n->urls
array. urlinfo_write()
-
used to save the in-memory representation of the URLs to the two on-disk KV files.
urlinfo_set()
andurlinfo_get
-
Accessors for getting at the
urlinfo
struct corresponding to a given URL. It uses a hash so it’s O(1). urlinfo_setFromEnv()
-
There are some occasions where the information about the URL is in the environment passed by the BKD. Rather than having a more complicated call to
urllist_set()
, this helper function digs out the information from the environment and fills in the URL information accordingly (e.g.BKD_GATE
). urlinfo_addURL()
andurlinfo_rmURL()
-
Maintenance functions to add/remove URLs to the list.
urlinfo_probeURL()
-
This function establishes a connection to the BKD at the given URL and updates what we know about that URL. E.g. it looks for what components are populated in the BKD side, whether it is a gate or not, etc. The information is updated in the
n->urls
structure and the whole URLlist is marked as dirty so that it can be saved. urllist_find()
-
This functions works as an iterator. Each time that it is called, it will return a new URL for the give component. When no more URLs are known for the component, zero is returned. A typical use would be:
{{{#!cplusplus numbers=off k = 0; while (url = urllist_find(n, comp, flags, &k)) { /* code here */ } }}}
See `populate.c` for a real example.