Joseph K. Myers

Sunday, May 30, 2004

(release date--was invented earlier)

Listswitch, Fast Closest Matching System

Using an in-order list such as


listswitch will write a C function, case-insensitive (optionally), a list assembly (optionally), and an enumeration (optionally) for differentiation between the given list of values.

The nearest match is returned, or no match, if it is impossible to associate the input with any list value.

The worst case time in particular is [n/u] log (u-1) + log (n - u[n/u]), where u is on [2, 256], depending on the nature of the in-order list, not the size.

Is this better than a time of log n? Let n = 3,000 and u = 26 (the most common value).

Then our function is 115 * 4.643856189774725 + 3.321928094887362. Clearly, this is worse than log n, which is 11.550746785383245.

However, each expenditure of listswitch is guaranteed constant cost, equivalent to one 8-bit comparison. Each expenditure of a log n algorithm is a comparison with increasing cost. Therefore, the equivalent cost of our function is precisely 537.365389 operations.

In this case, the expenditures of log n would come to (n-2)/(u-1) (at the final match) which makes sense, if one considers the binary search algorithm like this:

middle node: comparison must evaluate n/2 / (u-1) = 60 bytes before finding a difference.

next midpoint: comparison must evaluate n/2 / (u-1) + n/4 / (u-1) = 90 bytes.

next midpoint: n/2 / (u-1) + n/4 / (u-1) + n/8 / (u-1) = 105 bytes.

This progression continues to reach 119.92.

The geometric sequence has a = n/(u-1) and r = 2^(-1).

The sum for the first k terms is n/(u-1) * (1 + 1/2^(k)) * 2 = n/(u-1) * (2 + 1/2^(k-1)).

Finding the sum of sums for k = 1 to j, we have

sigma { k = 1, j, n/(u-1) * (2 + 1/2^(k-1) }

= n/(u-1) * sigma { k = 1, j, 2 + 1/2^(k-1) }

= n/(u-1) * (2*j + sigma { k = 0, j, (1/2)^k })

= n/(u-1) * (2*j + (1 - (1/2)^j)/(1 - 1/2))

= n/(u-1) * (2*j + 2 * (1 - (1/2)^j))

= n/(u-1) * (2*(j + 1) - 1/2^(j-1)).

j, of course, equals log n, so the sum is n/(u-1) * (2*(log(n) + 1) - (2^(-1))^(log(n)-1))

= n/(u-1) * (2*(log(n) + 1) - (2/n)).

Someone can find this sum in a web browser merely by pasting and executing the following URL in a web browser with JavaScript:

javascript:n=3000,u=26;log=function(x){return Math.log(x)/Math.log(2);};alert(n/(u-1)*(2*(log(n)+1)-(2/n)));

which is 3012.0992 operations.

It would help clarify things to differentiate between the concept of "decisions" and cost/time. A log n decision-making function makes log n/2 decisions on average, but it may perform variable-cost operations after each decision. A true binary search would really run in log n time, and this would be the exact worst case number of operations.

A less efficient binary search is easier to implement, partly due to the shortcomings--even of C--which place some distance between programmers and the bits. A binary search algorithm need never actually reevaluate bits. It is false therefore to assert that log n units of lowest cost can be surpassed, as the worst case.

In the log n function which is being discussed, a full comparison is made at each nodal point. In listswitch, a log n behavior is more closely followed, and so only a byte is compared. The costs shown for listswitch are always the full number of byte comparisons which must be made. The cost shown in the statement that a log n decision-making function makes log n/2 average decisions, does not refer to the bytes compared.

The way the log n search algorithm is implemented, using the definition of u as given previously, there are (2 * n - 1) * u / (u - 1) lowest units of lowest cost in the expected log n search.

Supposing that the log n function were properly implemented, we would ignore those costs.

Ignoring the costs, and paying attention to decision-making only, we get the following results:

The average log n time would be 2 * n * (1 - 0.5^(n/(u-1))) / log n.

The average time of ours is f(n) = log (p(n)-1) + f(n/u - 1), where p(n) is defined recursively by

p(0) = 0, p(l) = 1 + p(l-1) * (1 - 1/u), where f(n) disappears at p(l) <= 2.

(p(l), which of course, is a geometric sequence, is also u - u * ((u-1)/u)^l.)

Is this better than average behavior of log (n/2)? (We are ignoring the increased cost of O(N) operations within the ordinary log n binary search algorithm.)

For the same values, our function is log 25 + log 24.70716607839431 + log 2.2453545348310406, which is log (25 * 24.70716607839431 * 2.2453545348310406) = log (1386.9086849236578), or better than log (1500) = log(n/2).

Our function has an amortized cost of 10.437657087359923 for n = 3000, better than 10.550746785383243.

Our function gets much better as u diminishes, or as n increases to very large values. For instance, in the HTTP server, where u is 5, and n is 8*, we have O(1.6604445653683295), while log(8/2) is O(2).

(*Or at most, u = 16 and n = 47, to make O(3.8308135440330386), compared to average log (n/2), which is O(4.554588851677638).)

For our function, the time function and cost function and number of operations vary directly as a single function.

This is not to say that log n is evil; a log n algorithm has to be used within the 8-bit searches of listswitch, anyway.

To create an exact match, compare strings, which is an O(1) operation*.

(*Note: log(n/2) is simply log(n) - 1, which can pay for a low-level function.)


% listswitch [options] - list.txt

creates list.h and list.c, containing list(). The function name is variable, based on the name of list.txt. No processing will occur if list.txt possess no file extension. Standard input and output will be used if no arguments follow the hyphen, or if no hyphen is present, and no files will be created. Any argument containing a hyphen is equivalent to the hyphen, but the characters preceding the hyphen, within the same argument,are interpreted as options.

I sometimes suggest that 1) programs should use one-letter options, for the sake of typing, and 2) programs which can't use one-letter options should be split into two programs, because they perform too many functions. Performing many functions is not bad, but it is convenient many, many times to access them alone from the shell.

Listswitch does not sort its input. The input must be sorted and unique.

Options a (assemble input) i (case insensitive) and e (enumeration) are available. An enumeration allows one to program using variable names, not numeric output from list(). Since the variable names correspond to the numeric output from listswitch, adding contents to list.txt and regenerating list.c will not invalidate other source files, because the enumerated variable names within list.h will be recreated.

Option a assumes all input is valid C between " and ". Option e assumes that the alphabetic subset of the input never collides with other elements in the C program.


listswitch.tar.gz (1487 bytes)