Joseph K. Myers

Tuesday, November 12, 2002

Line, Normalizing Text

The idea of whitespace is central to the thought of "text." On a computer, however, whitespace is imaginary. The computer does not know that "hey bill" and "hey [space-space] bill" are one and the same. For this reason there are an infinite number of supposedly distinctive variations of what amounts to precisely the same text according to the way you or I would see it.

The function of line is to collapse all this porridge of text and whitespace into a single, "normalized" line.

This special "line" representation has many more purposes than merely unwrapping text. It allows the computer to accurately compare two texts, without being misled by insignificant whitespace. It allows a section of text, the output of ls, the zip codes in your address list--it allows all of them to be changed from the context of "data" to the context of a paragraph or a sentence. It reformats line breaks, indentation, etc. into neat single spaces. Line executes the simple reformation of one kind of information into another.

Mathematical notes:

Line reduces sequences of whitespace into single spaces. That is, the series P of non-whitespace values ("words") will be represented by P joined with spaces (hexadecimal 20). However, the beginning and ending member of P may be "empty"--a message may begin or end with whitespace, which is reduced to a single space, but not deleted.

Download:

line.tar.gz (457 bytes)

Compile line with the usual "make" command if desired (with the provided Makefile).

Install into /usr/local/bin or its equivalent.

Line is used as any filter:

line [< file] [> destination]

Line may be used in combination with wrap, e.g., line | wrap, in order to rewrap garbled text (see wrap.txt).

Performance:

The best performance of line on this computer is about 1 MB / 0.02s.

http://www.myersdaily.org/joseph/unix/line.txt