Joseph K. Myers

Friday, October 25, 2002

Text Encoding

A proposal for the modification of present encoding systems.

The existing encoding methods are troublesome because replacement mechanisms for ordinary (ASCII) text can fail in the presence of multi-byte characters; the legacy program cannot know when the character is truly finished. For example, a wrapping algorithm may mistake a portion of a large character value for a linebreak, and thus miscalculate line length. Of course, it is impossible for such characters to be accurately counted by implementations insensible to values outside the range of 0-255.

My proposal is simple. All values 0-127 shall be unchanged. Values 128-254 are used to begin the sequence of bytes representing a number of an unspecified character set. Further members must not contain byte values except in the range 128-254. 255 is considered a dead byte, and indicates the completion of a single non-ASCII character.

Compression schemes for this method of text may represent byte strings as they wish, and only processes intended to deal with data as the eye sees need give any special note to this method of encoding.

Notes:

It is assumed that there will never be a reason to transgress the canonicality of ASCII characters 0-127.

Byte is used to mean a "normal octet." Bit order is irrelevant.

String is used to mean a sequence of bytes. Message is a synonym to make sense.

Character counts need only count ASCII bytes, and then the dead bytes. The sum of the two is also the sum of characters.

Algorithms matching any character within the ASCII range will never choke on non-ASCII characters encoded according to this proposal.

Determination of character set is as simple as determination of any character value >127. An encoding specification is also not required; an implementation may and should assume ASCII input until the occurrence of any value other than 0-127.

There is presumed to be one character set definition in a single message.

The "character set" is presumed to be similar to ASCII, but far more exhaustive. It should consist of one ordinal value for each character, no skipped values, and no duplication.

Most transmission processes should not make any effort to interpret data except as a standard flow of octets.