Request for String.charCodeArray() method in ECMA-SCRIPT

ECMA-SCRIPT operations dealing with strings as numbers have no efficient means of matching the character encodings used in encryption and digests, and the speed of operation in these computations is slow. These issues may be resolved by providing the inverse of String.fromCharCode() and translating the string into a numeric array of either UTF-8 or UTF-16.

Synopsis

Although JavaScript and ECMA-SCRIPT are scripting languages, it is common for them to be applied to numeric and mathematical purposes, where often strings are treated as numbers.

Problem identification

The following was noted in the source code of a recent MD5 message-digest algorithm for JavaScript:

/* there needs to be support for Unicode here,
 * unless we pretend that we can redefine the MD5
 * algorithm for multi-byte characters (perhaps
 * by adding every four 16-bit characters and
 * shortening the sum to 32 bits). Otherwise
 * I suggest performing MD5 as if every character
 * was two bytes--e.g., 0040 0025 = @%--but then
 * how will an ordinary MD5 sum be matched?
 * There is no way to standardize text to something
 * like UTF-8 before transformation; speed cost is
 * utterly prohibitive. The JavaScript standard
 * itself needs to look at this: it should start
 * providing access to strings as preformed UTF-8
 * 8-bit unsigned value arrays.
 */

This note shows that the main need for such a function is to allow conforming behavior in applications dealing with character code information; the speed benefit is a side effect only.

Application example

Speed only

Consider the character-linear (order-independent) message digest formula:

D = sigma (sqrt(M[i]))

implemented as

function sqrtdigest(m) { // typeof(m) == "string"
var D = 0, i;
for (i=0; i<m.length; i++)
D += Math.sqrt(m.charCodeAt(i));
return D;
}

However, this is not convenient for the construction of ECMA-SCRIPT. To each character in turn must be applied the charCodeAt() method of the string. This reduces the efficiency of the code by a large margin, compared to a strictly number-based array.

Let the following two strings represent equivalent arbitrary characters:

M1 = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
+ 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa';
M2 = [];
for (i=0; i<100; i++) M2[i] = 97;

One string is a normal ECMA-SCRIPT string. The second is the computed equivalent of character encodings.

Timing the digest computation for each of them gives an indication of the data type more suited to these kind of operations.

T_1_0 = new Date();
for (i=0; i<100; i++)
sqrtdigest(M1);
T_1 = new Date();
document.write('<p>', T_1 - T_1_0,
' ms for 10000 characters digest in string.</p>');

This was 427 ms, or 23 K/s.

function sqrtdigest2(m) { // typeof(m) == "object"
var D = 0, i;
for (i=0; i<m.length; i++)
D += Math.sqrt(m[i]);
return D;
}
T_2_0 = new Date();
for (i=0; i<100; i++)
sqrtdigest2(M2);
T_2 = new Date();
document.write('<p>', T_2 - T_2_0,
' ms for 10000 characters digest in array.</p>');

This was 114 ms, or 88 K/s.

The results show that speed is generally improved, but the primary reason for the necessity of the requested change is impossibility.

Impossibility

Applications requiring access to the real encoding numbers of a string have another disadvantage. A C program, or any with basic access to each byte, may assume an encoding has already been determined for purposes such as computing a message digest. For ECMA-SCRIPT this is not the case, for the charCodeAt() method only provides access to the 16-bit Unicode values of each character. It is prohibitive in JavaScript to compute the bytes for any other encoding.

Since the ECMA-SCRIPT specification considers strings to be arrays of 16-bit values only, accessible with String.charCodeAt(), it is inconvenient so to be prohibitive for any mechanism to deal with an encoded string in UTF-8, since the generation of such a string is provided only by user code, not built-in. This is silly, as the capability for encoding in UTF-8 is already built into most applications of ECMA-SCRIPT.

Consider the case of a password. The digest of the sum of the digests of a password and a random string transferred between client and server may be used to provide authentication without encryption. The common password is 8-bit characters, yet there is no reason it should not be in an encoding of mixed 8-bit and multi-bit characters. A script cannot easily transform 16-bit Unicode to UTF-8 for compatible purposes of cryptography, message digests, or any other mathematical process dealing with a string encoding.

For this I am suggesting that ECMA-SCRIPT now implement a complementary string method, String.charCodeArray(x), which would return an Array of the character encodings of the string. Acceptable values of x would be 8 for an 8-bit-number Array in UTF-8, or 16 for a 16-bit-number Array in UTF-16. The array would be given an attribute .charCode of either 8 or 16. In addition, the String.fromCharCode(y) method would accept a charCodeArray as the argument y, in addition to manually separated values as now. Thus, String.fromCharCode('string'.charCodeArray()) == 'string'.

Open issues and variations

It may be advisable to use String.charCode() in preference to String.charCodeArray(). Even though it would not seem to be semantically appropriate for returning an array, it may immediately be preferable in actual use. There may be an almost unanimous wish that the committee had not used the longer variation when the shorter function name was possible. Consideration of this would probably decide in favor of String.charCode() as the function name, and this needs to be determined before the initial specification of this addition to the String object in ECMA-SCRIPT. Presumably the shorter function name, String.charCode(), should be used, unless there is an overriding motivation otherwise.

The immutability of the returned charCodeArray is open for discussion. The speed of access to an immutable character array may be critically superior, and allow reduction of the complexity of an ordinary array to include only the features necessary for 8-bit, unchanging values. However, this would require the addition of a new concept in the ECMA-SCRIPT language, and is not seemingly necessary. In current practice it is the style to generalize and combine rather than specialize--the allowing of non-numerical indices into an "array" is an example--and it would follow that immutability is not necessary and that it is best to return an ordinary array, in so far as is possible.

Notes

The digest formula is an example and not an actual digest method.

Timings: sqrtdigest

Joseph K. Myers; 2003/04/21; 14566 NW 110th St; Whitewater, KS 67154; USA; 316-799-2882; e_mayilme @ hotmail.com, jmyers @ lilly.csoft.net

http://www.myersdaily.org/joseph/javascript/string-charcode.html