Content

Specifics

keys
time
speed

(verify integrity )

A Study On English Efficiency

Each word in English communicates a tome of thought, and has been used in measureless quantity during the years of our history. Most English is fairly efficient, whether written or spoken. An analysis of text can reveal the effectiveness of words in comparison to the size of a simple, number-based "language."

The procedure involved will "read" a text. At each word, defined as a string of non-whitespace, the algorithm will act in one of three ways. If the word is recognized, the assigned number will be substituted in its place. If the word is not recognized, then a comparison will be made of N, the base-10 form of the number of words reocgnized, and the length of the word. If the word length is longer than N, the word will be recognized and assigned the value N for future reference. In either case, the word will be left unchanged.

function wordz(s) {
	var h = {}, n = 0, z = s.match(/\S+|\s+/g);
	for (var i=0; i<z.length; i+=2) {
		if (h[z[i]])
		 	z[i] = h[z[i]];
		else if (z[i].length > Math.ceil(Math.log(n+2)/Math.log(10)))
			h[z[i]] = (n++)+'';
	}
	return z.join('');
}

The "decompression" process reverses the word-by-word operations of the above. Instead of putting numbers in place of words, any number which is recognized as a member of the recognized words will be replaced by the correct word value.

function wordu(s) {
	var h = {}, n = 0, z = s.match(/\S+|\s+/g);
	for (var i=0; i<z.length; i+=2) {
		if (h[z[i]])
			z[i] = h[z[i]];
		else if (z[i].length > Math.ceil(Math.log(n+2)/Math.log(10)))
			h[n++] = z[i];
	}
	return z.join('');
}

Results

The article above (including the source examples) is compressed to the following degree.

in=1608; out=1372; difference=15%

 time: 0.029 sec
speed: 55448 b/sec

As the compression technique applies to any sequence of non-whitespace, it works well with any kind of data which shows some redundancy in its sequences of whitespace-separated substrings. The functions defined above, for example, are reduced quite well.

function wordz(s) {
	var h = {}, n = 0, z = s.match(/\S+|\s+/g);
	for (var i=0; i<z.length; i+=2) {
		if (h[z[i]])
		 	z[i] = h[z[i]];
		else if (z[i].length > Math.ceil(Math.log(n+2)/Math.log(10)))
			h[z[i]] = (n++)+'';
	}
	return z.join('');
}

0 wordu(s) {
	2 h = 3 n = 4 z = 5
	6 7 8 9 10 {
		if 11
			12 = 13
		14 if 15 > 16
			h[n++] = z[i];
	}
	19 20
}

Notice that, while the first function is left unchanged, the information it "gives" allows the second function to be greatly abbreviated. This particular example required 0.008 sec for compression, 0.007 seconds for decompession, and shows a performance summary of in=489; out=361; difference=26%. If allowed to repeat, the eventual performance would approach in=13720; out=5816; difference=58%, an experiment which was performed in 0.130 sec (105,538 B/sec). Decompression was performed in 0.123 sec, in=5816; out=13720; difference=-136%


Joseph Myers, e_mayilme @ hotmail.com.