Joseph K. Myers

Monday, March 17, 2003

I am thinking of designing my own server for WWW / free document service (i.e., websites).

Design is importantly a statement of what you want: your intentions, your desires. Fashion--would you like to see this or that? Programs--would you make them do? I describe the nature of it.

3-18-04. I believe that certainly a feature should not be considered until the philosophy is for it is well understood. I believe in knowing what you do before you program.

The server would not be chrooted. Why is it a security superiority to have this done? It isn't; it is a substitute for poor design. The abstraction of the file system is the perfect abstraction, with all security built into it--like the Constitution, if you need something, it is a problem with the file system, and the file system should be fixed--the Constitution should be amended--not the other way around.

(6-24-03. If anything, a chroot jail seems like a kill-fest or a rape. Designed to happen inside of that jail is a violation of all human life and value inside--for programs that run burning and raping. This is the only reason for jails in real life, and the only reason for computer jails. 3-18-04: it is, as I said, for poor design--but I suppose it is an appropriate place.)

The server would never allow access to anything unreadable by "everyone." An internet file has a server serving it--which normally is a "user", function as itself. However, the idea is ruined (simply because no one normally runs a webserver as "no user"--and operating systems do not like providing this proper solution--3-18-04: in my opinion, an operating system must support the semantics of a process which has no identity and owns nothing) by the fact that any user should be able to access a file. Jill may be able to access her credit card purchases, but so may Bob and Sally, if the file is on the internet. Therefore the internet server must honor restrictions, if present, on every document--if permission is not granted to "anyone" to read it, then the internet server will deny access to it.

6-24-03. The server would not supply specific status messages. It is a security hole to show someone that "/mail" was found on your server simply by saying "Permission Denied" instead of "Not Found." You don't think so? What if even the name itself were confidential information?--and remember, I believe that the server should be running inside the entire file system, with you free to link it to the web. What if the file named "I killed him" was found this way? Then the court would subpoena it.

HTTP authentication is false authentication. The user is not really authorizing himself to the computer; rather, the internet server is masquerading as that user. The internet server may do something "automatically" which the user didn't want to do*, the server requires unnecessary "root" level authority (perfectly ridiculous, and unsafe, for an internet server) to do a user substitution--for the good reason that the realm of identity belongs only to the core of the computer, not a software program, in any case--and HTTP is not a computer login service, and simply cannot provide computer "authorization."

(*This is the fault of the programmer, though not necessarily a mistake. I am a programmer, but even with no mistakes, could I know what you wanted to do? No, I couldn't, and neither could any other programmer, and so none of us should claim to represent you as a user.)

The server won't do caching. Here again is the principle of the computer and the operating system doing it right. Modern computer systems cache all file system data--and organizing it with the open() method is the perfect access to cached data. It is no longer important nor necessary to do otherwise. Also, a server with a modern operating system is able to guarantee its security as well as the integrity of every read and every write, and a more reliable, lower-speed hard disk is able to be used.

3-18-04. read(fd, char b[], int length) can be replicated as mread(fd, char **b, int length), where memory within *b can't be modified by the program. It would be far more efficient to let the operating system do the caching without having to always do another copy-and-write in memory from the cached data to any program, and better than mmap.

8-26-03. Hosts would be mapped to directories in the configuration file. Aliases, subs, etc., would all be defined in terms of the file system. Anything that can be opened within the dir by anyone would be openable in the same manner by the server. Links can go anywhere.

9-26-03. The server would be configureless. Like a file system, it would have all of its "configuration" be part of the way it is actually used.

CGI scripts would actually be "delivery files," and anything with the x mode bit set would be a delivery file. Just to illustrate the performance of this technique, a delivery file's latency is as fast as the native C function call--millions of times per second, or billions.

In order to be a delivery file (a program opened as a reused pipe instead of a file), some CGI scripts would need to be rewritten. Other scripts would simply be executed by a higher-level delivery file, which would really be running the CGI script in the ordinary way. This technique is also more secure, because the delivery file that does this is safe from contamination.

3-18-04. The way a delivery-file works is through the abstraction of a file descriptor. The server uses a file descriptor to provide the contents of a response to a client. File descriptors without st_size information available (such as the output from most present web server programs) are sent through chunked encoding, to avoid the inefficiency of having to close the connection to the client as the only way of indicating the end of content.

A delivery file is started with a queue of input descriptors and output descriptors. argv[1] is a decimal number N indicating the number of input descriptors, which are fd numbers 4, 6, 8, ..., 2N + 2. Output descriptors are fd numbers 5, 7, 9, ..., 2N + 3. The server waits on these descriptors just as on descriptors from an ordinary file.

To trick ordinary programs into running as delivery files (but as delivery files which process only the first request, which is a risk that programmers have to take, because a queue may contain any number of requests--though whether it can contain any number is up to the operating system's limit on descriptors; however, probably most programmer's will ignore this risk), fd 4 is aliased as 0, and fd 5 is aliased as 1.

Also, envp is defined for the process in part the same as an ordinary "CGI" program would have it, specifically with CONTENT_LENGTH, plus the HTTP headers with capitalized letters and other characters besides A-Za-z0-9 changed to underscores.

Note that the server-file process can either stat its input or read until EOF to get the content-length for each request. The input stream is seekable; it points to the beginning of input, but it can be sought to zero in order to obtain the full, precise HTTP headers of a request. envp, and the entire, stupid "CGI" environment should not be used by intentionally programmed server files.

fd 3 is a special descriptor which allows the delivery file process to wait for another queue without restarting.

Note that more than N descriptors may exist for the delivery file to use. The next queue may be of a different size than the last. This is not harmful, because the entire server will have at most a limited number of possible queue items.

Files would be stored in URL-encoded directories. There would be no decode between requesting and opening. A file notes/stat460.txt would be identical to the file on another server. However, notes/Binomial/Geometric/Hypergeometric Probability Distributions could also exist. The real file would be notes/Binomial%2FGeometric%2FHypergeometric%20Probability%20Distributions (The concern that this would encourage such abusively-long and complicated file names is mitigated by human desire for convenience--and besides, it is more convenient to have the long name once the long name has been typed.) No URL changes would be needed--any server's data could be copied to and hosted on the server trouble-free.

3-18-04. Please observe that this method of handling file names and URLs allows any URL to be served by the server, without loss of generality. The current method of decoding and millions of equivalent URLs means that a decoded file name can't be used on many servers, as well as that a huge loss of efficiency (to no purpose) occurs. For instance, a file named "(j repeats) / j times" can't be served by any other Unix server. The same occurs for many files that can be placed on servers.

However, any file can be served by using preserved encodings. The only shortcoming, which isn't a shortcoming, is that two URLs can't be equivalent. That's a good thing.

9-27-03. The server will only recognize a subset of HTTP/1.1 headers (see ../reference/http-headers.html). Only request headers which exclude the possibility of sending a correct response without supporting them will be need to be "seen" and a "Request Unsupported" response sent. (also see ftp://ftp.isi.edu/in-notes/rfc2616.txt).

Language should not be negotiated. It's not a server communication. I do recommend that links be made to files like "index.html.fr."

10-1-03. Like mathopd, it will start up with a glistening network of light servers.

Link to the drives used by the server. (3 drives, all in RAID 5, equivalent to 2 drives with fail-safe).

Or wait until they have a >= 250 GB model?

http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,362,00.html

Fibre channel.

MSI motherboard

http://www.msi.com.tw/program/products/server/svr/pro_svr_detail.php?UID=441&MODEL=MS-9131

http://www.accupc.com/itemDetail.jsp?pid=mbmsk8dmft&refer=PriceGrabber

rated 5 stars with 645 ratings.
MSI K8D MASTER-FT

manual
http://download.msi.com.tw/support/mnu_exe/E9131v1.0.exe

quick start
http://www.msi.com.tw/html/support/manual/note/quick_installation.pdf

Utilities

urlencodedir

This will URL-encode the plain file names so they are ready for Web service.

Presently, there are an infinite number of ways for specifying the same resource. This is a tracking weakness, a security weakness, and a server load weakness. It is also a confusion weakness. When a link is made, the link is not necessarily the same--some people leave in spaces, some use parenthesis, some escape some characters, some others. If the files themselves are stored in the same names as the server uses, then using the same name uniformly is obvious: use the name of the file itself. This program makes this possible.

The format used is consistent with domain names. All characters but . and - are encoded. Of course, alphanumerics are not changed, just the same as in domain names.

10-5-03. The server finally started serving serial linear (non-synchronous) connections yesterday. Now the sigpipe handler must be installed to stop sending data (process receives a sigpipe, and the server quits, if a download/transmission is cancelled).

Server hardware needs wants

10-21-03. There are two approaches to beginning the model of a server-in-development. 1: http-header by http-header; 2: feature by feature. Generally, I think for http, the first way is the best way to do the second way, almost all the time.

Take dates. (Dates with the girl at the grocery store in front of the two little girls. Dates with the whipped topping and date roll.)

The default version can be supported, but honestly no real person cannot help but be very disappointed. The reading of dates is hard anyway, and they format the dates so that developers can whip out and read the date like porn, yet the sadness is that the use developers have for them is really very hard since the RFC date style is worse than DDDD, for example, or (since it is just as good because we can't read RFC dates, and can't read decimals, and even if we could we couldn't vividly evaluate them or use them for any purpose, just like we rarely need or understand or want the date of the post in the forum, just that it was next to ours) hex, or base64, or 96-bit base64, what D. J. Bernstein likes. (I don't like that, because downloading takes a lot of nanoseconds. Oh, well, it would help sorts, but probably still be an inaccurate count of nanoseconds.)

For an example of this, I think I should always make the date function show a RFC format plus a hex + .hex format, like this: 374a666d6179760a for seconds (64 bits), or 374a666d6179760a.6179760a for an additional 32 bit time if we eventually move (yeah, we probably should; I'm just saying that some applications don't use one kind of time or another).

Motherboards

http://www.giga-byte.com/MotherBoard/Products/Products_NewProduct_List.htm gigabyte motherboards
http://www.leadtek.com.tw/motherboard/winfast_k8nw_1.shtml Really Cool Motherboard
http://www.tyan.com/products/html/thunderk8w_spec.html tyan motherboard

Features

CGI programs can be executed like a script, or like a server. I think this is a better conceptual approach than FastCGI, etc.

All such programs can be re-niced after one second.

Configuration of the server is preferably kept to a minimum. Too many times, servers like Apache get turned into Apache + incompatible server configuration, where it is easy to write CGI programs and site layouts that are not portable to any other server, because they are dependent on arbitrary settings. Also, features of the server should be functional without configuration.

Essentials

It's easy to know what a server should have:

Resumable downloads. Mime-type customization. Miminal configuration.

How it works

11-27-03.

On a network a computer accepts access to its ports. A server runs on the computer, "bound" to one of the ports--so when the computers receives traffic on this port, it goes to the server. The server multiplexes to accept connections, and it produces the proper responses, in this case HTTP headers and content.

A server can run with an ugly URL on most computers, but to actually set up productionable hosting, it needs to be combined with tools to make the users' domain names.

The whole suite of software itself can be managed easily with something like svscan to run the DNS server, mail server, and web server together.

12-4-03. Using ifconfig, the server owns 10 IP addresses, which are used to help form 5 ports for web servers. User config may change their domain names to select these servers, rather than tacking bratty, permanence-robbing URLs such as domain:81/. (I.e., imagine trying to switch web hosts this way.) Our server does not want to keep you from switching.

Each server serves two 500 GB arrays on different interfaces. This enables an 2N -1 servers to be redundantly used, with any one of them in a related set to fail, without any servers failing. I.e., two different "hosts" are on one server, never the same host. If any single server fails, another server (presumably which has not failed) carries it without any downtime.

We will never allow content-redirection. I.e., index.html.fr has logistic values in its name, that must not be compromised by selecting from either index.html.en or index.html.fr. In addition, such selection is always ambiguous in the most definite sense. However, it is a simple solution to switch to our host without changing links that have previously referred to files without index-language/file-type information (such as the ridiculous idea of requesting /foo and returning /foo.gif or /foo.html, depending on the order of "preferences" in the request--the whole idea being so sinful that it is against the souls of all programmers). Simply use a rudimentary server file in that location, or even put a link in all such locations to a rudimentary server file that serves the purpose of all locations. Have the rudimentary server file redirect users from its own name to the language/file-type name.

12-11-03. Obviously, I can't do anything so incompatible as require all file names to be encoded, so I will provide an option. It should have been designed this way from the first.

I think every attribute of a server I come across I should choose from, and write down here.

simultaneous connection limit = 10000