Faster serving of large file objects

Posted by David Luebbert, 9/29/04 at 11:02:13 AM.

I would like to be able to reliably serve large objects (mainly mp3s and zip archives) from my website (htp://www.songtrellis.com). These largish things typically must be stored on disk, in order to preserve addressability within Frontier's object databases. There is a 2Gb limit on the size of Frontier .root files. To work around this restriction, a filespec is usually recorded in a website's tables to point to the directory on disk that stores the large objects. That way we can point to a directory of files that will be served by mainResponder but don’t have to physically allocate space in a root file.

I would like to eventually be able to offer thousands of mp3s on my site which will vary in size from 3Mb up to 50Mb. Storing all of this in the .root table for my website will not be possible and probably is not even desirable, since hours and perhaps days or months may pass between requests for a particular mp3. Space in .root should be reserved for data that is served very frequently.

When a user browses into a filespec that points to a directory, a listing is sent back to his browser which includes links to all of the files store in the directory. When a user clicks on a particular file link for the first time since Frontier has started and the file is large, a lot of inefficient things happen which lead to very long wait times for the user. (Note that on some systems that run caching software, it’s the first access of a particular file during a Frontier run that has awful performance. Once it has been produced once by the Manila server and delivered to a caching layer, subsequent web requests can be very fast)

If you click on a large file object (say a 30Mb mp3 file stored in an on disk directory), the browser will sit there waiting for Frontier to respond for many seconds. On my PowerBook G4, the wait time is approximately .5 seconds/Mb to load.

Here's what I've learned during my two weeks of exposure to the kernel sources.

Filespecs are handled in mainResponder.respond and finally filter down to the serveFile procedure in respond. serveFile calls the built-in function file.readwholefile(). That function calls file.read with a count of bytes to read of infinity.

Digging into readverb within the kernel and the routines it calls, I discovered that a call to fifreadhandle in findinfile.c takes most of the machine cycles when you ask to read a multi-megabyte file. Fifreadhandle was apparently originally built to read bytes from a file until a prescribed eolmarker was found. This would be necessary behavior for an outliner, which would need to locate paragraph marks in a file and chunk the file in between into different outline paragraphs. The routine, running in this mode, would fetch full buffers of text (1024 bytes on my Mac), scan until an eol was found or the buffer was completely scanned, and then copy the scanned text to an output handle, which would be grown to accomodate the new text.

fifreadhandle can called with a null eolmarker and a count of bytes to read of longinfinity. In this case, the routine sees the null eolmarker parameter, skips the EOL scan, and tries to discover the actual end of file by repeatedly attempting to read 1024 byte buffers, growing the output handle to accommodate the newly read characters, and copying the new text to the end of the handle. This is really disastrous if a 50Mb file is read from disk, because in that case we execute 50,000 file reads of 1024 bytes length, grow our output handle by 1024 bytes 50,000 times and perform 50,000 separate copy operations with all of their associated overhead.

We can make a simple optimization to fifreadhandle which eliminates this overhead for the case where it is called with ctbytes == longinfinity and eolmarker == chnul,. This is what gets exercised when file.read is called with a count of bytes to read == infinity. The payoff here is quite large.

The optimization for unbounded reads with no eol scanning is to take advantage of the fact that both Windows and the Mac are able to accurately report the logical size of a file, that they are able to successfully allocate buffers of a size to match, and that they can successfully execute a read that can transfer the entire file contents into that very large buffer in one read operation.

As you would expect, doing the file size discovery, allocate, and read directly into the handle just once is way faster. With the optimization in place (15 or so lines of code),for a 30Mb file the web browser hangs out doing nothing for 2 or 3 seconds before the download starts progressing. This is 6 or 7 times faster than unoptimized performance.

This optimization is super localized at a low level in one routine and changes none of the code at higher levels.

There's actually an optimization for memory utilization and even better execution time that could be attempted if we were willing to alter mainResponder.respond.

Visiting mainResponder.respond again and tracing what happens after the call to file.readwholefile in serveFile, it takes the file contents and stores the result in a string handle. The handle is passed back to webserver.dispatch. This takes the handle and passes it down to webserver.buildresponse, which copies the handle string contents to the response stream.

This business of copying an enormous file into a buffer handle, and then slowly filling the TCP/IP response stream out of the buffer contents is a bad deal because it still can take several seconds to fill the buffer before data transmission can begin and because the buffer is so huge and creates an enormous memory demand just in order to handle a single URL request. If several folks started large file reads simultaneously, some or all of the requests would necessarily fail in a short memory situation.

The optimization would be to notice filespecs in mainResponder.respond whose file types cannot be altered by processing in respond (mp3, pdf, gif, jpg, mid, and doc come to mind) and somehow pass control to a method that would read file contents into a smaller buffer sufficient in size to keep the TCPIP connection busy and to send these smaller buffer contents across the wire until the file is completely copied across the connection. This would make it so that we no longer would have to grow a buffer large enough to hold the entire file and would not have to wait for this buffer to fill before sending bytes back to the requesting machine.

For a 50Mb file, the transmission of the requested file would start almost instantly, eliminating the 2 or 3 second wait still left in the process by the first optimization, and the memory demand required to buffer the transmission to the requester without introducing substantial waits could probably be a very small fraction of what is currently necessary.

Looking in webserver.callresponder, it appears that this type of optimization has been attempted before in Frontier for Manila's benefit. There is code that Andre wrote for version 6.2b10 that looks for a filespecvaluetype returned from mainResponder.respond. It executed under the control of a flEnableDirectFileServing flag. If the filespecvaluetype is present, it would execute a special routine to copy the file specified by the filespec into the TCP/IP connection's output buffer. This optimization has been ifdefed out of the compile for webserver.callresponder.

This ifdef’ed code for direct file serving appears to be an implementation of the more ambitious optimization I’ve described above. Considering the benefits this code could provide, I wonder if Andre or others can say why it was not put into service.

If we were to reactivate direct file serving, we would have to modify the serveFile routine within mainResponder.respond to return the file name expected by the direct file serving code rather than the file contents. The file name would be packaged as a filespec and then would be recorded as the paramtable responseBody. Andre’s code, when it recieves the response from mainResponder.respond, goes to the responseBody to retrieve the name of the file to transmit and then performs the file transmission.

Discuss