Apache Today - HTTP Compression Speeds up the Web

Your Daily Source for Apache News and Information

Breaking News

Preferences

Contribute

Triggers

Link Us

Search

About

The Apache Software Foundation
The Jakarta Project
Apache-Related Projects
Apache Module Registry
Apache Project
Apache XML Project
Apache-Perl Integration Project
PHP Server Side Scripting
The Java Apache Project
The Apache FAQ
ApacheCon


	internet.com Internet News Internet Investing Internet Technology Windows Internet Tech. Linux/Open Source Web Developer ECommerce/Marketing ISP Resources ASP Resources Wireless Internet Downloads Internet Resources Internet Lists International EarthWeb Career Resources Search internet.com Advertising Info Corporate Info

HTTP Compression Speeds up the Web
Oct 13, 2000, 15 :28 UTC (13 Talkback[s]) (9666 reads) (Other stories by Peter Cranstone)

A longer version of this appeared on WebReference.

The volume on the Web is forecasted to more than triple over the next three years and the category expecting the fastest growth is data. Data and content will remain the largest percentage of Web traffic and the majority of this information is dynamic so it does not lend itself to conventional caching technologies. Issues range from Business to Consumer response and order confirmation times, to the time required to deliver business information to a road warrior using a wireless device, to the download time for rich media such as music or video. Not surprisingly, the number one complaint among Web users is lack of speed. That's where compression can help, by using mod_gzip.

The Solution: Compression

The idea is to compress data being sent out from your Web server, and have the browser decompress this data on the fly, thus reducing the amount of data sent and increasing the page display speed. There are two ways to compress data coming from a Web server, dynamically, and pre-compressed. Dynamic Content Acceleration compresses the data transmission data on the fly (useful for e-commerce apps, database-driven sites, etc.). Pre-compressed text based data is generated beforehand and stored on the server (.html.gz files etc).

The goal is to send less data. To do this the data must be analyzed and compressed in real time and be decompressed with no user interaction at the other end. Since smaller amounts of data (less packets) are being sent, they consume less bandwidth and arrive significantly faster. The network acceleration solutions need to be focused on the formats utilized for data and content including HTML, XML, SQL, Java, WML and all other text based languages. Both types of compression utilize HTTP compression and compress HTML files fully three times smaller.

To get an idea of the improvement in speed involved, here's a live demonstration:

Real time Web server content acceleration test:

Before: http://12.17.228.53:8080/music.htm
After: http://12.17.228.53/music.htm

Why Compress HTML?

HTML is used in most Web pages, and forms the framework where the rest of the page appears (images, objects, etc). Unlike images (GIF, JPEG, PNG) which are already compressed, HTML is just ASCII text, which is highly compressible. Compressing HTML can have a major impact on the performance of HTTP especially as PPP lines are being filled up with data and the only way to obtain higher performance is to reduce the number of bytes transmitted. A compressed HTML page appears to pop onto the screen, especially over slower modems.

The Last Mile Problem

The Web is as strong as its weakest link. This has and always will be the last mile to the consumer's desktop. Even with the rapid growth of residential broadband solutions the growth of narrowband users and data far exceeds its limited reach. According to Jakob Nielsen he expects the standard data transmission speed to remain at 56K until at least 2003 so there is a distinct need to do something to reduce download times. Caching data has its benefits, but only content reduction can make a significant difference in response time. It's always going to be faster to download a smaller file than a larger one.

Is Compression Built into the Browser?

Yes. Most newer browsers since 1998/1999 have been equipped to support the HTTP 1.1 standard known as "content-encoding." Essentially the browser indicates to the server that it can accept "content encoding" and if the server is capable it will then compress the data and transmit it. The browser decompresses it and then renders the page.

Only HTTP 1.1 compliant clients request compressed files. Clients that are not HTTP 1.1 compliant request and receive the files un-compressed, thereby not benefiting from the improved download times that HTTP 1.1 compliant clients offer. Internet Explorer versions 4 and above, Netscape 4.5 and above, Windows Explorer, and My Computer are all HTTP 1.1 compliant clients by default.

To test your browser, click on this link (works if you are outside a proxy server):

http://12.17.228.52:7000/

And you'll get a chart like this:

To verify that Internet Explorer is configured to use the HTTP 1.1 protocol:

Open the Internet Options property sheet
- If using IE 4, this is located under the View menu
- If using IE 5, this is located under the Tools menu
Select the Advanced tab
Under HTTP 1.1 settings, verify that Use HTTP 1.1 is selected (see Figure 1 below).

IE4/5 Setting HTTP 1.1

What is IETF Content-Encoding (or HTTP Compression)?

In a nutshell... it is simply a publicly defined way to compress HTTP content being transferred from Web Servers down to Browsers using nothing more than public domain compression algorithms that are freely available.

"Content-Encoding" and "Transfer-Encoding" are both clearly defined in the public IETF Internet RFC's that govern the development and improvement of the HTTP protocol which is the "language" of the World Wide Web. "Content-Encoding" applies to methods of encoding and/or compression that have been already applied to documents before they are requested. This is also known as "pre-compressing pages." The concept never really caught on because of the complex file maintenance burden it represents and there are few Internet sites that use pre-compressed pages of any description. "Transfer-Encoding" applies to methods of encoding and/or compression used DURING the actual transmission of the data itself.

In modern practice, however, the two are now one and the same. Since most HTTP content from major online sites is now dynamically generated, the line has blurred between what is happening before a document is requested and while it is being transmitted. Essentially, a dynamically generated HTML page doesn't even exist until someone asks for it. The original concept of all pages being "static" and already present on the disk has quickly become an 'older' concept and the originally well defined separation between "Content-Encoding" and "Transfer-Encoding" has simply turned into a rather pale shade of gray. Unfortunately, the ability for any modern Web or Proxy Server to supply "Transfer-Encoding" in the form of compression is even less available than the spotty support for "Content-Encoding."

Suffice it to say that regardless of the two different publicly defined "Encoding" specifications, if the goal is to compress the requested content (static or dynamic) it really doesn't matter which of the two publicly defined "Encoding" methods is used... the result is still the same. The user receives far fewer bytes than normal and everything happens much faster on the client side. The publicly defined exchange goes like this....

A Browser that is capable of receiving compressed content indicates this in all of its requests for documents by supplying the following request header field when it asks for something....
When the Web Server sees that request field then it knows that the browser is able to receive compressed data in one of only 2 formats... either standard GZIP or the UNIX "compress" format. It is up to the Server to compress the response data using either one of those methods ( if it is capable of doing so).
If a compressed static version of the requested document is found on the Web Server's hard drive which matches one of the formats the browser says it can handle then the Server can simply choose to send the pre-compressed version of the document instead of the much larger uncompressed original.
If no static document is found on the disk which matches any of the compressed formats the browser is saying it can "Accept" then the Server can now either choose to just send the original uncompressed version of the document or make an attempt to compress it in "real-time" and send the newly compressed and much smaller version back to the browser.

Most popular Web Servers are still unable to do this final step.

The Apache Web Server which has 61 percent of the Web Server market is still incapable of providing any real-time compression of requested documents even though all modern browsers have been requesting them and capable of receiving them for more than two years.
Microsoft's Internet Information Server is equally deficient. If it finds a pre-compressed version of a requested document it might send it but has no real-time compression capability.
IIS 5.0 uses an ISAPI filter to support GZIP compression. It works as follows. The user requests a page, the server sends the page and then stores a copy of it "compressed" in a temporary folder. The next time a user requests the page it sends the one stored in the temp directory.

What it then tries to do is constantly check that the pages in the temp directory are always current, and if not gets a current page and then compresses it.
IBM's WebSphere Server has some limited support for real-time compression but it has "appeared" and "disappeared" from the marketplace through various release versions of WebSphere.
The very popular Squid proxy server from NLANR also has no dynamic compression capabilities even though it is the de-facto standard proxy-caching software used just about everywhere on the Internet.

The original designers of the HTTP protocol really did not foresee the current reality with so many people using the protocol that every single byte would count. The heavy use of pre-compressed graphics formats such as .GIF and the relative difficulty to further reduce the graphics content makes it even more important that all other exchange formats be optimized as much as possible. The same designers also did not foresee that most HTTP content from major online vendors would be generated dynamically and so there really is no real chance for there to ever be a "static" compressed version of the requested document(s). Public IETF Content-Encoding is still not a "complete" specification for the reduction of Internet content but it does work and the performance benefits achieved by using it are both obvious and dramatic.

What is GZIP?

It's a lossless compressed data format. The deflation algorithm used by GZIP (also zip and zlib) is an open-source, patent-free variation of LZ77 (Lempel-Ziv 1977, see reference below). It finds duplicated strings in the input data. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length), distances are limited to 32K bytes, and lengths are limited to 258 bytes. When a string does not occur anywhere in the previous 32K bytes, it is emitted as a sequence of literal bytes. (In this description, "string" must be taken as an arbitrary sequence of bytes, and is not restricted to printable characters.)

Technical Overview

HTML/XML/JavaScript/text compression: Does it make sense?

The short answer is "only if it can get there quicker." In 99% of all cases it makes sense to compress the data. However there are several problems that need to be solved to enable seamless transmission from the server to the consumer.

Compression should not conflict with MIME types
Dynamic compression should not effect server performance
Server should be smart enough to know whether the user's browser can decompress the content

Let's create a simple scenario. An HTML file which contains a large music listing in the form of a table.

http://12.17.228.53:8080/music.htm This file is 679,188 bytes in length.

Let's track this download over a 28K modem and then compare the results before and after compression. The theoretical throughput over a 28K modem is 3,600 bytes per second. Reality is more like 2,400 bytes per second but for the sake of this article we will work at the theoretical maximum. If there was no modem compression then the file would download in 188.66 seconds. On the average with modem compression running we can expect a download time of about 90 seconds which indicates about a 2:1 compression factor. The total number of packets transmitted from modem to modem effectively "halved" the file size. But note that the server still had to keep open the TCP/IP sub system to "send" all the bytes to the modem for transmission. What happens if we can compress the data prior to transmission from the server. The file is 679,188 bytes in length. If we can compress it using standard techniques (which are not optimized for HTML) then we can expect to see the file be compressed down to 48,951 bytes. This is a 92.79% compression factor. We are now transmitting only 48,951 bytes (plus some header information which should also be compressed but that's another story). Modem compression no longer plays a factor because the data is already compressed.

Where are the performance improvements?

Bandwidth is conserved
Compression consumes only a few milliseconds of CPU time
The server's TCP/IP subsystem only has to server 48,851 bytes to the modem
At a transfer rate of 3,600 bytes per second the file arrives in 13.6 seconds instead of 90 seconds

Compression clearly makes sense as long as it's seamless and doesn't kill server performance.

What else remains to be done?

A lot! Better algorithms need to be invented that compress the data stream more efficiently than gzip. Remember gzip was designed before HTML came along. Any technique which adds a new compression algorithm will require a thin client to decode and possibly tunneling techniques to enable it "firewall friendly." To sum up we need:

Improved compression algorithms optimized specifically for HTML/XML
Header compression. Every time a browser requests a page it sends a header file. In the case of WAP browsers header information can be as high as 900 bytes. With compression this can be reduced to less than a 100.
Compression for WAP. (Currently WAP/WML does not support a true entropy encoding technique. It uses binary encoding to compress the tags while ignoring the content.)
Dynamic compression for caching servers.
Real time compression/encryption with tunneling.

comp.compression FAQ
GZIP home page
HTTP Compression Resources Lists other HTTP compression products
RFC's (Request for Comments)
- RFC 1952 GZIP file format specification Describes the GZIP file format specification that is used to compress and decompress files
- RFC 2616: HTTP 1.1 The official HTTP 1.1 Protocol Specification
Remote Communications
- ApacheBench Apache's free Internet benchmarking tool modified to support measurement of RC's Apache Web server acceleration module
- mod_gzip
- mod_gzip FAQ

Related Stories:
Apache Module Registration: mod_gzip(Aug 29, 2000)

Current Newswire:

Everything Solaris: Apache: Handling Traffic

LinuxEasyInstaller 2.0 final release

Apache 2.0.32 beta is available

Everything Solaris: Apache: The Basics

Apache Jakarta James Mailserver v2.0a2 Released

PostgreSQL v7.2 Final Release

Daemon News: Multiple webservers behind one IP address

Zend Technologies launches Zend Studio 2.0

NuSphere first to enable development of PHP web services

Covalent Technologies raises $18 million in venture capital

[Home][Top of Page]

Index Mode | Flat Mode | Thread Mode | Thread Flat

Talkback(s)

Name

Date

Get rid of indenting.....
Getting rid of the indenting that HTML editors include could speed up the pages some more. By around 8 to 10% if you use tables like I do.

Oct 13, 2000, 18:01:59

Compression
How much faster will this be than simply using built in Modem compression protocols or PPP compression?

Oct 13, 2000, 18:19:35

mod_gzip?
It is a good ieda to compress any website dynamically.

Is it good to have a mod_gzip? Compression should be a
transpant filter in the network stream of Apache server.
So it should be a pipeline module.

Chang LI

Oct 14, 2000, 04:17:51

doubt
today's websites are mostly comprised of multimeida files that are already
quite compact in nature like JPG,MPEG. And dynamic content ratio is also rising, which increase the server's load and compete with CPU resource with
compression module. Also consider the fact that the network bandwidth is the
dominant factor that affects user's perception of latency and the rising
technology like replica and mirroring are relieving it, like what Akami is
doing. It is rather doubtful whether the compression is really so useful.

Oct 14, 2000, 17:22:04

Re: Compression
> How much faster will this be than simply using built in Modem compression protocols or PPP compression?

This question reminds me of the typical network layer capabilities problem:
should the ends do the most processing or should the network do it instead?
In both cases, the answer is the same, IMO: end to end solutions are usually
better (or at least as good as) point to point solutions. End to end compression
means more computing power on both the sender and the receiver but, as data
goes compressed all along the way, the TOTAL used bandwidth is smaller. So if
you are requesting uncompressed data, modem compression (MNP IIRC) will reduce
bandwidth between you and your provider, but NOT between your provider and the
source of the data. I guess this means data compression is mainly better for
data sources (since they can send more data at the same cost), as well as for
anyone paying for transferred bytes (compression=less bytes ;-)

On PPP compression protocols, IIRC, compression was achieved by dropping known
parts of the headers (for example, source IP, being PPP a point to point
protocol), not by compressing the payload (I may be mistaken, of course, since I
do not know too much about PPP).

HTH

Marcos

Oct 16, 2000, 09:26:46

php and mod_gzip
ok,
this sounds really cool and useful, but can I use it to compress the output from my php-parsed web pages and if so, how?

Oct 17, 2000, 01:52:04

mod_gzip is for static websites only
> ok, this sounds really cool and useful, but can I use
> it to compress the output from my php-parsed web pages
> and if so, how?

No, mod_gzip is not able to compress dynamically generated
web pages, unfortunally. :(

There are two companies offering accelerators, which are
capable to compress both static and dyanmic HTML data:
www.packeteer.com with a hardware solution and www.vigos.com
with a software solution.

Oct 18, 2000, 10:09:56

Re: php and mod_gzip
You must use a proxy ...it does not compress dinamic content.

The usual way is to place a http proxy before the web servers to compress all.

Oct 18, 2000, 11:38:11

Re: Re: php and mod_gzip
> You must use a proxy ...it does not compress dinamic content.
The usual way is to place a http proxy before the web servers to compress all.

ok, should i set up apache+mod_gzip or squid or something else as the proxy?

could you provide links to the info on how to this and links to the software for this?

Oct 18, 2000, 21:04:58

Re: php and mod_gzip
PHP gzip compression...

http://leknor.com/code/php/view/class.gzip_encode.php.txt

Oct 20, 2000, 06:23:43

Re: Compression
The compression algorithm build into modems is (probably) very old, so new algorithms could mean a significant improvement. Also, this works for any network connection, not just modems.

Nov 15, 2000, 12:11:24

The real problem is...
Compression is good, but the real problem is LATENCY.

For true acceleration, you want it faster for all users, broadband and modem and for both static and dynamic content. You can easily see that if it takes you too long to perform compression, you can actually /add/ time to the total page access time. So in the real world, it's typically faster to send broadband users uncompressed data rather than taking the time to reduce the number of bytes.

Because of this problem, Packeteer's product says it operates in pass-through for broadband users. But the question is, how can you really tell an end user's access speed? You simply don't know. All proxy servers will pretty much seem like fast users. So all modem users behind proxy servers (such as AOL) won't get compressed/accelerated content.

And for medium to high traffic sites, the latency problem is exasperated by typical OS scheduling. If you have even twenty simultantaneous users asking for a page, the box has to compress and service clients 1-19 before it can service client 20. Waiting that long can make it faster to send even modem users uncompressed data. Each user connection is a separate process that has to get scheduled individually. Increasing the number of users exacts a enormous performance penalty in terms of CPU usage and user response time.

And having lots of slow clients doesn't make things easier, because each of those users is going to burn an apache process longer, which means more scheduling and context swapping.

In short, this is a very difficult problem to solve, and shouldn't be added to existing webservers if you care about performance or have more than a few users.

Disclamer: Our company has developed specialized appliances that solve the problem, so we know the limitations of existing products and are biased to our very high-performance solution. More later :)

Dec 12, 2000, 01:32:09

Gzip and images
Can I use a image transfer with the gzip compression? How...

Dec 14, 2000, 02:43:00

Home | Search Talkbacks | Customize View

Top of Page

Enter your comments below.

About Triggers

Media Kit

Security