See also SimpleModProxy, ApacheModProxySsl


After few discussions had face-to-face with some of you (Stefanomazzocchi on the phone ranting about setting up Tomcat, Jeremy over lunch at my place few weeks ago, and several others), and few odd questions popping out on the list, I feel the need to tell you why my vision is so narrow when someone touches the Apache argument.

As I said several times in the past 6 years, I've learnt how to use Apache (1.3 first, and 2.0 lately) to suit my needs and I would never envision an HTTP server running without it.

Given my pragmatical vision, it's hard to explain why I am so biased, and probably the best way to come out-of-the-loophole is to share the few things I learnt, and that make my everyday life of administrator easy...

So, those are few tips for those of you who wonder about my rants.

See also:


Why Apache as a front end?

Probably the first and most important question to answer is WHY it is so important to have Apache HTTPd as a front-end for a website.

I believe that for anyone, there's nothing more annoying than hitting a web page, waiting for a few seconds, and then seeing our favorite browser coming up with The connection was refused when attempting to contact http://www.domain.tld/

In my opinion (and my boss') it is unacceptable to have a "downtime" on a website, and if that happens, whoever connects needs to know what's going on, or, at least, we need to tell him something ... {{{We are sorry, but currently http://www.domain.tld/ is unavailable because of essential system upgrades. We expect to resume all our services in less than 10 minutes. Please, check back later}}} ... sounds so much better (maybe with our little nice logo, and yada yada, yada).

When once I asked to Brian Behlendorf why Apache was doing some oddities in the code, he responded "Call it defensive programming": this explains the entire vision behind Apache: Apache, no matter what, can not "go down" and not respond to HTTP requests. This is the essence behind it and its design is centered around this idea, so, in my opinion (and experience) it is that one option allowing us to achieve our goal of "zero port 80 downtime".

Apache's design enforces a multi-process model: there is always a minimal wrapper bound to port 80 (as safe and minimal as possible), spawning new OS processes per request doing the work. This allows that even in the worst case scenario (a segmentation violation in the code that dumps the entire OS process), something will be sent back to the client.

A Java-based web server can not achieve this. Java is a single-process environment and if something happens to it, it will just exit, unbinding port 80 and leaving our clients with "connection refused".

There is another issue, important one, about security. Java does not support switching user-ID after it's started, and under UNIX operating systems, everyone knows that noone apart from root can bind to ports < 1024.

In our case it is a problem, I either decide to run my service as root (and that is NOT a good idea), or I bind to some port > 1024 (usually 8080). But then, the complexity arises when forwarding requests for port 80 (our usual HTTP service) to a port above 1024 (8080). Either firewall packages, or port remappers, any of those solution involves a some-degree of complexity.

Apache avoids all that. Being native, it can bind to ports < 1024 and run as a non-privileged user, allowing us to run our servlet container (as well) as a non privileged user.

But those are not the only advantages, Apache helps us in much much better ways, and I hope, at this point to be able to show you what and how...


What Apache? How Apache?

A very personal choice is what version of Apache you want to run. In my

following examples I will assume you're going to use Apache 2.0, as it is now stable and much more performant than the "old" 1.3.

It's now several months that most of the sites hosted by VNU (my employer) are running 2.0 (apart from our old legacy "rolaren" server) and I never had in my personal experience a single problem.

Apache 2.0, though, is somehow more "difficult" to build and configure: the most difficult choice is the selection of the MPM (Multi-Process Module) to use. Read the manual to choose what suits you best, but in my case the "worker" MPM (multi-process, multi-threaded) is the one giving me the best performance/solidity ratio.

The "www.apache.org" website, on the other hand, uses the "prefork" MPM (multi-process, single threaded, exactly as Apache 1.3 did), but I feel that under certain operating system it is slightly slower than "worker". Your choice.

As a reference, I configure Apache 2.0 in the following way:

./configure \
    --with-mpm=worker \
    --enable-modules=all \
    --enable-mods-shared=all \
    --enable-proxy \
    --enable-proxy-http \
    --disable-ipv6

Basically, I use the worker module, all modules are compiled as DSO modules (dynamically loaded, so that I can disable the ones I don't use), including the proxy/proxy-http module, and I don't care for IPv6 support.


Connecting Cocoon

As Stefano, I had several headaches trying to connect Apache and (name your Servlet container of choice). Mod_JK (JK2) doesn't work for me, mod_webapp works for me, but just for me because I'm the author, and was forced to sadly abandon its development, the only solution I see (and the one which works best for me currently) is mod_proxy.

Mod_proxy is a nice little module, especially in Apache 2.0 where its caching part is completely decoupled in another module (mod_cache), it's very small, lightweight, and does the job...

Plus, you have the advantage to choose whatever servlet container you have in the backend: Orion, WebSphere, Tomcat, Jetty, you name it, it supports HTTP :-) (well, apart from ServletExec, but that's another story, and if someone wants some hints, let me know).

Connecting Cocoon is simple: all you have to do is configure your servlet container to run on a high port (8080 for example) and make sure it runs as a non privileged user, make sure that it knows that is a proxied-HTTP server (Cocoon, Jetty, Resin, Orion, ... They all have this concept, check out the documentation), and configure Apache with those two lines:

LoadModule proxy_module modules/mod_proxy.so
LoadModule proxy_http_module modules/mod_proxy_http.so

ProxyRequests Off
<Proxy *>
  Order deny,allow
  Deny from all
  Allow from localhost
</Proxy>

ProxyPass        / http://localhost:8080/
ProxyPassReverse / http://localhost:8080/

The first one tells Apache that any whatsoever request (from / -slash- onwards) gets "proxied" to localhost:8080, and the second one tells Apache to make sure that any Location HTTP header coming back gets rewritten accordingly (just in case if your Servlet container doesn't let you set the "proxied" configuration).

That's IT. It runs, and it runs smoothly.


Trivially serving static files

Now, Apache is definitely faster than any Java based servlet container in serving files straight to HTTP clients. This is just because nowadays it uses a kernel-based function called sendfile, that makes its performances far greater than anything than Java can do.

Using mod_proxy and the set of ProxyPass configuration directive doesn't allow us to set a "pattern" to associate to resources to be served straight off the filesystem, it only allows us to define exclusion lists and processing lists.

In my example, then I will rewrite my configuration to make Apache serve everything beginning with /static/ straight out of my web-application, without even touching the servlet container:

# Make sure that my document root points to the root of the web
# application (where the WEB-INF is located, for instance).
DocumentRoot /export/webapps/cocoon

# We don't proxy any request beginning with the keyword "/static/".
# So, for example, "/static/logo.gif" will be served directly by
# Apache from the "/export/webapps/cocoon/static/logo.gif file"
ProxyPass        /static/ !

# Another one for "favicon.ico", so that explorer and mozilla are happy
ProxyPass        /favicon.ico !

# And now we send back to the servlet engine everyting else that does
# not begin with "/static/" or "/favicon.ico"
ProxyPass        / http://localhost:8080/
ProxyPassReverse / http://localhost:8080/

Simple, the ! (exclamation mark) keyword in ProxyPass means "don't" :-)


The holding page

If you used one of the configurations above, you'll see that if your servlet container is not respondong on port 8080 for any reason, you will get a nice "Bad Gateway" error page (HTTP 502 Error).

As that page is quite ugly (I have to admit that the HTTPd freaks are not good HTML artists), you might want to point your clients to a better-designed page (or containing some lame excuse on why your servlet container is down).

You can do that easily (again), by using the ErrorDocument directive. Note that, though, the ErrorDocument directive requires a file (so it needs to be non proxied). Either you get down nasty with your mod_alias configurations, or simply, use the second configuration and include it in your webapp as a static file. Anyway, what you have to specify in that case is simply:

# If mod_proxy cannot connect to the servlet container, we want
# to display a nice static page saying the reason
ErrorDocument 502 /static/unavailable.html

If (for example) you wanted to use Server-Side-Includes to render your page (it might be nice to display something like the host name, or the time when the request was received, you can do so by using SHTML files. This is what I use at home:

<html>
  <head>
    <title><!--#echo var="SERVER_NAME"-->: server off-line</title>
  </head>
  <body>
    <h3><!--#echo var="SERVER_NAME"-->: server off-line</h3>
    <p>
      We are sorry, but the server is temporarily unavailable due to
      maintenance. Our team is working to restore service as soon as
      possible.<br />
      In case of troubles, please feel free to contact our webmaster
      sending an email to
      <a href="mailto:<!--#echo var="SERVER_ADMIN"-->">
        &lt;<!--#echo var="SERVER_ADMIN"-->&gt;
      </a>.
    </p>
    <hr/>
    <p>
      <small>
        <!--#echo var="SERVER_SOFTWARE"--> running on
        <!--#echo var="SERVER_NAME"-->:<!--#echo var="SERVER_PORT"-->
        at <!--#echo var="DATE_LOCAL"-->.
      </small>
    </p>
  </body>
</html>

And to make it work properly this is how your httpd.conf will have to look like:

# Make sure that Server Side Includes are processed and sent
# to the client with mime-type as text/html
AddType text/html .shtml
AddOutputFilter Includes .shtml

# Make sure that our SHTMLs are processed in the static
# directory
<Directory "/export/webapps/cocoon">
    Options IncludesNoExec
</Directory>

# If mod_proxy cannot connect to the servlet container, we want
# to display a nice static page saying the reason. This is a
# SHTML page (using the Server-Side-Includes filter)
ErrorDocument 502 /static/unavailable.shtml


Putting all together (step one)

Ok, now that we have seen how each piece gets together, let's try to put them all together, adding also that any request to /WEB-INF/ should be forbidden straight away (there's no point in proxying them when we know that the servlet container will block them all)

# Make sure that my document root points to the root of the web
# application (where the WEB-INF is located, for instance).
DocumentRoot /export/webapps/cocoon

# Make sure that Server Side Includes are processed and sent
# to the client with mime-type as text/html
AddType text/html .shtml
AddOutputFilter Includes .shtml

# Make sure that our SHTMLs are processed in the static
# directory
<Directory "/export/webapps/cocoon">
    Options +IncludesNoExec
</Directory>

# Block the stupid "WEB-INF" pseudo-url (god I wish web-applications
# were designed with some intelligence... Ok, my fault as well)
<Location /WEB-INF>
    Order deny,allow
    Deny from all
</Location>

# If mod_proxy cannot connect to the servlet container, we want
# to display a nice static page saying the reason. This is a
# SHTML page (using the Server-Side-Includes filter)
ErrorDocument 502 /static/unavailable.shtml

# We don't proxy any request beginning with the keyword "/static/".
# So, for example, "/static/logo.gif" will be served directly by
# Apache from the "/export/webapps/cocoon/static/logo.gif file"
ProxyPass        /static/ !

# Another one for "favicon.ico", so that explorer and mozilla are happy
ProxyPass        /favicon.ico !

# And now we send back to the servlet engine everyting else that does
# not begin with "/static/" or "/favicon.ico"
ProxyPass        / http://localhost:8080/
ProxyPassReverse / http://localhost:8080/

Simple, easy, beautiful...


A more complex example: mod_rewrite

This is all nice and clean, but if we want to be really nasty, and starting to serve (for example) all our GIF and JPG files straight via Apache, we would need to use mod_rewrite.

I know, mod_rewrite is ugly, it uses PERL regular expressions (so, well, it's even slightly slower), but mod_proxy is way too crummy, it's either "in" or "out", and it takes over the whole world (you can't really do much else after you said you're going to forward a URL).

So, mod_rewrite, even if it's ugly, even if it's slower, is our solution. With a couple of rules, we can take the configuration written above to the extreme, and basically do WHATEVER we want with a URL before it even knows about a possible servlet container in the backend.

I suggest you to read carefully the mod_rewrite documentation, but, as a start, I'm going to rewrite what's written above, using rewrite and its flags, from here on, you're on your own :-) :-)

# Make sure that my document root points to the root of the web
# application (where the WEB-INF is located, for instance).
DocumentRoot /export/webapps/cocoon

# Make sure that Server Side Includes are processed and sent
# to the client with mime-type as text/html
AddType text/html .shtml
AddOutputFilter Includes .shtml

# Make sure that our SHTMLs are processed in the static
# directory
<Directory "/export/webapps/cocoon">
    Options +IncludesNoExec
</Directory>

# If mod_proxy cannot connect to the servlet container, we want
# to display a nice static page saying the reason. This is a
# SHTML page (using the Server-Side-Includes filter)
ErrorDocument 502 /static/unavailable.shtml

# The nastiness begins, let's fire up the "rewrite engine"
RewriteEngine On

# Everything that starts with "/static" or "/static/" is served straight
# through: no redirection, no proxying, no nothing, and the [L] flag
# implies that if this rule is matched, no other matching must be
# performed
RewriteRule "^/static/?(.*)" "$0" [L]

# Everything that starts with a NON-CASE-SENSITIVE match (the NC flag)
# of "/WEB-INF" or "/WEB-INF/" is forbidden (the F flag). And again,
# this is the last rule (the L flag), nothing will be processed by the
# rewrite engine if this rule is matched
RewriteRule "^/WEB-INF/?(.*)" "$0" [L,F,NC]

# Everything ending in ".gif", ".jpg" or ".jpeg" will be served again
# directly by Apache, no need to bother the servlet container. As above
# this is the last rule as specified by the [L] flag at the end
RewriteRule "^/(.*)\.gif$" "$0" [L]

RewriteRule "^/(.*)\.(jpg|jpeg)$" "$0" [L]

# Everything else not matched above needs to go to the servlet container
# via HTTP listening on port 8080. The [P] flag (which is required)
# implies that our requests will be handled by mod_proxy.
RewriteRule "^/(.*)" "http://localhost:8080/$1" [P]

# Make sure that if the servlet container specifies a "Location" HTTP
# header during redirection starting with "http://localhost:8080/", we
# can handle it and return to our client the effective (not real)
# location we want to redirect them to. This is _essential_.
ProxyPassReverse / http://localhost:8080/

As I mentioned before, ugly, but really effective. In few lines we connect the HTTP-based servlet container running Cocoon to Apache, we make sure that if the servlet container falls over, we direct people to an appropriate holding page, we serve all that is under /static, all GIF and all JPEG files straight off without touching Cocoon and all the rest through our sitemap, and as a free bonus, everything that ends in ".shtml" (from disk or from the sitemap) will be passed through the Apache "Server-Side-Includes" filter (mod_include, which is ugly, but sometimes _really_ effective)...


Letting Apache handle error pages

Whenever we want Apache to handle error messages in a consistent way (basically overwriting what Cocoon writes as a body in error pages), we can do that by simply adding a few lines to the configurations we used before:

# Make sure that Apache processes the headers coming back from the proxy
# requests. This will enable also the evaluation of HTTP status codes.
ProxyPassReverse / http://localhost:8000/

# Tell mod_mod proxy that it should not send back the body-content of
# error pages, but be fascist and use its local error pages if the
# remote HTTP stack is sending an HTTP 4xx or 5xx status code.
ProxyErrorOverride On

# For each individual error we want to handle, let's specify what file
# we want to use. Note that all files must be available through a
# locally accessible directory (as our /static/), and they can even be
# SSI files (SHTML files).
ErrorDocument 404 /static/notfound.shtml
ErrorDocument 500 /static/error.shtml
ErrorDocument 502 /static/unavailable.shtml

This is how it can be done, so that (for example, as suggested by Jeremy), one can configure Cocoon to dump full-stack-traces on the staging server, (or from an interface available only to the internal network), while displaying nicely formatted error messages to our client.


Preserving the Host header through a proxy

In some cases, it is quite important to preserve the Host header throughout the proxied request.

For example, to be able to deal with multiple virtual hosts on the backend servlet container, the proxied request MUST include the original Host name requested by our client. Apache allows us to pass this value through using the ProxyPreserveHost directive:

# Make sure that the virtual host name is passed through to the
# backend servlet container for virtual host support.
ProxyPreserveHost On


Putting it all together (step 2)

Linking together all the different pieces we've analyzed before, now, we can attempt to write up a do-it-all fragment of our httpd.conf file:

#######################################################################
# GLOBAL CONFIGURATIONS                                               #
#######################################################################

# Make sure that my document root points to the root of the web
# application (where the WEB-INF is located, for instance).
DocumentRoot /export/webapps/cocoon

# Make sure that Server Side Includes are processed and sent
# to the client with mime-type as text/html
AddType text/html .shtml
AddOutputFilter Includes .shtml

# Make sure that our SHTMLs are processed in the static
# directory
<Directory "/export/webapps/cocoon">
    Options +IncludesNoExec
</Directory>

#######################################################################
# ERROR PAGES CONFIGURATION                                           #
#######################################################################

# If mod_proxy cannot connect to the servlet container, we want
# to display a nice static page saying the reason. This is a
# SHTML page (using the Server-Side-Includes filter)
ErrorDocument 502 /static/unavailable.shtml

# For each individual error we want to handle, let's specify what file
# we want to use. Note that all files must be available through a
# locally accessible directory (as our /static/), and they can even be
# SSI files (SHTML files).
ErrorDocument 404 /static/notfound.shtml
ErrorDocument 500 /static/error.shtml

#######################################################################
# MOD_PROXY CONFIGURATIONS                                            #
#######################################################################

# Make sure that if the servlet container specifies a "Location" HTTP
# header during redirection starting with "http://localhost:8080/", we
# can handle it and return to our client the effective (not real)
# location we want to redirect them to. This is _essential_ to handle
# also the error returned by the backend servlet container.
ProxyPassReverse / http://localhost:8080/

# Make sure that the virtual host name is passed through to the
# backend servlet container for virtual host support.
ProxyPreserveHost On

# Tell mod_mod proxy that it should not send back the body-content of
# error pages, but be fascist and use its local error pages if the
# remote HTTP stack is sending an HTTP 4xx or 5xx status code.
ProxyErrorOverride On

#######################################################################
# MOD_REWRITE CONFIGURATIONS                                          #
#######################################################################

# The nastiness begins, let's fire up the "rewrite engine"
RewriteEngine On

# Everything that starts with "/static" or "/static/" is served straight
# through: no redirection, no proxying, no nothing, and the [L] flag
# implies that if this rule is matched, no other matching must be
# performed
RewriteRule "^/static/?(.*)" "$0" [L]

# Everything that starts with a NON-CASE-SENSITIVE match (the NC flag)
# of "/WEB-INF" or "/WEB-INF/" is forbidden (the F flag). And again,
# this is the last rule (the L flag), nothing will be processed by the
# rewrite engine if this rule is matched
RewriteRule "^/WEB-INF/?(.*)" "$0" [L,F,NC]

# Everything ending in ".gif", ".jpg" or ".jpeg" will be served again
# directly by Apache, no need to bother the servlet container. As above
# this is the last rule as specified by the [L] flag at the end
RewriteRule "^/(.*)\.gif$" "$0" [L]
RewriteRule "^/(.*)\.(jpg|jpeg)$" "$0" [L]

# Everything else not matched above needs to go to the servlet container
# via HTTP listening on port 8080. The [P] flag (which is required)
# implies that our requests will be handled by mod_proxy.
RewriteRule "^/(.*)" "http://localhost:8080/$1" [P]

And that's all... You can roughly copy and paste this example in a <VirtualHost> section of your httpd.conf (obviously after having applied the appropriate modification), and go...


Conclusions

I hope to have cleared some of the doubts on Apache, and why I love it so much... It is a hub, a hub embracing your website and making it work better, faster, more reliably and exactly fine-tuned precisely as you (or your boss) like it.

And you can trust Apache, I believe that our spirit, the spirit of the entire Cocoon community is built on top on the original HTTPd vision of let's make things work so nicely that the world won't have to look for another solution...

HTTPd does it in its little piece of being an HTTP hub, Jetty does it in its little piece of being a servlet container, Cocoon does it in its little piece of being the best "web-application" framework available on the planet right now. Together, those three little pieces will conquer the world.

Have fun...
PierFumagalli


Notes from CalebRacey - when mounting applications using reverse proxying it appears that you need to keep the url intact for state and session information to be preserved.
e.g. "ProxyPass /cocoon/ http://someserver.com:8080/cocoon/" works and the state examples that come with cocoon work properly
but "ProxyPass /proxy/cocoon/ http://someserver.com:8080/cocoon/" seems to break the state examples by inserting /proxy/ into the url. This is the same for a php application i have on another server suggesting that this is a general session thing not just specific to cocoon


notes from RBD on proxying with different names
i think that the cause of the above is that servlet containers set the path property of the session cookie. the means that most browsers will not send back the cookie.
for the brave using open source containers, you could try rolling your own version of your container which supports setting different paths.


More notes from CalebRacey - in order to prevent people from using your proxy to cover their tracks deny access to any request that doesn't start with a /

<LocationMatch "^[^/]"> 
Deny from all 
</LocationMatch> 

People scan for open proxies that will allow proxying of "http://" requests so they set their browser to use it.


Notes from TonyCollen:

I've realized that mod_proxy can also be used in a way that will allow you to proxy through requests to virtualhosts, which I will write up very soon now.

---

Notes from GeoffHoward:

I couldn't get apache to start up under RedHat 7.something with --with-mpm=worker (worked when I changed it to --with-mpm=prefork). Anyone know why? The error log is lost now but mentioned apr_thread and resource not available. The options are described in the httpd docs and none of them make much sense to me. Anyone know a good idiot's summary of the issues involved? I don't have a lot of knowledge of threading issues across different flavors of unix (probably should).

---

Notes from StuartRoebuck:

I note that the current official Cocoon documentation states "don't use Apache's mod_proxy because it is not fully compatible with HTTP/1.1 and disables connection keep-alive".

It would be helpful to have some clarification of whether this is still an issue.

---

Notes from PierFumagalli

Stuart, Apache 1.3's mod_proxy wasn't fully compatible with HTTP/1.1. Apache's mod_proxy has been entirely rewritten for the 2.0 release, and now it works like a rock!!!

---

Notes from BillHumphries

Tony, see p. 188 of the O'Reilly Apache Book for an example of this. The book's examples focus on Mod_Perl.

The proxy is much less of a headache than building additional modules, and handy for whatever heavyweight processes besides a servet engine you run.

Caleb, you can also limit which hosts may speak to the proxy through:

{{{<Proxy *>

</Proxy>}}}

It's a good idea to limit access to the local machine and whatever IPs your debugging from.

---

Notes from ScottKelley

 RewriteEngine         on
# for debugging
#RewriteLog            /tmp/mod_rewrite.log
#RewriteLogLevel       1

 ProxyRequests         Off

# make sure the values in ProxyPassReverse match the ones in RewriteMap file
 ProxyPassReverse  /  http://10.0.0.2:80/
 ProxyPassReverse  /  http://10.0.0.1:9101/

# Make sure that the virtual host name is passed through to the
# backend servlet container for virtual host support.
 ProxyPreserveHost On

# Tell mod_mod proxy that it should not send back the body-content of
# error pages, but be fascist and use its local error pages if the
# remote HTTP stack is sending an HTTP 4xx or 5xx status code.
 ProxyErrorOverride On

 RewriteMap    server  rnd:/usr/local/apache/conf/server_pool.conf

# make sure the status page is handled locally
# and make sure no one uses our proxy except ourself
 RewriteRule    ^/apache-rproxy-status.*  -  [L]
 RewriteRule    ^(http|https)://.*          -  [F]

#
# The S=X statements are critical so you skip to the last rewrite on a match.
# Don't fubar them.
#

# send jsp to 'dynamic' servers
# Vignette helpfully scatters them all around
 RewriteRule   ^/(.*\.(jsp))$         to://${server:dynamic}/$1  [S=5]

# send site to 'dynamic' servers
 RewriteRule   ^/(site/(.*))$         to://${server:dynamic}/$1  [S=4]

# send centric to 'dynamic' servers
 RewriteRule   ^/(centric/(.*))$    to://${server:dynamic}/$1  [S=3]

# send binary to 'dynamic' servers
 RewriteRule   ^/(binary/(.*))$       to://${server:dynamic}/$1  [S=2]

# send console to 'dynamic' servers
 RewriteRule   ^/(console/(.*))$      to://${server:dynamic}/$1  [S=1]

# send the fat media to 'static' servers
# we serve the little stuff locally from this apache instance                -
 RewriteRule   ^/(fatmedia/(.*))$      to://${server:static}/$1

# now proxy the rewritten rules to the right place
 RewriteRule   ^to://([^/]+)/(.*)      http://$1/$2    [E=SERVER:$1,P,L]

The server_pool.conf file looks like. These addresses are actually hardware load balanced to many boxes.

dynamic   10.0.0.1:9101
static    10.0.0.2:80


I think the rule for favicon.ico is no longer necessary, since all browsers now understand <link rel="shortcut icon" ... /> --ML

ApacheModProxy (last edited 2009-09-20 23:42:55 by localhost)