Using mod_rewrite with Apache
“The great thing about mod_rewrite is it gives you all the configurability and flexibility of Sendmail. The downside to mod_rewrite is that it gives you all the configurability and flexibility of Sendmail.''
-- Brian Behlendorf, Apache Group
Introduction
On a small web site navigation around the pages is straightforward, document management is simple and the process is transparent to boot web-master and site user. However, as the size and complexity of the site increases the overhead associated with the manual document management of a small site becomes unsustainable. Where the site is supported by a database the management of site links both internally and externally can become impossible by conventional means.
Fortunately, help is at hand. The most common web server on the Internet is Apache, and Apache has the tool for the job: mod_rewrite. mod_rewrite is a powerful tool to manage formatting and reformatting of URLs before they are presented to the main web server for processing. Further, it can handle automatic redirections, hot link protection (where another site links to images and other resources, usually without authorisation), virtual folder management and a wealth of other tasks. Once described as 'damn cool voodoo', mod_rewrite has the capacity to transform a nightmare of disparate web resources into a coherent, search-engine friendly web site.
The aim of this tutorial is to cover some frequently used basic functions and capabilities in sufficient detail that you can go away and start developing mod_rewrite configurations to solve problems of your own.
Prerequisites:
A working knowledge of regular expressions1
Some familiarity with Apache configuration files & directives
Context:
For this document I'll assume that your site is hosted externally. This implies that directives that must be placed in HTTDPD.CONF are not available. Many Apache commands and directives can potentially be used in several different configuration files., and configuration can be done through Apache's .htaccess file2 on a per-directory basis.
If you're running your own Apache server in house or as a dedicated machine in a data centre then you might want to optimise your performance by configuring HTTPD.CONF instead, and remove the file handling overhead associated with reading .htaccess in the directory tree. This also gives some extra flexibility in configuring for debugging purposes. See Configuring HTTPD.CONF for some notes on this.
Enabling mod_rewrite
Even if mod_rewrite is enabled by the system administrator, its configuration isn't automatically inherited by subdirectories in the directory tree. Create .htaccess like this:
#Apache mod_rewrite rules.
RewriteEngine On
Options FollowSymLinks
All the examples below assume that this header is already present in the .htaccess file.
Note: the Apache documentation requires that Options FollowSymLinks is enabled. Third party hosting services usually enable this elsewhere by default and it can often be omitted in .htaccess. If you DO include it, it is likely that the default of Options All will be overridden. This will interfere with scripting languages such as PHP unless Options ExecCGI is also added. Depending on your requirements other Options directives may also need to be updated.
Basic Directives
It is beyond the scope of this tutorial to cover all the directives in detail. See the excellent Apache documentation for full details. However, here's a full list of the directives and
Directive |
Purpose |
.htaccess? |
|
RewriteBase |
Sets the base URL for operations. |
Yes |
|
RewriteCond |
Defines additional conditions to be applied to a RewriteRule |
Yes |
|
RewriteEngine |
Enables/disables the rewriting engine. |
Yes |
|
RewriteLock |
Sets a LockFile resource for use with RewriteMap programs. Not required otherwise |
No |
|
RewriteLog |
Defines a log file for rewriting operations. |
No |
|
RewriteLogLevel |
Set the amount of information logged. |
No |
|
RewriteMap |
Defines a mapping function for look-up. Allows text file, database or CGI mapping. |
No |
|
RewriteOptions |
Defines special options for the engine. Currently limited to inherit, which caused the engine to use Rewrite directives inherited from the parent directory. |
Yes |
|
RewriteRule |
Defines a basic condition to be met and the transformation to be applied. |
Yes |
Processing mod_rewrite directives
For historical reasons the format of the directives in .htaccess doesn't match the processing order. It's worth knowing because this can affect the order in which flags are recognised, and hence the order in which rules may be applied.
When matching a URL against Rewrite rules the engine scans the directives looking for a rule that applies. Then it scans any immediately preceding RewriteCond directives and checks for a match. The point to note is that if the RewriteRule isn't matched the preceding RewriteCond directives aren't evaluated.
Canonical names
The Problem
There are frequently several ways to access a particular part of your site. Some may be subject to change, or just an internal short cut. Regardless of the URL the user supplied, he should always see the preferred canonical on.
The solution:
Create .htaccess like this:
# Rewrite the URL for canonical names.
RewriteCond %{HTTP_HOST} !^www\.example\.com [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteRule ^/?(.*) http://www.example.com/$1 [L,R,NE]
What's happening?
The RewriteRule is evaluated first. The condition ^/?(.*) breaks down as follows:
^ - match the start of the key
/? - matches a forward-slash, and includes it in the match. If the leading slash is absent from the key the rule is not matched.
.* - match zero or more of almost any character
() - defines a back-reference, so that we can refer to it later. In this case it will capture everything matched by .*, and include the / since that is 'greedy'
Thus, we're looking for a key starting with a leading slash, and capturing that for later use.
The expression will match, for example, /name or /folder but not folder, or stuff/folder.
The RewriteRule defines a transformation to be applied - http://www.example.com/$1. Backreferences are defined by a $ symbol followed by a digit from 1 to 9, specifying which back-reference is to be used.
The transformation is returned precisely as we see it, but with any back-references substituted. In this case the engine returns a new URL with the contents of the first back-reference appended to it. This will give us a result such as http://www.example.com/folder.
Note that there are three flags appended to the Rewrite Rule. These are:
L – Last rule. Stop processing here.
R – Redirect. Send a Redirect with our new URL as a target.
NE – No Escape. Leave any special characters in the original string such as % alone. Normally these would be escaped.
Changed Names
The problem
A document or folder has been replaced by a version with a new name. The site needs to maintain access to the old name as it is linked to extensively within the site and elsewhere. You might also use this to present a different path to the web from the path in the file system.
The solution
# Rewrite oldfile.html to newfile.html
RewriteRule ^/oldfile.html$ /newfile.html
#Rewrite /games to /usr/web/games
RewriteRule ^/games(.*)$ /usr/web/games$1
What's happening?
The protocol and host name are stripped off first, together with any parameters. They are reapplied automatically later. The engine then processes the remainder.
The first rule looks for a URL of /oldfile.html. Since we're matching beginning and end (^ and $ respectively) that's all we'll match. If we find it we substitute /newfile.html and carry on.
The second rule looks for a URL starting with /games. The parentheses ask the engine to capture everything matching the expression .* (zero or more of any character) as a back-reference. Then we substitute /usr/web/games and append any contents of the back-reference with $1.
The operation of the first rule is self-evident.
Rule 2 will match /games or /games/racing or /games/index.php and return /usr/web/games, /usr/web/games/racing or /usr/web/games/index.php respectively.
Once our rule processing is complete the scheme and host is prepended, and our query string is appended. Thus we get
http://www.example.com/oldfile.html
becomes
http://www.example.com/newfile.html
and
http://www.example.com/games/index.php?game=checkers
becomes
http://www.example.com/usr/web/games/index.php?game=checkers
Hot link protection
The problem
All too often you post your image to your host as part of your carefully designed page, and someone finds it and links to it from their own site, or from a busy forum or blog. Before you know it, your image is being requested hundreds of times a day on sites unrelated to you or your own site. The alternative name for this is bandwidth theft, and it can cost you a packet in bandwidth or lost performance.
The solution
Browsers (optionally) send the site they are working on as part of the request for a file. This field is the HTTP_REFERER field, and you can check it with mod_rewrite and take some action accordingly. If the browser is on a page on your site then mod_rewrite can check that the HTTP_REFERER field matches your domain. If so, fine. If not you can ignore the request or return some other image or result.
Ensure all the files you wish to protect are easily identifiable. File extensions for images and multimedia are good. Or consider moving the files to a folder of their own and using a more general rule there.
Create .htaccess like this:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.example.com/.*$ [NC]
RewriteRule .*.jpg$ - [L]
What's going on?
The rule looks for requests for any file with a .jpg extension. Then it checks that HTTP_REFERER is not blank, and that it doesn't match the domains listed. A blank referrer field, or a match with one of the listed domains will abandon the rule, so you can list as many domains here as you wish. For files used on multiple sites this will allow requests from any of those sites.
If the rule and all the conditions are matched, mod_rewrite silently drops the request. The browser will receive a 404 Not Found error, and the user will see a broken image symbol.
For multiple file types, repeat the block, changing the extension each time.
For example, to protect JPG, GIF and PNG files use:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.example.com/.*$ [NC]
RewriteRule .*.gif$ - [L]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.example.com/.*$ [NC]
RewriteRule .*.jpg$ - [L]
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://example.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.example.com/.*$ [NC]
RewriteRule .*.png$ - [L]
Unfortunately, requests from the internet cannot be trusted. HTTP_REFERER is an optional field, so you might not get it at all. Some browsers allow the content of HTTP_REFERER to be changed, removed or spoofed. Proxy servers may replace the HTTP_REFERER field with the top level domain for the request to mask the location of internal documents. All these considerations mean that checking HTTP_REFERER is not a foolproof solution. Since, for the most part, users don't do any of this, mod_rewrite can reduce the problem dramatically.
Blocking an unwanted robot
The Problem
- Good robots will request the file
/robots.txtand honour the contents. This is enough to stop Google or Bing from trawling your directories. Some robots ignore the file and trawl directories and subdirectories anyway. For a large directory this can cause a heavy server load to no purpose.
The solution
Create a rule that forbids access to the directory we want to protect. We can't block just the host address because that would deny access to legitimate users. We want to ensure only the robot is affected. We achieve this by also checking the User-Agent HTTP information.
RewriteCond %{HTTP_USER_AGENT} ^NameOfRobot.*
RewriteCond %{REMOTE_ADDR} ^123\.45\.67\.[8-9]$
RewriteRule ^/site/archive/.+ - [F]
From Static to Dynamic
Description
Suppose we rewrite a static page on our site so that it is created dynamically, but we want to install the new page seamlessly. We want to transform the URL so that the dynamic page is loaded without being noticed by the browser or user.
Solution
We just rewrite the URL to the CGI-script and force the handler to be cgi-script so that it is executed as a CGI program. This way a request to /site/example.html internally leads to the invocation of /site/example.cgi.
RewriteBase /site/
RewriteRule ^example\.html$ example.cgi [H=cgi-script]
1 See http://www.regular-expressions.info for an excellent resource
2 .htaccess is a text file installed in any folder on a web server. If Apache finds this file it will apply directives it finds to any operations on files in that folder.
Thanks to http://www.apache.org for some of these examples.