|
|
© 2000 Anthony Grace |
| No rated * * * * * | Resize -A +A |
3.1 Overview
CGI is a means by which Web servers interface to other application
programs, thus extending the services provided by the Web server.
By way of CGI scripts, users gain program access to Web servers
and can extend the capability of HTML. It should be noted that
the term Web server does not necessarily mean a physical
piece of harware; it is a piece of software that can reside locally
or on another remote machine. CGI was one of the first practical
techniques for creating dynamic content and it has made possible
all sorts of new functionality in Web pages. CGI has become a
de facto standard used on most modern Web servers today.
CGI scripts are executable files - programs written, compiled and linked in one of a number of languages. The most common language for a CGI program is Perl, although C, C++, Java, Javascript are popular as well. Perls dominance as a CGI development language is being challenged not only by some of these other languages, but also by Web development tools. Some of these tools have built into them, all of the capabilities usually provided by CGI programming languages. An example would be HAHTsite IDE.
HTML does not have the facility to directly query a database. With CGI, this capability exists. By using CGI scripts, a request can be sent from within HTML, and processed by the HTTP server, to query the database for some specific information. The result can then be displayed in dynamically-built HTML code. This means that there is no need to manually change a Web page whenever data on that page changes. All that has to be done is to simply place the data in a database, and build a CGI script to access the data and display it dynamically. Whenever a request is made to view the page that contains the CGI script, the Web server initiates a request to the database and formats the most current data into a dynamic Web page.
CGI scripts are frequently combined with Java applets to make the Web a true client / server environment. A CGI script is a server-side process that functions as an interface to other application programs and databases. It can reside on the same machine as the HTTP server or on a remote machine.
On a theoretical level, CGI enables us to extend the capability of our server to parse ( interpret ) input from the browser and return information based on user input. On a practical level, CGI is an interface that enables the programmer to write programs that can easily communicate with the server. These programs will work with all servers that understand the CGI protocol. A typical client / server architecture incorporating CGI is shown below:
Figure 11: A Typical CGI Architecture

CGI programs are the code that accepts a request initiated by a HTTP browser which is interpreted by a HTTP server, accepts passed data from the HTTP server, and takes an action prefined and described by the programmer. ( The HTTP protocol will be described later ).
Without CGI, we would have to extend the Web servers capabilities by modifying the server code ourselves. This is not a good idea! Besides the need to have a low-level unerstanding of network programming, HTTP and TCP/IP, it would also mean editing and recompiling the server source code. Alternatively, it would mean writing a custom server for each task. Besides the fact that this is technically difficult and requires access to source code, it will only work for our specific server. If at some stage, the server has to be moved to another platform, it would entail starting all over again or porting the code to the new platform; a time-consuming and undesirable task.
Since CGI is a Common Interface, the programmer is not restricted to any specific computer language or platform, making it portable. Some of the alternatives, such as server APIs, are proprietary in nature and restrict us to certain languages and platforms. To write a CGI script, we can use any language that can do the following:
Print to the standard output ( STDOUT )
Read from the standard input ( STDIN )
Read from the environment variables
Many of the limitations ascribed to CGI are actually limitations of HTML or HTTP. As the standards for the Web in general evolve, so does the capability of CGI.
3.2 The HTTP Protocol
HTTP is a simple, stateless protocol. When a client such as a browser is asked to fetch a HTTP URL, it opens a connection to the HTTP server, sends the request, receives the reply, and displays the the contents of the reply to the user. The protocol consists of two phases, the request and response phases. When the client sends a request, the first thing that it specifies is a method, which indicates to the server what type of action is to be performed. The first line of the request also contains the address ( URL ) of a document and the version of the HTTP protocol it is using:
GET /hello.html HTTP/1.0
GET describes the method used to ask for the document named hello.html using HTTP version 1.0. This request is sent, and then the client can send extra header fields containing information that describe to the server such things as: what type of software the client is running and what content types it can understand. This information may be used by the server in the process of constructing its response. After the client sends the request line, it can send any number of these header fields, which are entirely optional:
User-Agent: Mozilla/4.0(compatible;MSIE 4.0;Windows 95)
Accept: image/gif, image/jpeg, text/*, */*
The most frequently used field is Accept, which can occur one or more times, and specifies the media ( MIME ) types that the client prefers. MIME was originally invented as a way to describe mail message bodies. It has since become a common way of expressing content type information. This could be a HTML page, graphics or sound. The server stores this information in the environment variable $HTTP_ACCEPT. This can be checked by the CGI program to ensure that it returns a file in a format understood by the browser. The User-Agent header provides information about the client software. Mozilla refers to Netscape. After the headers, the client sends a blank line to indicate the end of the header section. After the request header and a blank line, the client can send data if it has made a the request using the POST method ( The POST and GET request methods are discussed later ). If the client sent a GET request, there is nothing more to send, so the client simply waits.
At this point we enter the response phase, where it is the servers turn to respond. The first line of the response is a status line that specifies the version of the HTTP protocol that the server is using, a status code and a description of the status code:
HTTP/1.0 200 OK
Another common status code is the all-too-familiar 404 Not Found, which means that the requested document was not there. The following table shows a selection of some of the more common status codes:
Status Code Message
200 Success
204 No Response
301 Document Moved
401 Unauthorised
404 Not Found
501 Not Implemented
A complete set of status codes can be obtained from:
http://www/w3/org/hypertext/WWW/Protocols/HTTP/HTRESP.html
After the status line, the server sends out response headers that tells the client such things as what software the server is running and the content type of the servers response. It also contains various pieces of information about the document to follow. As with the request headers, much of the information here is optional except for the important Content-type header. This tells the client what type of data it is sending so that the browser can format and display the document. The following code extract in Perl, checks to see if the browser accepts JPEG or GIF images ( taken from CGI programming on the WWW by Shishir Gunavaram ):
#!/usr/bin/perl
$gif_image = logo.gif;
$jpeg_image = logo.jpeg;
$plain_text = logo.txt;
$accept_types = $ENV{HTTP_ACCEPT};
if ($accept_types = ~ m|image/gif|)
{
$html_document = $gif_image;
}
elseif ($accept_types = ~ m|image/jpeg|)
{
$html_document = $jpeg_image;
}
else
{
$html_document = $plain_text;
}
To determine which types of images are supported, if any, regular expressions are used on the $accept_types variable. We can then open the file, read it and eventually output the data requested to standard output. A typical set of response headers:
Date: Sunday, 12-Mar-00 11:47:12 GMT
Server: Apache/1.3.6
MIME-version: 1.0
Content_type: text/html
Content_length: 1014
Last_modified: Friday, 10-Mar-00 15:52:21 GMT
The exact length of the data sent, in bytes, is indicated in
Content-length and is stored in the $CONTENT_LENGTH environment
variable. This applies to the POST method and will be explained
later. There are several other response header fields which are
not mentioned here because it is easy to look them up, and they
are mostly optional.
However, some of them can cause a little confusion. For example:
the Content-Encoding and Content-Transfer-Encoding fields. The
former is used to specify optional compression or encryption techniques
that can be applied by the server and will have to be decoded
at the other end. Possible values are x-gzip for files compressed
with the gzip program, and x-compress for files compressed with
the standard compress program. Browsers that have the ability
to handle compressed data let the server know by sending an Accept-Encoding
field in the request header. The Content-Transfer-Encoding notifies
gateways and relays that data passing through may need special
handling. This field is not usually needed since most browsers
communicate directly with the server over a binary TCP/IP connection.
The server sends a blank line after the headers to indicate the end of the header section. If the request sent was successful, then the requested data is sent as part of the response. Otherwise, the response will contain a status code indicating the reason for the failure. The HTTP protocol does not set a limit on the size of the documents transmitted; it can handle anything from a 12-byte Hello World! to a multimegabyte dump of a database.
3.3 Uniform Resource Locaters ( URLs )
The Uniform Resource Locater ( URL ) is an unambiguous, straightforward means of indicating protocol, host, and location of an Internet resource:
http://www.hostname.domain:8080/directory/file.html
The first part of the URL indicates the communications protocol; in this case, HTTP. A colon is used to separate it from the rest of the URL. The second part of the URL, beginning with a double slash, gives the name of the host machine. The communications port is optionally included. By default the HTTP daemon listens on port 80 ( more of this later ). If the port number is different to this, it should be explicitly included. The hostname could be indicated by the dotted IP address, but it is the convention to use the name of the machine. The remainder of the URL is the path, whose format varies dependent on the protocol being used. It could be a file, or it could be a query to retrieve a document from a database or other program.
URLs can be complete, partial or relative. Complete URLs include all of the constituent components of the URL and will always point the browser to the correct location. In partial URLs, the protocol and hostname components are omitted. When browsers encounter links of this kind, they will interpret the URL relative to the current page.
With relative URLs, the protocol, hostname and part of the path is omitted:
otherfile.html
Everything , including the path itself, is interpreted relative to the current document. The pathnames of relative URLs follow the same conventions as their namesakes in Unix. So, the relative URL directory1/file2.html refers to a document one directory below the current document. On the other hand, ../file2.html tells the browser to go up one level in the directory before looking for the document.
Relative URLs are very useful when constructing new pages or testing. We can construct logically linked sets of documents within a site. This allows the entire set of documents to be moved from one place to another, even to remote sites.
File URLs are the most basic of URLs, and refer to documents on the local machine. The general syntax is:
file://path/to/the/file
With this type of URL, the hostname and port are also omitted ( with the exception of FTP ). The full path to the file of interest is listed using the notation of the particular O/S; slash for Unix, backslash for DOS, etc. Most browsers will translate the Unix path notation into the local style. We should never use file URLs if the documents are going to be served over the Web. File URLs are of most use when testing a set of HTML documents on a local machine. When developing a set of linked pages, the best solution is to use relative URLs. This saves the bother of revising all of the links when the finished documents are finally put in place.
3.4 Extra Path Information
CGI programs have no way to interact with their server during execution, so it cannot receive a path parameter, or even ask the server to map it to a real file system location. Therefore the server has to have some way to translate the path before calling the CGI program. This is the motivation for passing extra path information ( as part of the URL ). The server needs to know where the name of the program name ends and understand that anything following the program name is extra. The following example shows how to call a script with extra path information:
http://localhost.localdomain/cgi-bin/test.pl/cgi/cgi_doc.txt
The server knows how to pre-translate this extra path and send the translation to the CGI program as an environment variable. In the above example, the server knows that test.pl is the name of the program. It also knows that /cgi/cgi_doc.txt is stored in the environment variable PATH_INFO. The variable PATH_TRANSLATED is also set, which maps the information stored in PATH_INFO to the document root directory. This provides a convenient means of attaching a path along with a request.
3.5 The GET and POST methods
The keystone of CGI is the concept of environment variables. These are set each time the browser sends information to the server. Probably the most important of these from a Perl CGI perspective, are those that specify the method used to send the data.
When a user clicks on the submit button on any of the forms developed in Prototype 2, the data entered is sent straight to the server. The data can be sent in one of two ways: GET or POST. The POST method is recommended since, unlike the GET method, it does not set any limits on the quantity of input data. Different environment variables are set, depending on which method is used. Data appended to a link is always sent with GET. The type and configuration of the server may also have an effect. The environment variables are stored in a special %ENV hash ( in Perl an associative array of key / value pairs ) which is set each time that a script is run. For example, we could use a link to input data to a script:
<H1>What time is it ?</H1>
<P>In <A HREF=http://hostname.domain/cgi-bin
/getlocaltime.cgi?zone=GMT&place=Waterford>
Waterford</A>
The link has two name / value pairs. Whenever the user clicks on the link, the browser sends the corresponding pair of values to the script for processing. There are no spaces in the URLs. This data will be sent by the GET method and will be appended to the URL as a set of name / value pairs.
Similarily, we could add data to the URL in the ACTION attribute of a FORM and send it by the POST method. When the user clicks on the submit button, both the data appended to the URL and the information the user enters in the FORM will be sent to the server. The only difference this time is that the data does not appear at the end of the scripts URL in the Location bar of the browser:
<FORM ACTION=http://hostname.domain/cgi-bin
/getlocaltime.cgi?zone=GMT&place=Waterford
METHOD=POST>
As we have just seen, the parameters to a CGI program are transferred either in the body of the request ( POST ) or in the URL itself ( GET ). In both of the above examples, the zone and place variable names would have been defined in an HTML FORM coupled with the values entered by the user. An ampersand ( & ) is used to separate the variable=value pairs. These pairs are then passed by the server to the CGI program, either through Unix environment variables or via standard input ( STDIN ). If the CGI program is invoked using the GET method, the parameters are expected to be embedded in the URL of the request. The server will then transfer them to the program by assigning them to the $QUERY_STRING environment variable. The CGI program would then parse the FORM and retrieve the parameters from $QUERY_STRING along with any other environment variables from the %ENV hash. If the CGI program is invoked with the POST method, the parameters are contained in the body of the request and passed to the program by the server as standard input ( STDIN ). The script can use the environment variable $REQUEST_VARIABLE to determine the request method used, and most scripts would cater for both possibilities. Once we know where to find the data, we can retrieve it. With POST, it will be contained in the standard input and we can use the $CONTENT_LENGTH to find its exact length in bytes.
A simplified version ( without the formatting and image tags ) of a FORM developed in the previous section is shown in Figure 12, for illustration purposes:
Figure 12: Example of simple HTML FORM
<html>
<head>
<title>form5</title>
</head>
<form ACTION=http://localhost/cgi-bin/form1.cgi METHOD=POST>
Please enter your name and password
Name: <input TYPE="text" NAME="name">
Password: <input TYPE="password" NAME="code"
SIZE=12>
<input TYPE="submit" VALUE="Submit">
<input TYPE="reset" VALUE="Reset">
</form>
</body>
</html>
The <INPUT> tag enables us to define image maps, input bars, radio buttons, checkboxes, text areas, etc., as a means of accepting input from the user. These details can be looked up in any good HTML book.
The script in Figure 13, will display the set of environment variables. The results depend on how the script is called ( with POST or GET ) :
#!/usr/bin/perl
print Content-type: text/html\n\n;
print <HTML><HEAD>
<TITLE>Environment Variables</TITLE>
</HEAD></BODY>;
foreach $env_var(keys %ENV)
{
print <BR><FONT COLOR=red>
$env_var</FONT> is set to
<FONT COLOR=blue>
$ENV{$env_var}</FONT>;
}
print </BODY></HTML>;
Figure 13: Output of env_var Script

3.6 The Decoding Process
Before sending any data input by the user in a FORM to a CGI program, each form element name is equated with the value entered by the user to create a key / value pair. In the transferred data, these key / value pairs are separated, as stated already, by an ampersand ( & ). The name and value in each pair is separated by an equals sign ( = ), and spacesss are substituted with plus signs ( + ). When the GET method is used, the data is sent as part of the URL. As a consequence of this, the FORM information cannot include any spaces or other special characters that would not normally be permitted in URLs, such as slashes ( / ). This constraint is also applied to the POST method. This means that the browser has to perform special encoding on user input. Encoding involves the replacement of spaces and other special characters with hexadecimal equivalents. For example, all slashes ( / ) are replaced with %2F ( note the use of % as an special character identifier ). This process is known as URL Encoding.
The general algorithm for decoding user-supplied data is as follows:
(1) Determine the request protocol by looking at the $REQUEST_METHOD environment variable.
(2) If the GET method has been employed, read the input from the $QUERY_STRING and / or the extra path information from $PATH_INFO.
(3) If the POST method has been employed, determine the size of the request from $CONTENT_LENGTH and then read that amount of data from STDIN.
Split the query on the & character, where the format is:
key1=value1&key2=value2
Decode the hexadecimal and + characters in each key / value pair.
Create an associative array of key / value pairs, where each key serves as an index of the hash. This is straightforward in Perl.
The above algorithm is a useful aid to understanding how everything works. In practise, we could use CGI.pm, which is a Perl module for creating and parsing CGI FORMS. The reason the request method is checked is that one module can be used to take care of both requests. The forms used in our Web site example use the POST method. However, it is still possible to send the information as a query string by entering it in the location bar of the browser; the script should handle both methods of data transfer. We can even save the complete request in our favourites folder, or as a link to another page.
Using the example in Fig.12, the sequence of events is as follows: the user fills out the form, and the browser encodes the information into a string of key / value pairs. If the request method is POST, the server passes the information as standard input to the CGI script. If the request method is GET, the server stores the information in the environment variable $QUERY_STRING. The following is an example of a simple parsing subroutine in Perl:
sub_parse_form
{
if($ENV{REQUEST_METHOD} eq GET)
{
@pairs = split(/&/, $ENV{QUERY_STRING});
}
elseif($ENV{REQUEST_METHOD} eq POST)
{
read(STDIN, $buffer, $ENV{CONTENT_LENGTH});
@pairs = split(/&/, $buffer);
if($ENV{QUERY_STRING})
{
@getpairs = split(/&/, $ENV{QUERY_STRING});
push(@pairs, @getpairs);
}
}
else
{
print Content-type: text\html\n\n;
print Use Post or Get;
}
foreach $pair( @pairs )
{
($key, $value) = split(/=/, $pair);
$key =~ tr/+/ /;
$key =~ s/%([a-fA-F0-9][a-fA-F0-9])
/pack(C, hex( $1))/eg;
$value =~tr/+/ /;
$value =~ s/%([a-fA-F0-9][a-fA-F0-9)
/pack(C, hex( $1))/eg;
$value =~s/<!-(.|\n)*-->//g;
if($formdata{$key})
{
$formdata{$key} .= , $value;
}
else
{
$formdata{$key} = $value;
}
}
}
1; # will return true when called by main script
Note: the subroutine file does not need to start with the Perl shebang line. It can be called from any other script with: &sub_parse_form;.
3.7 Server Redirection
In addition to creating virtual documents on the fly, CGI programs can also direct the server to retrieve an existing document and return that instead. This is known as server redirection. We need to send a Location header to the server to tell it what document to send. The server then retrieves this document, giving the impression that the client had not requested our CGI program, but this document instead. This is often used to send a generic thank you reply to the user, after they have filled out a comments form. For example:
#!/usr/bin/perl
print Location: /thanks.html, \n\n;
exit (0);
The server will then return the /thanks.html file located in the document root directory. Note: that we cannot return any content type headers when using server redirection. It is also possible to return documents located elsewhere on the network via their URL:
Print Location: http://www.other.com, \n\n;
Figure 14: Server Redirection
Other reasons for using server redirection include catering for a document that has been moved, or for load balancing where one URL can distribute the load to several different machines.
3.8 The Expires and Pragmas Headers
Most browsers these days can cache any document that we access. This saves the browser the trouble of having to retrieve the same document document over and over, thus saving on resources. This can cause problems with virtual documents created by CGI programs because the browser will look to the cache store rather than the server the next time we request the same document. The problem here is that the client browser is not getting back a real-time virtual document. There is however, a workaround to this; we can employ the Expires and / or Pragma headers to instruct the client not to cache the document, whenever we want the server to run the CGI script again:
#!usr/bin/perl
print Content-type: text/html, \n;
print Pragma: no-cache, \n\n;
.
Or
#!usr/bin/perl
print Content-type: text/html, \n;
print Expires: Saturday, 18-Mar-00 18:32:09 GMT, \n\n;
Note: Before using the above headers, the user should check the current version of the HTTP protocol in use, and any extra features added in HTTP 1.1.
3.9 Server Side Includes
There may be times when it is necessary to output a minimum amount of dynamic information such as the current date, without having to write a CGI script to accomplish same. This can be achieved using Server Side Includes (SSI). Not every server supports SSIs, but there are often workarounds ( usually a script ) for those that do not. Both Apache and Netscape offer this feature. Apache can be easily configure to handle SSI with the configuration applet shown later.
Basically SSIs are directives which are placed into HTML documents to execute other programs or to output data like the contents of, say environment variables. Most hit counters on Web sites today are implemented in this way. Whenever a client requests a document from an SSI-enabled server, the server parses the specified document which is configured for such a purpose, before returning the evaluated document back to the client. There is a list of SSI directives which, though not included here, can be found in most good CGI textbooks.
There are several disadvantages to using SSI. They never really standardised well, are cumbersome to use, and introduce a performance overhead if the server has to continually parse documents before sending them back to the client. In addition, enabling SSI can pose a security risk. The fact that SSI can be configured to allow novice users to execute arbitrary programs on the server is enough to ruin any system administrators day! For example, a user could embed directives allowing them to execute sysyem commands that output confidential information. A possible solution is to disable that part of SSI using the Options directive with the IncludesNOEXEC option in the access.conf file. Although SSI can be a powerful tool if used cautiously, everything SSI can do, scripting ( possibly combined with proprietary site development tools ) can do better. Perl and Tcl SSI extensions are now beginning to appear and may provide a better alternative. From a performance point of view, a good idea would be to employ the Apache module mod_include with mod_perl. The topic of embedding Perl directly into the server will be looked at later.
3.10 Hidden Fields
One of the most obvious drawbacks of the HTTP protocol is its inability to maintain state. That is, the protocol does not provide any means of accessing data from previous requests. Whenever a user activates a CGI script by pressing a submit button, the browser treats that person as a new individual, even if it is the same person who submitted a CGI script a second or two before. One of the most common applications today which require that state be recorded, is the shopping cart or ordering system on the Web. The user may enter a virtual shopping mall which may be represented by a separate order form for each different department. Rather than ask the customer to fill out their name and address for each order form that they complete, the system needs some way to store the information, or state, so that it can be referred to at a later stage. There are two common methods of maintaining state, or information, about visitors to a site: hidden fields and cookies ( which will be covered next ).
Hidden fields form part of the HTML code, and as such are visible to anyone looking at the source code for our page. Hidden fields are usually generated by the same CGI script which processes the form whose data we wish to store. Basically, we will use hidden fields to store information gathered from an earlier form so that it can be combined with the data the user is entering in the present form; we write a script to collect the data from the first form, and then generate the hidden fields which will contain this information in the second form. Finally, when it comes time to process, or parse, the data in the second form, the data from the first form will also be processed. Processing hidden fields is done in exactly the same way as processing any other data collected via a form. The basic idea is that the script which processes the initial data to be stored should also generate the hidden fields in the next form. The syntax to add hidden fields to a form looks like:
foreach $key ( keys %formdata )
{
print <INPUT TYPE = hidden
NAME = $key
VALUE = $formdata { $key } >\n;
}
This code would usually be found in a script which would first parse the data from an initial form and then, generate a new form into which it would store this data in hidden fields. A very important point here is that this data is not persistent. When the user leaves the series of interconnected pages in our site that store and generate the hidden fields, the connection between the user and said data is lost. The back and forward buttons on most browsers will retrieve a cached page but will not reinvoke the script. For example, if a user jumps to an external page and later returns via a hypertext link or bookmark, all their data will be lost because the state information is stored in the form, not in the URL. The main advantages of hidden fields are ease of implementation and minimal server requirements; they are more suitable for simple applications or for passing short-term data between pages.
3.11 Persistent Cookies
This method has been adopted by the Internet Engineering Task Force ( IETF ) as a draft specification for maintaining HTTP state information. Cookies were first introduced in Netscape Navigator and have since been adopted by most popular browsers, including Internet Explorer. Netscapes Cookie Specification can be seen at:
http://home.netscape.com/newsref/std/cookie_spec.html
A cookie is a bit of information sent by a Web server to a browser that can be read back later from that browser. It provides a means of marking each visitor to our site, and then looking at that mark when they return. Basically, cookies work by passing data in the HTTP request and response headers. They are relatively safe in that they are text files rather than executables, and therefore cannot contain unwelcome viruses. Some users feel uncomfortable with the notion of being tracked, and have the option of rejecting some or all cookies that may come their way. CGI scripts can be used to generate cookies and embed one or more of them into the HTTP header. When the browser next requests a page from the CGI URL, it passes back any previous cookies it received from that server using the $HTTP_COOKIE environment variable. These are read by the script just like any other CGI parameters. A typical script would look like:
#!/usr/local/perl
require subparseform.lib;
&Parse_Form;
print Set-Cookie: $key = $value; expires = Fri,
31-Mar-00 15:49:30 GMT;
path = /; domain = wit.ie; secure \n\n;
The Set-Cookie header sets one cookie on the client side, where a key is equal to a value. The expires attribute determines how long the cookie will remain in the cookie file on the users computer. By default, cookies are temporarily stored in the browsers RAM and are lost when the user quits out of the browser. In order to access the information at a later date, it is necessary to add an expiration date to the cookie. The path attribute a particular subset of URLs on our site for which the cookie is valid. In the above example, the cookie is valid for the entire root hierarchy. The domain attribute is set by default to the same domain that sent the cookie. There is only a need for this attribute when our domain is subdivided into smaller subdomains, and we wish to specify which subdomains have access to the cookie. The secure attribute can be used if we are saving sensitive data, or an identification code that leads to sensitive data.
Cookies are as easy to implement as hidden fields and provide superior security since cookies are only sent to URLs which match the cookies criteria. This reduces the chances of spoofing and the inadvertent transmission of sensitive data by the server. Persistence is improved over the hidden field method because the cookies are stored in a cookie file on the client machine. These will survive any amount of Web surfing by the user, and even the shutting down of the client.
We can send up to 20 cookies with a maximum size of 4Kb per cookie. Each visitor can store up to 300 cookies from all of the sites they have visited. By means of some clever scripting, these cookies can be combined in order to keep the count down.
3.12 The Apache Web Server
Definition: A Web server is software that listens on a TCP/IP port (or ports ), for an HTTP request and responds by returning an HTML page. The default Web server port is 80.
Apache is based on the public domain HTTP daemon developed
at the National Centre for Supercomputing Applications ( NCSA
), and comes bundled with Red Hat. The first official release
was based on NCSA httpd 1.3 with numerous bug fixes and enhancements.
Apache refers to the many patches that were applied to the NCSA
server, hence a patchy server. The name has nothing
whatsoever to do with red Indians.
Apache is the most popular Web server used on the Internet today by far. Despite the fact that it is free, it compares well with any other commercial Web server on the market. It does not ship with a visual configuration interface and this has usually meant editing the configuration files manually. However, with Red Hat Linux 6 most of the Apache configuration tasks can be accomplished visually with the Linux Configuration applet.
If the Red Hat Linux 6 Server installation was chosen, then Apache is probably already installed. It is easy to confirm on the command line:
#ps -ax gives a list of currently running processes
or
#rpm -q to see if the Apache package is installed
Alternatively, we could use the following:
#find / -name httpd -print
This document does not deal with installation and compilation specifics. It does attempt to look at some of the basic configuration issues. Further details can be obtained from the Apache website: http://www.apache.org . The Red Hat Linux 6 Apache package is in binary form, meaning that it has already been compiled. If there is a need for a more recent version of Apache, or to use any modules not included with the Red Hat package, it will mean recompiling Apache.
The Apache HTTP Web server is actually a daemon ( httpd ). A daemon is a program that is in continuous execution on a Unix / Linux system. Usually, it is waiting for some event to occur - such as a file to appear in a particular directory or a datagram to appear on some port - before taking some particular action. There are several daemons providing network services - such as ftpd and httpd - and each watches one or more well-known ports, and provides a well-known service. Apache can run under either inetd supervisor daemon or as a standalone daemon ( the usual default ). Inetd is the Internet superserver. It listens to a set of ports and when a datagram arrives, it awakens the program that handles input from that port, and then hands over control of the port. When the program finishes its work and exits, inetd resumes control. This saves the hassle of cluttering up memory with dozens of daemons listening on their respective ports. inetd provides a single-daemon solution to this. The downside is that this approach is a little slower in response time and is not commonly employed. Most servers can be run as standalone. However, it is not an either / or proposition; we can run inetd to handle some ports and allow the standalone daemon to handle traffic on a port with heavy traffic. The needs of a server may change over time, and the configuration details can be adapted accordingly.
3.12.1 Configuring the Apache Web Server
With Apache installed, we can easily test whether it is working properly by checking to see if the default start page appears; simply enter the URL http://localhost/ and the default page should appear. There are links from this page to view the documentation for Apache. The default start page is the file /home/httpd/html/index.html. In order to get going with a quick Apache Web site, it is possible to defer the configuration work for a time and simply replace index.html with our own home page. From there, more HTML pages, graphics and links to other pages can be added.
The configuration of the Apache server is managed by three configuration files: httpd.conf, srm.conf, and access.conf. These files are to be found in the conf directory. To help in getting started, three similar files named srm.conf-dist, access.conf-dist and httpd.conf-dist are also located in the conf directory. Copy or rename these files without the -dist suffix. Then edit these files. They are well commented and should be read carefully. Start with the httpd.conf file; this file sets up general attributes about the server like the port number and the user it runs as, etc. ServerAdmin must be set to the e-mail address of the administrator. Mail will be sent to this address if something goes wrong. The User option only applies when Apache runs in stand alone mode. The default is to run as nobody, so that Apache has no permission to change anything on the system. This minimises the damage that Apache could do to our system should a hacker attack it. Next edit the srm.conf file; this sets up the root of the document tree among other attributes. The access.conf file sets permissions for how files can be accesed. It also includes the option to allow Apache execute CGI programs. With access.conf it is possible for example to exclude certain addresses from accessing the server or certain directories on the server.
By far the easiest means of configuring Apache using Red Hat
is to use the Configuration applet ( see Figure 14 ). To set the
Apache defaults:
Log on as root.
Open the Linux Configuration applet and select Defaults from
the Apache Web Server section. The Apache Defaults tab will open.
Enter a name for the server. The default is localhost.
(4) If desired, change the default Apache document root - this
is the
base directory for storing the sites HTML files - enter
a new document
root. If left blank, the documents in our site will be rooted
at the default
directory, /home/httpd/html.
For the Script Alias entry, we indicate what the scripts will
be invoked
as, and the physical path translation. For example:
/cgi-bin/ /home/http/cgi-bin/
Scroll down the list until the Features check boxes come into
view. Select
Host Name Lookups to have the hostname of browsers, rather than
their IP addresses, appear in the Apache logs.
If required, select May Execute CGI.
Click Accept, and when prompted, activate the changes.
Figure 14: The Red Hat Configuration applet

We are now ready to run httpd, allowing our machine to service HTTP URLs.
3.12.2 Starting httpd
In order to run httpd in standalone mode, run the command:
httpd -f configuration-file
where configuration-file is the pathname of httpd.conf. For example,
/etc/httpd/httpd -f /etc/httpd/conf/httpd.conf
will start up httpd with configuration files in /etc/httpd/conf. Any errors at this stage will show in the error logs. Note that httpd must be run as root in order to use port numbers below 1024 ( i.e. 1023 or less ). Once httpd is up and running, it can be started automatically at boot time by including the appropriate httpd command line in one of our system files - such as /etc/rc.d/rc.local.
3.13 Security
CGI scripts are particularly dangerous Web application components in terms of security, as they operate with relatively open access to the server machine on which they execute; this is limited on Unix machines by account permissions of the Web server process. This does not pose too much of a problem for a single programmer developing for a single Web server. However, CGI programs can be a security nightmare in a corporate or college environment. More often than not, the problem is not one of protecting the server from malicious CGI programmers but protecting it from careless CGI programmers. Many common CGI programming errors can inadvertently lead to malicious clients gaining unauthorised access to the server machine. Refer to:
http://www.w3.org/Security/Faq/www-security-faq.html
Regardless of the quality of the CGI code, there is always the possibility of hacker subversion of the CGI program. For example, we could have an HTML FORM which provides an area for users to enter text. Depending on how the CGI script were coded, if the user were to enter a filename instead of the expected text, havoc could ensue on our Web server.
Another major problem with CGI scripts and particularly the libraries which they use, is that they often contain unwanted or questionable code. There is a profusion of scripts on offer on the Internet. Since many developers download and use libraries and routines in scripts that they are developing, the question of provenance is of primary concern. One single line of malicious code in a CGI script which is thousands of lines long could cause untold damage. Consider the following infamous line of mischief:
rm -rf
The effect of the above single line of code would be to delete the files from the servers hard disk. This could easily be inserted into a large script and go unnoticed and, as such, is an important issue to consider when downloading code from the Internet. The lesson to be learned here is that all downloaded code should be thoroughly reviewed before being placed in the cgi-bin directory. In addition, one should never download and execute object code, only code that can be read and checked for improprieties beforehand. This takes on an even higher level of urgency in the case of an ISP considering the granting of scripting priveleges to suscribers, or a college department considering similar cgi-bin access to students.
3.13.1 Securing the Server
The first and most fundamental task when setting up a Web server and / or site involves the correct setting of file permissions for the server and document roots. Every file and directory on a Unix server has a set of permissions that determine who can use that file or directory, and exactly what they can do with it. The correct permissions must be set so that the scripts will work properly and remain secure from hackers.
The meaning of permissions varies slightly for files and directories. A read-protected file cannot be read or opened. A read-protected directory cannot have its contents listed. Write-protecting a file means that we cannot write ( i.e. change ) to it. Write-protecting a directory means that we cannot create, move, rename, copy or delete its contents. A CGI script file without execute permissions cannot be run. A directory without execute permissions means that we cannot read, write or execute the files that it contains.
A strict need to know policy needs to be adopted
for both the document root
( where the HTML files are stored ) and the server root ( where
the log and configuration files are stored ). It is particularly
important to get the permissions right in the server root as this
is where CGI scripts, access logs and configuration files are
stored. Usually this involves protecting the server from both
local and remote users by creating a www user for
the Web administrator and a www group for users on
the system who need to be able to author HTML documents. We edit
the /etc/passwd file to make the server root the home directory
for the www user and, the /etc/group file to add any authors to
the www group.
The server root is set up such that only the www user can write to the configuration and log directories. This makes sense, as it usually the system administrator who assumes responsibility for such issues. Under no circumstances should they be world-readable. The cgi-bin directory and its contents should be world executable and readable, but not writable. That is, they have a file permission setting of 755. Local HTML Web authors may be granted write access on a discretionary basis. The following is a sample screenshot of the server root for this project. Note the root ownership for user and group; this type of laxity is allowable on a stand-alone machine. On a network machine, a more disciplined approach to ownership would need to be adopted:
Figure 15: File permissions on project server

In the above example there was no need to add other authors to the group with a write permission for the HTML directory; there was only author - the owner or root. All files that we want to serve on the Internet must be readable by the server while it is running under the permissions of user nobody ( see configuration applet in Fig.14 ).
Some general security tips:
Security tools are freely available and should be used where suitable:
http://www.cern.org/
Make sure that the /etc/passwd and /etc/group files are owned
by
root and are not world-writable.
Make sure to keep the root password is kept secure and changed regularly.
When writing Perl scripts, make use of the -T flag which returns
information
on security problems as described in the perlsec manpage.
Investigate any unusual activity by inspecting the logs on a daily basis.
Require user passwords.
Make sure that each user has a logon ID and that logon IDs are not shared.
3.14 Summary of CGI Life Cycle
The ability of CGI programs to create dynamic Web content is actually a side effect of its intended purpose: to define a standard method for an information server to talk with external programs. This original intention goes some way towards explaining why CGI has one of the worst life cycles imaginable. Whenever a server receives a request that involves accessing a CGI program, the server has to create a new process to execute the CGI program and pass to it, via environment variables and standard input, all the parameters necessary to generate a response. This is fine when one page is requested every minute, but what can be done when one request arrives every second? To make matters worse, most Web applications today need some kind of database access. This means that a new database connection must be made every single time the CGI runs, taking up to several seconds each time.
Creating a process for every such request requires time and significant server resources, which obviously limits the number of requests that a server can handle concurrently. On Unix systems, this involves a fork() call: the new process is provided with a set of environment variables that describe the HTTP request and that contain information related to the Web server. The FORM data is passed to the process via the $QUERY_STRING environment variable or by piping the data to the programs standard input. The Web server also has to redirect the CGI programs standard output and standard error. When the CGI program completes, the server has to parse the programs standard output and return this parsed data to the browser as a response. The CGI programs STDERR is usually included in the servers log file. Thus, the CGI protocol requires the server to perform a considerable amount of processing each time that a CGI program is requested. A new process has to be created, pipes to accommodate the CGI programs input and output have to be created and opened, and the executable file associated with the program must be loaded and run.
Figure 16: The CGI Life Cycle

Although CGI programs can be written in a wide range of different programming languages, Perl has become the de facto standard. While writing a script in Perl may make it platform independent to some degree, it also requires that a Perl interpreter be started for each request, which takes up even more time and resources.
One other often-overlooked problem with CGI is that CGI programs cannot interact with the server or take advantage of the servers abilities once they begin execution due to the fact that they are running in separate processes. Also, there is no standard way for CGI programs to communicate with one another. In order to make any improvement to CGI, a more flexible and multi-threaded approach is needed.
3.15 Other solutions
FastCGI
FastCGI is a type of gateway for Web servers that improves performance by loading scripts as persistently running processes. Open Markets FastCGI protocol addresses many of the shortcomings of CGI. Its approach is to implement a software layer that receieves control from a Web servers API and invoke scripts or programs that conform to the FastCGI specification.
FastCGI works in a similar manner to CGI, except that it creates a single persistent process for each FastCGI program as shown in Fig.17. Processes are not created and destroyed for each request received; instead, FastCGI programs process a request, then wait to receive the next request. This means that it is no longer necessary to create a new process for each request, thus eliminating some of CGIs overhead. In addition there is a performance boost because data can be stored in memory rather than on disk; depending on the functionality of the program in question, with CGI this might involve making additional references to a disk. On the other hand, if each process is executing a Perl interpreter, this architecture does not scale very well.
Figure 17: The FastCGI Life Cycle

Since FastCGI is an open protocol, we can implement FastCGI applications without depending on software from Open Market. Open Market does, however, provide a convenient set of libraries for C, Perl, Tcl and Java, to aid in program development. Also, FastCGI can distribute its processes across multiple servers. This is not always straightforward when porting larger applications; for example, larger applications designed to terminate after each request will sometimes have memory leaks that would not be harmful in a conventional CGI environment. In addition, migration can be complicated by the need to recompile all code that uses the stdio library with the FastCGI library. Another problem with the FastCGI specification is that it does facilitate easy interaction between FastCGI programs and the server.
mod_perl
This is another option for improving performance when using the Apache Web server; mod_perl is a module for the Apache server that embeds a copy of the Perl interpreter into the Apache httpd executable, giving complete access to Perl functionality within Apache. One of the problems with CGI is that each time a script is executed, it requires forking extra processes. For a high-traffic site, this can seriously impede server performance. The effect of embedding the Perl interpreter directly into the Web server is that our scripts are precompiled by the server and executed without forking. The source for both Apache and mod_perl can be obtained from:
http://www.apache.org/
mod_perl is not a module; it is a module of the Apache server and a such it can be considered as a deeply integrated Apache / Perl hybrid. Embedding the Perl interpreter into the httpd executable enables a set of mod_perl configuration directives, most of which can be used to specify most of the stages of a request. In addition, we can embed Perl code into the configuration files using <Perl> </Perl> directives. We can also use Perl for server-side includes. Obviously embedding one large program into another large program leads to a very large program indeed! Apache, like most Unix Web servers, uses the flock-of-daemons approach to scalability and the added bulk of mod_perl multiplies accordingly. The typical solution to such a problem is to throw lots of memory at it and it will go away! However, if there exist only limited memory capacity, mod_perl may not be the way to go.
Server Extension APIs
Invoking CGI applications can be slow, as we have seen. To solve this and to give developers access to server internals, Netscape and Microsoft created server-specific APIs. Netscape provides an internal API called NSAPI and Microsoft provides ISAPI. Either of these APIs can be used to write server extensions that can extend the functionality of the server allowing it to handle tasks which were once the domain of external CGI programs. As can be seen in the following diagram, server extensions are located within the main process of a Web server:
Figure 18: The Server Extension Life Cycle

Server-specific APIs usually use linked C or C++ code making them very fast, and as a result they make full and efficient use of the servers resources. However, they are not easy to maintain or develop, and they frequently pose security and reliability problems. Also, when a server extension crashes, it can bring down the entire server. Still, the main drawback with server extensions is their proprietary nature; they are usually tied to the server API for which they were written as well as a particular operating system.
Active Server Pages
This is a proprietary technique develeoped by Microsoft for creating dynamic Web content, termed Active Server Pages ( ASP ). Using ASP, we can create an HTML page with snippets of embedded code - usually VBScript or JScript. However, most languages can be used. Prior to sending the page to the client browser, this code is read and executed by the Web server. ASP has gained a fair amount of popularity, among companies where the Microsoft suite of products and O/S are already in place.
3.16 Project Roadmap
To develop the functionality of our Web site further might invlove writing a script to take all of the parsed data, open a database connection and update the database accordingly. Another script might e-mail the user with confirmation of registration or an error message, depending on whether the user entered the correct data. Yet another simple script could verify the users e-mail address and password as an added security feature. Since the primary aim of this document is to serve as a tutorial / guide highlighting the architectural options open to a Web site developer, this document will now continue along that path. We have already determined to seek a more secure, multithreaded approach to Web site construction than that offered by the CGI protocol. With that in mind, the next section presents a more elegant and efficient alternative: Servlets.
[Your opinion is important to us. If have a comment, correction
or question
pertaining to this chapter please send it to comments@peoi.org.]
| Previous: Information architecture |
Next: CGI the Java way |