Infrastructure Perspective: The HTTP Transaction

Introduction

I
n this infrastructure perspective, we will look at the process of handling the production of a page from the perspective of the HTTP protocol. Essentially, we think of the web site as one huge function, an HTTP as a mechanism for calling that function with various arguments.

This path does not delve too deeply into exactly how a transaction is handled. Instead, it focuses on the kinds of data flowing into and out of the web site. This is important information, as any data flowing into the web site should be regarded as untrustworthy (having come in over the network).

The HTTP Request

A
n HTTP request is made up of several important parts, as color-coded here:

POST /info/services/website HTTP/1.0
Connection: Keep-Alive
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i586; Nav)
Host: www.cs.uchicago.edu
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
Accept-Encoding: gzip
Accept-Language: en
Accept-Charset: iso-8859-1,*,utf-8
Cookie: cs_mode=edit; cs_login=bla2blahh15861cablah57dblah43blah
Content-type: multipart/form-data;
              boundary=---------------------------62993215913904462331845589843
Content-Length: 296
 
-----------------------------62993215913904462331845589843
Content-Disposition: form-data; name="userid"
 
dustin
-----------------------------62993215913904462331845589843
Content-Disposition: form-data; name="passwd"
 
foobar
-----------------------------62993215913904462331845589843--

The portion in red is the method. This is generally either GET or POST, and this site uses the two literally: GET gets a page from the server, while POST sends new content to the server.

The next, orange portion is the path, sometimes also known as the URI. It specifies the document the browser would like to access. It is the part of the original URL which appears after the server's name. So the example request above corresponds to http://www.cs.uchicago.edu/info/services/website.

The third, green portion contains the request headers. These contain all sorts of extra data about the request. The most important headers above are Host, Cookie, and Content-type. The Host header corresponds to the server name in the URL. Apache uses the value in this header to determine exactly which server the browser thinks it's talking to, allowing a single Apache process to represent many server names, or virtual hosts. The Cookie header contains information from previous requests that the server has asked the browser to remember; the site uses this information to identify the current user and what mode she's in. Finally, the Content-type header identifies the format of the entity-body.

The fourth, blue section is the entity-body itself. GET requests don't have entity bodies, but POST requests like this one can be in one of two formats (differentiated by the Content-type header). The first, default, format is application/x-www-form-urlencoded, but the format shown here is multipart/form-data, and is required for doing file uploads and handling other complex forms.

Breaking Down the URL

S
o the Python infrastructure (in particular, the translate handler, if you're interested) breaks down the URL from the HTTP request into a number of even smaller fragments.

The URL is broken down as shown in Figure 1. The scheme and server portions are self-explainatory. The script_path portion of the URL is the longest section of the path which corresponds to a Python script under docs/. So, for example, if there are Python scripts docs/path.py, docs/path/to.py, and docs/path/to/script/index.py, but not docs/path/to/script/arg1.py or anything longer, the Python infrastructure would break down the URL as shown in Figure 1.

http :// www.cs.uchicago.edu /path/to/script /arg1/arg2/arg3 ?foo=bar
scheme server script_path args_path internal or special
Figure 1: Breakdown of example URL.

Anything left between the end of the script_path and the end of the URL or a ? character is the args_path, and tells the Python script at script_path specifically what the user wants to see.

All paths of documentation are served up as HTML by the script docs/info/services/website/path.py. So in the current URL, /info/services/website/path is the script_path, and anything remaining after that is the args_path, which tells path.py which path, and possibly which body within that path, you wish to view.

If there's a question mark in the URL, it marks the beginning of the internal or special data. The data is internal if it has an = character in it, otherwise it is special. Internal data is used to perform actions--modifications to site content like adding, editing, or deleting. Special data triggers the execution of special Python scripts (ah! what an apt name!) which do things like change users' passwords, allow them to login and logout, change modes, etc.

Cookies

T
he Cookies header contains one or more name-value pairs. In the request above, these are cs_mode with value edit and cs_login with value bla2blahh15861cablah57dblah43blah. Cookies are small pieces of information that the server sends to the browser. The browser then sends those pieces of information back to the server on every request. Cookies are a great way to maintain a sense of state for the user. In the case of this site, the user may have two pieces of state: a login, and a mode.

Form Data

A
s mentioned above, the request's entity-body is only present for POST requests. POST requests are the result of form submissions with FORM tags like this:
<form action="url" method="POST" enctype="multipart/form-data">
The name value of each form field in the form is transmitted to the server in the entity-body. The Python infrastructure breaks this information down, making it easy to manipulate.

Processing the Request

W
hen the Python infrastructure receives a request, it determines whih Python script is responsible for formulating a response, and dispatches to that script.

In general, if the request method was POST, the Python script will perform some action (changing something in the database, performing a search, etc.) and bounce the browser to a new URL, which it will fetch with the GET method. For GET requests, the script will usually produce a page for the browser to display.

The Response

T
he HTTP response from the server is similar in appearance to the original HTTP request. It has a line of basic information, some headers, and an entity-body. Please refer to HTTP documentation for more information on its precise form.

There are several types of responses possible from this site:

  1. A page (the common case)
  2. An error page
  3. A bounce

Any of these types of responses may also include new cookies for the browser.

Web Page Response

W
hen the site sends a web page back to the browser, it sends a response that looks something like this:
HTTP/1.0 200 OK
Date: Mon, 16 Jul 2001 19:35:13 GMT
Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4
Connection: close
Content-Type: text/html
 
<!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" 
"http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>Department of Computer Science</title></head>
...
The first line indicates that the server is sending a page back. The Content-type header gives the type of the document -- HTML. The headers are terminated by a blank line and followed by the document itself.

Error Pages

T
he site is quite tolerant of errors, such as mistyping a URL or entering invalid data into a form. However, occasionally errors do occur. In these cases, the server sends a response that looks like this:
HTTP/1.0 404 Not Found
Date: Mon, 16 Jul 2001 19:35:13 GMT
Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4
Connection: close
Content-Type: text/html; charset=iso-8859-1
 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
...

The response looks similar to the normal web page response, but the number and message in the first line are different. This number tells browsers and caches that this is not a normal, cacheable page.

Bounce Response

S
ometimes, the server wants to tell the browser to make a new request for a new URL. This is called a bounce, and looks like this:
HTTP/1.1 302 Found
Date: Mon, 16 Jul 2001 19:53:58 GMT
Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4
Location: http://www.cs.uchicago.edu/info/services/website
Connection: close
Content-Type: text/html
 
<HTML><BODY>Go 
<a href="http://www.cs.uchicago.edu/info/services/website">here</a>.
</BODY></HTML>

The 302 status tells the browser that it should not display the page, but should begin a new request to a new URL. The Location header specifies that new URL. The entity-body is ignored by all but the oldest browsers, but is sent just in case.

Bounces are nice for a few reasons. First, a browser does not keep URLs which produced bounces in its history list. After performing an action, the site sends a bounce response with a new URL. The URL that triggered the action, then, is not kept in the browser's history list and will not be repeated by a user clicking back.

Second, if a POST request results in a bounce to a new URL, then the new URL will be fetched with the GET method. The site assumes this behavior, and all common browsers do behave this way, but the HTTP/1.1 RFC instructs browsers to use the same method, and instructs servers to use codes 303 or 307 instead. Unfortunately, some browsers do not support these new codes.

As a result, as browsers evolve to adhere more closely to the standard, the site's bouncing mechanism may need to change.

Setting Cookies

W
hen the site wants to send a cookie to the browser, it uses a Set-Cookie header like this:
Set-Cookie: cs_mode=edit; Max-Age=86400; 
            expires=Thu, 12-Jul-2001 18:12:10 GMT; Path=/; secure;
            Domain=www.cs.uchicago.edu;
The first part, cs_mode=edit, gives the name and value of the cookie. The remaining pieces are instructions to the browser regarding when and where the cookie should be sent back to the server. In this case, the cookie will be sent back on every request to www.cs.uchicago.edu until Thursday, July 12.

See the Cookie RFC for more details on the mechanics of cookies.

What about SSL?

F
ortunately, all of the above applies just as well for SSL connections. SSL supplies an encrypted channel through which a standard HTTP transaction may take place without modification.