Infrastructure Perspective: The HTTP Transaction
Introduction
I |
This path does not delve too deeply into exactly how a transaction is handled. Instead, it focuses on the kinds of data flowing into and out of the web site. This is important information, as any data flowing into the web site should be regarded as untrustworthy (having come in over the network).
The HTTP Request
A |
POST /info/services/website HTTP/1.0 Connection: Keep-Alive User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i586; Nav) Host: www.cs.uchicago.edu Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */* Accept-Encoding: gzip Accept-Language: en Accept-Charset: iso-8859-1,*,utf-8 Cookie: cs_mode=edit; cs_login=bla2blahh15861cablah57dblah43blah Content-type: multipart/form-data; boundary=---------------------------62993215913904462331845589843 Content-Length: 296 -----------------------------62993215913904462331845589843 Content-Disposition: form-data; name="userid" dustin -----------------------------62993215913904462331845589843 Content-Disposition: form-data; name="passwd" foobar -----------------------------62993215913904462331845589843--
The portion in red is the method. This is generally either GET or POST, and this site uses the two literally: GET gets a page from the server, while POST sends new content to the server.
The next, orange portion is the path, sometimes also known as the URI. It specifies the document the browser would like to access. It is the part of the original URL which appears after the server's name. So the example request above corresponds to http://www.cs.uchicago.edu/info/services/website.
The third, green portion contains the request headers. These contain all sorts of extra data about the request. The most important headers above are Host, Cookie, and Content-type. The Host header corresponds to the server name in the URL. Apache uses the value in this header to determine exactly which server the browser thinks it's talking to, allowing a single Apache process to represent many server names, or virtual hosts. The Cookie header contains information from previous requests that the server has asked the browser to remember; the site uses this information to identify the current user and what mode she's in. Finally, the Content-type header identifies the format of the entity-body.
The fourth, blue section is the entity-body itself. GET requests don't have entity bodies, but POST requests like this one can be in one of two formats (differentiated by the Content-type header). The first, default, format is application/x-www-form-urlencoded, but the format shown here is multipart/form-data, and is required for doing file uploads and handling other complex forms.
Breaking Down the URL
S |
The URL is broken down as shown in Figure 1. The scheme and server portions are self-explainatory. The script_path portion of the URL is the longest section of the path which corresponds to a Python script under docs/. So, for example, if there are Python scripts docs/path.py, docs/path/to.py, and docs/path/to/script/index.py, but not docs/path/to/script/arg1.py or anything longer, the Python infrastructure would break down the URL as shown in Figure 1.
| |||||||||||||
Figure 1: Breakdown of example URL. |
Anything left between the end of the script_path and the end of the URL or a ? character is the args_path, and tells the Python script at script_path specifically what the user wants to see.
All paths of documentation are served up as HTML by the script docs/info/services/website/path.py. So in the current URL, /info/services/website/path is the script_path, and anything remaining after that is the args_path, which tells path.py which path, and possibly which body within that path, you wish to view.
If there's a question mark in the URL, it marks the beginning of the internal or special data. The data is internal if it has an = character in it, otherwise it is special. Internal data is used to perform actions--modifications to site content like adding, editing, or deleting. Special data triggers the execution of special Python scripts (ah! what an apt name!) which do things like change users' passwords, allow them to login and logout, change modes, etc.
Cookies
T |
Form Data
A |
<form action="url" method="POST" enctype="multipart/form-data">The name value of each form field in the form is transmitted to the server in the entity-body. The Python infrastructure breaks this information down, making it easy to manipulate.
Processing the Request
W |
In general, if the request method was POST, the Python script will perform some action (changing something in the database, performing a search, etc.) and bounce the browser to a new URL, which it will fetch with the GET method. For GET requests, the script will usually produce a page for the browser to display.
The Response
T |
There are several types of responses possible from this site:
- A page (the common case)
- An error page
- A bounce
Any of these types of responses may also include new cookies for the browser.
Web Page Response
W |
HTTP/1.0 200 OK Date: Mon, 16 Jul 2001 19:35:13 GMT Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4 Connection: close Content-Type: text/html <!doctype html public "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Department of Computer Science</title></head> ...The first line indicates that the server is sending a page back. The Content-type header gives the type of the document -- HTML. The headers are terminated by a blank line and followed by the document itself.
Error Pages
T |
HTTP/1.0 404 Not Found Date: Mon, 16 Jul 2001 19:35:13 GMT Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4 Connection: close Content-Type: text/html; charset=iso-8859-1 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <HTML><HEAD> ...
The response looks similar to the normal web page response, but the number and message in the first line are different. This number tells browsers and caches that this is not a normal, cacheable page.
Bounce Response
S |
HTTP/1.1 302 Found Date: Mon, 16 Jul 2001 19:53:58 GMT Server: Apache/1.3.14 (Unix) mod_python/2.7.1 Python/2.0 PHP/4.0.4 Location: http://www.cs.uchicago.edu/info/services/website Connection: close Content-Type: text/html <HTML><BODY>Go <a href="http://www.cs.uchicago.edu/info/services/website">here</a>. </BODY></HTML>
The 302 status tells the browser that it should not display the page, but should begin a new request to a new URL. The Location header specifies that new URL. The entity-body is ignored by all but the oldest browsers, but is sent just in case.
Bounces are nice for a few reasons. First, a browser does not keep URLs which produced bounces in its history list. After performing an action, the site sends a bounce response with a new URL. The URL that triggered the action, then, is not kept in the browser's history list and will not be repeated by a user clicking back.
Second, if a POST request results in a bounce to a new URL, then the new URL will be fetched with the GET method. The site assumes this behavior, and all common browsers do behave this way, but the HTTP/1.1 RFC instructs browsers to use the same method, and instructs servers to use codes 303 or 307 instead. Unfortunately, some browsers do not support these new codes.
As a result, as browsers evolve to adhere more closely to the standard, the site's bouncing mechanism may need to change.
Setting Cookies
W |
Set-Cookie: cs_mode=edit; Max-Age=86400; expires=Thu, 12-Jul-2001 18:12:10 GMT; Path=/; secure; Domain=www.cs.uchicago.edu;The first part, cs_mode=edit, gives the name and value of the cookie. The remaining pieces are instructions to the browser regarding when and where the cookie should be sent back to the server. In this case, the cookie will be sent back on every request to www.cs.uchicago.edu until Thursday, July 12.
See the Cookie RFC for more details on the mechanics of cookies.
What about SSL?
F |