This paper is the intellectual property of the author(s). It was presented at CAUSE98, an EDUCAUSE conference, and is part of that conference's online proceedings. See http://www.educause.edu/copyright.html for additional copyright information.

On Remote Access to Protected Web Content

By Wade B. Komisar

The University of Virginia discusses its investigations of various methods to protect copyrighted and site-licensed web information. Widely used IP filtering methods unfortunately disenfranchise remote ISP-connected users from accessing resources to which they are entitled. Alternative solutions, however, are not readily apparent.

Introduction

This paper describes a continuing experiment and a work in progress. Originally, I had intended to talk about a finished product that enabled a more reasonable approach to securing web content. However, over the past six months things have changed.

The complexity of web content has increased. When the problem of enabling "outside" users to access protected web content came to our attention two years ago, most content was straight HTML. Only a few sites were using frames and tables. Since then, the popularity of JavaScript, Java, cookies, and ActiveX has increased greatly.

Also, our experience with HTTP and HTML has increased. We have learned the hard lesson that there is a wide division between standards and implementations. More important than the (ever-changing) standards themselves are the various liberties that many web authors take, and browsers accommodate.

Therefore, I will be presenting on the more general issue of enabling remote access to IP-protected web resources. I shall outline the problem, discuss our general requirements, talk about alternative solutions, including our own homegrown solution, and do a quick assessment based upon our experience to date.

The Problem

The web is suffering growing pains. They are felt particularly by those who try to extend the technology from its initial roots. The web was born in a democratic, if not chaotic and anarchic context. Although some rudimentary security, htaccess and IP-filtering, is available, the strength of the web has always been its ability to allow everybody access to everything; for the browser to be the window into the whole web world.

But, the reality is that the web world is not one continuous space. There are propriety interests and concerns, as well as copyrighted, site-licensed, and confidential information. Problems of scale have made htaccess and IP-filtering inefficient or inadequate methods for deciding who should be given access and who should not. Simply put, the popularity of the web has outstripped the security technology that the web presently employs. This can be readily seen when we pair the observations of the increase in the popularity of online (web) publishing, and the increase in user demand for web access.

The Increase in On-line Publishing

Online publishing has increased, especially for reference information. Within a university context, such material would be:

journal and book abstracts and indexes.
encyclopedias.
dictionaries.
other material such as electronic books, course material, source documents, etc.

As little as 10 years ago, this material would occupy expensive shelf space, and comprise not only the original volumes, but also addenda, that themselves would be months, perhaps years out of date. More recently, reference material began to arrive in the form of CD-ROMs that were updated more frequently, but still requiring floor-space for dedicated workstations.

Besides requiring space, physical volumes of reference material are not easily shared. Only one person could access the material at a time. CD-ROMs were not much better. Reference workstations often had sign-up sheets. LAN-sharing of CD-ROMs enabled greater simultaneous access, but access was still limited to the number of workstations on the LAN.

Over the past couple of years, more material has become accessible through the web from the publishers sites. The advantages are obvious.

The material is kept more up-to-date. The web pages are hosted, or at least controlled by the publisher, allowing for quick and timely revisions.
The material is easily shared. Any web browser on any workstation wired to the Internet has the means to access the material.

However, in our proprietary and copyrighted world, reference materials are not freely and publicly available. Companies, such as GaleNet, Cambridge Scientific, LEXIS/NEXIS, and Encyclopedia Britannica charge for access to their material, and limit access to users who have paid, or are members of an organization that has site-licensed the material. The problem thus turns on how to define the members of a particular organization that has a site-license.

The Obsolescence of Subnet-based Security and the Popularity of the Internet

One of the basic internet truisms is that all computers, whether clients or servers, PCs or mainframes, are identified by a unique signature known as the IP address. An IP addresses is 32 bits of binary information that is more typically represented in "dot" notation consisting of 4 decimal numbers separated by period punctuation.¹

Alternatively, internet-connected computers can be addressed by their hostnames, which is simply a convention that associates IP-addresses with alphanumeric representations that (hopefully) denote something meaningful about the computers and their contents. "Hostnames" are heuristics that are more easily remembered. For instance, it is easier to remember that IBMs main web site may be accessed as http://www.ibm.com, rather than http://204.146.18.33. However, both URLs will enable you to access the IBM web page.

Although "hostnames" provide a useful method of remembering internet resources, they imply an organizational order where none may actually exist. This comes from the initial design of IP-address and domain name resolution. The assumption is that internet resources controlled by specific organizations will have similar hostnames and IP addresses. This is because organizations that wish to have an internet presence are allocated a range of IP-addresses that come to be considered their subnets, or domains. The University of Virginias subnet is defined by the starting IP address of 128.143, and the associated domain name of "virginia.edu". The assumption, then, is that all internet resources that belong to the University of Virginia will be in the UVa domain, "virginia.edu", and have IP addresses that start with "128.143". For the most part, this is true. Our major internet services within the "virginia.edu" subnet, all have "128.143" IP addresses: 128.143.22.36 (www), 128.143.22.24 (ftp), and 128.143.2.66 (news). In addition, any workstation connecting to the UVa subnet with a network interface card or a modem, is assigned an IP-address that begins with "128.143" and is thus part of the "virginia.edu" domain.

IP-filtering exploits the assumption that server and client computers of the same organization will be on the same subnet. Thus if there is an internet resource, such as UVa's news server, that should only be accessed by people affiliated with the University, IP-filtering would ensure that only workstations that are on the University subnet would have access.

However, as much as IP-filtering limits access to the subnet, it also disenfranchises community members whose workstations are on different subnets. This problem has always existed, it simply has been exacerbated by the popularity of the Internet. The people who do not have access to subnet protected resources are generally:

Off-campus, remote, continuing education or extension students and faculty.
non-resident faculty who may be on sabbatical, or leave, taking vacation, conducting fieldwork, or serving as visiting faculty elsewhere.
resident faculty, students, and staff who have contracted with a local internet service provider, or AOL, MSN, or CompuServe, to provide internet access.

It is the resident population that has been recently the most vocal about their lack of access. This is partly our own doing. As with many higher education institutions, UVa has always provided modem access to our network, and still does. However, with the increase in the popularity of the Internet, the demand for remote internet access outstripped UVa's economic ability to supply enough modems. We are not in the internet service provider business, and are not able to scale to demand. We recently adopted a policy to encourage users who are tired of the busy UVa modem banks to use internet service providers. We have even contracted with one for pricing breaks and set up a direct high-speed network connection between the ISP and UVa networks.

By subscribing to a local ISP, University personnel enjoy the benefits of timely access to the Internet. However, their workstations are now assigned an IP address that is on their ISPs subnet, rather than the UVa subnet. As a result, they are locked out of internet resources that are protected by IP filtering.

Some internet services, such as news, mail, and ftp have had, or have recently developed application-level authentication mechanisms that enable remote access. UVa has implemented some of these services, such as our remotenews server, for remote users. For some web applications, we rely upon htaccess. We also have a customized authentication method that we wrote called Web Wrapper, which provides state and session as well as more robust authentication. .

The problem arises when the authentication is for a remote site, which is the case for most online publishers. When UVa contracts with a company to make their dictionary, encyclopedia, or journal index available to the University community, the company needs to open their web site to UVa access. Some companies require a cgi program to be run at UVa to provide authentication. The cgi program is customized for UVa common authentication method, which is a modified form of the "whois" internet service. The cgi program queries the user for authentication information, queries our centralized "whois" database, and enables access for legitimate users.

However, the more common method of enabling access to licensed materials is the usage of IP filtering. Companies maintain an access list of all the customer subnets and filter web access through this list. However, as we have already shown, IP filtering disenfranchises all legitimate, but remote users.

This is the problem the UVa library administration came to us with. Could we, the Advanced Technology Group, come up with a solution that enabled authenticated web access to a subnet protected web host. We surmised that we needed a proxy-style solution.

Requirements

A proxy is an internet program that is situated in the protocol stream between the client the target server. In the case of web access, the proxy basically:

receives the http transmission from browsers and forwards the transmission to the respective target servers,
receives responses from web servers and forwards them to the respective browsers.

Typically a proxy server is utilized to provide an interstitial caching service, which allows for more efficient access of popular web information. However, proxies are also used as a method to provide access to "firewalled" web resources, and it is this ability that we wished to exploit. However, we had some requirements that we needed to satisfy for our security proxy.

Authentication

UVa has had a centralized database of all UVa faculty, staff, and students that is accessible on-line. It is culled from registration and payroll information, and is kept fairly up-to-date. It utilizes both a "whois" and LDAP protocol interface, enabling any networked workstation to query for public directory information on individual users. The "whois" interface also includes hidden authentication mechanisms that will verify a user through the submission of the users last name, University ID, and, optionally, birth date. This is not a heavyweight security scheme. Although University ID.s are not public information, University personnel do not take great pains to protect them. However, this authentication method is adequate for accessing site-licensed reference material. We simply need to ensure that our proxy would be able to prompt remote users for their last name and University ID, query our central database, and allow legitimate users access to the licensed material.

Authorization

Some of the material our proxy would front-end may be licensed for a specific segment of the University community; specialized medical and legal resources are only accessible to law and medical faculty and students respectively. The data in the University "whois" database contains information on department or school affiliation and classification (undergraduate student, graduate faculty, or staff). Our proxy should be able to filter authorized material based upon this information for each protected resource.

Session and State

One of the challenges of web programming is that the http protocol is stateless. Every transaction, the request from the client and the response by the server, is autonomous, and bears no relation to previous or subsequent transactions, whether they are requests to the same or a different server.

Our proxy needs to remember that a user has been authenticated, and has to keep track of which web resource the user has been authenticated to use. The authentication session needs a time-out, especially in a computer lab or at a kiosk. Further, if the user attempts to access a bookmarked page on a protected web site, our proxy must be able to issue an authentication challenge, and then provide access the bookmarked page.

Redirection

The purpose of our proxy is to enable remote users to authenticate themselves so that they may access licensed web resources. It is not intended to get in the way of users whose workstations are on the University subnet. Our proxy needs to be able to distinguish local users from remote users, and redirect local users to access the protected resource directly.

A side benefit of the ability of discriminating between local and remote web clients is the ability to define a workstation or set of workstations, such as a student lab, as being "remote", even though the workstations are on the University subnet. This not only enables testing without dialing in, but also enables the proxy to front-end web access from "public" workstations.

Rewriting

One of the challenges to our proxy is ensuring that it will front-end the specified IP-protected web resources and get out of the way for the others. As we will see, alternative proxy solutions approach this problem differently, and has become one of the critical factors in deciding which proxy solution we will use.

Transparency

Another issue in choosing which proxy solution we will implement is its transparency to the user. Will a user notice its operation beyond the initial authentication screen? Does it slow response time, or alter, or break the format of a web page?

Configuration

Some proxy solutions call for the user to configure their browser. Although this is not necessarily an onerous task, it is prone to user error, which in turn increases the support burden on the organization. Other solutions do not call for user configuration, and thus may be less of a support burden. However, these solutions have their own problems, as we will see.

There are additional requirements beyond those listed above. These include scalability, and efficiency, among others. While no less important, they were as much deciding factors as issues pertaining to functionality for individual users.

Proxy Implementations

There are proxy solutions available that satisfy our requirements to varying degrees.

Conventional Proxies

Conventional proxy solutions are the most common used on the Internet. Supported by the two most popular browsers, it requires users to configure their browsers to recognize a proxy server. Once configured, the browser passes its request for a web page to the proxy server using a particular style of URL that includes the web server machine and the absolute address of the requested page. The proxy server processes the browser's request, serves the request from its cache, if implemented, or forwards the request to the target server. Upon receipt of the server's response, the proxy may update its cache, and forward the response to the requesting browser.

Typically, a proxy server will field all http requests that are generated from a configured browser. However, in both of the major browsers there are two configuration options that allow for some conditional testing.

Manual Configuration

The manual configuration option allows users to enter the hostname or IP address and port for the proxy server. Both Netscape Communicator, and Microsoft Internet Explorer allow the user to also specify the domain names of web sites, whose requests will not go through the proxy, but will be serviced directly by those domains web servers.

Automatic Configuration

The automatic configuration option enables to user to enter the URL of a (JavaScript, JScript, or ECMAScript) (local or remote) file that will take as an augment the URL that is being requested. The script will conditionally test the URL to see how to handle the request: whether to access the requested server directly, or through a proxy (and if so, which one if there multiple proxies).

We are looking at two products, the Apache Web Server² , with its proxy module implemented, and the Squid Object Cache³ for conventional proxying. Both of these products were designed for caching. However each can also provide the services we need, up to a degree, for our remote-access proxy.

Transparent and "Inline" Proxying

Two less standard methods of implementing proxying are similar in two respects.

they do not require the user to configure the proxy server on the browser.
they get in the middle of the network transmission between the browser and target server, by situating themselves in the protocol stream. However, each monitors a different protocol, and has different behaviors.

Transparent Proxying

Transparent proxies operate by analyzing transmissions between browsers and their target servers at the packet level, which is the fundamental protocol level in TCP/IP networking. A transparent proxy usually is incorporated into a router, or relies upon a router to redirect packets that are intended for web servers to the proxy.

Transparent proxying is completely invisible to the user. They have no recourse but to utilize it. However transparent proxying calls for a fundamental change to a sites networking infrastructure, and places an extra load on a sites primary router to implement the redirection.

Inline Proxying

Inline proxying positions the proxy in the http protocol stream between the browser and target web server. It's usage requires a modified URL than initially calls the inline proxy and includes the target webserver in the remaining URL line. For example, the URL for the Encyclopedia Britannica is http://www.eb.com/home.html. A URL with an inline proxy would look something like http://www.inlineproxy.virginia.edu/@www.eb.com/home.html. The inline proxy queries and checks a user's authentication, calls the target server, requests the specified page, receives the servers response, and parses the html in the response to that all further URL references to the target server are front-ended by the inline proxy.

Our initial experiment in providing remote access was the creation of the "man-in-the-middle" inline proxy. However, Apache, with the proxy and rewriting modules, can be configured as an inline proxy. The Squid Object Cache has documentation on how it may be configured as a transparent proxy. It also has an "http-accelerator" mode that may allow it to be used as an inline proxy.

Analysis

The analysis that has been done so far is based upon at least six months of experience running the "man-in-the-middle" (mIm) proxy in production, and preliminary experiments with Apache and Squid. Indeed, the reason for an overall assessment of different proxying solutions was engendered by the problems we saw in the mIm solution as time went by.

First, we need to rule out the use of a transparent proxy to do the type of remote user access that we need. Although transparent proxies may be an effective method in some firewall configurations, they would be effective only when we had control over either the initial router or gateway the user was connected to, or the router that connected to the licensed resource. Since the users router is owned by the ISP, and the licensed resource is sitting behind the publishers router, we have control over neither.

We are left with either implementing an inline proxy, or a conventional manual or automatic proxy. All three of these solutions satisfy most of the requirements, although some products may call for more customization than others for handling our site-specific authentication/authorization methods. When we first began the project, we focused on the essential difference between the conventional and inline proxies; conventional proxies require the user to configure their own browser, inline proxies do not.

This is not a trivial issue. We have seen, time and time again, that no matter how detailed the documentation, or "easy" the operation, requiring users to configure applications on their own workstations dramatically increases our support load. Not only do we have to create and debug documentation, we need to train our help-desk, and prepare for an increase in the number a calls that will be coming in. If we have a solution that would not encumber such a support burden, it would be the first one we try.

Inline Proxy Shortcomings

And so, we wrote our "man-in-the-middle" inline proxy, using Perl⁴, the lib-www perl5 modules⁵, and the Open Group's (now unavailable) Strand software⁶. We named it "man-in-the-middle" ("mIm" for short) after a security paper that came out of Princeton that was a warning that inline proxies can be used a form of hacking attack⁷ . They certainly can be used as "anonymizers" proxies that strip out any specific user or workstation information from the http headers before accessing a web server.

Our mIm was put into production in January of 1998. It front-ended a few different resources, Encyclopedia Britannica, Galenet, Mathscinet, among others. And, it worked reasonably well. Remote users were able to gain access to web resources that UVa licensed.

The problems we started to have with the mIm proxy fall into three categories.

Rewriting

For an inline proxy to work, the html that comes back from the target web server has to be parsed and edited, so that all anchors that call for additional information from the target web resource are modified to first pass through the inline proxy. To do this the proxy needs a understanding of html, and where these anchors may be.

For mIm, we used the lib-www-perl5 modules for our html processing. These modules have a strict understanding of html structure, based upon rfc standards⁸. However, we found many web sites producing "broken" html, that the major browsers, Communicator and Internet Explorer, accept without a problem. We had to keep adding exceptions to our parsing and editing routines for most or the web resources we tried to front-end.

Cookies are a mechanism in which web server based programs can maintain state. It is done by having a users browser store information that is specific to an application. Some regard cookies as intrusive and a possible security risk, because the state information is written to the users hard disk. Some even use inline "anonymizer" proxies to strip cookies away.

We discovered that some licensed web sites depend upon cookies. And, while it is simple to strip cookies out with an inline proxy, it is a little more complicated to support them. Design and programming time would need to be devoted to this problem.

Initially, we designed our mIm to simply be an http proxy, not considering the need to handle other protocols. But, soon we found that some licensed sites enabled users to FTP download the results of their searches, which in some cases could be quite large.

FTP proxying has some additional challenges. Presently, when mIm receives a web page, it bring the contents of that web page into a buffer, parses and edits it, and then delivers it to the browser. The user will notice an lag in response time, but not a debilitating lag. If the same buffering technique is used with a large ftp download, the lag will be much more noticeable, not to mention taxing the proxy server's resources. The only remedy would be to implement a streaming buffering method for ftp downloads. There will still be a response time lag, but much shorter.

Dynamic URLs
Probably the greatest potential showstopper to the inline proxy solution, is the increasing presence of dynamically generated URLs. These are URLs that are embedded in conditional logic that is wrapped in JavaScript, or in a Java applet. The script or applet generates a specific URL based upon browser-side testing, such as the brand of browser, or data that a user enters in an applet or on a form.

It is problematic enough parse and edit the (often nonstandard) html that comes from some web sites. It is almost impossible to even guess the nature of dynamically generated URLs. For perhaps this reason alone, the hope of using inline proxies has dwindled.

Advantages and Disadvantages of Conventional Proxies

The fundamental value of a conventional proxy solution is that for the pain and effort of having users configure their browser, they can effectively handle the above problems.

Once a browser is configured with a conventional proxy, it knows how to rewrite URLs automatically. By using a centrally managed automatic proxy solution, we will also have some control over which URLs will go through the proxy and which will not
Conventional proxies already have support for cookies.
Both Apache and Squid enable ftp proxying and well as http proxying, as long as the ftp session is executed by the configured browser.
Any dynamic URL will still go through the proxy.

However there are some disadvantages to conventional proxies.

As we have mentioned, we need to support users configuring their own browsers through documentation and help-desk personnel.
We need to retool our conventional proxies to use the University authentication / authorization methods.
We need to enable conventional proxies to keep state and session.

Conclusions

We are at a crossroad in our investigation on how to deliver IP protected licensed web resources to our remote users. We have had a modicum of success with our "man-in-the-middle" inline proxy. But, as I have shown, it has proved to be inadequate for front-ending some resources.

Presently, I am investigating the efficacy of Apache and Squid as conventional proxies, not only for remote access, but also as caching proxies for the University. Both enable customization of the authentication and authorization routines. However, it is unclear whether they have adequate way of keeping session and state. But, it would be nice to kill two birds with one stone.

A third option is to combine our mIm experience with our knowledge of conventional proxies, and write a specialized proxy for remote access. If we need to go there, we will.

This leads to a final thought. The reason why we are in this situation is that there has yet to be an accepted and working standard to enable what I call "remote authentication" the ability for a site to authenticate users as members of an organization, without relying upon IP addresses. One hopes that such a standard could be possible, and implemented quickly. For, this problem is not going away. As the Internets and webs popularity grows, more licensed material is going to appear online, and more users coming from remote sites will demand access. And, as far a I can see, an industry solution is not immediately forthcoming.

Endnotes:

¹Hunt, Craig. TCP/IP Network Administration. 1992. O'Rielly & Associates, Inc. Sebastopol, CA. Pg: 29-30.

²www.apache.org.

³squid.nlanr.net/Squid.

⁴www.perl.com

⁵www.linpro.no/lwp/

⁶www.opengroup.org

⁷www.cs.princeton.edu/sip/pub/spoofing.html

⁸http://www.pmg.lcs.mit.edu/cgi-bin/rfc/view?1866

Back to the text