Location:

Whoa MAMA ...

Created by Matt Pasiewicz (EDUCAUSE) on October 17, 2008

The folks at Opera have created a new search engine, the Metadata Analysis and Mining Application (MAMA). This tool has produced some interesting stats about the structure of the web and is somewhat akin to the pet project I've been working on ... not just akin to it, but might outright put it to shame. I'd encourage you take a peak at Brain Wilson's early findings ... and I LOVE the following note from his introduction.

What follows is not as informal and witty as a blog nor as dry and formal as a research paper - it lies somewhere in between. Those expecting rigorous academia will forgive the occasional humorous turns of phrase or moments where personal observations and experience intrude - I try to limit it to places where they seem useful or interesting. For blog junkies, this will grow into a long, multi-part saga (hopefully) worthy of a company from Scandinavia.

http://dev.opera.com/articles/view/mama/

In the meantime, I should probably share an update on my own pet project. I've taken major strides forward since my first round of playing. I've started a new project and hope to get back to it in mid-to-late November. This time, I've scanned more than 3,000 .edu addresses, captured more than 56,000 pages (more on that later) and like the MAMA project, captured the source, extracted information about the structure of the page (number of links, link types, tag types, word count, etc), analyzed http header data, and ran it through a validator. In addition, I've also captured screenshots of those pages, analyzed the length of the page (in pixels), and captured GEO IP and whois data. I've also included metadata like FTE, Carnegie Class, etc so that folks can slice and dice the data as they see fit. I can't wait to get back to it. In the meantime, here are some quick observations about what I've seen thus far ...

Validation

It looks like a mere 183 home pages were valid, while a whopping 2,434 home pages were invalid according to the W3C Validation Service. I failed to get a response for the remaining 389 pages and I want to dive into why during my next major round of development. There was an average of 94 errors per page for those that failed.

Web Servers

I was unable to capture data on 327 of the servers. Unlike findings from the MAMA and NetCraft surveys, it looks like Microsoft has the dominate position among sites with .EDU domains with 1,493 of the sites reporting some flavor of IIS as it's server. Various versions of Apache made up a close second at 1,061 sites. This was just for home pages, not for departmental sites, etc. The rest of the servers represents a hodge podge of different sites ... my favorite is the commodore64-HTTPD/1.1. Dude, I'm serious. Very clever. It'd also be interesting to get a break down of the data by region and Carnegie classification. I have the data, but haven't gone there yet.

Other Characteristics

The average page height was 916 pixels. The screenshot software I used wasn't configured to use either flash or javascript. I was disappointed at first and while the software is capable of it, I decided the results were surprisingly revealing and elected to proceed. In the end, I was stunned to see how many sites didn't have either a degrade for their use of flash and/or javascript. I'm sure there's a way to get a feel for that by parsing the actual code, but the imagery was really striking.

More to Come

There is much more to come, but I couldn't help but chime in on the great work of the MAMA project and then chime in with some preliminary findings of my own. I can't wait to circle back to my own pet project, but with MAMA around, am I just spinning my wheels?


 
© Copyright 1999-2009 EDUCAUSE