![]() |
|
![]() |
![]() |
![]() |
Whoa MAMA ...Created by Matt Pasiewicz (EDUCAUSE) on October 17, 2008
The folks at Opera have created a new search engine, the Metadata Analysis and Mining Application (MAMA). This tool has produced some interesting stats about the structure of the web and is somewhat akin to the pet project I've been working on ... not just akin to it, but might outright put it to shame. I'd encourage you take a peak at Brain Wilson's early findings ... and I LOVE the following note from his introduction.
http://dev.opera.com/articles/view/mama/ In the meantime, I should probably share an update on my own pet project. I've taken major strides forward since my first round of playing. I've started a new project and hope to get back to it in mid-to-late November. This time, I've scanned more than 3,000 .edu addresses, captured more than 56,000 pages (more on that later) and like the MAMA project, captured the source, extracted information about the structure of the page (number of links, link types, tag types, word count, etc), analyzed http header data, and ran it through a validator. In addition, I've also captured screenshots of those pages, analyzed the length of the page (in pixels), and captured GEO IP and whois data. I've also included metadata like FTE, Carnegie Class, etc so that folks can slice and dice the data as they see fit. I can't wait to get back to it. In the meantime, here are some quick observations about what I've seen thus far ... Validation It looks like a mere 183 home pages were valid, while a whopping 2,434 home pages were invalid according to the W3C Validation Service. I failed to get a response for the remaining 389 pages and I want to dive into why during my next major round of development. There was an average of 94 errors per page for those that failed. Web Servers I was unable to capture data on 327 of the servers. Unlike findings from the MAMA and NetCraft surveys, it looks like Microsoft has the dominate position among sites with .EDU domains with 1,493 of the sites reporting some flavor of IIS as it's server. Various versions of Apache made up a close second at 1,061 sites. This was just for home pages, not for departmental sites, etc. The rest of the servers represents a hodge podge of different sites ... my favorite is the commodore64-HTTPD/1.1. Dude, I'm serious. Very clever. It'd also be interesting to get a break down of the data by region and Carnegie classification. I have the data, but haven't gone there yet. Other Characteristics The average page height was 916 pixels. The screenshot software I used wasn't configured to use either flash or javascript. I was disappointed at first and while the software is capable of it, I decided the results were surprisingly revealing and elected to proceed. In the end, I was stunned to see how many sites didn't have either a degrade for their use of flash and/or javascript. I'm sure there's a way to get a feel for that by parsing the actual code, but the imagery was really striking. More to Come There is much more to come, but I couldn't help but chime in on the great work of the MAMA project and then chime in with some preliminary findings of my own. I can't wait to circle back to my own pet project, but with MAMA around, am I just spinning my wheels?
|
![]() |
|
| Unless otherwise noted, EDUCAUSE holds the copyright on all materials published by the association, whether in print or electronic form. In certain cases the work remains the intellectual property of the individual author(s) (see Special Circumstances). Content from conference speeches, presentations, blogs, wikis and feeds reflect the opinions of the author, and not necessarily those of EDUCAUSE or its members. | |||