Location:

Texting in the Clouds

Created by Matt Pasiewicz (EDUCAUSE) on September 20, 2008

No, this article isn't about SMS or cloud computing ... its about text, curiosity and tag clouds. On some level that seems a bit passé ... tag clouds again? Isn't that old news? Yes, sure, but Catherine Howell's recent commentary on my little home page experiment left me wondering what else might be gleaned from more experimentation on tag clouds as a means of visualizing bodies of text (as compared to tags about text). So, once again, I decided to invest some more time on a not so little side project. I'm not sure if its as interesting this time around, but I thought I'd go ahead and share my results (and highlight the process that went into creating them).

First, I examined the full text of 566 Articles from 30 issues of EDUCAUSE Review and 18 issues of EDUCAUSE Quarterly. I pursued a slightly different technique for each publication, but in both cases, I removed the html mark up of each article and then ran the full text through OCLC's Tag Cloud service. I attempted to automate the process by using PHP and CURL to post the text to OCLC. I then ran OCLC's page (the results returned from CURL) through PHP's implementation of Tidy so I could get valid XHTML. From there, I used SimpleXML to load OCLC's page and extract the tag could portion of their page and then stored it MySQL. OCLC's service is interesting because they offer a little more data than similar services. The expose the total number of matches represented in their results -- and that proved to be very helpful.

Next, I looped through each the tag cloud data for each article and "exploded" each of the keywords returned by OCLC into a simple space delimited structure. For instance, if OCLC said a word like "open" was mentioned three times, I'd print "open open open" ... effectively denormalizing their tag cloud data into a structure that I could submit to another service like Wordle. Why did I do that? Well, I was thinking that by reducing the payload of information I sent to Wordle, then I'd decrease the time I'd need to spend waiting on it to process the full text of all data in an article. Heh, I'm not sure if I wasted or saved or time here, but it was kinda fun nonetheless.

Ultimately, I decided that processing a tag cloud from Wordle for each article would be too labor intensive, so I decided to create a cloud for every issue of each publication (or at least every issue where I had easy access to the full text of each article). At this stage, I looped through each article, took the combined set of keywords and then created report for each issue. From there, I endured the still labor intensive process of sending the results to Wordle. I did this manually, and I'm glad I did. Wordle is interesting ... on some level, it is very frustrating because they don't have any open API to it and because their results don't really seem actionable once you get them. Yes, you can play with the font, color and other attributes that make up their cloud and yes, maybe their results aren't designed for data analysis, but I did encounter a bit of a eureka moment as I started working through the process.

A Wordle tag cloud doesn't let you link to anything once you get the cloud (thus the actionable part), but it does provide some limited interaction with the data you submitted ... it lets you remove words from the cloud. Why is this interesting? My first thought was yuck! Why in the world would they allow that ... it effectively allows you to skew the results. Nonetheless, it turned out convenient ... I started the project by working with data from EQ and occasionally removing some weird noise from smaller tags like, I dunno, think "author" or something like that. As I was wrapping up my work on EQ, I realized that this data is only so interesting if you're from an audience familiar with the publication and its content .... words like teaching, learning, educause, university, students and faculty dominated the tag cloud as you would expect. That would be an important source of information if you were sifting through a larger hodge podge of data, but seemed kinda boring if you were familiar with the series.

When I started processing the clouds for EDUCAUSE Review, I took a slightly different approach ... I started unscientifically removing some of the more dominant tags so that I could increase the resolution of words that might prove more revealing (again, assuming you already know this publication was about topics stemming from the use of technology and education). I think the results are interesting, but if I had it to do again, I would have probably used the data from OCLC to get empirical data about the most frequently used words and then removed the top x percent of them in a more automated/systemic fashion.

In any event, I think the results are fairly intriguing and might speak to the different contexts that one might use when thinking about information visualization. In my case, the broad sweeping trends were noise and I wanted to dive a layer deeper. 'Reminds me a bit of macro vs. micro economics ... and wonder if there are any principles that one could glean from it and wounder if one could anticipate and inform user interaction given the costs associated with information creation and consumption. Eh, I guess I'm getting a little off topic here ... apologies if this is a little rangey, but hey, it is Saturday after all ;)

Was this interesting? I dunno, but it definitely makes me think about future of not only text mining and analysis, but also the importance of extending their reach and visualizing them where possible. I've experimented with the Yahoo Term Extraction API in the past, but it is only so interesting (or was at the time). Services like ManyEyes, OpenCalais are looking promising as well and I hope to find some time to investigate them more in time. I've come away from this experience with a renewed belief that semantic markup is very important and look forward to the growth of both microformats and the semantic web. The years ahead should be very interesting! Heh, would I do it again? Probably not ... or at least not until more of this can be automated. That said, I hope you've found this interesting and look forward to any thoughts.

Oh, gheeze ... if you endured all that, you deserve a link to it too ;)
http://www.educause.edu/educause/tag_sample/index.php?START=ER

Thanks!

Matt

 

Submitted by Matt Pasiewicz (EDUCAUSE) on September 20, 2008 - 4:53pm.

Heh, a reminder for me to read (and also sharing w/ others via comment)
http://www.research.ibm.com/visual/papers/vernacular_visualization.pdf

Submitted by Kevin Guidry (Indiana University) on September 23, 2008 - 3:22pm.

It seems that, like statistical packages (SPSS, SAS, etc.), tools like Wordle that automate data analysis and visualization open up the danger that persons unfamiliar with the methods used by the tools and the limitations of those methods can easily lead themselves and others astray.  Just because one can throw data at a tool and get output doesn't mean that (a) one should do so or (b) one can understand and interpret the results.  I'm certainly not railing against automated tools (abaci and stone tablets for everyone!).  Nor am I in any way faulting what you've done here or accusing you in particular of anything; I think that what you're doing is very cool and admirable!  I'm just sounding a general note of caution.  When something, including data analysis, seems too easy then it probably *is* too easy.

Submitted by Matt Pasiewicz (EDUCAUSE) on September 23, 2008 - 7:56pm.

I appreciate the sentiment and to some degree even share it. Oh, and no worries about any fear of feeling threated by your note ... I was kinda hoping someone would hop on that bandwagon. It is really quite refreshing. That said, I hope I didn't misrepresent my play or my fondness of tools like Wordle as some type of a an academic enterprise or advocacy for one technique over another ... mostly just a little tinkering, kicking the tires and sharing the experience. Literally just a weekend project. I tried to be careful there.

With regard to what folks could be doing with Wordle in particular, I think some might characterize it as somewhat akin to a common poll ... they're similarilry opaque and statistically dubious ... perhaps a stimulus for dialog and debate ... perhaps something that just hints at some undercurrent of activity ... or perhaps something that is just fun and entertaining. Is that good or bad? I'm not sure, but on some level, I want to believe that getting folks talking about (and interacting with) data has some intangible, indistinct benefit ... even if it just raises intellectual curiosity and awareness in ways that might not have otherwise existed. Perhaps I'm just naivve or overly optomistic, but I guess I want to believe that putting more tools in the hands of more people could be one of the most important things that we could do to raise the bar about awareness of suspect data and about the pros and cons of various sources. I guess I'm also reminded of the open source, open data and open content phenomenas and of the pros and cons there.

I'd be interested in hearing more commentary on the lifecycle and "social life of information" and what it will mean for society as more tools, more data and more people begin interacting in new and novel ways and on a scale that were never possible before. The whole visual literacy/net-savvy construct also comes to mind, I guess, but that just doesn't seem to do it justice. All fascinating topics ... Oh, that I had more time!

Submitted by Catherine Howell (La Trobe University) on September 26, 2008 - 9:47am.

The opacity of a (tool? toy?) like Wordle is arguably part of its charm... I really see Wordle primarily as a stimulus to conversation. If people find Wordle data "pictures" entertaining -- clearly, many do -- then I see that as a valuable bonus. :-)

All researchers have a responsibility in the way they "frame" and share representations of data. A statistical chart is a representation, quite as much as a Wordle "picture". Note that I'm not suggesting a chart and a Wordle image are *equivalent* representations -- they are doing different things with information. But no presentational convention is automatically transparent or meaningful, we have to learn how to read them...  Some conventions are obviously able to present data much more densely and richly than Wordle. But there are visual interfaces to data out there which are much, much richer and more "layered". The recent explosion of semantic applications is pointing to new ways of exploring data, visually. The crossover between visual literacy and information literacy is growing - in interesting ways.


 
© Copyright 1999-2009 EDUCAUSE