I am trying to guess/gauge the effort needed to do some post processing on Google Webmaster Tools downloads, specifically links. To keep tabs on who is linking to us. It can be an alert to backlinking and form spam. I don't know if the limitation on 100K links will be an issue or not. I just didn't know if someone had already done something like a script that will take a list of links and go out and retrieve the page with curl, and then suck out the line with our university (and maybe the line before and after). I would also like to track it, and see what are new links this week. I would also like to check something to see if I have already analyzed it. Some perl and curl, and a flat file would probably work. I just didn't want to reinvent the wheel. Or is some of this better handled through a private space crawler that could catalog outbond links as well?


Thanks for your suggestion, and question. I think that a little more background is in order. What we have found is something that seems to be part of general black hat search engine optimization. A number of other universities have detected it, but I don't think it is confined to universities. We implemented some things in Google Alerts to check for new web pages that have drug names associated with them. We don't have a medical school, so this works with a fairly simple configuration. And we picked up a lot that were in unmoderated comments on blogs, or unauthenticated wiki entries. And we forwarded them on to the owners, which caused blogs to be more moderated, and wikis to be authenticated or entries queued. The response was to be more subtle. Since it is Google/Search Engine ranking the perpetrators are after, they began hacking websites, and if they succeed, they insert code that says if the User Agent string is googlebot, for instance, or if Google is in the referrer string, then execute the redirect to the cheap drugs website. If not, display the page normally. That makes it more difficult to check. So I got the web development team to add me as a "webmaster" for Google. You can be set up as a webmaster, if you apply, and then place some code that Google supplies in your webpage. Then you have a view from the "webmaster tools" of some of Google's data. One of the pieces of data is the sites that link to your website. In that large list, we have found numeric IPs, and IPs from countries and companies that we didn't expect. If we could curl the webpage, and then evaluate the link, then we could possibly block the URL as a known weak website. Sometimes, it is good to have information of that sort to see when new links pop up. I hope that this is more clear. I have included the security list as well, since I believe my last note was a bit too terse to be clear. One related question, that I thought of as I was writing the above. What scanning tools do you use for new webpages? What about for re-authorization? Thanks, Jim Moore