Main Nav

We are investigating deploying a searchable directory on our web site. With that, I mean a web form where you may enter a first and last name and get names, emails, phones, and limited other information on people in our campus community.

 

A concern is how do we prevent email harvesters? By that, I mean people who mine this campus directory to get expansive lists of our community’s contact info.

 

Five techniques I can think of are:

1.       IP-based throttling, but this might block legitimate users who are behind shared IPs due to NAT.

2.       Assigning some kind of identity to browsers, like setting a cookie or other techniques. The problem is it’s super easy to clear out the browser’s memory of a given session, making it appear like a brand new session to the server.

3.       CAPTCHA, but this has usability and accessibility concerns, plus I keep reading of ways the CAPTCHAs get overridden.

4.       Server-based intelligence, but “intelligence” is the problem and difficult to deal with.

5.       Requiring exact last name matches and at least one character from the first name, but this wouldn’t stop someone with an expansive dictionary of common last names abusing this service.

 

Have you dealt with this, and what did you do?

 

Aren Cambre, '99, '03
Team Lead, Web Technologies Team
Office of Information Technology
Southern Methodist University

 

********** Participation and subscription information for this EDUCAUSE Constituent Group discussion list can be found at http://www.educause.edu/groups/.

AttachmentSize
image001.gif2.62 KB

Comments

Just a follow-on question…is anyone having to address privacy concerns around use of their campus directory? If so, what controls are you putting in place?

 

Regards

Lonnie Smetana

Manager, Web & Mobile Solutions

University of Manitoba

 

 

 

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.


Message from troiani@rowan.edu

While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:

Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.

For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).

Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Managaement
Rowan University, Glassboro, NJ


On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.


Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDU] On Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:

 

Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.

 

For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Managaement

Rowan University, Glassboro, NJ

 

 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

I see you’re in Canada, but in America, we have a law called FERPA that generally permits students to force the university not to disclose their identities to third parties. If the student requests this under FERPA, we do not show his info in the student directory and other places.

 

We have a similar mechanism for faculty and staff, but I don’t think it’s generally required by law for them.

 

Aren

 

Message from troiani@rowan.edu

Absolutely, but if there was an actual person involved he could just read the email addresses and copy and paste them. If we used a custom algorithm the person could easily decode that as well since the code to do so is right there in the Javascript (it has to be so the browser can decode them!). 

This method is obviously designed to stop bots which are obviously the main problem when it comes to harvesting emails.

Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Management

Rowan University
201 Mullica Hill Road
Glassboro, NJ 08028

856.256.4410

--------------------------------




-------------------------------

On Feb 19, 2014, at 5:37 PM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:

Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?
 
From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory
 
While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:
 
Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.
 
For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).
 
Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Managaement
Rowan University, Glassboro, NJ
 
 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

Message from troiani@rowan.edu

Absolutely, but if there was an actual person involved he could just read the email addresses and copy and paste them. If we used a custom algorithm the person could easily decode that as well since the code to do so is right there in the Javascript (it has to be so the browser can decode them!). 

This method is obviously designed to stop bots which are obviously the main problem when it comes to harvesting emails.

Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Management

Rowan University
201 Mullica Hill Road
Glassboro, NJ 08028

856.256.4410

--------------------------------




-------------------------------

On Feb 19, 2014, at 5:37 PM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:

Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?
 
From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory
 
While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:
 
Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.
 
For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).
 
Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Managaement
Rowan University, Glassboro, NJ
 
 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

I agree that bots are a problem for general email address harvesting, like to scoop mailto links, but are bots really behind hammering online campus directories? Seems like a human generally needs to be involved there.

 

Aren

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDU] On Behalf Of Troiani, Francis J.
Sent: Thursday, February 20, 2014 8:33 AM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

Absolutely, but if there was an actual person involved he could just read the email addresses and copy and paste them. If we used a custom algorithm the person could easily decode that as well since the code to do so is right there in the Javascript (it has to be so the browser can decode them!). 

 

This method is obviously designed to stop bots which are obviously the main problem when it comes to harvesting emails.

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Management

Rowan University

201 Mullica Hill Road

Glassboro, NJ 08028


856.256.4410

 

--------------------------------




-------------------------------

 

On Feb 19, 2014, at 5:37 PM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:



Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:

 

Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.

 

For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Managaement

Rowan University, Glassboro, NJ

 

 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

Message from troiani@rowan.edu

Bots can and will hammer any pages containing @

Another option is to exclude your directory from searching in your robots.txt file. This is a double edged sword though because you would be blocking actual users from getting results. Of course you could specifically allow legit search engines in like Google but then of course the bots can search using Google and then get to your directory anyway so this isn't really effective in my view.

Where a person might be involved in harvesting is if you set up your directory so it is not "browseable" - For instance no way to click on "A" and receive a list of the people with last names starting with "A". This would then require the bot to submit a name to search but I'm not sure how well bots would do with this. This is where a devious person could go through and enter "A" and get a bunch of listings that you save out. 

A third option of course you could include a CAPTCHA but that's just plain mean and wouldn't stop the last scenario.

Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Management

Rowan University
201 Mullica Hill Road
Glassboro, NJ 08028

856.256.4410

--------------------------------




-------------------------------

On Feb 20, 2014, at 10:14 AM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:

I agree that bots are a problem for general email address harvesting, like to scoop mailto links, but are bots really behind hammering online campus directories? Seems like a human generally needs to be involved there.
 
Aren
 
From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Thursday, February 20, 2014 8:33 AM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory
 
Absolutely, but if there was an actual person involved he could just read the email addresses and copy and paste them. If we used a custom algorithm the person could easily decode that as well since the code to do so is right there in the Javascript (it has to be so the browser can decode them!). 
 
This method is obviously designed to stop bots which are obviously the main problem when it comes to harvesting emails.
 
Frank Troiani
Associate Director, University Web Services

Division of Strategic Enrollment Management

Rowan University
201 Mullica Hill Road
Glassboro, NJ 08028

856.256.4410
 
--------------------------------

-------------------------------
 
On Feb 19, 2014, at 5:37 PM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:


Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?
 
From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory
 
While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:
 
Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.
 
For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).
 
Frank Troiani
Associate Director, University Web Services
Division of Strategic Enrollment Managaement
Rowan University, Glassboro, NJ
 
 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

I probably wasn’t clear. This is a form where the user types in a complete last name and at least a character of the first name, hits submit, then gets results, so there’s not a page that by default is presenting a browseable campus directory.

 

Aren

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDU] On Behalf Of Troiani, Francis J.
Sent: Thursday, February 20, 2014 9:39 AM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

Bots can and will hammer any pages containing @

 

Another option is to exclude your directory from searching in your robots.txt file. This is a double edged sword though because you would be blocking actual users from getting results. Of course you could specifically allow legit search engines in like Google but then of course the bots can search using Google and then get to your directory anyway so this isn't really effective in my view.

 

Where a person might be involved in harvesting is if you set up your directory so it is not "browseable" - For instance no way to click on "A" and receive a list of the people with last names starting with "A". This would then require the bot to submit a name to search but I'm not sure how well bots would do with this. This is where a devious person could go through and enter "A" and get a bunch of listings that you save out. 

 

A third option of course you could include a CAPTCHA but that's just plain mean and wouldn't stop the last scenario.

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Management

Rowan University

201 Mullica Hill Road

Glassboro, NJ 08028


856.256.4410

 

--------------------------------




-------------------------------

 

On Feb 20, 2014, at 10:14 AM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:



I agree that bots are a problem for general email address harvesting, like to scoop mailto links, but are bots really behind hammering online campus directories? Seems like a human generally needs to be involved there.

 

Aren

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Thursday, February 20, 2014 8:33 AM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

Absolutely, but if there was an actual person involved he could just read the email addresses and copy and paste them. If we used a custom algorithm the person could easily decode that as well since the code to do so is right there in the Javascript (it has to be so the browser can decode them!). 

 

This method is obviously designed to stop bots which are obviously the main problem when it comes to harvesting emails.

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Management

Rowan University

201 Mullica Hill Road

Glassboro, NJ 08028


856.256.4410

 

--------------------------------

-------------------------------

 

On Feb 19, 2014, at 5:37 PM, Cambre, Aren <acambre@MAIL.SMU.EDU> wrote:




Given the kind of user who is motivated to harvest information using a dictionary attack on a web form, can he probably figure out base 64 encoding?

 

From: The EDUCAUSE Web Administrators Constituent Group Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDUOn Behalf Of Troiani, Francis J.
Sent: Tuesday, February 18, 2014 8:08 PM
To: WEB@LISTSERV.EDUCAUSE.EDU
Subject: Re: [WEB] Preventing misuse of campus directory

 

While these other options certainly help, we've been able to quell most of the harvesting by the doing the following:

 

Output email hashed email addresses (you can do base64 which is nice and easy to decode) onto the page and then use a wee bit of javascript to decode the info back into a human readable format. Most bot harvesters aren't going to run the JS.

 

For accessibility sake you can detect the user agent and provide the real address to readers that might have a hard time with javascript (at the risk of losing a bit of obfuscation).

 

Frank Troiani

Associate Director, University Web Services

Division of Strategic Enrollment Managaement

Rowan University, Glassboro, NJ

 

 

On Feb 18, 2014, at 5:55 PM, "Scott Crevier" <scott.crevier@SNC.EDU> wrote:

We decided long ago to stop chasing a moving target. We provide all of the data (phone, email, etc) in our directory, and we don't worry about spam. We already have spam filters to handle that. Besides, most tactics that you may employ just make it more inconvenient for your typical visitors.

 

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 CasperJS or PhantomJS are scriptable headless WebKit engines (full web browsers) that spam harvesters regularly use to bypass this type of protection. Any data you present in a browser can be scraped by a bot, even if you're using JavaScript obfuscation techniques. I'll +1 Scott Crevier: > We decided long ago to stop chasing a moving target. We provide > all of the data (phone, email, etc) in our directory, and we don't > worry about spam. We already have spam filters to handle that. > Besides, most tactics that you may employ just make it more > inconvenient for your typical visitors. Justin C. Klein Keane, MA MCIT Security Engineer University of Pennsylvania, School of Arts & Sciences The digital signature on this message can be verified using the key at https://sites.sas.upenn.edu/kleinkeane/pages/pgp-key On 02/20/2014 09:26 AM, Troiani, Francis J. wrote: > Absolutely, but if there was an actual person involved he could > just read the email addresses and copy and paste them. If we used a > custom algorithm the person could easily decode that as well since > the code to do so is right there in the Javascript (it has to be so > the browser can decode them!). > > This method is obviously designed to stop bots which are obviously > the main problem when it comes to harvesting emails. > > *Frank Troiani* /Associate Director, University Web Services/ > Division of Strategic Enrollment Management > > Rowan University 201 Mullica Hill Road Glassboro, NJ 08028 > > /856.256.4410/ /http://www.rowan.edu/ > > -------------------------------- > > > > > > ------------------------------- > > On Feb 19, 2014, at 5:37 PM, Cambre, Aren > wrote: > >> Given the kind of user who is motivated to harvest information >> using a dictionary attack on a web form, can he probably figure >> out base 64 encoding? >> >> *From:* The EDUCAUSE Web Administrators Constituent Group >> Listserv [mailto:WEB@LISTSERV.EDUCAUSE.EDU] *On Behalf Of >> *Troiani, Francis J. *Sent:* Tuesday, February 18, 2014 8:08 PM >> *To:* WEB@LISTSERV.EDUCAUSE.EDU >> *Subject:* Re: [WEB] >> Preventing misuse of campus directory >> >> While these other options certainly help, we've been able to >> quell most of the harvesting by the doing the following: >> >> Output email hashed email addresses (you can do base64 which is >> nice and easy to decode) onto the page and then use a wee bit of >> javascript to decode the info back into a human readable format. >> Most bot harvesters aren't going to run the JS. >> >> For accessibility sake you can detect the user agent and provide >> the real address to readers that might have a hard time with >> javascript (at the risk of losing a bit of obfuscation). >> >> *Frank Troiani* troiani@rowan.edu >> Associate Director, University Web Services Division of Strategic >> Enrollment Managaement Rowan University, Glassboro, NJ rowan.edu >> >> >> >> >> On Feb 18, 2014, at 5:55 PM, "Scott Crevier" >> > wrote: >> >> We decided long ago to stop chasing a moving target. We provide >> all of the data (phone, email, etc) in our directory, and we >> don't worry about spam. We already have spam filters to handle >> that. Besides, most tactics that you may employ just make it >> more inconvenient for your typical visitors. >> >> >> >>
Close
Close


Annual Conference
September 29–October 2
View Proceedings

Events for all Levels and Interests

Whether you're looking for a conference to attend face-to-face to connect with peers, or for an online event for team professional development, see what's upcoming.

Close

Digital Badges
Member recognition effort
Earn yours >

Career Center


Leadership and Management Programs

EDUCAUSE Institute
Project Management

 

 

Jump Start Your Career Growth

Explore EDUCAUSE professional development opportunities that match your career aspirations and desired level of time investment through our interactive online guide.

 

Close
EDUCAUSE organizes its efforts around three IT Focus Areas

 

 

Join These Programs If Your Focus Is

Close

Get on the Higher Ed IT Map

Employees of EDUCAUSE member institutions and organizations are invited to create individual profiles.
 

 

Close

2014 Strategic Priorities

  • Building the Profession
  • IT as a Game Changer
  • Foundations


Learn More >

Uncommon Thinking for the Common Good™

EDUCAUSE is the foremost community of higher education IT leaders and professionals.