Identity Finder at The University of Pennsylvania

Last reviewed: August 2017

Case Study

Identity Finder at The University of Pennsylvania


The University of Pennsylvania enacted a comprehensive Social Security Number policy in May of 2007. The stated purpose of the policy was to protect social security numbers by eliminating them, converting them to University specific Penn ID number, truncating to the last four digits or enforcing strict controls on the storage of necessary social security numbers (encryption).

The adoption of this policy posed several immediate challenges to the University information security staff. The most prominent of these challenges was locating social security numbers in University data stores in order to remediate them in accordance with the new policy. Without a clear picture of where our personally identifying information (PII) was stored it would be impossible to embark on any successful policy compliance plan.


After the implementation of the SSN policy the University of Pennsylvania's School of Arts and Sciences (SAS) was confronted with the challenge of policy compliance. The first step in compliance was finding a technical solution to identify PII. Once the need was explicit we began a program of exploration and evaluation in order to determine the nature and scope of the solution market space. At the time there were a number of data loss prevention (DLP) and PII identification solutions available, both open and closed source.

Starting a product evaluation was a daunting task, but identifying our requirements turned out to be almost equally as difficult. SAS needed a way to manage the potentially thousands of endpoints that could contain PII, with fewer than three staff persons. This initial staffing challenge mandated that any solution we selected could be distributed, to empower end users to remediate their own data stores. However, given the scope and scale of endpoints in SAS we needed a solution that would allow us to manage and track deployment and remediation centrally. Support for an open and unobtrusive information security program was a parallel need. We wanted to ensure that any solution proposed would be flexible enough to allow varying degrees of management. As a baseline we wanted a non-intrusive solution that would guarantee the privacy of the end user but would still allow us some central reporting. However, we wanted a solution that could be tailored so that if an end user requested, we could disable any central reporting. We realized that the only manageable way to eliminate the unnecessary use of SSN data was to allow data owners to identify and remove data from their own machines. As long as we provided a tool that we could confirm was installed and run, even if we could not track the amount of data identified or remediation action taken, we would consider the deployment a success. Ultimately we sought central management tools but distributed remediation tools.

Another pressing concern was the spread of so called "toxic" data. PII and SSN data at rest were potential policy violations, but leaking that data could become harmful and costly to the PII owners and the University as a whole. We quickly identified that any solution deployed would have to be able to contain any toxic data to the endpoint. We did not want any data identified as sensitive to be transmitted over the network or duplicated in any way.

After clearly enumerating all of our product requirements SAS embarked on a year long product evaluation of a half dozen industry leaders identified in an informal market survey. We looked at open and closed source solutions. Our testing involved deploying several virtual machines in various configurations that were stocked with a number of fake SSN data stores in several common formats including portable document format (PDF), Microsoft office formats (including Access databases), plain text files, database report formats, and other common repository formats. As a baseline for the virtual machine we used a standard allocations image that had been utilized for real work by an employee for enough time to have all the common desktop applications and user data.

We evaluated each product on ease of installation and maintenance, ease of use, ability to accurately identify our target SSN data, identification of false positives (data that did not actually contain PII but was flagged as such), format of reporting, ease of remediation for end users, and integration into a central management interface. Our testing revealed that almost every product required some degree of customization with the help of the vendor in order to meet our PII identification criteria. We did not find that any product was far and away better or worse at finding confidential data. Given the even performance in this factor it was important to have second order requirements with which to evaluate each product.


Identity Finder was ultimately our selection for PII remediation within SAS and eventually across the University of Pennsylvania. We felt that Identity Finders polished end user interface and robust features would encourage users to embrace the solution and utilize it to maximum effect. Our deployment plan relied heavily on the end user being able to quickly and easily identify PII and remove it with minimal hassle. Identity Finder was also customizable in such a way as to offer two deployment offerings, one that reported the location and amount of match data to the console, and one that reported only that the product was installed.

Because of the non-homogeneous nature of the computing environment in SAS (Windows and Apple platforms, domain and non-domain machines, etc.) we realized that installation would involve having a technical support person perform the installation manually. Because we wanted to maximize installation effect we targeted the installation time for remediation as well. Our plan called for local support persons to visit machines, install the software, and plan an immediate follow up session with end users to identify PII and remediate it accordingly.

Because SSN's were used as a primary identifier by the University several years ago we were immediately concerned with faculty or staff who had been employed at the University for long enough that they might have had to deal with SSN data as part of common business. Along with University employees who use SSN's as part of their normal business function, these two groups were targeted for initial installation. It has been our experience that the largest stores of SSN data are legacy files and applications that have been migrated forward as users upgrade machines, and have often been lost or forgotten about. This data is particularly worrisome as the data owner may not be aware of the data's existence (and thus the value of the hardware). However, these stores are also the easiest to remediate as the users typically does not object if the stores are simply securely wiped.

Once users who might have access to legacy PII and current users who need access to PII in support of their job function were identified and targeted for Identity Finder, installation progress was tracked through the console. Following this initial roll out, installations followed a department based deployment with local support providers being responsible for their own areas. Central security staff manage the console, tracking installation, and reviewing scan results periodically to ensure that remediation takes place.

In our approach we have noted that the biggest remediation efforts occur immediately after PII stores are identified. It is the most common case that PII can be identified and eliminated, and it is rare for the information to be recreated at a later time. For this reason the first scan of any machine is the most important. Scheduled scans have limited value in this scenario and so although we recommend that users schedule follow up scans of their machines we are less concerned with subsequent scans.


Technical staff are required to configure Identity Finder clients as well as policy within the console. Additionally there is no easy way to roll out clients en masse and track them easily in the console unless there is a central Microsoft Active Directory to tie the clients to groups. Without this ability a certain amount of manual work is required to accurately identify machines in the console.

Implementation Challenges

One of the largest challenges, given the distributed and diverse nature of the computing population in SAS, is identification of end points. Because machines are not necessarily joined to a central Windows domain they do not carry unique identifiers. For this reason it was important that after installation, technical staff identify new machines in the console and organize them in a meaningful fashion to facilitate follow up. Without accurate identifying information in the console it is impossible to locate machines that may have stores of PII that have not been removed or protected over time. For this reason the Identity Finder client installer was placed on a repository to which only local support providers had access. This forced end users to contact their support provider to get a copy of the software, allowing for proper tracking of the endpoint and collaborative remediation.

Unintentional empowerment was another challenge. The Identity Finder client includes, by default, many features that we felt could become problematic. For this reason we disabled some of these features, including the ability for users to encrypt their own data stores. Because Identity Finder does not provide key escrow functionality we wanted to make sure that users weren't able to encrypt data lest they forget or lose their encryption keys. Moving data to the Recycle Bin was also disabled to prevent users from deleting data in an insecure method. The ability to customize such features within the Identity Finder client became critical to our deployment strategy.

Also of concern are network shares. Although Identity Finder does a wonderful job of scanning shared drives and finding data there are two concerns. The first is licensing as Identity Finder is intended to be licenses on a per user basis rather than a per machine basis. Luckily we were able to site license the software so this was not an issue for SAS, but it is a consideration when choosing a solution. Our biggest challenge with network devices was logistical. Once PII was identified it would be difficult to identify a data owner given the distributed nature of multi-user shares. When PII was found it became challenging to pinpoint data owners and coordinate remediation amongst the various parties who might have access to, and could potentially be using, the identified PII. In addition to the logistical difficulties shared devices tend to be better managed and secured both physically and in terms of software. For this reason we chose to target shared repositories only after all endpoints were scanned.

The sheer number of endpoint deployments also created a hurdle for installation. Because we wanted a technical staff to be on hand to help end users interpret scan results and guide remediation each deployment took quite a bit of staff time. Although this made deployment slower, it increased effectiveness, allowed for end user education, and overall reduced chances of new PII being created on each endpoint. The Identity Finder console was a critical tool in tracking installations and assisting in the management of the deployment effort. Using the console a central information security staff person can quickly get an overview of deployment penetration and focus project management efforts on groups that are identified as having low or slower deployments so that resources can be effectively allocated to aid in our overall installation.

Although our initial deployment effort called for local support providers to schedule two appointments with faculty, one for installation and one for follow up, we have abandoned that strategy. The logistics of coordinating schedules and appointments became unwieldy and productivity was low. Instead support providers are now encouraged to dovetail client installation with other regular visits to machines. This opportunistic model has proven much more effective for client installs.

Deploying the endpoint client is quite fast. It generally takes less than 5 minutes to perform the installation. The initial scan of machines takes a variable time depending on the size of the drive scanned as well as physical factors (such as access speeds, etc.). Scans could take minutes or hours depending on configuration details and volume of data.

Deploying in a managed environment was much more rapid than to distributed departments. Installation on managed staff machines was simpler and more straightforward than on faculty machines. Unmanaged endpoints tend to have more eccentricities that confound simple installation. Our installation generally proceeded at a three to one ratio of installations on managed staff machines to installation on an unmanaged faculty machine.

Future Plans

SAS plans to support the continued deployment of Identity Finder clients and track changes using the management console. Support for Apple clients is sufficient that all Apple workstations will be targeted for installation as well. Using the management console SAS expects to be able to chart deployment across the remaining endpoints throughout the school. As new machines are provisioned they will have Identity Finder clients installed as part of the School's standard build. Data transferred from older machines will be scanned with Identity Finder prior to the data being moved to new machines.

Return on Investment

The financial risk of having PII on endpoints will vary depending on jurisdiction and volume of data. If breach notification is required the cost of compromise could be substantial. By helping to identify PII on end user machines and educate users during the deployment SAS has significantly reduced exposure to loss or exposure of PII. Given the cost of Identity Finder client and server licenses this reduction is more than justified.


3 (on a scale of 1 to 5, where 5 is Highly Replicable)


5 (on a scale of 1 to 5, where 5 is Highly Effective)

Submitted By

Justin Klein Keane, Senior Information Security Specialist, University of Pennsylvania (last reviewed by Christine Brisson, Director of Information Security and Unix Systems, SAS Computing, University of Pennsylvania)

Questions or comments? Contact us.

Except where otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0).