Data Collection
The decision to collect SOGI data should be guided by a set of shared principles about how best to enfranchise everyone while recognizing and mitigating the risks, and the process must be informed by an understanding of several key considerations.
Whether to Collect Data
Deciding to collect SOGI data always carries some level of risk. Institutional leaders, in consultation with the core team, must weigh the risks, benefits, and needs, starting with why the data is being collected. Is the data necessary to provide a product or service, reduce discrimination, or directly improve the experience of members of a marginalized group? If the answer is "no," consider not collecting the data. If the answer is "yes," consider collecting only the minimum needed. Institutions that are at high risk—those that are subject to discriminatory laws or policies or that have leaders (whether from government or inside the institution itself) who have displayed a willingness or high likelihood to demand data that would reveal personal SOGI information—should not collect any new SOGI data that is not required by law, should not enter any existing SOGI data into a central database, and should consider deleting any existing data that they are not required to maintain.
Intent
Institutions should declare, at the point of collection, how SOGI data will be used (e.g., aggregate versus individual, anonymized versus open), indicating how each data item can be used. Add prominent links to easy-to-interpret data-usage policies anywhere you are collecting data, along with an email address, phone number, and/or information on an office that can assist with questions. If data is being stored in a centralized location, be explicit about the usage requirements and expectations for downstream apps, the campus community, and third parties. This can be facilitated by the use of forms to request access to data; keep in mind that once data is sent elsewhere, usage policies likely cannot be enforced. In addition, post-violation enforcement responds to any potential harm only after it has already occurred, and many types of harm cannot be undone and will have lasting impacts.
Consent to Collection and Use
Informed consent involves having a full understanding of, and voluntarily giving permission for, an explicit activity for a defined period. Any person providing data has a right to know what data is being collected, what the impacts are of agreeing or declining to provide the data, and how data security and privacy are being provided. In some cases, informed consent is required by law; in cases where it is not, informed consent remains a best practice and should always be required when collecting SOGI data. When consent is requested, the institution or department's privacy policy should be easily and publicly available, without the individual having to navigate to a location separate from where the data is collected. This same principle applies to the ease of requesting usage details at any point and not just at collection.
People should be presented with a request for consent that includes what data is being collected, why, who will have access to the data, and the option to opt-in or opt-out (with several choices):
- Opt-in to all data collection: A single selection to opt-in, completing all data entry even if not required
- Opt-out of all data collection: A single selection to opt-out, choosing not to provide any information, required or optional
- Selectively opt-in to certain data collection: Selecting specific data points to opt-in, choosing to provide only required information
- Selectively opt-out of certain data collection: Selecting specific data points to opt-out, choosing to provide only required information
Due to the potential risk, the safest solution is to not collect any SOGI data by default and to then allow people to selectively opt-in to specific SOGI data collection.
Safety Concerns and Other Precautions
- Consider adding escape buttons on websites where the data is being collected or displayed.
- Establish policies concerning the abuse of open text fields, such as intentionally entering incorrect information (e.g., to undermine the effectiveness of the data, to input hate speech, or to show their displeasure at the institution respecting LGBTQIA+ identities).
- Populating data from other sources, such as credit cards, admissions or financial aid applications, student information systems, and student records is not recommended because there is no way to verify that the data came from the individual and is current. If data is entered by a third party, the system should clearly identify the data source because often this data will be incorrect because it is often based on assumptions.
Data Architecture and Data Fields
All data related to gender and sexual identity, including pronouns and honorifics, should be treated as sensitive and given a high level of privacy and security protection. Whenever possible, data should be anonymized or aggregated to a level where individuals cannot be identified. Data should not be collected until policies on access control, collection, retention, and deletion are established to prevent instances of data being collected and held pending decisions. If SOGI data is already being collected and appropriate policies are not in place, its use should be restricted and further data collection should be paused until necessary policies are put in place. Assess the risk of the data being linked to a data source that reveals the identity of the corresponding individuals. The goal should be to promote inclusivity and equality while respecting individuals' privacy and rights.
Systems should be capable of identifying personally identifiable information (PII) associated with any specific individual and either removing the record completely or de-identifying/anonymizing the record to protect individuals and respect their preferences related to data. The same data might exist elsewhere, such as in emails or on employee devices. A robust and rapid deletion process should be in place that also includes confirmation that data is removed in all forms and from all systems. However, keep in mind that data is not limited to electronic formats or single locations—in most cases, deleting data is not as simple as removing a record.
Data Architecture
Different approaches (and combinations of approaches) can be used for designing an architecture to capture data, depending on the systems already in place at an institution, as well as any planned changes. It is important to consider this information as part of a formal requirements process when updating existing systems or selecting new systems. Following are four starting points:
- Tiered: The tiered approach is a robust structure for categorizing and organizing data and data collection methods into a defined number of levels. This can support access control based on data granularity and source, depending on the intended use. Data collection protocols should be aligned with the data source and categorization such that, for example, highly sensitive data, regardless of source, will be masked and may also require additional permissions for access. This is the recommended approach.
- Data Access Controls: This is a more personalized (person-centric) approach, allowing individuals to adjust preferences or data profiles within limits established by the institution. As with a tiered implementation, this model allows for fine-grained collection controls, but it puts that control in the hands of the individual rather than the institution. While this approach affords individuals the most flexibility, it may be overwhelming for individuals and as such is not the recommended approach for most situations. You may find over time that you want to transition to access controls on each piece of data.
- Open by Default: This least-restrictive data collection option poses the highest risk to vulnerable communities. Establishing an architecture should include identifying sources of data and mechanisms for gathering data, defining access levels and permissions to ensure security and privacy, and categorizing datasets for better understanding. Centralized data is more consistent and easier to access and manage; implementing security is easier with centralized data. Distributed data is more scalable, lowers the risk of data loss, and offers better performance than centralized data. If this option is used, then a hybrid approach is recommended to support the needs of openness, security, and scalability. The open-by-default approach allows people to manage what data they share, without limitations from the institution. This is commonly found in social media, where the types of data you can provide are nearly unlimited but it remains your choice (in most cases) to provide the data.
- Closed by Default: The closed-by-default strategy focuses on restricting access to data. The architecture must utilize authentication and authorization to control who can access the data, protect the data by encrypting it at rest and in transit, minimize levels of access granted, and monitor access and use of the data to prevent unauthorized activities. In a closed-by-default architecture, centralized data is recommended due to control, security, and management. This is most consistent with practices related to research and the IRB review in higher education.
Data Fields
Various data fields fall under the scope of SOGI data. The following is a non-exhaustive list of examples; apart from sex assigned at birth, an individual's data from all of these categories can change over time:
- Gender Identity: Gender identity cannot be assumed from pronouns, legal sex, or sex assigned at birth. Instead, allow people to self-identify gender. If you provide a list of options for gender identity—for instance, to make research data "cleaner"—use checkboxes rather than radio buttons to allow for multi-selection, including a field for people to write in their own (which should be an option even if other options are checked). Do not include "transgender" in the list of options. One of the most frequent mistakes is the separation of gender terms "woman" from "trans woman" and "man" from "trans man"; these are not separate genders. If the question of whether someone is transgender is important, this can be a separate item.
- Honorifics: Once again, do not assume honorifics from other data. A list may be appropriate to use here; ensure that the list includes gender-neutral titles and include a field for people to write their own. The most used gender-neutral profession-agnostic title is "Mx."
- Legal Name: Collect legal names only when absolutely necessary and ensure they will not be shared without consent and disclosure. Legal names do need to be collected when they are required for an official legal or regulatory purpose, such as submission of the FAFSA, and in such cases, the format of collection is usually predetermined. If the data field does not follow practices noted under lived name (see below), provide a disclaimer explaining why, citing the limitations of connected systems outside your institution's control.
- Legal Sex: As with legal names, if you're collecting legal sex information, it's most likely due to a reporting requirement. This may also be relevant in some research; there are very few other reasons an educational institution should need this data. Be advised that legal sex may be different across various government-issued documents (e.g., driver's license, birth certificate), so be specific about where this information should be pulled from. Furthermore, if the system is constrained to specific options, be precise about how people should fill out the form. Legal sex markers may include M, F, I, U, X, and E, though each type of government-issued document has its own limitations (for example, a state may have M, F, and X options for driver's licenses or M and F for birth certificates, and it might not provide any legal documents with I or U options), and this list may change over time. Do not assume that everyone in your institution comes from your state or has documents that follow your state's rules. X gender markers should be respected and neither assumed nor forced to conform to whichever binary gender feels closer, as this may not exist. Keep in mind that often these forms will not provide pertinent options for all individuals, and if an X or a blank is simply translated into an M or F, provide information on why and in what instances this will occur. Unlike sex assigned at birth, a person's legal sex can change.
- Lived Name (first, middle, last): Lived name, which is the name a person uses and might differ from a legal name, is often the only information required to deliver a service (such as accessing a course or grading homework). Best practice is to make this a free-form field that accepts all characters. Be mindful of international audiences here; accept varied character sets and ensure that all connected systems will properly handle them, do not include a minimum number of characters, and a character limit of 500 allows needed flexibility.
- Other Names: Often people will have other names they have lived by in the past; whether this is due to transition, marriage, or another life circumstance is irrelevant. In addition, people might change their first, middle, or last name individually. If this information must be collected, do not assume someone will have only two names in their lifetime. Best practice is to allow the individual to add as many responses as necessary, in free-form text fields; if this information is being collected to feed into a system that requires first and last names to be separate attributes, allow the person to input multiple full names. Do not refer to this category by a more specific term that excludes other uses (i.e., use "other names" and not "maiden name" as the primary field label).
- Pronouns: Pronouns should never be a required field. Until recently, drop-down menus were the best practice for collecting pronouns. However, given the diversity of gender expression and identity, this is no longer the case. Furthermore, do not use other information—such as gender identity or legal sex—to assume someone's pronouns. Current best practice is to provide an open-text field, which must allow space for people to enter multiple sets of pronouns. Another option would be checkboxes that allow multiple selections for common options along with an open-text field for additional pronoun sets to be added by individual users. In either case, the open text must allow for letters, commas, parentheses, spaces, and slashes and should allow at least 50 characters. Pronouns are often given in the style of "she/her, they/them" or "e/em/eir(s)." Note that some people use "it/its" pronouns or neopronouns (e.g., "bug/bugs," "void/voids," "beep/beeps"). If someone voluntarily adds these to their pronouns, the best course of action is to simply use them to refer to the person. Meanwhile, others do not use pronouns at all, and an option should be available to indicate no pronouns.
- Physical Sex: Do not collect this information unless absolutely necessary. Usually, it is needed only for medical forms, given that many colleges and universities have health centers on campus or are affiliated with teaching hospitals. Understand the desired outcome when collecting this data, and know that a person's physical characteristics and physical sex can change (for example, via gender-affirming medical care). While health insurance relies on a single variable for physical sex, the approach is not effective for understanding an individual's medical needs due to the variation that exists within and across typical physical sex categories (i.e., male, female, and intersex). Physical sex can be broken down into several categories and details depending on the desired information, and assumptions about any item should not be made based on a single physical sex variable, regardless of whether a patient is known to be trans or not. For example:
- Anatomy/organs
- Secondary sex characteristics (e.g., facial and body hair, breasts)
- Hormone balances
- Chromosomes
- Sex Assigned at Birth: Many jurisdictions in the United States require information about sex at birth to be collected and used, typically to restrict the rights of trans people. Sex at birth is generally assigned by a doctor at the time of birth based on a newborn's apparent physical sex and becomes a person's legal sex at that time. Legal sex and physical sex, however, can be changed later in life (via legal processes, gender-affirming medical care, etc.), while sex assigned at birth does not change. Sex assigned at birth should not be collected or used for any purpose unless specifically required by law.
- Sexual Orientation: A person's identity in relation to the gender or genders to which they are sexually attracted, which is separate from their gender. This information rarely needs to be collected; however, for appropriate use cases, use radio buttons, checkboxes, or dropdowns or allow an individual to write in their chosen sexual orientation.
- Transgender Identification: Use a separate field to collect information on whether or not someone is transgender. Appropriate options (radio buttons or checkboxes are both workable options) include Yes, No, Don't Know, and Prefer Not to Disclose. You may also use other options. This information rarely needs to be collected.