IT Needs and Challenges

Demand is increasing for staff who can run and support machine learning technology. Setting up, running, and maintaining machine learning technology requires very specific skills. Staff with these responsibilities need to be aware of how machine learning pipelines work and how to help inexperienced researchers design and develop workflows for their research. Other important skills to look for or develop in current staff include the ability to deploy, scale, and manage containers to help minimize virtualization footprints. Experience with Docker or Kubernetes is especially valuable, as Docker is currently the most popular container platform and Kubernetes is one of the most popular and standardized container orchestrators. Additionally, it is valuable to have staff who can conduct a needs assessment for researchers with varying technical knowledge in machine learning. In interviews conducted as part of this research, IT managers reported that finding this combination of skills to support more widespread machine learning research at higher education institutions is currently like trying to find a unicorn.

IT managers preparing to support more of this type of research at their institution should start conducting needs assessments and identifying any researchers who might be looking into machine learning technology on their own. Additionally, managers may want to encourage staff to leverage their professional development resources to build some of the "unicorn skills" to help design and plan for implementation of any machine learning researcher support.

Case Study Example

At Carnegie Melon University, researchers are using both local HP Z8 workstations and cloud computing to explore new approaches and uses of machine learning. Arun Suggala, a PhD student, is working from the theoretical side of machine learning to try to identify and fix corruptions while training models. Johannes DeYoung is bringing an integrative design art and technical hybrid interdisciplinary program to students. PhD graduate Liam Li is now working with a start-up firm to help automate stages of the machine learning pipeline, improve experiment tracking, and eliminate some of the pain points of working in the cloud. All of these researchers and more at CMU have required varying levels of expertise and support along the way, and IT staff needs to be ready to assist and direct researchers who are using various machine learning approaches and methods.

Institutions must balance the tradeoffs of local versus cloud computing resources. Interviewees on both the research and IT manager sides reported positive experiences with both local and cloud computing resources for machine learning, suggesting that institutions may want to explore a balanced approach that involves adopting both models. Many researchers reported moving away from cloud-based resources, primarily due to their cost of use; however, some researchers have been moving away from NSF computing resources to local machines for reasons beyond cost. NSF resources are available for researchers nationwide and have become frequently overloaded and difficult to book for sufficient consecutive compute time. With a local workstation, these researchers now only have to share resources with a few other researchers at their own institution, and many workstations are powerful enough for several researchers to use at the same time. Interviewees reported that local resources can be a better solution for some of the less technologically knowledgeable users because they often have fewer or less-complex demands for computing, which frees IT from having to orchestrate multiple cloud instances. These more straightforward needs can help simplify the building of machine learning models and eliminate the risk of getting overbilled for cloud compute resources.

Conversely, a few IT managers reported that their institution was starting to shift some functions to cloud-based resources and away from local resources where possible. Interviewees reported several benefits from these shifts to the cloud, including diminished concerns about hardware upgrades every few years and the lower costs of only paying for computing when there is a documented need for it.

Case Study Example

At Rice University, Erik Engquist, Director of the Center for Research Computing, oversees staff facilitators who work with faculty to help find cloud solutions for their projects and lab needs. These facilitators help faculty apply for research credits and set up contacts with vendors on a case-by-case basis. Though this type of support demands initial work up front for the facilitators, once the faculty member has a research plan and an established relationship with a vendor, the facilitator can move into a background role and easily support others.

Efforts are needed to lower the barrier of entry to machine learning. Several of the researchers interviewed for this research project reported challenges of a technical nature with machine learning. Researchers coming from backgrounds in art, civil engineering, or life sciences all reported struggling to find resources, especially among those pioneering the use of machine learning in their field and trying to build models on new types of data. One computational biology PhD candidate who is conducting research on the evolutionary history of genomes has struggled to find tools, libraries, or machine learning methods that can satisfactorily answer his questions. In his words, "A lot of machine learning methods are predicated on having large amounts of labelled data that you can learn from. But in our field, we have a lot of data that is not labelled—the 'ground truth' is not known—so it's unclear whether or which machine learning algorithms will be useful."

IT staff can help mitigate some of these barriers to entry by building out infrastructure and processes for onboarding researchers who are delving into machine learning research for the first time. Devoting attention now to developing processes, documentation, and a community of practice might help relieve some of the top-down administrative pressures later as more researchers look to apply machine learning in their labs or research projects. As processes and the community have time to mature, the overall barrier to entry across the institution will be lower and can create more opportunities for faculty and students to experiment with machine learning with fewer demands for support from IT.

Case Study Examples

At the University of California, Berkeley, Anthony Suen, Director of Programs, Division of Computing, Data Science, and Society, is working to expand cloud access to data science educational infrastructure. Suen and others at UC Berkeley are working with colleagues at UC San Diego and the University of Washington and have begun production operations on the NSF-funded CloudBank Program, which aims to simplify the use of public clouds across computer science research and education. The program seeks to develop processes, tools, and educational and outreach materials to address many of the current pain points users face when trying to make effective use of public clouds. This type of program could help lay the foundation for a more community-driven, open-source solution to cloud computing access and may help institutions avoid getting locked-in with a lone vendor.

Another new venture that UC Berkeley is helping launch is 2i2c (International Interactive Computing Consortium), a nonprofit organization that runs and supports customized JupyterHubs for educational and research users. They are tailored for the communities they serve and are 100% open source. In addition, 2i2c supports, develops, leads, and advocates for open-source tools in interactive computing that are created, used, and controlled by the community.

Jackson State University is working to build machine learning capabilities from the ground up. The mathematics department recently started a multidisciplinary big-data program, with participation from other departments such as public health, physics, biology, and electrical and computer engineering. As at many other institutions, officials at Jackson State are working to improve the machine learning knowledge and skills of their instructors and researchers. Their budget does not allow for hiring full-time big-data professors or large machine learning infrastructure, so the instructors are working with IT and the VP of research to connect with industry to help build out their infrastructure with grants. The instructors are also working on their own to expand their skills with machine learning and use the collaborative nature of the multidisciplinary program to share their insights with others. An example application that was being evaluated on an HP Z8 workstation was using machine learning to build a predictive model for traffic conflicts by learning from 67 GB of basic safety messages from connected vehicles.

An example HP Z8 workstation.
An example HP Z8 workstation. These types of workstations can provide the compute power for several machine learning research projects at once. (Image credit: HP)