Faculty and Researcher Best Practices

Numerous resources are available online to give students equal access. Equal access to technology for all students, irrespective of the resources available to them, was an issue that many faculty researchers discussed, particularly with the growing number of students taking courses in machine learning. Tools like Jupyter Notebooks are a great introduction to machine learning courses, as well as for faculty beginning to explore new types of data and beginning to incorporate machine learning into their research. But many of these solutions run on local systems, and if students or researchers don't have access to the necessary hardware, they may encounter barriers including long run times in their machine learning work.

Institutions might consider exploring solutions for students and researchers such as Google Colab, which allows students and researchers to run their projects on Google's cloud servers. Interviewees called tools like this a big "equalizer for students," especially when a student's own devices or technologies can't manage their class assignments in a timely manner.

Researchers should build communication lines with IT early. Researchers need to make sure they work to open communication lines with their IT counterparts to ensure they get the support they need. Interviewees reported that a common complaint from the IT side of the house is that researchers tend to buy equipment with their grant money without first consulting IT. This can lead to greater frustrations down the road when that equipment breaks, or when researchers encounter a data security issue, and IT is needed for back-end support. Interviewees reported more positive experiences for both IT staff and researchers when communication lines were opened early and often between the two groups. Strong communication lines ensure IT can have initial needs conversations, help with the setup of technology, and help end users avoid common issues and pain points.

Building and maintaining open communication lines are also essential when researchers are working with IT to budget time or access to local systems or cloud computing. Lines of communication can ensure that researchers, especially those less familiar with machine learning technology, get access to what they need to accomplish their goals. Interviewees recommended that researchers document and communicate opportunities, improvements, and capabilities within machine learning and AI infrastructure so that future budgets will be available to accommodate the need and growth projected for labs or research projects.

A flexible workflow process can streamline the incorporation of machine learning into research. Researchers use machine learning in their research in many ways, and their technology needs vary widely. In addition to working with IT to help determine needs, some of the researchers we interviewed highlighted the benefits of considering workflow process in planning their projects. Researchers have conferences and grant and publication deadlines to meet, so knowing and communicating the timelines and requirements for their workflow is essential for a smooth and successful research process.

There are many ways researchers can streamline their workflows based on the type and quantity of data they are working with, but flexibility across workstations or platforms is a key ingredient across most of these approaches. Researchers reported that it was extremely helpful to be able to work locally on a workstation, or sometimes even a good laptop or desktop, as they are planning their workflow. One researcher with access to an on-premises workstation highlighted how it let him "stack a lot of jobs and then run them in parallel" for more, faster testing of new models and configurations for his research. Other researchers mentioned the benefits of being able to more easily debug their configurations and being able to "visualize simulation outputs in the early stages of research."

Access to open-source machine learning frameworks such as TensorFlow or PyTorch may be another key to establishing a successful workflow process—as one researcher put it, "We need control over our workflows to test models and write papers. Libraries like TensorFlow help with that." Nvidia's NGC Containers may be a similarly helpful resources, hosting a catalog of curated GPU-optimized software to help users simplify and accelerate their workflows. Though libraries and catalogs such as these can be difficult to navigate, even for more experienced researchers, as these solutions continue to build and expand access researchers expect their usability will improve and allow for more, and less experienced, researchers to use them with greater ease.

Case Study Example

At Stanford University, researchers in Chelsea Finn's laboratory are using a three-stage technology process to minimize their resource costs and maximize the theories and ideas they can test as they develop new methods in machine learning. These researchers are working in teams, tackling various problems and trying to find the next improvement, the next model, or the next solution they can present to the machine learning community.

Stanford Three-Stage Process of Research

Workstation

  • The HP Z8 workstation equipped with dual Intel Xeon processors and dual Nvidia Quadro RTX 8000 GPUs, 384 GB of system memory, and HP Z Turbo Drive M.2 storage allows for variability in model development and enables the researcher to explore and push the boundaries of configurations.
  • Researchers use an HP Z8 workstation to test multiple machine learning models, generally training each one for a few hours at a time.
  • As they test each of their ideas, the researchers can identify the most promising and move on to the next stage of training and development.

On-Premises Cluster

  • Once an idea has shown promise on the workstation, researchers move to an on-premises cluster using 1–10 GPUs for 3–5 days of training.
  • This stage is useful for wider hyperparameter sweeps and is used for jobs that might risk destabilizing a local workstation.
  • At the end of this stage, the researcher can either discard the idea or will be able to show that the idea might provide an important contribution or solve an important problem in their field. They can move on to the final stages of research.

Cloud Use

  • The use of cloud systems is reserved for the final stages of research—using 10–100+ GPUs to run a burst or big sweep of testing and training.
  • This stage sets the researcher up to confidently publish findings against other models currently in their field. Additionally, researchers may take advantage of the cloud when a paper or conference deadline is approaching and they do not have sufficient time to run all the required testing on their local systems.

A cost/benefit analysis can highlight the differences of a local workstation versus cloud computing. Using a three-stage workflow process similar to Stanford's may work best for many researchers, but each institution has its own resources, budgets, and processes to consider. Cloud computing can provide incredible power, but it can be extremely expensive for projects if researchers haven't obtained cloud grants from Google, AWS, or other funding sources. Interviewees recommended communicating with fellow researchers and IT groups to help determine the best path forward.

Chelsea Finn’s AI and robotics lab at Stanford University
Image from inside Chelsea Finn’s AI and robotics lab at Stanford University (Image credit: Chelsea Finn, Stanford University)