Last modified: 27 Oct 2022

Implementation of the Virtual Computing Laboratories

Overview
Uses of the VCL
Storage Cloud
Compute Nodes
Orchestration
Management Nodes
Reservation System
Provisioning System

Overview

A Virtual Computing Laboratory (VCL) is a collection of managed, reconfigurable host computers, sometimes with an efficient storage system attached.

These are the phases of the VCL implementation plan:

Phase I: preconfigured, dedicated virtual machines (VMs) with full administrator privileges
This is the current (27 Oct 2022) state of implementation.

However, since 2009, VCL computers or hosts have evolved to be defined as nearly any lab computer, high-end workstation or server set up with a hypervisor (predominantly, type II). Hosts that run VMs controlling SET Labs critical infrastructure are excluded, but sometimes it is expedient or prudent to have a critical infrastructure VM run on some host in the VCL, as was the case for the Moodle VM in the early-to-mid-2010s and is the case for Cadence in 2020.

By 2019, each SET lab's room of computers were considered VCLs in their own right because they ran a Windows 10 VM on top of a Fedora host and their configuration was capable of running at least one additional VM. A 4-core CPU host would likely not be used since two cores are reserved for OS use and Windows 10 runs better with 2 cores instead of 1.

The TPS 202/302 hosts can run several VMs due to CPU core counts of 12 and 16, respectively, and RAM of 32GB and 64GB, respectively. However, TPS 302 hosts were also configured with GPUs, at first for use in "TCSS 556 Advanced Machine Learning", but later for student research. The original plan was to use a special VM with a guest operating system oriented toward machine learning or the GPU software, and pass through the GPUs to the VM. That would allow control of the host's resources such that local (in-lab) use of the Windows 10 VM would not be adversely affected. It turns out that pass-through PCI support in VirtualBox is experimental and part of a costly additional extension package, and it could not be made to work. In addition, the in-lab usage is currently very light. Instead, SET Lab staff provided remote ssh access to the hosts (via "/root/scripts/add_gpu_user"), which allows students to use the full CPU and GPU. Since these computers also use "prosumer-grade" NVME SSDs for their storage and machine learning is I/O-intensive, there is heavier than normal wear on the SSDs for hosts with popular GPUs ("g2080tia" and "g2080tib").
Phase II: ability to reserve virtual machines (VMs) on VCL hosts and specify applications to be ready to use during the reserved time
The plan was to use Apache VCL for phase II, but its support for VirtualBox was very different than our way of using it and it took too long for a grad student to get something working.

In addition, Apache VCL uses a pool of predefined VMs, and we have found the need to define VMs every quarter vs. reusing predefined ones. We do re-use the VM definitions (as opposed to the virtual disk containing the OS and applications). Reasons for our approach include the desire to have the latest updates to the OS, changing course needs, and newer versions of the OS and/or applications (e.g., Kali version change).
Phase III: ability to reserve "bare metal" hardware (hosts in a VCL) and modify virtual machines, to install and configure operating systems, networks and applications
When this project started in 2009, it was thought there might be several courses that needed bare metal hardware. In 2016, this need materialized for computer science graduate courses, such as "Cloud and Virtualization Systems Engineering" and capstone projects (especially when performance measurements are desired for theses or conference papers). However, since nothing is set up within our VCL system, real computers with remote access microcontrollers (e.g., Dell's DRAC) are provided as needed.

An last-resort alternative to a remote access microcontroller is an IP KVM switch and a web power switch.

Storage Cloud

A storage cloud exists for both VCL 1 (8 nodes) and VCL 3 (12 nodes originally, now (2022) 9). However, they are not being used for VM storage and loading at this time, partly due to slow network speeds and lack of configuring glusterfs, but also due to less need for saving VMs (more disk space and/or better disk utilization on cns).

VCL 4, at a 5 to 1 ratio of VMs to sns, would require 12 sns per host (7 hosts in VCL 4, so 7*12=84 storage nodes). That would overwhelm a 48-port switch, but perhaps stacking 2-3 48-port 1Gb switches using 10Gb connections would work to handle speeds, but 84 storage nodes would consume a lot of space, electricity and ventilation. These calculations were made in 2009 when we only had magnetic, spinning hard drives and an estimated 20MB/sec disk write speed. In 2020, with NVME SSDs on both compute nodes and storage nodes connected to a 10GbE switch, one might be able to increase the VMs to sn ratio drastically, decreasing the total number of storage nodes needed.

Compute Nodes

Compute nodes are really all we use. Instead of storing VMs on storage nodes, we store them on the compute nodes. The VMs remain up and therefore consuming resources on the host until the VM owner shuts them down. Therefore, the host has dedicated VMs and often cannot handle any more than initially provisioned on it.

We often run out of space with savvy users who do a lot of snapshots, unless it is planned for in advance.

Most courses use more than one VM per student or team. That strains our hosts. Sometimes we can ask faculty to allow the use of only two VMs at the same time, vs. using simultaneously all of the VMs allocated per student. We have had up to nine VMs per student requested in the same quarter for the same course, and that required some education of the faculty about resource limitations and costs.

The use of the VMs is almost always for one quarter only, although some graduate use can extend longer. That way, we can free up a host for the next quarter's needs vs. having to keep the VMs for an undetermined amount of time.

Virtual networks of VMs became important over the years. While it is relatively easy to create and manage a virtual network on the same host, it is more difficult (but still possible, with overlay networking) to manage one across hosts. Often, the demands for VMs per student is two or three and the class size is 30 to 40 student. That combination easily exhausts the resources of a 64-core, 256GB RAM host, so the course needs require splitting the class's VMs across multiple hosts. If all VMs are to be networked on a private subnet, we either have to add a virtual router (another VM) in the configuration, or use overlay virtual networking.

Phase I: preconfigured VMs
There is a considerable amount of detail here about the use of the scripts to create, manage, monitor and destroy VMs, and how the .vcl file is created and distributed to users.
Phase II: reservable VMs
Not implemented.
Phase III: reservable bare metal
Not implemented.

Orchestration

Starting in 2017, we attempted to create a means for someone to control the set of VMs assigned to a user (or team) in a course. This is the first attempt to document it, although a good deal of documentation is present in ticket 6089.

Definitions

Orchestration is the process of viewing or changing the state of VMs, either when they are running or when they are powered off. The user who orchestrates is called the conductor. The conductor has administrator rights over all of the VMs associated with a user, for all users in a given course.

The conductor often has the same set of VMs as the students, but the conductor's VMs are not orchestrated.

The root user sets up the conductor such that the conductor account can manage the VMs and gather information about them. That VM information is saved in a database to organize and improve the responsiveness of managing the VMs.

Conductor Setup

The first thing that happens after VMs are created for all students is that the instructor or someone the instructor delegates responsibility to is designated as the conductor. The instructor notifies the VCL administrator about the choice of conductor. The VCL administrator then allows the conductor's account to orchestrate with the "setup_orch" script. The general form of the command is:

Usage: setup_orch -c conductor -q quarter class [user]
       where: conductor is the UW Net ID of the controlling account
              quarter is the qqqyyyy quarter name
              class is the name of the course or a unique id
              user is an optional user name

For example, let's say we are setting up a course ("tinfo452") in the Spring 2020 quarter ("spr2020") for an instructor who will be the conductor ("costarec"):

cd /root/scripts
./setup_orch -c costarec -q spr2020 tinfo452

That will cause an "orch" folder to be created in "/classroom/home/tinfo452/costarec", and that folder will contain a subfolder called "tinfo452". Inside the "tinfo452" folder will be one VCL information file for each student, of the form "uwnetid.vcl.yaml". Also in the "tinfo452" folder is a list of all student information, in the "student_info" file.

The "uwnetid.vcl.yaml" file is a YAML version of the ".vcl" file, converted from each student's host home directory's ".vcl" file (e.g., "/classroom/home/tinfo452/srondeau/.vcl"). "/root/scripts/dump_vcl_file" converts the .vcl file into YAML to make it easier to programmatically extract the information using a Perl YAML module.

In the student information file, each line is of the form "uwnetid<tab>real name<tab>preferred name". The student information is pulled from a file called "/root/spr2020.tinfo452", which is created especially for orchestration from class list information.

If the conductor does not have a user account on the host, one must be created for the conductor, with a base home directory the same as the students' (e.g., "/classroom/home/tinfo452"). In addition, the conductor account must be a member of "vbvmuser" and "sshusers" (e.g., "usermod -a -G sshusers,vbvmuser costarec"). This must be done for all hosts that has this course's VMs on it.

Since the conductor did not have a user account, that means that the conductor also does not have any VMs for this course. A VCL information file is needed for orchestrating the VMs, so a ".vcl" file should be copied from another student to the conductor's cssgate home directory as ".vcl2"; e.g.,

scp /classroom/home/tinfo452/srondeau/.vcl root@cssgate.insttech.washington.edu:/home/INSTTECH/costarec/.vcl2

That ".vcl2" file must have the conductor's UW Net ID and host password on the first line.

If more than one host is involved in the course, the "*.vcl.yaml" files need to be copied to the first host's orch/class file (e.g., "costarec/orch/tinfo452"), and the password of the conductor must be forced to be the same on all hosts. The conductor must also have a VCL information YAML file; for example, "costarec.vcl.yaml", with simple contents like:

---
vcl_user: costarec
password: '452@2020'

That file will allow the conductor access to other hosts for which the conductor has an account with the same password as listed.

If there is more than one class that is being conducted, the information is stored under the first class's home directory (e.g., "/classroom/home/tinfo452/thok/orch"), the master directory for the conductor. A symbolic link to the new class must be made to that master directory. For example, if the first class is "tinfo452" and the second is "thok2020":

ln -s /classroom/home/tinfo452 /classroom/home/thok2020

Creating or Updating the VCL Information DB Tables

The conductor has privileges to run "/root/scripts/monitor_vms", which will update the "vclinfo" FirebirdSQL database on cssgate if "-s" is used. Once the VMs are created for a course, monitor_vms should be run on each host used by the given course, via "orch". For example:

orch -c tinfo452 "monitor_vms -c tinfo452 -d -s"

The first class (after "orch -c") is required and gets the conductor to the right "orch/course" folder containing all VCL information for the course. The command and arguments in double quotes is the actual command the conductor wants to run; in this case, on all hosts for the course "tinfo452", get the VM details ("-d") and save them to the database ("-s"). "monitor_vms" is only run once per host, and no substitutions (e.g., "%v%") are done on its arguments.

If any changes are made via orchestration or by the students, the database should be updated the same way to reflect the current state of the course VMs. The VCL information in the database is timestamped, and the logic to manage it should always respect the latest information.

Orchestration Web Page

As of April 2020, a web page to help a conductor orchestrate VMs is under further development. Its scope is currently limited to setting up a virtual network amongst VMs assigned to students in a course.

Orchestration Script

A script called "/usr/local/bin/orch" exists on all compute nodes. It is intended to allow a conductor to manage VMs via each student's VM information and the "/usr/local/bin/ssh_vbm" script. Basically, the conductor serves as a proxy user for the student, who has full control over his/her VMs. "orch" is also used via "monitor_vms" to extract information about the VMs from the VM definitions and the state of the VMs, for immediate consumption or for saving in database tables.

A conductor could manage the VMs from the command line, but it takes a lot of knowledge about the hypervisor and the interface to managing the VMs to do so. The hope of a website is to make some common tasks easier, in part by allowing the conductor to create custom groups of VMs or students and applying the change to the entire group.

A common usage scenario is setting up virtual networking. While the VCL administrator strives to set up the VMs as desired out of the box, sometimes needs change during the quarter or as the course progresses. The conductor could change the VMs to be networked as desired.

Another common usage scenario is increasing or decreasing the amount of virtual RAM used by the VMs. One may want to increase a Windows Server VM's default RAM to handle Exchange, for example, and then decrease it for the next class exercise to save host RAM and possibly improve performance.

Here is the general form of the orch command:

Usage: orch [-h ip] -c class [-u users] [-g users] [-v vm_name] ["cmd_and_args"]

where:

              -s is an optional flag to suppress class/owner output
              -h ip is an optional ip address of the host
              -c class is an required course number or project name
              -u users is an optional list of user accounts for use with cmd
              -v vm_name is an optional VM name for use with cmd
              -g users means to return student info for given users
                 -- returns tab-delimited lines of
                    user/real_name/preferred_name
              cmd_and_args is the ssh_vbm command to run if -g not used;
                           for args, use where needed:
                                     %c% for class,
                                     %u% for user
                                     %v% for VM name

"-s" is used by monitor_vms to keep the output in columns (one value per column); not using it is the default to clarify what is happening. For example, if one wants to know the status (state) of a particular VM ("win10a") of the tinfo452 course for all students (the "entire class orch command"):

orch -c tinfo452 -v win10a "vcl_status %v%"

What happens is that the tinfo452 subfolder of orch for the conductor running this command is consulted. The "*.vcl.yaml" files provide the student or team ids (before the first period of the file name), while inside those files is the VCL information for all the student's VMs. The VM names per student are searched for ones that match the desired VM name ("win10a"), and if it is found, the "vcl_status" command is sent via ssh to the user and host of the VM, with the VM name substituted for "%v%", as if the user sent the command him/herself. For example, if the user is srondeau and the IP address of the host is 140.142.71.13, this orch command:

orch -c tinfo452 -u srondeau -v win10a "vcl_status %v%"

issues a command that looks like:

echo tinfo452/srondeau/ ; ssh srondeau@140.142.71.13 vcl_status win10a

which may return this output:

tinfo452/srondeau/win10a
State:                       powered off (since 2020-04-14T15:24:52.000000000)

More commonly, for the entire class orch command above, there are several users or teams, so this could be the orch command output:

tinfo452/_452e1/win10a
State:                       powered off (since 2020-03-23T16:56:19.000000000)
tinfo452/_452e2/win10a
State:                       powered off (since 2020-04-10T01:07:23.000000000)
tinfo452/_452e3/win10a
State:                       powered off (since 2020-04-13T20:28:34.000000000)
tinfo452/_452e4/win10a
State:                       running (since 2020-04-02T21:03:07.026000000)
tinfo452/_452e5/win10a
State:                       powered off (since 2020-04-08T03:09:32.046000000)
tinfo452/_452e6/win10a
State:                       powered off (since 2020-04-11T07:57:23.855000000)
tinfo452/_452e7/win10a
State:                       running (since 2020-04-09T17:31:24.377000000)
tinfo452/_452e8/win10a
State:                       running (since 2020-04-11T02:58:00.473000000)
tinfo452/_452e9/win10a
State:                       powered off (since 2020-04-12T02:19:23.258000000)

The orch command's "cmd_and_args" is any valid command (and its arguments) that ssh_vbm accepts. Since ssh_vbm allows any VBoxManage subcommand to be executed, one can control any aspect of the VM by issuing the subcommand and its arguments, with %c%, %u%, and %v% used where needed in the arguments.

Management Nodes

We started this project with management nodes to support Apache VCL's web-based reservation system, but eventually abandoned them because we never adopted Apache VCL (Phase II).

Provisioning System

Everything is manually provisioned. The scripts help to automate creating VMs for a class, but it is still time-consuming, especially if the guest OS needs to be installed and configured. Windows VMs are very time-consuming, since we either sysprep them (requiring post-creation work) or re-install trial versions to prevent them from expiring during the quarter.

Our ability to use Windows OSes for coursework comes from both a UW Microsoft Campus Agreement (allowing Windows to be virtualized) and the Dreamspark Premium program.

Backup Disk

We originally had backup disks attached to the switch, but did not use them. They were intended to backup the storage nodes, which we don't use to store VM images. We don't backup anything, and have been fortunate that this has never affected us nor have we lost any VMs in the middle of a quarter. There are quite a few hosts we can choose from should a host fail.