Computing References and Resources

1.0 Computing tools and how they interact

There are many different computing tools used at IHME and on Sim Science in particular. Many of these tools have their own trainings and in-depth resources available on the Hub or IHME Learn. Here, we seek to give an overview of how these tools interact and helpful links to other sources of information.

It is worth noting that versions of this document exist for other teams on the Hub such as the Cancer team, and the Demography team. This document seeks to add in Sim Science specific information and common practices.

The Cluster: a group of powerful computers provided by IHME for computationally intensive tasks.These computers can all access a shared file system.For researchers on sim science, the most common task we use the cluster for is opening Jupyter notebooks to run code. We will go into more depth on how to access the cluster and other tasks below.

For general information on the cluster, please see the Module 3 within the IHME Learn training Computational Infrastructure Level 1.

Terminals: You will need to use a terminal for both accessing the cluster and updating Vivarium documents. There are many options for terminals you can use. In section 2, we will review how to use terminals and how to select one for different applications.

GitHub: GitHub is an internet hosting service we use to control versioning for code and documentation. It uses Git to track changes and allows for multiple users to contribute to the same files simultaneously without overwriting each other’s work.

You will use GitHub by “cloning” a repository to your local machine or the cluster, allowing you to make changes to the documents in that repository, and then uploading those changes to Github again so other users can review and access the edits.

GitHub is the online system we use, but the program you will use on your local machine or on the cluster is called “Git”. Git is a “version control system” used to track changes to projects. It can be used whenever you are operating within a GitHub repository.

You can find a training for how to use Git and some basic commands in Module 2 within the IHME Learn training Computational Infrastructure Level 1. We will see an example of how to clone and make edits using Git in the next section. It is worth noting that you can use Git and GitHub from BOTH the cluster and from terminals on your local computer.

Miniconda: Miniconda aims to simplify package management and deployment for Python and R programming. On Sim Science, we use it for updating Vivarium documents. An example of this will be in the next section.

Link to download Miniconda

Text Editors: For updating Vivarium documents, a text editor is needed. Most of the Sim Science team uses Sublime Text. Others on the team prefer Atom, which was created by GitHub and so works well with their system or Visual Studio Code. In addition to these, there are many other options including: Vim, Notepad++, and Gedit. If you have a prior favorite text editor, please use that!

Link to download Sublime Text

Link to download Atom

Python:

Python is the programming language most commonly used by the Sim Science team.

Link to download Python Please see the below information on versioning for Python.

PyCharm: PyCharm is an integrated development environment (IDE) for running Python code that is designed to be user friendly.

Link to download PyCharm

R:

R is another programming language that is commonly used at IHME, but less commonly used on the Sim Science team.

Link to download R and R Studio

R Studio: R Studio is another IDE, similar to PyCharm, but for running R code that is designed to be user friendly.

Link to download R and R Studio

Jupyter:

Jupyter is a web-based IDE. That essentially means it’s a place to write code and store code that is online and can integrate well with GitHub and the cluster. You can code in multiple languages in Jupyter including Python and R. This is more commonly used by the Sim Science team than PyCharm. Information on installing and using Jupyter is in the Accessing the Cluster section below.

2.0 How to Access a Terminal

A lot of work you will do requires you to access a terminal. A terminal is a way for you, a user, to communicate with a computer or computing system. There are many options for terminals. This section is written for someone not familiar with terminals, so if you have familiarity or favorite software please feel free to use those!

Writing code into a terminal is called using the command line. IHME has a helpful training on the command line in Module 1 of the IHME Learn training Computational Infrastructure Level 1. Command lines technically interact with an operating system on your computer.

Operating systems are things like Windows or macOS (for Mac computers) that help a computer run and provide some basic infrastructure. Opening and saving files all happen through your operating system. You can change or install multiple operating systems, but that is quite uncommon.

Some operating systems require different command lines to accomplish the same task. The IHME cluster uses the operating system Linux, and so the trainings provided are designed for Linux. Linux and macOS are very similar (they are both Unix like), but Windows is quite different. Therefore, we recommend for Windows users to install a terminal that can accept command lines written for Linux while still allowing you to “speak” with your Windows computer.

Recommended Terminals for Windows Users:

For updating Vivarium docs and interacting with GitHub, we recommend using Git Bash. This installs automatically with Git for Windows. However, some prefer the Windows Subsystem for Linux (WSL) for its user interface and tools.

Link to download Git and Git Bash

Link to download WSL

For accessing the cluster, we recommend using PuTTy or Bitvise. How to access the cluster is included in more depth in Section 3.

Recommended Terminals for Mac Users:

Since macOS is similar to Linux and is the base operating system on Mac computers, the pre-installed Terminal app can be used for all your terminal needs.

Other Options:

Most terminals can also be used to access the cluster, although the common practice for Windows users on the Sim Science team is to use separate terminals for working on our local machine and for cluster access.

Git can be used for updating Vivarium docs from Command Prompt. Command Prompt is the terminal that is pre-installed on Windows computers, but it is not Linux based. Therefore the command line trainings will not be applicable if you use this option.

For Mac users, there are other terminal options such as iTerm2 which provides more features than Terminal.

Git Tokens:

“Pushing” things to GitHub will create a prompt asking for a username and password. Counterintuitively, the username is your GitHub username, but the password is NOT your GitHub password, but instead is a unique token that you will need to create. This website has information on creating a token. Many Sim Science users set their token to never expire and save the token where they can reference it later. However, this might compromise security in some cases, so regenerating a token periodically is best practice.

There are also ways to set up terminals so that you do not have to enter this information every time. This is covered in the aliases section below.

How to Install Conda:

Conda is an open-source, cross-platform, language-agnostic package manager and environment management system. In order to accomplish most tasks at IHME, you will need to install it.

For your local computer, you can use this link to install Miniconda. We recommend installing Miniconda3 in 64 bit.

Link to download Miniconda

On the cluster, you can use the version of conda provided by the Central Comp team. This is simplest and recommended. To do this, log into the cluster and then enter the code /ihme/code/central_comp/miniconda/bin/conda init. This adds information on how to access conda to your bashrc file. You will need to restart the terminal for the changes to take effect. This is also noted in the bash files section

There are other ways to install conda, but the above is simplest.

3.0 How to Access the Cluster

For this section, we will review cluster set up for a first-time user. Multiple other teams have versions of this information available on the Hub and there is duplicated information with the IHME Learn training for the Cluster. This will be a high-level overview focused on Sim Science specific tasks.

Some Hub pages on accessing the cluster:

The cluster is accessed through the Secure Shell protocol or SSH for short. To access the cluster, an SSH “client” is needed. The client is an application that can make SSH connections.

Both Mac and Windows include command-line SSH clients by default. This means that most terminals can be used to access the cluster. As mentioned above, feel free to use any terminal you are familiar with!

IHME Learn provides information on accessing the cluster from the command line in Module 3 within the IHME Learn training Computational Infrastructure Level 1.

Mac users have to use the above method for cluster access. For Windows users, there are some SSH clients that come with a graphical user interface (e.g., you can “click” on things rather than type commands only) which are more intuitive and we recommend if you are new to this type of computing work.

As mentioned above, for Windows users this is PuTTY or Bitvise

Link to download PuTTY or Bitvise

Accessing the Cluster from PuTTY

We provide step by step instructions for accessing the cluster for the first time. These instructions are for PuTTY, if you are using a different SSH client search for similar information on the Hub or ask a team member for help.

For your first time on PuTTY, you will set up and save the instructions for a slurm session. To do this:

  1. Open up PuTTY

  2. Under “Host Name” enter: gen-slurm-slogin-p01.cluster.ihme.washington.edu

  3. Under “Port” enter: 22

  4. Select SSH connection type

  5. Under “Saved Sessions” enter: slurm (or any other name you choose!)

  6. Hit “Save”

../../_images/putty_1.png

Next and for all future times accessing the cluster, you can simply select slurm from the list of saved sessions and hit “Open”.

../../_images/putty_2.png

Once you open a PuTTY terminal, you will have to enter your username and IHME password. After that you are connected to the cluster and can enter command lines from your trainings!

../../_images/putty_3.png

Your Bash Configuration Files

Bash configuration files contain information and commands that are used when interacting with the cluster. When the cluster tries to execute some command line prompts, it will look to your bash configuration files for information or filepaths.

You should have 2 bash configuration files, your bash profile or .bash_profile and your bash rc or .bashrc. The rc stands for run commands and comes from the predecessors of Unix.

If you make edits to either of these files, you will need to log out and then back into the cluster before they will take effect.

bash profile

The bash profile is only run when you log in to the cluster, and so is usually a shorter file. It contains a few lines with generic profile settings. You should NOT need to edit this file when you first start. For reference, the lines of code in your bash profile are below. If you think these don’t match what you have, ask a friend to help you troubleshoot.

[[ -e ~/.profile ]] && source ~/.profile    ##Loads generic profile settings
[[ -e ~/.bashrc  ]] && source ~/.bashrc     ##Loads bash rc

bash rc

Your bash rc file is run more frequently and can contain helpful settings and information needed to run certain commands. First, you will need to run the below line of code in your terminal to set up conda. Once you’re logged into the cluster on the terminal of your choosing, run:

1$ /ihme/code/central_comp/miniconda/bin/conda init

This information is also covered in the terminal access section above

Following that, below is a block of code designed to be copied and pasted into your bash rc file. If you’re curious about what this code means, there are some comments included or you can ask a friend. It is mainly settings to make the terminal more user friendly.

Section to copy and paste:

# Source global definitions
if [ -f /etc/bashrc ]; then
  . /etc/bashrc
fi

# don't put duplicate lines or lines starting with space in the history.
# See bash(1) for more options
HISTCONTROL=ignoreboth

# append to the history file, don't overwrite it
shopt -s histappend

# for setting history length see HISTSIZE and HISTFILESIZE in bash(1)
HISTSIZE=-1
HISTFILESIZE=2000

# check the window size after each command and, if necessary,
# update the values of LINES and COLUMNS.
shopt -s checkwinsize

# If set, the pattern "**" used in a pathname expansion context will
# match all files and zero or more directories and subdirectories.
# shopt -s globstar

# make less more friendly for non-text input files, see lesspipe(1)
[ -x /usr/bin/lesspipe ] && eval "$(SHELL=/bin/sh lesspipe)"

# set variable identifying the chroot you work in (used in the prompt below)
if [ -z "${debian_chroot:-}" ] && [ -r /etc/debian_chroot ]; then
    debian_chroot=$(cat /etc/debian_chroot)
fi

# set a fancy prompt (non-color, unless we know we "want" color)
case "$TERM" in
    xterm-color|*-256color) color_prompt=yes;;
esac

# uncomment for a colored prompt, if the terminal has the capability; turned
# off by default to not distract the user: the focus in a terminal window
# should be on the output of commands, not on the prompt
force_color_prompt=yes

if [ -n "$force_color_prompt" ]; then
    if [ -x /usr/bin/tput ] && tput setaf 1 >&/dev/null; then
  # We have color support; assume it's compliant with Ecma-48
  # (ISO/IEC-6429). (Lack of such support is extremely rare, and such
  # a case would tend to support setf rather than setaf.)
  color_prompt=yes
    else
  color_prompt=
    fi
fi

if [ "$color_prompt" = yes ]; then
    PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
    PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt

# If this is an xterm set the title to user@host:dir
case "$TERM" in
xterm*|rxvt*)
    PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
    ;;
*)
    ;;
esac

# enable color support of ls and also add handy aliases
if [ -x /usr/bin/dircolors ]; then
    test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
    alias ls='ls --color=auto'
    #alias dir='dir --color=auto'
    #alias vdir='vdir --color=auto'

    alias grep='grep --color=auto'
    alias fgrep='fgrep --color=auto'
    alias egrep='egrep --color=auto'
fi

# colored GCC warnings and errors
#export GCC_COLORS='error=01;31:warning=01;35:note=01;36:caret=01;32:locus=01:quote=01'

# some more ls aliases
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'

# Add an "alert" alias for long running commands.  Use like so:
#   sleep 10; alert
alias alert='notify-send --urgency=low -i "$([ $? = 0 ] && echo terminal || echo error)" "$(history|tail -n1|sed -e '\''s/^\s*[0-9]\+\s*//;s/[;&|]\s*alert$//'\'')"'

# Alias definitions.
# You may want to put all your additions into a separate file like
# ~/.bash_aliases, instead of adding them here directly.
# See /usr/share/doc/bash-doc/examples in the bash-doc package.

if [ -f ~/.bash_aliases ]; then
    . ~/.bash_aliases
fi

# enable programmable completion features (you don't need to enable
# this, if it's already enabled in /etc/bash.bashrc and /etc/profile
# sources /etc/bash.bashrc).
if ! shopt -oq posix; then
  if [ -f /usr/share/bash-completion/bash_completion ]; then
    . /usr/share/bash-completion/bash_completion
  elif [ -f /etc/bash_completion ]; then
    . /etc/bash_completion
  fi
fi

Another common thing to include in your bash rc file are aliases. More information on what these are and some examples can be found below in the aliases section.

Command Line

Once you have accessed the cluster, you can do a number of things! These are best covered through a few different trainings:

  1. You can move files, check permissions, and explore directories using the command line. More information on this can be found in Module 1 within the IHME Learn training Computational Infrastructure Level 1.

  2. You can start jobs on the cluster, simple tasks are covered in Module 3 within the IHME Learn training Computational Infrastructure Level 1.

If you need help applying any of these trainings to a practical situation, please ask!

Accessing Jupyter or RStudio from the Cluster

The other most common task for a Sim Science researcher on the cluster is to start a Jupyter session. Information on how to do this can be found on the Hub page here. You will also need to update your Bash configuration files in order to complete this, which is covered in depth in the section Your Bash Files.

Once you have started a session, you will be able to create code, test simulation results, or do quick calculations. Once you have finished coding, you’ll want to follow the same steps as outlined above in the Contributing New Documentation section to save the information on GitHub. All of the same Git commands work on the cluster the same way as on your local machine.

You will need to make sure that you have cloned your repository and are in the appropriate working directory while logged into the cluster. Then you can add, check the status, commit, and push information in a similar way. Researchers will generally create a new GitHub repository with a name starting with vivarium_research, e.g. vivarium_research_ciff_sam. This will store code written by researchers, but not the simulation code itself, which is managed by the engineers in a different repository. Having separate repositories ensures that researchers do not disturb engineering workflow.

Aliases and Other Cluster Tips

Aliases:

Often, it can be annoying to type the same information repeatedly everytime you access the cluster. To help account for this you can create aliases. These are short-hand commands for commonly typed things.

Here is a Hub page written by the Cost Effectiveness team on how to set up aliases.

Here, we provide a few copy and paste aliases you can add to your bashrc file. Be sure to update the names to match your project and username. Also, note that once you include these you will need to restart your cluster connection for them to take effect. The alias names themselves are arbitrary. While examples are provided, please name these whatever is short and clear for your use.

The aliases below are:

  1. Starting a Jupyter notebook in your project’s repository

  2. Starting an srun session (note: you can change the memory or other parameters before saving)

  3. Checking on your current jobs on the cluster

1$ alias jupyter_<PROJECT_NAME>="sh /ihme/singularity-images/rstudio/shells/jpy_rstudio_sbatch_script.sh -e <INSERT_ENVIRONMENT_NAME> -c /ihme/code/central_comp/miniconda/bin/activate -t lab -d /ihme/code/<INSERT_USERNAME>/<INSERT_PROJECT_REPO> -A proj_simscience -p i.q"
2$ alias srun_5G="srun --mem=5G -c 1 -A proj_simscience -p all.q --pty bash"
3$ alias squeue_<USERNAME>="squeue -u <INSERT_USERNAME>"

If you ever forget what settings you included in an alias you can enter the command type <ALIAS_NAME> into the terminal and the full alias code will be displayed.

This is useful if you want to change the parameters of a command as well - simply display the alias code, copy and paste the command into the terminal, and then make needed adjustments before running.

Setting up Easier Cluster Access:

There are ways to configure access so that getting on the cluster is fewer steps.

For those using PuTTY, you can configure settings such that you do not need to type your username and password every time you access the cluster. This Hub page does a very good job of outlining the steps. However, note that for step 2 of “Configure PuTTY Itself”, this author needed to enter “gen-slurm-slogin-p01.cluster.ihme.washington.edu” instead of “cluster-submit1.ihme.washington.edu”, which is listed on the page.

A similar procedure can be used for Bitvise, instructions are on this webpage.

For those using command line to access the cluster, you can do two things for easier access:

  1. Set up an alias to allow for a shorter command line to access the cluster

  2. Configure your computer to not need your username and password everytime

For both of these, this Hub page by the Cost Effectiveness team has a good step by step guide to configuring your setup. If you need help with this process, reach out to someone on the team.

Long Cluster Jobs: When your computer falls asleep, it will stop access to the cluster and cut off any interactive jobs (i.e. srun sessions) that were currently running. This can be problematic if a command needs to run overnight. There are a few different options to account for this including: screen, MOSH, or tmux. If you need to use these, ask a teammate.

File Systems and Storage

The cluster can be confusing with where to store code and data. Our team has created some best practices to use for data storage.

For code, please create a new directory under /ihme/code with your username. For example, this might be /ihme/code/lutzes. You should clone GitHub respositories to this location and have all Jupyter notebooks and other code stored here.

For data files, there are two locations based on the size of the data file.

  1. For small data files, store these on GitHub in the same location as your code. Examples might include: a list of nicknames, disease severity proportions by age/sex group, or drug efficacy data. The absolute maximum file size on GitHub is 100 megabytes, but be mindful of including any file over 10 megabytes, especially if there are many such files or if the file changes frequently. Too many large files can slow down the process of making new clones of the repository.

  2. For large files, store these in a shared location on the cluster. Considering making a new folder for each project for data storage.

  3. When you decide where to store data, please also consider any data restrictions that might exist.

Regardless of where you store data, it is important to track updates to data files carefully. Engineers might copy and paste a file into a new location, so updating the file might not actually change what is being used in the sim. Therefore, follow these steps:

  1. Use the naming conventions below to ensure consistency.

  2. Always version up rather than replacing a data file that is used by engineering or is not tracked in GitHub (e.g., create a new file with the current date rather than just replacing with a different file of the same name).

  3. Include the exact file name and location in the docs. This means if you version up a data file, you will need to update the docs to reflect the new name. This ensures the engineers are aware of any changes.

For consistency, please use this naming convention for all files: FILENAME_20230309.ext. For example, this might be heart_failure_proportions_20230310.csv

4.0 Conda Environments

A conda environment is a “workspace” in which you can run code with certain packages installed. You can install a package in a conda environment without affecting any other conda environments: they are isolated from one another.

This allows you to have multiple projects that each have their own separate set of packages and package versions. Below are some common questions on environments.

What is an environment again? It’s a “workspace” that contains a specific collection of packages that you have installed. Basically, it is a shortcut to have all the relevant packages you need for a project in one place.

What are the advantages to having separate environments? Over time, new versions of packages come out. It can therefore be helpful to create new environments to ensure you have the latest package versions.

While you can uninstall and reinstall new versions of packages in existing environments, this can sometimes cause errors in existing code. Therefore, it is helpful to keep environments that work with existing code and to create new environments for new projects and install the latest versions of packages in those.

What environments are available for me to use? The Central Computation team maintains an environment, which anyone at IHME can use, that includes all the packages necessary for accessing GBD results (plus some other common packages). However, this environment is read-only. Read-only means you can use it, but you can’t change it. So if you want any packages not included there, you will need to make your own environment.

Another option is to copy the engineering team’s environment for a particular project. For this option, you will technically make your own environment, but rather than selecting packages by yourself, you will just install everything the engineering team is using. However, since you are making your own environment you can also add new packages or update as needed.

If you are not familiar with environments, we recommend this option as it is straightforward but still allows you to make a personal environment.

Instructions for how to do this are found in the readme section of the engineering GitHub page for your project. For example, these are the CVD environment instructions. If you are having trouble locating these for your projects, ask an engineering team member.

Another common option is to make a make your own environment for a project. If you are familiar with environments, this is a recommended approach. It is common practice for each researcher to make a new environment for each project they work on. They may even make multiple if they want to use different versions in different parts of a project.

How do you make a new environment? Before you can make a new environment, ensure that you have git and conda installed. Instructions for this can be found above if needed.

Once these are installed, navigate in your preferred terminal. Ensure that you are in the right location to have this environment on your local machine or on the cluster as needed. Then, follow the below code:

1$ conda create --name=INSERT_NAME_HERE python=3.8
2$ #conda will download python and base dependencies
3$ conda activate ENVIRONMENT_NAME
4(ENVIRONMENT_NAME) $ pip install <INSERT PACKAGE NAME HERE>

From here, repeat the pip install line for all packages you wish to include.

How do I install new information to an existing environment? Once you have made a new environment, you can add some commonly used packages using pip install package. A list of common packages to install is provided below. You can also include multiple packages in a single command. For convenience, a code snippet you can copy and paste is included here with some common packages.

1$ pip install numpy pandas scipy risk_distributions statsmodels matplotlib seaborn db_queries get_draws gbd_mapping

Common Packages:

Packages for data manipulation and statistics:

  • NumPy (usually imported as np)

  • Pandas (usually imported as pd)

  • SciPy

  • risk_distributions (more information)

  • statsmodels (usually imported as sm or smf)

Packages for visualization:

  • Matplotlib (usually imported as plt)

  • Seaborn (usually imported as sns)

Packages for accessing GBD data (shared function information):

Trouble Shooting:

Packages usually have to be in your environment before you can import them in Python. If an import command fails, try installing the package to the environment and restarting the Jupyter kernel (for example Kernel -> Restart in the Jupyter Notebook menu).

However, there are some common packages that do not require a pip install and come pre-loaded into Python. A partial list is included below for clarity. These do still need to be imported at the start of a notebook.

  • math

  • warnings

  • random

The IHME specific packages for accessing GBD data should only be used on the cluster (db_queries, get_draws and gbd_mapping). If you are creating an environment on your local machine, these will not install correctly and should be removed from the pip statement above.

Some packages have dependencies on other python packages or are not able to be installed using the pip command. If you attempt to install a package and find errors, ask a friend for help.

When should I use the GBD environment vs making my own? In general, it is best practice to use your own environment for project work. However, the GBD environment is helpful for small tasks and non-project work.

I installed a package to this environment on the cluster - why won’t it work? Your local machine and the cluster are different and don’t “speak” between environments. So if you install a package to an environment while on the cluster, it won’t show on your local machine.

What is Python vs Conda Vs Anaconda? Python is the name of a programming language. It is the name for the syntax used in code.

Conda is a package manager that we use to create and maintain environments. It is designed to allow for easier package installation and control across team members.

Anaconda is a software you can use to install Python and conda, and create conda environments, on Windows. It is specifically designed for data science.