Working with Remote Hosts

Speaker: Alex Townsend (Cornell University)

Overview

Overview of what a Cluster is
- Terminology you will encounter, and a visual depiction of how it works.
Convenient Login to Remote Hosts
- Log into remote hosts using ssh.
Git: An enthusiastic user’s point-of-view
- Presentation by Alex Townsend.
- Overview of using git and related tools (e.g. GitHub) effectively.
  - Working with collaborators, managing students.
Game Changing Git
- Wield the git difftool and git difftool.
Version Control with Large Files
- Special considerations are needed when working with large data.
Configuration is Key
- An example repository to clone and configure your Unix terminal.

Overview of what a Cluster is

Terminology

The server you are logging into is called the remote (host).
The user (you) are referred to as the client.
When referring to clusters, you may come across the following terms:
- The master node (sometimes called head).
- The slave nodes (the workers).
- You often are only allowed to log into the master node.
- There is usually a queuing system (e.g. qsub) that submits jobs that get farmed out to the (available) slaves.
- In most scenarios, you get charged by the number of cores / resources you are using.

In Picture Form

In this picture, you are the client. For the majority of clusters you will encounter, you will only be allowed to ssh into the head node. Some clusters will allow you to login into a worker directly, but this is generally rare.

Once you have logged into the head node, in order to farm out computations to the workers you will generally submit jobs to a queue. Depending on the resources available (e.g. how many other people are currently using the cluster), and how many resources you are requesting, it may not launch right away.

What the queuing system is and how to farm out work depends on the cluster you have logged into.

Required Setup

If you are not on campus, then you will not be able to log into our cluster. You can identify if this is the reason you are unable to login if you try and ssh in and the command just hangs (use ctrl+c to kill it). To have an on-campus IP address, you will need to use the Cornell VPN.

Follow the directions to install and run the VPN here.

Your login is your Cornell NetID and corresponding password (e.g. what you would enter to log into studentcenter).

The Tool

`ssh` - Secure SHell

ssh [opts] <username@remote.host>

username is the username on the remote host.
remote.host is the url of the server you want to log into.

IP Address: 128.253.51.103
Symbolic name: en-cs-totient-01.coecis.cornell.edu

Use -l to specify username (no need for @ anymore).
-p <port>: connect to a specific port (may be necessary depending on the server).
Can forward graphical programs (NOT the entire screen):

Enable X11 forwarding with -X.
Enable "trusted" X11 forwarding with -Y (actually less secure, only use if needed).

We will be using ssh, which is a secure shell login protocol. This tool enables you to login into and execute commands on the remote using the terminal on your client.

The instructions below are for Unix systems. If you are using Windows, browse through the below to understand the workflow, then follow the directions in the SSH for Windows section below.

So, since my username on Totient is my Cornell NetID (in this case sjm324), I could log into Totient using both of the following:

# Using the symbolic name
$ ssh sjm324@en-cs-totient-01.coecis.cornell.edu
# Using the IP Address:
$ ssh sjm324@128.253.51.103

Making SSH Convenient

Rarely would you actually want to enter your username or password everytime you try and log into a server. The solution is to use SSH Keys. For these, there are

A public key you copy to any cluster / server you are working with
- Generated with a .pub file name.
A private key that you should never share with anybody.
- Generated with a _rsa postfix to the name of your key by default.

The steps are simple enough:

Generate your SSH key (if you do not have one). Follow the instructions in this tutorial.
- For the following steps, I assume you generated the files
  - The public key ‘id’: ~/.ssh/id_rsa.pub
  - The private key ‘id’: ~/.ssh/id_rsa
- If you entered something else, change the commands below accordingly.
Add the newly generated SSH key to your it to your keychain:
```
# Adds the key to your ssh keyring
$ ssh-add ~/.ssh/id_rsa
```
Note that we are adding the private key to our local ssh keyring, not the public key.

We need to copy the public component of the key to the host. The directions below are inlined from here, replace all instances of <NetID> with your NetID.

# Create the ~/.ssh directory on the remote
$ ssh <NetID>@en-cs-totient-01.coecis.cornell.edu mkdir -p .ssh
# Copy the public part of the key from the client to the host
$ cat .ssh/id_rsa.pub | ssh <NetID>@en-cs-totient-01.coecis.cornell.edu 'cat >> .ssh/authorized_keys'

At this point, you have now registered your SSH key with Totient. You can now use passwordless login:

$ ssh en-cs-totient-01.coecis.cornell.edu

Typing the entire host name the entire time is tedious. We will setup a config file on the client (your computer).
```
# Create the file if it does not exist
$ touch ~/.ssh/config
```
Now edit it with your favorite text editor to include
```
Host totient
Hostname en-cs-totient-01.coecis.cornell.edu
User <NetID>
ForwardAgent yes
```

With all that, you should now be able to just execute

$ ssh totient

and login to the Totient cluster. Every now and then you may be asked to enter the password for your actual SSH key, but this will not be that frequent.

SSH for Windows

By default, Windows does not have an SSH client. If you are not using Cygwin or Bash for Windows 10, you will need to install 3rd party software (that is well established and reliable).

To login to a server as well as generate SSH keys, you will need to download Putty (the program to log into remote hosts with) and PuttyGen (to generate the keys). These can both be downloaded from here.
To set up SSH, follow the tutorial here.

Git: An enthusiastic user’s point-of-view

Presentation courtesy of Alex Townsend.

Abstract:

Git is a version control system that is changing the way collaborative projects (software, courses, research, etc.) can be managed. It is widely regarded as a superior alternative to svn, Dropbox, and Google docs for mildly tech-savvy individuals. I will discuss the basic workflow for Git and Github as an enthusiastic everyday user.

Game Changing Git

Two extraordinarily powerful tools with git are the difftool and the mergetool. They are exactly what they sound like:

The difftool allows you to view what has changed between the last commit and now.
The mergetool allows you to have git present all of the merge conflicts for you automatically. You can then fix the conflicts and write the file(s) to successfully merge!

GUI Options

At this point, the most popular GUI client appears to be meld. If you are developing on OSX, then when you installed XCode, you received a copy of ‘FileMerge’.

Terminal Options

It is worth mentioning that when you are working on a cluster, you will not be able to have a gui client launch without significant effort. If you are already familiar with vim, then the following commands will enable you to use vim to assist with your merges and diffing.

Tell git to use vim for the git difftool. For starters, vimdiff is not the only tool. If you execute git difftool in your terminal, it should inform you that it is not set, and list the options available. Try them out!
```
# set vimdiff as the difftool
$ git config --global diff.tool vimdiff
# if you want, let `git d` serve as an alias
$ git config --global alias.d difftool
```
Go find a repository and start changing files. Compare the differences between running git diff and git difftool from your terminal. We are confident you will never go back.
Tell git to use vim for the git mergetool. As with before, vimdiff is not the only tool. If you execute git mergetool in your terminal, it should inform you that it is not set, and list the options available. Try them out! Note: you will probably want to keep your difftool and your mergetool as the same thing, if only for coherence in presentation.
```
# tell git to use vimdiff as the mergetool
$ git config merge.tool vimdiff
# the diff3 conflict style tends to be easier to read
$ git config merge.conflictstyle diff3
# always open up the merge conflicts automatically
$ git config mergetool.prompt false
```
These instructions were pasted from here for convenience. This article is invaluable in explaining what is going on when using vimdiff as your merge tool.

Once you get comfortable, you will probably want to also execute
```
# http://stackoverflow.com/a/1251696/3814202
$ git config --global mergetool.keepBackup false
```
so that all of those *.orig files after a merge conflict are not generated. You’re tracking the code with git after all, if you really break things you have both commits on their own and can just re-do the merge if necessary.

Version Control with Large Files

Git is far from the only version control system out there. You are likely to encounter many forms in the wild, but generally speaking the most popular tools are git, svn, and hg (mercurial). While they achieve similar tasks, they are somewhat fundamentally different.

Since git is a decentralized version control system, you should avoid tracking large files with it. Any time there is a small change on a binary file (such as your datasets, game assets, or even pdf’s), a binary diff is generated and stored in the history. Over time, these small changes produce huge git repositories without any real need.

None of the options below are perfect, but they are at least functional.

Git Annex

Git Annex is an excellent tool for this purpose. It has been around for a while and is reliable. The downside is that anybody who would want to clone your repository must also have git-annex installed.

Git Large File Storage (LFS)

Git LFS is a relatively new tool that attempts to solve the underlying issues of using git with large files similar to how git submodules are handled. Instead of directly tracking files, a brief text file with a pointer to a repository and commit (an LFS server) are stored.

This tool is new, and still has a lot of room for growth. Although in theory widespread adoption of this tool would benefit the service providers (GitHub, Atlassian, etc.), this is not quite the reality. Most providers are going to bill you by bandwidth used, meaning if you pay for a 1GB quota and have a 500MB data file, you got to upload and download your file once. Then you will get billed for every clone after.

The solution at this point in time, if you have the resources, would be to setup your own git-lfs server and host your data there. Note that some service providers will only enable git-lfs for public repositories.

Using just mercurial (`hg`)

Mercurial has large files extension, you can learn more here.

Using `git` and `svn` together

Since svn is centralized, it is an effective tool for tracking large files. Tracking your code with svn, though, is (in this author’s opinion) less effective than git due to how merge conflicts are resolved. In general, you should be good to go if you use:

git to track your code.
svn to track your data.
- E.g. if you have a data directory in your repository, add that to your .gitignore, and use svn to clone your actual data repo there.

There exists a git-svn tool described in this article, but you may have better mileage if you just manually control the repositories and provide explicit instructions on your README of how to get all the necessary data for your project.

Configuration is Key

You may feel like your terminal is uncomfortable, or wish it could look better. Now that you have the ability to access a cluster, if you want to get some practice cloning a repository as well as make your terminal more appealing in one step, follow the directions here.