Working with Remote Hosts

Speaker: Alex Townsend (Cornell University)

Overview


Overview of what a Cluster is


Terminology

In Picture Form

In this picture, you are the client. For the majority of clusters you will encounter, you will only be allowed to ssh into the head node. Some clusters will allow you to login into a worker directly, but this is generally rare.

Once you have logged into the head node, in order to farm out computations to the workers you will generally submit jobs to a queue. Depending on the resources available (e.g. how many other people are currently using the cluster), and how many resources you are requesting, it may not launch right away.

What the queuing system is and how to farm out work depends on the cluster you have logged into.

Convenient Login to Remote Hosts


Required Setup

If you are not on campus, then you will not be able to log into our cluster. You can identify if this is the reason you are unable to login if you try and ssh in and the command just hangs (use ctrl+c to kill it). To have an on-campus IP address, you will need to use the Cornell VPN.

Follow the directions to install and run the VPN here.

The Tool

ssh [opts] <username@remote.host>

  • username is the username on the remote host.
  • remote.host is the url of the server you want to log into.
    • IP Address: 128.253.51.103
    • Symbolic name: en-cs-totient-01.coecis.cornell.edu
  • Use -l to specify username (no need for @ anymore).
  • -p <port>: connect to a specific port (may be necessary depending on the server).
  • Can forward graphical programs (NOT the entire screen):
    • Enable X11 forwarding with -X.
    • Enable "trusted" X11 forwarding with -Y (actually less secure, only use if needed).

We will be using ssh, which is a secure shell login protocol. This tool enables you to login into and execute commands on the remote using the terminal on your client.

The instructions below are for Unix systems. If you are using Windows, browse through the below to understand the workflow, then follow the directions in the SSH for Windows section below.

So, since my username on Totient is my Cornell NetID (in this case sjm324), I could log into Totient using both of the following:

# Using the symbolic name
$ ssh sjm324@en-cs-totient-01.coecis.cornell.edu
# Using the IP Address:
$ ssh sjm324@128.253.51.103

Making SSH Convenient

Rarely would you actually want to enter your username or password everytime you try and log into a server. The solution is to use SSH Keys. For these, there are

  1. A public key you copy to any cluster / server you are working with
    • Generated with a .pub file name.
  2. A private key that you should never share with anybody.
    • Generated with a _rsa postfix to the name of your key by default.

The steps are simple enough:

  1. Generate your SSH key (if you do not have one). Follow the instructions in this tutorial.
    • For the following steps, I assume you generated the files
      • The public key ‘id’: ~/.ssh/id_rsa.pub
      • The private key ‘id’: ~/.ssh/id_rsa
    • If you entered something else, change the commands below accordingly.
  2. Add the newly generated SSH key to your it to your keychain:

    # Adds the key to your ssh keyring
    $ ssh-add ~/.ssh/id_rsa
    

    Note that we are adding the private key to our local ssh keyring, not the public key.

  3. We need to copy the public component of the key to the host. The directions below are inlined from here, replace all instances of <NetID> with your NetID.

    # Create the ~/.ssh directory on the remote
    $ ssh <NetID>@en-cs-totient-01.coecis.cornell.edu mkdir -p .ssh
    # Copy the public part of the key from the client to the host
    $ cat .ssh/id_rsa.pub | ssh <NetID>@en-cs-totient-01.coecis.cornell.edu 'cat >> .ssh/authorized_keys'
    

    At this point, you have now registered your SSH key with Totient. You can now use passwordless login:

    $ ssh en-cs-totient-01.coecis.cornell.edu
    
  4. Typing the entire host name the entire time is tedious. We will setup a config file on the client (your computer).

    # Create the file if it does not exist
    $ touch ~/.ssh/config
    

    Now edit it with your favorite text editor to include

    Host totient
    Hostname en-cs-totient-01.coecis.cornell.edu
    User <NetID>
    ForwardAgent yes
    

With all that, you should now be able to just execute

$ ssh totient

and login to the Totient cluster. Every now and then you may be asked to enter the password for your actual SSH key, but this will not be that frequent.

SSH for Windows

By default, Windows does not have an SSH client. If you are not using Cygwin or Bash for Windows 10, you will need to install 3rd party software (that is well established and reliable).

  1. To login to a server as well as generate SSH keys, you will need to download Putty (the program to log into remote hosts with) and PuttyGen (to generate the keys). These can both be downloaded from here.
  2. To set up SSH, follow the tutorial here.

Git: An enthusiastic user’s point-of-view


Presentation courtesy of Alex Townsend.

Abstract:

Git is a version control system that is changing the way collaborative projects (software, courses, research, etc.) can be managed. It is widely regarded as a superior alternative to svn, Dropbox, and Google docs for mildly tech-savvy individuals. I will discuss the basic workflow for Git and Github as an enthusiastic everyday user.

Game Changing Git


Two extraordinarily powerful tools with git are the difftool and the mergetool. They are exactly what they sound like:

  1. The difftool allows you to view what has changed between the last commit and now.
  2. The mergetool allows you to have git present all of the merge conflicts for you automatically. You can then fix the conflicts and write the file(s) to successfully merge!

GUI Options

At this point, the most popular GUI client appears to be meld. If you are developing on OSX, then when you installed XCode, you received a copy of ‘FileMerge’.

Terminal Options

It is worth mentioning that when you are working on a cluster, you will not be able to have a gui client launch without significant effort. If you are already familiar with vim, then the following commands will enable you to use vim to assist with your merges and diffing.

  1. Tell git to use vim for the git difftool. For starters, vimdiff is not the only tool. If you execute git difftool in your terminal, it should inform you that it is not set, and list the options available. Try them out!

    # set vimdiff as the difftool
    $ git config --global diff.tool vimdiff
    # if you want, let `git d` serve as an alias
    $ git config --global alias.d difftool
    

    Go find a repository and start changing files. Compare the differences between running git diff and git difftool from your terminal. We are confident you will never go back.

  2. Tell git to use vim for the git mergetool. As with before, vimdiff is not the only tool. If you execute git mergetool in your terminal, it should inform you that it is not set, and list the options available. Try them out! Note: you will probably want to keep your difftool and your mergetool as the same thing, if only for coherence in presentation.

    # tell git to use vimdiff as the mergetool
    $ git config merge.tool vimdiff
    # the diff3 conflict style tends to be easier to read
    $ git config merge.conflictstyle diff3
    # always open up the merge conflicts automatically
    $ git config mergetool.prompt false
    

    These instructions were pasted from here for convenience. This article is invaluable in explaining what is going on when using vimdiff as your merge tool.

    Once you get comfortable, you will probably want to also execute

    # http://stackoverflow.com/a/1251696/3814202
    $ git config --global mergetool.keepBackup false
    

    so that all of those *.orig files after a merge conflict are not generated. You’re tracking the code with git after all, if you really break things you have both commits on their own and can just re-do the merge if necessary.

Version Control with Large Files


Git is far from the only version control system out there. You are likely to encounter many forms in the wild, but generally speaking the most popular tools are git, svn, and hg (mercurial). While they achieve similar tasks, they are somewhat fundamentally different.

Since git is a decentralized version control system, you should avoid tracking large files with it. Any time there is a small change on a binary file (such as your datasets, game assets, or even pdf’s), a binary diff is generated and stored in the history. Over time, these small changes produce huge git repositories without any real need.

None of the options below are perfect, but they are at least functional.

Git Annex

Git Annex is an excellent tool for this purpose. It has been around for a while and is reliable. The downside is that anybody who would want to clone your repository must also have git-annex installed.

Git Large File Storage (LFS)

Git LFS is a relatively new tool that attempts to solve the underlying issues of using git with large files similar to how git submodules are handled. Instead of directly tracking files, a brief text file with a pointer to a repository and commit (an LFS server) are stored.

This tool is new, and still has a lot of room for growth. Although in theory widespread adoption of this tool would benefit the service providers (GitHub, Atlassian, etc.), this is not quite the reality. Most providers are going to bill you by bandwidth used, meaning if you pay for a 1GB quota and have a 500MB data file, you got to upload and download your file once. Then you will get billed for every clone after.

The solution at this point in time, if you have the resources, would be to setup your own git-lfs server and host your data there. Note that some service providers will only enable git-lfs for public repositories.

Using just mercurial (hg)

Mercurial has large files extension, you can learn more here.

Using git and svn together

Since svn is centralized, it is an effective tool for tracking large files. Tracking your code with svn, though, is (in this author’s opinion) less effective than git due to how merge conflicts are resolved. In general, you should be good to go if you use:

  1. git to track your code.
  2. svn to track your data.
    • E.g. if you have a data directory in your repository, add that to your .gitignore, and use svn to clone your actual data repo there.

There exists a git-svn tool described in this article, but you may have better mileage if you just manually control the repositories and provide explicit instructions on your README of how to get all the necessary data for your project.

Configuration is Key


You may feel like your terminal is uncomfortable, or wish it could look better. Now that you have the ability to access a cluster, if you want to get some practice cloning a repository as well as make your terminal more appealing in one step, follow the directions here.