Speaker: Alex Townsend (Cornell University)
gitand related tools (e.g. GitHub) effectively.
masternode (sometimes called
slavenodes (the workers).
qsub) that submits
jobsthat get farmed out to the (available) slaves.
In this picture, you are the client. For the majority of clusters you will
encounter, you will only be allowed to
ssh into the head node. Some
clusters will allow you to login into a worker directly, but this is generally rare.
Once you have logged into the head node, in order to farm out computations to the workers you will generally submit jobs to a queue. Depending on the resources available (e.g. how many other people are currently using the cluster), and how many resources you are requesting, it may not launch right away.
What the queuing system is and how to farm out work depends on the cluster you have logged into.
If you are not on campus, then you will not be able to log into our cluster.
You can identify if this is the reason you are unable to login if you try and
ssh in and the command just hangs (use
ctrl+c to kill it). To have an
on-campus IP address, you will need to use the Cornell VPN.
Follow the directions to install and run the VPN here.
ssh [opts] <email@example.com>
usernameis the username on the remote host.
remote.hostis the url of the server you want to log into.
-lto specify username (no need for
-p <port>: connect to a specific port (may be necessary depending on the server).
-Y(actually less secure, only use if needed).
We will be using
ssh, which is a secure shell login protocol. This tool
enables you to login into and execute commands on the remote using the
terminal on your client.
The instructions below are for Unix systems. If you are using Windows, browse through the below to understand the workflow, then follow the directions in the SSH for Windows section below.
So, since my
username on Totient is my Cornell NetID (in this case
I could log into Totient using both of the following:
# Using the symbolic name $ ssh firstname.lastname@example.org # Using the IP Address: $ ssh email@example.com
Rarely would you actually want to enter your username or password everytime you try and log into a server. The solution is to use SSH Keys. For these, there are
_rsapostfix to the name of your key by default.
The steps are simple enough:
Add the newly generated SSH key to your it to your keychain:
# Adds the key to your ssh keyring $ ssh-add ~/.ssh/id_rsa
Note that we are adding the private key to our local ssh keyring, not the public key.
We need to copy the public component of the key to the host. The directions
below are inlined from here, replace
all instances of
<NetID> with your NetID.
# Create the ~/.ssh directory on the remote $ ssh <NetID>@en-cs-totient-01.coecis.cornell.edu mkdir -p .ssh # Copy the public part of the key from the client to the host $ cat .ssh/id_rsa.pub | ssh <NetID>@en-cs-totient-01.coecis.cornell.edu 'cat >> .ssh/authorized_keys'
At this point, you have now registered your SSH key with Totient. You can now use passwordless login:
$ ssh en-cs-totient-01.coecis.cornell.edu
Typing the entire host name the entire time is tedious. We will setup a config file on the client (your computer).
# Create the file if it does not exist $ touch ~/.ssh/config
Now edit it with your favorite text editor to include
Host totient Hostname en-cs-totient-01.coecis.cornell.edu User <NetID> ForwardAgent yes
With all that, you should now be able to just execute
$ ssh totient
and login to the Totient cluster. Every now and then you may be asked to enter the password for your actual SSH key, but this will not be that frequent.
By default, Windows does not have an SSH client. If you are not using Cygwin or Bash for Windows 10, you will need to install 3rd party software (that is well established and reliable).
Presentation courtesy of Alex Townsend.
Git is a version control system that is changing the way collaborative projects (software, courses, research, etc.) can be managed. It is widely regarded as a superior alternative to svn, Dropbox, and Google docs for mildly tech-savvy individuals. I will discuss the basic workflow for Git and Github as an enthusiastic everyday user.
Two extraordinarily powerful tools with
git are the
difftool and the
mergetool. They are exactly what they sound like:
difftoolallows you to view what has changed between the last commit and now.
mergetoolallows you to have
gitpresent all of the merge conflicts for you automatically. You can then fix the conflicts and write the file(s) to successfully merge!
At this point, the most popular GUI client appears to be meld. If you are developing on OSX, then when you installed XCode, you received a copy of ‘FileMerge’.
It is worth mentioning that when you are working on a cluster, you will not be
able to have a gui client launch without significant effort. If you are already
vim, then the following commands will enable you to use
to assist with your merges and diffing.
git to use
vim for the
git difftool. For starters,
not the only tool. If you execute
git difftool in your terminal, it should
inform you that it is not set, and list the options available. Try them out!
# set vimdiff as the difftool $ git config --global diff.tool vimdiff # if you want, let `git d` serve as an alias $ git config --global alias.d difftool
Go find a repository and start changing files. Compare the differences
git diff and
git difftool from your terminal. We are
confident you will never go back.
git to use
vim for the
git mergetool. As with before,
is not the only tool. If you execute
git mergetool in your terminal, it
should inform you that it is not set, and list the options available. Try
them out! Note: you will probably want to keep your
difftool and your
mergetool as the same thing, if only for coherence in presentation.
# tell git to use vimdiff as the mergetool $ git config merge.tool vimdiff # the diff3 conflict style tends to be easier to read $ git config merge.conflictstyle diff3 # always open up the merge conflicts automatically $ git config mergetool.prompt false
These instructions were pasted from here
for convenience. This article is invaluable in explaining what is going on
vimdiff as your merge tool.
Once you get comfortable, you will probably want to also execute
# http://stackoverflow.com/a/1251696/3814202 $ git config --global mergetool.keepBackup false
so that all of those
*.orig files after a merge conflict are not generated.
You’re tracking the code with
git after all, if you really break things you
have both commits on their own and can just re-do the merge if necessary.
Git is far from the only version control system out there. You are likely to
encounter many forms in the wild, but generally speaking the most popular tools
hg (mercurial). While they achieve similar tasks, they
are somewhat fundamentally different.
git is a decentralized version control system, you should avoid tracking
large files with it. Any time there is a small change on a binary file (such as
your datasets, game assets, or even pdf’s), a binary diff is generated and
stored in the history. Over time, these small changes produce huge git
repositories without any real need.
None of the options below are perfect, but they are at least functional.
Git Annex is an excellent tool for this
purpose. It has been around for a while and is reliable. The downside is that
anybody who would want to clone your repository must also have
Git LFS is a relatively new tool that attempts
to solve the underlying issues of using
git with large files similar to how
git submodules are
handled. Instead of directly tracking files, a brief text file with a pointer
to a repository and commit (an LFS server) are stored.
This tool is new, and still has a lot of room for growth. Although in theory widespread adoption of this tool would benefit the service providers (GitHub, Atlassian, etc.), this is not quite the reality. Most providers are going to bill you by bandwidth used, meaning if you pay for a 1GB quota and have a 500MB data file, you got to upload and download your file once. Then you will get billed for every clone after.
The solution at this point in time, if you have the resources, would be to setup
git-lfs server and host your data there. Note that some service
providers will only enable
git-lfs for public repositories.
Mercurial has large files extension, you can learn more here.
svn is centralized, it is an effective tool for tracking large files.
Tracking your code with
svn, though, is (in this author’s opinion) less
git due to how merge conflicts are resolved. In general, you
should be good to go if you use:
gitto track your code.
svnto track your data.
datadirectory in your repository, add that to your
.gitignore, and use
svnto clone your actual
There exists a
git-svn tool described in this article,
but you may have better mileage if you just manually control the repositories
and provide explicit instructions on your README of how to get all the necessary
data for your project.
You may feel like your terminal is uncomfortable, or wish it could look better. Now that you have the ability to access a cluster, if you want to get some practice cloning a repository as well as make your terminal more appealing in one step, follow the directions here.