Speaker: Alex Townsend (Cornell University)
ssh
.git
and related tools (e.g. GitHub) effectively.
git difftool
and git difftool
.remote
(host).client
.master
node (sometimes called head
).slave
nodes (the workers).master
node.qsub
) that submits jobs
that get farmed out to the (available) slaves.In this picture, you are the client. For the majority of clusters you will
encounter, you will only be allowed to ssh
into the head node. Some
clusters will allow you to login into a worker directly, but this is generally rare.
Once you have logged into the head node, in order to farm out computations to the workers you will generally submit jobs to a queue. Depending on the resources available (e.g. how many other people are currently using the cluster), and how many resources you are requesting, it may not launch right away.
What the queuing system is and how to farm out work depends on the cluster you have logged into.
If you are not on campus, then you will not be able to log into our cluster.
You can identify if this is the reason you are unable to login if you try and
ssh
in and the command just hangs (use ctrl+c
to kill it). To have an
on-campus IP address, you will need to use the Cornell VPN.
Follow the directions to install and run the VPN here.
ssh [opts] <username@remote.host>
username
is the username on the remote host.remote.host
is the url of the server you want to log into.128.253.51.103
en-cs-totient-01.coecis.cornell.edu
-l
to specify username (no need for @
anymore).-p <port>
: connect to a specific port (may be necessary depending on the server).X11
forwarding with -X
.X11
forwarding with -Y
(actually less secure, only use if needed).We will be using ssh
, which is a secure shell login protocol. This tool
enables you to login into and execute commands on the remote using the
terminal on your client.
The instructions below are for Unix systems. If you are using Windows, browse through the below to understand the workflow, then follow the directions in the SSH for Windows section below.
So, since my username
on Totient is my Cornell NetID (in this case sjm324
),
I could log into Totient using both of the following:
# Using the symbolic name
$ ssh sjm324@en-cs-totient-01.coecis.cornell.edu
# Using the IP Address:
$ ssh sjm324@128.253.51.103
Rarely would you actually want to enter your username or password everytime you try and log into a server. The solution is to use SSH Keys. For these, there are
.pub
file name._rsa
postfix to the name of your key by default.The steps are simple enough:
~/.ssh/id_rsa.pub
~/.ssh/id_rsa
Add the newly generated SSH key to your it to your keychain:
# Adds the key to your ssh keyring
$ ssh-add ~/.ssh/id_rsa
Note that we are adding the private key to our local ssh keyring, not the public key.
We need to copy the public component of the key to the host. The directions
below are inlined from here, replace
all instances of <NetID>
with your NetID.
# Create the ~/.ssh directory on the remote
$ ssh <NetID>@en-cs-totient-01.coecis.cornell.edu mkdir -p .ssh
# Copy the public part of the key from the client to the host
$ cat .ssh/id_rsa.pub | ssh <NetID>@en-cs-totient-01.coecis.cornell.edu 'cat >> .ssh/authorized_keys'
At this point, you have now registered your SSH key with Totient. You can now use passwordless login:
$ ssh en-cs-totient-01.coecis.cornell.edu
Typing the entire host name the entire time is tedious. We will setup a config file on the client (your computer).
# Create the file if it does not exist
$ touch ~/.ssh/config
Now edit it with your favorite text editor to include
Host totient
Hostname en-cs-totient-01.coecis.cornell.edu
User <NetID>
ForwardAgent yes
With all that, you should now be able to just execute
$ ssh totient
and login to the Totient cluster. Every now and then you may be asked to enter the password for your actual SSH key, but this will not be that frequent.
By default, Windows does not have an SSH client. If you are not using Cygwin or Bash for Windows 10, you will need to install 3rd party software (that is well established and reliable).
Presentation courtesy of Alex Townsend.
Abstract:
Git is a version control system that is changing the way collaborative projects (software, courses, research, etc.) can be managed. It is widely regarded as a superior alternative to svn, Dropbox, and Google docs for mildly tech-savvy individuals. I will discuss the basic workflow for Git and Github as an enthusiastic everyday user.
Two extraordinarily powerful tools with git
are the difftool
and the
mergetool
. They are exactly what they sound like:
difftool
allows you to view what has changed between the last
commit and now.mergetool
allows you to have git
present all of the merge conflicts
for you automatically. You can then fix the conflicts and write the file(s)
to successfully merge!At this point, the most popular GUI client appears to be meld. If you are developing on OSX, then when you installed XCode, you received a copy of ‘FileMerge’.
It is worth mentioning that when you are working on a cluster, you will not be
able to have a gui client launch without significant effort. If you are already
familiar with vim
, then the following commands will enable you to use vim
to assist with your merges and diffing.
Tell git
to use vim
for the git difftool
. For starters, vimdiff
is
not the only tool. If you execute git difftool
in your terminal, it should
inform you that it is not set, and list the options available. Try them out!
# set vimdiff as the difftool
$ git config --global diff.tool vimdiff
# if you want, let `git d` serve as an alias
$ git config --global alias.d difftool
Go find a repository and start changing files. Compare the differences
between running git diff
and git difftool
from your terminal. We are
confident you will never go back.
Tell git
to use vim
for the git mergetool
. As with before, vimdiff
is not the only tool. If you execute git mergetool
in your terminal, it
should inform you that it is not set, and list the options available. Try
them out! Note: you will probably want to keep your difftool
and your
mergetool
as the same thing, if only for coherence in presentation.
# tell git to use vimdiff as the mergetool
$ git config merge.tool vimdiff
# the diff3 conflict style tends to be easier to read
$ git config merge.conflictstyle diff3
# always open up the merge conflicts automatically
$ git config mergetool.prompt false
These instructions were pasted from here
for convenience. This article is invaluable in explaining what is going on
when using vimdiff
as your merge tool.
Once you get comfortable, you will probably want to also execute
# http://stackoverflow.com/a/1251696/3814202
$ git config --global mergetool.keepBackup false
so that all of those *.orig
files after a merge conflict are not generated.
You’re tracking the code with git
after all, if you really break things you
have both commits on their own and can just re-do the merge if necessary.
Git is far from the only version control system out there. You are likely to
encounter many forms in the wild, but generally speaking the most popular tools
are git
, svn
, and hg
(mercurial). While they achieve similar tasks, they
are somewhat fundamentally different.
Since git
is a decentralized version control system, you should avoid tracking
large files with it. Any time there is a small change on a binary file (such as
your datasets, game assets, or even pdf’s), a binary diff is generated and
stored in the history. Over time, these small changes produce huge git
repositories without any real need.
None of the options below are perfect, but they are at least functional.
Git Annex is an excellent tool for this
purpose. It has been around for a while and is reliable. The downside is that
anybody who would want to clone your repository must also have git-annex
installed.
Git LFS is a relatively new tool that attempts
to solve the underlying issues of using git
with large files similar to how
git submodules are
handled. Instead of directly tracking files, a brief text file with a pointer
to a repository and commit (an LFS server) are stored.
This tool is new, and still has a lot of room for growth. Although in theory widespread adoption of this tool would benefit the service providers (GitHub, Atlassian, etc.), this is not quite the reality. Most providers are going to bill you by bandwidth used, meaning if you pay for a 1GB quota and have a 500MB data file, you got to upload and download your file once. Then you will get billed for every clone after.
The solution at this point in time, if you have the resources, would be to setup
your own git-lfs
server and host your data there. Note that some service
providers will only enable git-lfs
for public repositories.
hg
)Mercurial has large files extension, you can learn more here.
git
and svn
togetherSince svn
is centralized, it is an effective tool for tracking large files.
Tracking your code with svn
, though, is (in this author’s opinion) less
effective than git
due to how merge conflicts are resolved. In general, you
should be good to go if you use:
git
to track your code.svn
to track your data.
data
directory in your repository, add that to your
.gitignore
, and use svn
to clone your actual data
repo there.There exists a git-svn
tool described in this article,
but you may have better mileage if you just manually control the repositories
and provide explicit instructions on your README of how to get all the necessary
data for your project.
You may feel like your terminal is uncomfortable, or wish it could look better. Now that you have the ability to access a cluster, if you want to get some practice cloning a repository as well as make your terminal more appealing in one step, follow the directions here.