From Documentation
Jump to: navigation, search
m
 
Line 7: Line 7:
  
 
__TOC__
 
__TOC__
 
  
 
<!--To make this long page more manageable, each major section has its content on its own separate page, which is then transcluded into this one.  Please edit the pages listed below to see the change reflected on this page-->
 
<!--To make this long page more manageable, each major section has its content on its own separate page, which is then transcluded into this one.  Please edit the pages listed below to see the change reflected on this page-->

Latest revision as of 17:25, 8 February 2019

Sharcnet logo.jpg
Knowledge Base / Expanded FAQ

Note: In the Fall of 2018 this FAQ is receiving a major update to account for retirement of old systems. If you still want to see the old FAQ, a snapshot of it is available at this link. Some of the information on this page has now been moved to the Legacy Systems page.

This page is a comprehensive collection of essential information needed to use SHARCNET, gathered conveniently on a single page of our Help Wiki. If you are a new SHARCNET user, this page most likely contains all you need to get going on SHARCNET. However, there is much more information in this Help Wiki. Please use the search box to find pages that may be relevant to you. You can also go to the Main Page of this wiki for a general table of contents. Finally, you can also look at the list of all articles in this Help Wiki or a list of all categories.


Contents


About SHARCNET

What is SHARCNET?

SHARCNET stands for Shared Hierarchical Academic Research Computing Network. Established in 2000, SHARCNET is the largest high performance computing consortium in Canada, involving 18 universities and colleges across southern, central and northern Ontario.

SHARCNET is a member consortium in the Compute/Calcul Canada national HPC platform.

Where is SHARCNET?

The main office of SHARCNET is located in the Western Science Centre at The University of Western Ontario. The SHARCNET high performance clusters are installed at a number of the member institutions in the consortium and operated by SHARCNET staff across different sites.

What does SHARCNET have?

The primary SHARCNET compute system is the Graham heterogeneous cluster located at the University of Waterloo. It is named after Wes Graham, the first director of the Computing Centre at Waterloo. It consists of 36,160 cores and 320 GPU devices, spread across 1,127 nodes of different configurations.

What can I do with SHARCNET?

If you have a program that takes months to run on your PC, you could probably run it within a few hours using hundreds of processors on the SHARCNET clusters, provided your program is inherently parallelisable. If you have hundreds or thousands of test cases to run through on your PC or computers in your lab, then with hundreds of processors running those cases independently will significantly reduce your test cycles .

If you have used beowulf clusters made of commodity PCs, you may notice a performance improvement on SHARCNET clusters which have high-speed Infiniband interconnects, as well as SHARCNET machines which have large amounts of memory. Also, SHARCNET clusters themselves are connected through a dedicated, private connection over the Ontario Research Innovation Optical Network (ORION).

If you have access to other super computing facilities at other places and you wish to share your ideas with us and SHARCNET users, please contact us. Together we can make SHARCNET better.

Who is running SHARCNET?

The daily operation and development of SHARCNET computational facilities is managed by a group of highly qualified system administrators. In addition, we have a team of high performance technical computing consultants, who are responsible for technical support on libraries, programming and application analysis.

How do I contact SHARCNET?

For technical inquiries, you may send E-mail to help@sharcnet.ca, or contact your local system administrator or HPC specialist. For general inquiries, you may contact the SHARCNET main office.

Getting an Account with SHARCNET and Related Issues

Do I need a SHARCNET account?

At this point, the only cases when you need a local SHARCNET account are:

  • If you want to use a legacy cluster.
  • If you want to edit a wiki page on sharcnet.ca web portal

In all other cases, you only need your Compute Canada account. This includes:

  • Using a National System (Graham, Cedar, Niagara, Beluga etc).
  • Using orca (it was recently reconfigured to have the same setup - scheduler, file systems, software stack - as the National Systems).

To apply for a Compute Canada account, follow this link

Please note you do not need a SHARCNET account to attend our training events (webinars, Summer Schools etc).

What is required to obtain a SHARCNET account

Anyone who would like to use SHARCNET may apply for an account. Please bear in mind the following:

  • There are no shared/group accounts, each person who uses SHARCNET requires their own account and must not share their password
  • Applicants who are not faculty (eg. students, postdocs) require an account sponsor who must already have a SHARCNET account. This is typically one's supervisor.
  • There is no fee for academic access, but account sponsors are responsible for reporting their research activities to Compute Canada, and all academic SHARCNET users must obtain a Compute Canada account before they may apply for a SHARCNET account.
  • All SHARCNET users must read and follow the policies listed here

How do I apply for an account?

Applying for an account is either done through the Compute Canada Database (for academic users) or by contacting SHARCNET (for non-academic use). Detailed step-by-step instructions are provided on the Getting_an_Account_with_SHARCNET page.

How do I update / renew my account?

It is no longer necessary to report to SHARCNET. SHARCNET accounts (for academic users) are automatically activated or deactivated based on the status of your primary role with Compute Canada (your primary CCRI), so as long as one ensures that they have completed the Compute Canada account renewal and reporting process their SHARCNET account will be in good standing.

Compute Canada account holders may renew their account at any time, even after it has been expired, by visiting the CCDB and filling out their renewal form. Note that it may take 3-4 business days for your renewal to be confirmed by an account authority at Compute Canada. Note that if you are a sponsored user, you must email us at help@sharcnet.ca to have your SHARCNET account reactivated following expiry.

If anything is unclear or if you have any questions about the Compute Canada account renewal / reporting process please email accounts@computecanada.ca. If you have any questions about account renewals that directly pertain to SHARCNET please email help@sharcnet.ca.

I am changing supervisor or I am becoming faculty, and I already have a SHARCNET account. Should I apply for a new account?

No, you should apply for a new role (CCRI) and indicate that you want your new role to be your primary role. The process is described in detail on the Getting_an_Account_with_SHARCNET page.

I have an existing SHARCNET account and need to link it to a new Compute Canada account, how do I do that?

You first need to get a Compute Canada Role Identifier (CCRI) and then notify SHARCNET that you would like to link your Compute Canada Account (CCI) to your existing SHARCNET account. Detailed step-by-step instructions are provided on the Getting_an_Account_with_SHARCNET page.

What is a role / CCRI ?

A role (CCRI: Compute Canada Role Identifier) is a way to identify you as a person at a point in time. It includes information your position, institution and department, as well as any other roles that sponsor your role or that your role sponsor's.

Each person may have one or more roles that are associated with each of their current and past positions. These various roles ultimately link back to ones CCI (Compute Canada Identifier).

If the roles are created through Compute Canada they are referred to as a CCRI (Compute Canada Role Identifier), although other roles pre-dating Compute Canada also exist.

In practice you only need to be concerned with your role (and the appropriate role from your sponsor) when applying for accounts, running jobs in particular projects associated with a particular group/sponsor, or when viewing the SHARCNET web portal with multiple roles (you may see different information in the web portal depending on which role you have selected to be active).

For further information about roles please see the SHARCNET-specific role information here and the more general Compute Canada specific information here.

Can I have multiple roles ?

Yes. For more information, see Running_jobs#Accounts_and_projects and Frequently_Asked_Questions_about_the_CCDB on the Compute Canada wiki.

Can I just have a cluster account without having a web portal account?

No. The web portal account is an online interface to your account in our user database. It provides a way of managing your information and keeping track of problems you may encounter.

Can I E-mail or call to open an account?

No, please follow the instructions above.

OK, I've seen and heard the word "web portal" enough, what is it anyway?

A web portal is a web site that offers online services. Usually a web portal has a database at the backend, in which people can store and access personal information, but may involve other software services like this wiki. At SHARCNET, registered users can login to the web portal, manage their profiles, submit and review programming and performance related problems, look-up solutions to problems, contribute to our wiki, and assess their SHARCNET usage, amongst other things.

My supervisor forgot all about his/her username/CCRI, so my application can't go through, what should I do?

Please have them send an E-mail to help@sharcnet.ca and we will re-inform them of their login credentials.

My supervisor does not use SHARCNET, why is my supervisor asked to have an account anyway?

Your supervisor's account ID is used to identify which group your account belongs to. We account for all usage and provide default at the group level.

Is there any charge for using SHARCNET?

SHARCNET is free for all academic research. If you are working outside of academia we recommend you read our Commercial Access Policy which can be found in the SHARCNET web portal here.

I forgot my password

You can reset your password here, or by clicking the "Forget password" link after trying to sign-in.

I forgot my username

If you forget your username, please send an E-mail to help@sharcnet.ca. Your username for the web portal and cluster account are the same.

My account has been disabled (so i cannot login). What should I do ?

At present all academic SHARCNET accounts are automatically enabled/disabled based on the status of your corresponding Compute Canada roles. If your SHARCNET account is disabled it was most likely due to your Compute Canada account becoming expired as a result of not completing the Compute Canada account renewal / reporting process. You should have been sent an email from Compute Canada indicating why your account was deactivated.

To renew your account (you may do this even after your account is expired), log into the CCDB and complete the reporting process. Once your renewal is approved your SHARCNET account will be automatically reactivated (note that once requested, renewal may take up to 3-4 business days as a local account administrator must verify your reporting information). NOTE: if your sponsor's Compute Canada account was expired and deactivated, you must request that we reactivate your SHARCNET account manually after you've renewed your Compute Canada account - email help@sharcnet.ca.

If you have questions concerning your account please email help@sharcnet.ca.

How do I change the email address associated with my account?

If you wish to use a new email address you have to update your Contact Information at the Compute Canada Database. Contact information is now updated at SHARCNET automatically based on what you have indicated to Compute Canada.

I've changed institutions and want to update the affiliation of my primary role

To change your primary role to your new institution, goto the Compute Canada My Account Add Role page and complete the information to apply for a new role, be sure to tick both: Make this role primary and Disable old roles.

I no longer want my SHARCNET account

If you would like to cease using SHARCNET (including access to all systems and list email) email help@sharcnet.ca. Please let us know if you'd like to disable your corresponding Compute Canada role (resulting in all it's associated Compute Canada consortia accounts being disabled as well) or if you'd just like to disable your SHARCNET account independent of your other consortia accounts.

You should only request this if you want your account disabled *now* - if you do not complete the annual renewal process at Compute Canada your account will eventually be deactivated automatically.

The Acceptable Use Policy, in particular pt. 36, outlines our policy in the event that an account is disabled.

You may have your account re-enabled by emailing help@sharcnet.ca.


Logging in to Systems, Transferring and Editing Files

How do I login to SHARCNET?

You access the SHARCNET clusters using ssh. For Graham and other national systems Compute Canada credentials are required. For the remaining systems, listed here, you will require SHARCNET credentials.

Unix/Linux/OS X

To login to a system, you need to use an Secure Shell (SSH) connection. If you are logging in from a UNIX-based machine, make sure it has an SSH client (ssh) installed (this is almost always the case on UNIX/Linux/OS X). If you have the same login name on both your local system and SHARCNET, and you want to login to, say, graham, you may use the command:

ssh graham.computecanada.ca

If your Compute Canada username is different from the username on your local systems, then you may use either of the following forms:

ssh graham.computecanada.ca -l username
ssh username@graham.computecanada.ca

If you want to establish an X window connection so that you can use graphics applications such as gvim and xemacs, you can add a -Y to the command:

ssh -Y username@graham.computecanada.ca

This will automatically set the X DISPLAY variable when you login.

Windows

If you are logging from a computer running Windows and need some pointers we recommend consulting our SSH tutorial.

What is the difference between Login Nodes and Compute Nodes?

Login Nodes

Most of our clusters have distinct login nodes associated with them that you are automatically redirected to when you login to the cluster (some systems are directly logged into, eg. SMPs and smaller specialty systems). You can use these to do most of your work preparing for jobs (compiling, editing configuration files) and other low-intensity tasks like moving and copying files.

You can also use them for other quick tasks, like simple post-processing, but any significant work should be submitted as a job to the compute nodes. On most login nodes, each process is limited to 1 cpu-hour; this will be noticable if you perform anything compute-intensive, and can affect IO-oriented activity as well (such as very large scp or rsync operations.)

How can I suspend and resume my session?

The program screen can start persistent terminals from which you can detach and reattach. The simplest use of screen is

screen -dR

which will either reattach you to any existing session or create a new one if one doesn't exist. To terminate the current screen session, type exit. To detach manually (you are automatically detached if the connection is lost) press ctrl+a followed by d, you can the resume later as above (ideal for running background jobs). Note that ctrl+a is screen's escape sequence, so you have to do ctrl+a followed by a to get the regular effect of pressing ctrl+a inside a screen session (e.g., moving the cursor to the start of the line in a shell).

For a list of other ctrl+a key sequences, press ctrl+a followed by ?. For further details and command line options, see the screen manual (or type man screen on any of the clusters).

Other notes:

  • If you want to create additional "text windows", use Ctrl-A Ctrl-C. Remember to type "exit" to close it.
  • To switch to a "text window" with a certain number, use Ctrl-A # (where # is 0 to 9).
  • To see a list of window numbers use Ctrl-A w
  • To be presented a list of windows and select one to use, use Ctrl-A " (This is handy if you've made too many windows.)
  • If the program running in a screen "text window" refuses to die (i.e., it needs to be killed) you can use Ctrl-A K
  • For brief help on keystrokes use Ctrl-A ?
  • For extensive help, run "man screen".

What operating systems are supported?

UNIX in general. Currently, Linux is the only operating system used within SHARCNET.

What makes a cluster different than my UNIX workstation?

If you are familiar with UNIX, then using a cluster is not much different from using a workstation. When you login to a cluster, you in fact only log in to one of the cluster nodes. In most cases, each cluster node is a physical machine, usually a server class machine, with one or several CPUs, that is more or less the same as a workstation you are familiar with. The difference is that these nodes are interconnected with special interconnect devices and the way you run your program is slightly different. Across SHARCNET clusters, you are not expected to run your program interactively. You will have to run your program through a queueing system. That also means where and when your program gets to run is not decided by you, but by the queueing system.

What programming languages are supported?

Those primary programming languages such as C, C++ and Fortran are supported. Other languages, such as Java, Pascal and Ada, are also supported, but with limited technical support from us. If your program is written in any language other than C, C++ and Fortran, and you encounter a problem, we may or may not be able solve it within a short period of time. Note: this does not mean you can't use other languages like Matlab, R, Python, Perl, etc. We normally think of those as "scripting" languages, but that doesn't imply that good HPC necessarily requires an explicitly-compiled language like Fortran.

How do I organize my files?

Main file systems on our National systems:

Filesystemgraham.png

How are file permissions handled at SHARCNET?

By default, anyone in your group can read and access your files. You can provide access to any other users by following this Knowledge Base entry.

All SHARCNET users are associated with a primary GID (group id) belonging to the PI of the group (you can see this by running id username , with your username). This allows for groups to share files without any further action, as the default file permissions for all SHARCNET storage locations (Eg. /gwork/user ) allows read (list) and execute (enter / access) permissions for the group, eg. they appear as:

  [cc_user@gra-login2 ~]$ ls -ld scratch/
   drwxrwx---+ 12 cc_user cc_user 4096 Jul 18 08:59 scratch/


Further, by default the umask value for all users is 0002, so any new files or directories will continue to provide access to the group.

Should you wish to keep your files private from all other users, you should set the permissions on the base directory to only be accessible to yourself. For example, if you don't want anyone to see files in your home directory, you'd run:

chmod 700 ~/

If you want to ensure that any new files or directories are created with different permissions, you can set your umask value. See the man page for further details by running:

man umask

For further information on UNIX-based file permissions please run:

man chmod

What about really large files or if I get the error 'No space left on device' in ~/project or ~/scratch?

If you need to work with really large files we have tips on optimizing performance with our parallel filesystems here.

How do I transfer files/directories to/from or between cluster?

Unix/Linux

To transfer files to and from a cluster on a UNIX machine, you may use scp or sftp. For example, if you want to upload file foo.f to cluster graham from your machine myhost, use the following command

myhost$ scp foo.f graham.computecanada.ca:

assuming that your machine has scp installed. If you want to transfer a file from Windows or Mac, you need have scp or sftp for Windows or Mac installed.

If you transfer file foo.f between SHARCNET clusters, say from your home directory on orca to your scratch directory on graham, simply use the following command

[username@orc-login2:~]$ scp foo.f graham:/home/username/

If you are transferring files between a UNIX machine and a cluster, you may use scp command with -r option. For instance, if you want to download the subdirectory foo in the directory project in your home directory on graham to your local UNIX machine, on your local machine, use command

myhost$ scp -rp graham.sharcnet.ca:project/foo .

Similarly, you can transfer the subdirectory between SHARCNET clusters. The following command

[username@orc-login2:~]$ scp -rp graham:/home/username/scratch/foo .

will download subdirectory foo from your scratch directory on graham to your home directory on orca (note that the prompt indicates you are currently logged on to orca).

The use of -p option above will preserve the time stamp of each file. For Windows and Mac, you need to check the documentation of scp for features.

You may also tar and compress the entire directory and then use scp to save bandwidth. In the above example, first you login to graham, then do the following

[username@gra-login2:~]$ cd project
[username@gra-login2:~]$ tar -cvf foo.tar foo
[username@gra-login2:~]$ gzip foo.tar

Then on your local machine myhost, use scp to copy the tar file

myhost$ scp graham.computecanada.ca:project/foo.tar.gz .

Note for most Linux distributions, tar has an option -z that will compress the .tar file using gzip.

Windows

You may read the instruction using ssh client. [[1]]

How can I best transfer large quantities of data to/from SHARCNET and what transfer rate should I expect?

In general, most users should be fine using scp or rsync to transfer data to and from SHARCNET systems. If you need to transfer a lot of files rsync is recommended to ensure that you do not need to restart the transfer from scratch should there be a connection failure. Although you can use scp and rsync to any cluster's login node(s), it is often best to use gra-dtn1.computecanada.ca - it is dedicated to data transfer.

In general one should expect the following transfer rates with scp:

  • If you are connecting to SHARCNET through a Research/Education network site (ORION, CANARIE, Internet2) and are on a fast local network (this is the case for most users connecting from academic institutions) then you should be able to attain sustained transfer speeds in excess of 10MB/s. If your path is all gigabit or better, you should be able to reach rates above 50 MB/s.
  • If you are transferring data over the wider internet, you will not be able to attain these speeds, as all traffic that does not enter/exit SHARCNET via the R&E net is restricted to a limited-bandwidth commercial feed. In this case one will typically see rates on the order of 1MB/s or less.

Keep in mind that filesystems and networks are shared resources and suffer from contention; if they are busy the above rates may not be attainable

For transferring large amounts of data (many gigabytes) the best approach is to use the online tool Globus.

How do I access the same file from different subdirectories on the same cluster ?

You should not need copy large files on the same cluster (e.g. from one user to another or using the same file in different subdirectories). Instead of using scp you might consider issuing a "soft link" command. Assume that you need access to the file large_file1 in subdirectory /home/user1/subdir1 and you need it to be in your subdirectory /home/my_account/my_dir from where you will invoke it under the name my_large_file1. Then go to that directory and type:

ln -s /home/user1/subdir1/large_file1    my_large_file1

Another example, assume that in subdirectory /home/my_account/PROJ1 you have several subdirectories called CASE1, CASE2, ... In each subdirectory CASEn you have a slightly different code but all of them process the same data file called test_data. Rather than copying the test_data file into each CASEn subdirectory, place test_data above i.e. in /home/my_account/PROJ1 and then in each CASEn subdirectory issue following "soft link" command:

ln -s ../test_data  test_data

The "soft links" can be removed by using the rm command. For example, to remove the soft link from /home/my_account/PROJ1/CASE2 type following command from this subdirectory:

rm -rf test_data

Typing above command from subdirectory /home/my_account/PROJ1 would remove the actual file and then none of the CASEn subdirectories would have access to it.

How are files deleted from the /home/userid/scratch filesystems?

All files on /home/userid/scratch that are over 2 months old (not old in the common sense, please see below) are automatically deleted. Data needed for long term storage and reference should be kept in either ~/project or other archival storage areas. The scratch filesystem is checked at the end of the month for files which will be candidates for expiry on the 15th of the following month. On the first day of the month, a login message is posted and a notification e-mail is sent to all users who have at least one file which is a candidate for purging and containing the location of a file which lists all the candidates for purging.

An unconventional aspect of this system is that it does not determine the age of a file based on the file's attributes, e.g., the dates reported by the stat, find, ls, etc. commands. The age of a file is determined based on whether or not its data contents (i.e., the information stored in the file) have changed, and this age is stored externally to the file. Once a file is created , reading it, renaming, changing the file's timestamps with the touch command, or copying it into another file are all irrelevant in terms of changing its age with respect to the purging system. The file will be expired 2 months after it was created. Only files where the contents have changed will have their age counter "reset".

Unfortunately, there currently exists no method to obtain a listing of the files that are scheduled for deletion. This is something that is being addressed, however there is no estimated time for implementation.

How do I check the age of a file

We define a file's age as the most recent of:

*the access time (atime) and
*the change time (ctime)

You can find the ctime of a file using

[name@server ~]$ ls -lc <filename>

while the atime can be obtained with the command

[name@server ~]$ ls -lu <filename>

We do not use the modify time (mtime) of the file because it can be modified by the user or by other programs to display incorrect information.

Ordinarily, simple use of the atime property would be sufficient, as it is updated by the system in sync with the ctime. However, userspace programs are able to alter atime, potentially to times in the past, which could result in early expiration of a file. The use of ctime as a fallback guards against this undesirable behaviour.

It is also your responsibility to manage the age of your stored data: most of the filesystems are not intended to provide an indefinite archiving service so when a given file or directory is no longer needed, you need to move it to a more appropriate filesystem which may well mean your personal workstation or some other storage system under your control. Moving significant amounts of data between your workstation and a Compute Canada system or between two Compute Canada systems should generally be done using Globus

How to archive my data?

Use tar to archive files and directories

The primary archiving utility on all Linux and Unix-like systems is the tar command. It will bundle a bunch of files or directories together and generate a single file, called an archive file or tar-file. By convention an archive file has .tar as the file name extension. When you archive a directory with tar, it will, by default, include all the files and sub-directories contained within it, and sub-sub-directories contained in those, and so on. So the command tar --create --file project1.tar project1 will pack all the content of directory project1 into the file project1.tar. The original directory will remain unchanged, so this may double the amount of disk space occupied!

You can extract files from an archive using the same command with a different option:tar --extract --file project1.tar. If there is no directory with the original name, it will be created. If a directory of that name exists and contains files of the same names as in the archive file, they will be overwritten. Another option can be added to specify the destination directory where to extract the archive's content.

Compress and uncompress tar files

The tar archiving utility can compress an archive file at the same time it creates it. There are a number of compression methods to choose from. We recommend either xz or gzip, which can be used as follows:

[user_name@localhost]$ tar --create --xz --file project1.tar.xz project1
[user_name@localhost]$ tar --extract --xz --file project1.tar.xz
[user_name@localhost]$ tar --create --gzip --file project1.tar.gz project1
[user_name@localhost]$ tar --extract --gzip --file project1.tar.gz

Typically, --xz will produce a smaller compressed file (a "better compression ratio") but takes longer and uses more RAM while working. --gzip does not typically compress as small, but may be used if you encounter difficulties due to insufficient memory or excessive run time during tar --create. A third option, --bzip2, is also available, that typically does not compress as small as xz but takes longer than gzip.

You can also run tar --create first without compression and then use the commands xz or gzip in a separate step, although there is rarely a reason to do so. Similarly, you can run xz -d or gzip -d to decompress an archive file before running tar --extract, but again there is rarely a reason to do so.

The commands gzip or xz can be used to compress any file, not just archive files:

[user_name@localhost]$ gzip bigfile
[user_name@localhost]$ xz bigfile

These commands will produce the files bigfile.gz and bigfile.xz respectively.

Archival Storage

On Graham, files copied to ~/nearline will be subsequently moved to offline (tape-based) storage. See this link for more details.

How can I check the hidden files in directory?

The "." at the beginning of the name means that the file is "hidden". You have to use the -a option with ls to see it. I.e. ls -a .

If you want to display only the hidden files then type:

ls -d .*

Note: there is an alias which is loaded from /etc/bashrc (see your .bashrc file). The alias is defined by alias l.='ls -d .* --color=auto' and if you type:

l.

you will also display only the hidden files.

How can I count the number of files in a directory?

One can use the following command to count the number of files in a directory (in this example, your /work directory):

find /home/$USER -type f   | wc -l

It is always a good idea to archive and/or compress files that are no longer needed on the filesystem (see below). This helps minimize one's footprint on the filesystem and as such the impact they have on other users of the shared resource.

How to organize a large number of files?

With parallel cluster filesystems, you will get best I/O performance writing data to a small number of large files. Since all metadata operations on each of our parallel filesystems are handled by a single file server, depending on how many files are being accessed the server can become overwhelmed leading to poor overall I/O performance for all users. If your workflow involves storing data in a large number of files, it is best to pack these files into a small number of larger archives, e.g. using tar command

tar cvf archiveFile.tar directoryToArchive

For better performance with many files inside your archive, we recommend to use DAR (Disk ARchive utility), which is a disk analog of tar (Tape ARchive). Dar can extract files from anywhere in the archive much faster than tar. The dar command is available by default on sharcnet systems. It can be used to pack files into a dar archive by doing something like:

dar -s 1G -w -c archiveFile -g directoryToArchive

In this example we split the archive into 1GB chunks, and the archive files will be named archiveFile.1.dar, archiveFile.2.dar, and so on. To list the contents of the archive, you can type:

dar -l archiveFile

To temporarily extract files for post-processing into current directory, you would type:

dar -R . -O -x archiveFile -v -g pathToYourFile/fileToExtract

I am unable to connect to one of the clusters; when I try, I am told the connection was closed by the remote host

The most likely cause of this behaviour is repeated failed login attempts. Part of our security policies involves blocking the IP address of machines that attempt multiple logins with incorrect passwords over a short period of time---many brute-force attacks on systems do exactly this: looking for poor passwords, badly configured accounts, etc. Unfortunately, it isn't uncommon for a user to forget their password and make repeated login attempts with incorrect passwords and end up with that machine blacklisted and unable to connect at all.

A temporary solution is simply to attempt to login from another machine. If you have access to another machine at your site, you can shell to that machine first, and then shell to the SHARCNET system (as that machine's IP shouldn't be blacklisted). In order to have your machine unblocked, you will have to email to support@computecanada.ca as a system administrator must manually intervene in order to fix it.

NOTE: there are other situations that can produce this message, however they are rarer and more transient. If you are unable to log in from one machine, but can from another, it is most likely the IP blacklisting that is the problem and the above will provide a temporary work-around while your problem ticket is processed.

I am unable to ssh/scp from SHARCNET to my local computer

Most campus networks are behind some sort of firewall. If you can ssh out to SHARCNET, but cannot establish a connection in the other direction, then you are probably behind a firewall and should speak with your local system administrator or campus IT department to determine if there are any exceptions or workarounds in place.

SSH tells me SOMEONE IS DOING SOMETHING NASTY!?

Suppose you attempt to login to SHARCNET, but instead get an alarming message like this:

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
fe:65:ab:89:9a:23:34:5a:50:1e:05:d6:bf:ec:da:67.
Please contact your system administrator.
Add correct host key in /home/user/.ssh/known_hosts to get rid of this message.
Offending key in /home/user/.ssh/known_hosts:42
RSA host key for requin has changed and you have requested strict checking.
Host key verification failed. 

SSH begins a connection by verifying that the host you're connecting to is authentic. It does this by caching the hosts's "hostkey" in your ~/.ssh/known_hosts file. At times, a hostkey may be changed legitimately; when this happens, you may see such a message. It's a good idea to verify this with us, you may be able to check the fingerprint yourself by logging into another sharcnet system and running:

ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub 

If the fingerprint is OK, the normal way to fix the problem is to simply remove the old hostkey from your known_hosts file. You can use your choice of editor if you're comfortable doing so (it's a plain text file, but has long lines). On a unix-compatible machine, you can also use the following very small script (Substitute the line(s) printed in the warning message illustrated above for '42' here.):

perl -pi -e 'undef $_ if (++$line == 42)' ~/.ssh/known_hosts

Another solution is brute-force: remove the whole known_hosts file. This throws away any authentication checking, and your first subsequent connection to any machine will prompt you to accept a newly discovered host key. If you find this prompt annoying and you aren't concerned about security, you can avoid it by adding a text file named ~/.ssh/config on your machine with the following content:

StrictHostKeyChecking no

Ssh works, but scp doesn't!

If you can ssh to a cluster successfully, but cannot scp to to it, the problem is likely that your login scripts print unexpected messages which confuse scp. scp is based on the same ssh protocol, but assumes that the connection is "clean": that is, that it does not produce any un-asked-for content. If you have something like:

echo "Hello, Master; I await your command..."

scp will be confused by the salutation. To avoid this, simply ensure that the message is only printed on an interactive login:

if [ -t 0 ]; then
    echo "Hello, Master; I await your command..."
fi

or in csh/tcsh syntax:

if ( -t 0 ) then
    echo "Hello, Master; I await your command..."
endif

How do I edit my program on a cluster?

We provide a variety of editors, such as the traditional text-mode emacs and vi (vim), as well as a simpler one called nano. If you have X on your desktop (and tunneled through SSH), you can use the GUI versions (xemacs, gvim).

If your desktop supports FUSE, it's very convenient to simply mount your home tree like this:

mkdir sharcnet
sshfs graham.computecanada.ca: sharcnet

you can then use any local editor of your choice.

If you run emacs on your desktop, you can also edit a remote file from within your local emacs client using Tramp, opening and saving a file as /username@cluster.computecanada.ca:path/file.

Compiling and Running Programs

For information about compiling your programs on orca, graham and other national Compute Canada systems, please see the Installing software in your home directory page on Compute Canada wiki.

For information about how to compile on older SHARCNET systems, see Legacy Systems.

How do I run a program interactively?

For running interactive jobs on graham and other national systems, see Running jobs page on Compute Canada wiki.

If trying interactive jobs on legacy systems, see Legacy Systems.

My application runs on Windows, can I run it on SHARCNET?

It depends. If your application is written in a high level language such as C, C++ and Fortran and is system independent (meaning it does not depend on any particular third party libraries that are available only for Windows), then you should be able to recompile and run your application on SHARCNET systems. However, if your application completely depends upon a special software for Windows, it will not run on the Linux compute nodes. In general it is impossible to convert code at binary level between Windows and any of UNIX platforms. For options relating to running Windows in virtual machines there is a Creating a Windows VM page at the Compute Canada Wiki.

My application runs on Windows HPC clusters, can I run it on SHARCNET clusters?

If your application does not use any Windows specific APIs then it should be able to recompile and run on SHARCNET UNIX/Linux based clusters.

My program needs to run for more than seven (7) days; what can I do?

The seven day run-time limit on legacy systems cannot be exceeded. This is done to primarily encourage the practice of checkpointing, but it also prevents users from monopolizing large amounts of resources outside of dedicated allocations with long running jobs, ensures that jobs free up nodes often enough for the scheduler to start large jobs in a modest amount of time, and allows us to drain all systems for maintenance within a reasonable time-frame.

In order to run a program that requires more than this amount of wall-clock time, you will have to make use of a checkpoint/restart mechanism so that the program can periodically save its state and be resubmitted to the queues, picking up from where it left off. It is crucial to store checkpoints so that one can avoid lengthy delays in obtaining results in the event of a failure. Investing time in testing and ensuring that one's checkpoint/resume works properly is inconvenient but ensures that valuable time and electricity are not wasted unduly in the long run. Redoing a long calculation is expensive.

Although it is encourage to always use checkpointing for log running work loads, there are a small number of nodes available for 28 day run times on the national general purpose systems Graham and Cedar.

Handling long jobs with chained job submission

On systems that use the Slurm scheduler (e.g. Orca and Graham) job dependencies can be implemented such that the start of one job can be contingent on the completion of another job. This job contingency is expressed via the dependency optional input to sbatch expressed as follows in the job submit script:

    dependency=afterok:<jobid>

Other strategies for resubmitting jobs for long running computations on the Slurm scheduled systems are described on the Compute Canada Wiki.

How do I checkpoint/restart my program?

Checkpointing is a valuable strategy that minimizes the loss of valuable compute time should a long running job be unexpectedly killed by a power outage, node failure, or hitting its runtime limit. On the national systems checkpointing can be accomplished manually by creating and loading your own custom checkpoint files or by using the Distributed MultiThreaded CheckPointing (DMTCP) software without having to recompile your program. For further documentation of the checkpointing and DMTCP software see the Checkpoints page at the Compute Canada Wiki site.

If your program is MPI based (or any other type of program requiring a specialized job starter to get it running), it will have to be coded specifically to save state and restart from that state on its own. Please check the documentation that accompanies any software you are using to see what support it has for checkpointing. If the code has been written from scratch, you will need to build checkpointing functionality into it yourself---output all relevant parameters and state such that the program can be subsequently restarted, reading in those saved values and picking up where it left off.

How can I know when my job would start?

The Slurm scheduler can report expected start times for queued jobs as output from the squeue command. For example the follow command returns the current jobs for user 'username' with columns for job id, job name, start time (N/A if there is no estimate), and job state:

$ squeue -u username -o "%.10i%.24j%.12T%.24S%.24R"
    JOBID                    NAME       STATE              START_TIME        NODELIST(REASON)
 12345678                  mpi.sh     PENDING                     N/A              (Priority)

It is important to note that the estimated start time listed in the START_TIME column (if available) can change substantially over time. This start time estimate is based on the current state of the compute nodes and list of jobs in the queue. Because the state of the compute nodes and list of jobs in the queue are constantly changing the start time estimates for pending jobs can change for several reasons (running jobs end sooner than expected, higher priority jobs enter the queue, etc). For more information regarding the variables that affect wait times in the queue see the job scheduling policy page at the Compute Canada Wiki site.

Is package X preinstalled on system Y, and, if so, how do I run it?

The software packages that are installed and maintained on the national systems are listed at the available software page of the Compute Canada Wiki site. Some packages have specific documentation for running on the national systems. For the packages that have specific Compute Canada instructions follow the link in the 'Documentation' column of the list of globally installed modules table.

For legacy SHARCNET systems the list of preinstalled packages (with running instructions) can be found on the SHARCNET software page.

Command 'top' gives me two different memory size (virt, res). What is the difference between 'virtual' and 'real' memory?

'virt' refers to the total virtual address space of the process, including virtual space that has been allocated but never actually instantiated, including memory which was instantiated but has been swapped out, and memory which may be shared. 'res' is memory which is actually resident - that is, instantiated with real ram pages. resident memory is normally the more meaningful value, since it may be judged relative to the memory available on the node. (recognizing, of course, that the memory on a node must be divided among the resident pages for all the processes, so an individual thread must always strive to keep its working set a little smaller than the node's total memory divided by the number of processors.)

there are two cases where the virtual address space size is significant. one is when the process is thrashing - that is, has a working set size bigger than available memory. such a process will spend a lot of time in 'D' state, since it's waiting for pages to be swapped in or out. a node on which this is happening will have a substantial paging rate expressed in the 'si' column of output from vmstat (the 'so' column is normally less significant, since si/so do not necessarily balance.)

the second condition where virtual size matters is that the kernel does not implement RLIMIT_RSS, but does enforce RLIMIT_AS (virtual size). we intend to enforce a sanity-check RLIMIT_AS, and in some cases do. the goal is to avoid a node becoming unusable or crashing when a job uses too much memory. current settings are very conservative, though - 150% of physical memory.

in this particular case, the huge V size relative to R is almost certainly due to the way Silky implements MPI using shared memory. such memory is counted as part of every process involved, but obviously does not mean that N * 26.2 GB of ram is in use.

in this case, the real memory footprint of the MPI rank is 1.2 GB - if you ran the same code on another cluster which didn't have numalink shared memory, both resident and virtual sizes would be about that much. since most of our clusters have at least 2GB per core, this code could run comfortably on other clusters.

Can I use a script to compile and run programs?

Yes. For instance, suppose you have a number of source files main.f, sub1.f, sub2.f, ..., subN.f, to compile these source code to generate an executable myprog, it's likely that you will type the following command

ifort main.f sub1.f sub2.f ... sub N.f -llapack -o myprog 

Here, the -o option specifies the executable name myprog rather than the default a.out and the option -llapack at the end tells the compiler to link your program against the LAPACK library, if LAPACK routines are called in your program. If you have long list of files, typing the above command every time can be really annoying. You can instead put the command in a file, say, mycomp, then make mycomp executable by typing the following command

chmod +x mycomp

Then you can just type

./mycomp

at the command line to compile your program.

This is a simple way to minimize typing, but it may wind up recompiling code which has not changed. A widely used improvement, especially for larger/many source files, is to use make. make permits recompilation of only those source files which have changed since last compilation, minimizing the time spent waiting for the compiler. On the other hand, compilers will often produce faster code if they're given all the sources at once (as above).

I have a program that runs on my workstation, how can I have it run in parallel?

If the the program was written without parallelism in mind, then there is very little that you can do to run it automatically in parallel. Some compilers are able to translate some serial portion of a program , such as loops, into equivalent parallel code, which allows you to explore the potential architecture found mostly in symmetric multiprocessing (SMP) systems. Also, some libraries are able to use parallelism internally, without any change in the user's program. For this to work, your program needs to spend most of its time in the library, of course - the parallel library doesn't speed up your program itself. Examples of this include threaded linear algebra and FFT libraries.

However, to gain the true parallelism and scalability, you will need to either rewrite the code using the message passing interface (MPI) library or annotate your program using OpenMP directives. We will be happy to help you parallelize your code if you wish. (Note that OpenMP is inherently limited by the size of a single node or SMP machine - most SHARCNET resources

Also, the preceding answer pertains only to the idea of running a single program faster using parallelism. Often, you might want to run many different configurations of your program, differing only in a set of input parameters. This is common when doing Monte Carlo simulation, for instance. It's usually best to start out doing this as a series of independent serial jobs. It is possible to implement this kind of loosely-coupled parallelism using MPI, but often less efficient and more difficult.

Where can I find available resources?

The information about available computational resources are available to the public as follows:

Can I find my job submission history?

Yes, for SHARCNET maintained legacy systems, you may review the history by logging in to your web account.

For national Compute Canada systems and systems running Slurm, you can see your job submission history from a specific date YYYY-MM-DD by running the following command:

 sacct --starttime YYYY-MM-DD --format=User,JobID%15,Jobname%25,partition%25,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist

where YYYY-MM-DD is replaced with the appropriate date.

How many jobs can I submit in one cluster?

Currently Graham has a limit of 1000 submitted jobs per user.

How are jobs scheduled?

Job scheduling is the mechanism which selects waiting jobs ("queued") to be started ("dispatched") on nodes in the cluster. On all of the major SHARCNET production clusters, resources are "exclusively" scheduled so that a job will have complete access to the CPUs, GPUs or memory that it is currently running on (it may be pre-empted during the course of it's execution, as noted below). Details as to how jobs are scheduled follow below.

How long will it take for my queued job to start?

On national Compute Canada systems and systems running with Slurm, you can see the estimated time your queued jobs will start by running:

 squeue --start -u USER

and replace USER with the name of the account that submitted the job.

What determines my job priority relative to other groups?

The priority of different jobs on the systems is ranked according to the usage by the entire group. This system is called Fairshare. More detail is available here.

Why did my job get suspended?

Sometimes your job may appear to be in a running state, yet nothing is happening and it isn't producing the expected output. In this case the job has probably been suspended to allow another job to run in it's place briefly.

Jobs are sometimes preempted (put into a suspended state) if another higher-priority job must be started. Normally, preemption happens only for "test" jobs, which are fairly short (always less than 1 hour). After being preempted, a job will be automatically resumed (and the intervening period is not counted as usage.)

On contributed systems, the PI who contributed equipment and their group have high-priority access and their jobs will preempt non-contributor jobs if there are no free processors.

My job cannot allocate memory

If you did not specify the amount of memory your job needs when you submitted the job, resubmit the job specifying the amount of memory it needs.

If you specifyed the amount of memory your job needed when it was submitted, then the memory requested was completely consumed. Resubmit your job with a larger memory request. (If this exceeds the available memory desired, then you will have to make your job use less memory.)

Some specific scheduling idiosyncrasies:

One problem with cluster scheduling is that for a typical mix of job types (serial, threaded, various-sized MPI), the scheduler will rarely accumulate enough free CPUs at once to start any larger job. When an job completes, it frees N cpus. If there's an N-cpu job queued (and of appropriate priority), it'll be run. Frequently, jobs smaller than N will start instead. This may still give 100% utilization, but each of those jobs will complete, probably at different times, effectively fragmenting the N into several smaller sets. Only a period of idleness (lack of queued smaller jobs) will allow enough cpus to collect to let larger jobs run.

Note that clusters enforce runtime limits - if the job is still running at the end of the stated limit, it will be terminated. Note also that when a job is suspended (preempted), this runtime clock stops: suspended time doesn't count, so it really is a limit on "time spent running", not elapsed/wallclock time.

How do I run the same command on multiple clusters simultaneously?

If you're using bash and can login with the SSH authentication agent connection forwarding enabled (the -A flag; ie. you've set up ssh keys; see Choosing_A_Password#Use_SSH_Keys_Instead.21 for a starting point) add the following environment variable and function to your ~/.bashrc shell configuration file:

~/.bashrc configuration: multiple cluster command
export SYSTEMS_I_NEED="graham.computecanada.ca orca.computecanada.ca"
 
function clusterExec {
  for clus in $SYSTEMS_I_NEED; do
     ping -q -w 1 $clus &> /dev/null
     if [ $? = "0" ]; then echo ">>> "$clus":"; echo ""; ssh $clus ". ~/.bashrc; $1"; else echo ">>> "$clus down; echo ""; fi
   done
}

You can select the relevant systems in the SYSTEMS_I_NEED environment variable.

To use this function, reset your shell environment (ie. log out and back in again), then run:

clusterExec uptime

You will see the uptime on the cluster login nodes, otherwise the cluster will appear down.

If you have old host keys (not sure why these should change...) then you'll have to clean out your ~/.ssh/known_hosts file and repopulate it with the new keys. If you suspect a problem contact an administrator for key validation or email help@sharcnet.ca. For more information see Knowledge_Base#SSH_tells_me_SOMEONE_IS_DOING_SOMETHING_NASTY.21.3F.

How do I load different modules on different clusters?

SHARCNET maintained systems provide the environment variables named:

  • $CLUSTER, which is the system's hostname (without sharcnet.ca or computecanada.ca), and
  • $CLU which will resolve to a three-character identifier that is unique for each system (typically the first three letters of the clusters name).

You can use these in your ~/.bashrc to load certain software on a particular system, but not others. For example, you can create a case statement in your ~/.bashrc shell configuration file based on the value of $CLUSTER:

~/.bashrc configuration: loading different modules on different systems
case $CLU in
  orc)
    # load 2014.6 Intel compiler...
    module unload intel
    module load intel/2014.6
  ;;
  gra)
    # load 2018.3 Intel compiler...
    module load intel/2018.3
  ;;
  *)
    # This runs if nothing else matched.
  ;;
esac

Programming and Debugging

What is MPI?

MPI stands for Message Passing Interface, a standard for writing portable parallel programs which is well-accepted in the scientific computing community. MPI is implemented as a library of subroutines which is layered on top of a network interface. The MPI standard has provided both C/C++ and Fortran interfaces so all of these languages can use MPI. There are several MPI implementations, including OpenMPI and MPICH. Specific high-performance interconnect vendors also provide their own libraries - usually a version of MPICH layered on an interconnect-specific hardware library.

For an MPI tutorial refer to the MPI tutorial.

In addition to C/C++ and Fortran versions of MPI, there exist other language bindings as well. If you have any special needs, please contact us.

What is OpenMP?

OpenMP is a standard for programming shared memory systems using threads with compiler directives instrumented in the source code. It provides a higher-level approach to utilizing multiple processors within a single machine while keeping the structure of the source code as close to the conventional form as possible. OpenMP is much easier to use than the alternative (Pthreads) and thus is suitable for adding modest amounts of parallelism to pre-exiting code. Because OpenMP is a set of programs, your code can still be compiled by a serial compiler and should still behave the same.

OpenMP for C/C++ and Fortran are supported by many compilers, including the PathScale and PGI for Opterons, and the Intel compilers for IA32 and IA64 (such as SGI's Altix.). OpenMP support has been provided in the GNU compiler suite since v4.2 (OpenMP 2.5), and starting with v4.4 supports the OpenMP 3.0 standard.

How do I run an OpenMP program with multiple threads?

An OpenMP program uses a single process with multiple threads rather than multiple processes. On multicore (i.e practically all) systems, threads will be scheduled on available processors, thus run concurrently. In order for each thread to run on one processor, one needs to request the same number of CPUs as the number of threads to use. To run an OpenMP program foo that uses four threads, use the following job submission script.

The option --cpus-per-task=4 specifies to reserve 4 CPUs per process.

#!/bin/bash
#SBATCH --account=def-someuser
#SBATCH --time=0-0:5
#SBATCH --cpus-per-task=4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./ompHello

For a basic OpenMP tutorial refer to OpenMP tutorial.

How do I measure the cpu time when running multi-threaded job?

If you submit a job through the scheduler, then timing information will be collected by the scheduler itself and stored for later query.

If you are running an OpenMP program interactively, you can use the time utility to collect information.

In a typical example using 8 threads,

export OMP_NUM_THREADS=8
time ./ompHello

Your output will be something like:

real	0m1.633s
user	0m1.132s
sys	0m0.917s

In this example the real and user time are comparable, so the particular example program is not benefitting from multithreading. In general, real time should be less than user time if parallel execution is occurring.

What mathematics libraries are available?

Every system has the basic linear algebra libraries BLAS and LAPACK installed. Normally, these interfaces are contained in vendor-tuned libraries. On Intel-based (Xeon) clusters it's probably best to use the Intel math kernel library (MKL). On Opteron-based clusters, AMD's ACML library is available. However, either library will work reasonably well on both types of systems. If one expects to do a large amount of computation, it is generally advisable to benchmark both libraries so that one selects the one offering best performance for a given problem and system.

One may also find the GNU scientific library (GSL) useful to some point for their particular needs. The GNU scientific library is an optional package, available on any machine.

For a detailed list of libraries on each clusters, please check the documentation on the corresponding SHARCNET satellite web sites

How do I use mathematics libraries such as BLAS and LAPACK routines?

First you need to know which subroutine you want to use. You need to check the references to find what routines meet your needs. Then place calls to those routines you want in your program and compile your program to use the particular libraries that have those routines. For instance, if you want compute the eigenvalues, and optionally the eigenvectors, of an N by N real non symmetric matrix in double precision, you find the LAPACK routine DGEEV will do that. All you need to do is to have a call to DGEEV, with required parameters as specified in the LAPACK document, and compile your program to link against the LAPACK library.

Now to compile the program, you need to link it to a library that contains the LAPACK routines you call in your code. The general recommendation is to use Intel's MKL library, which has a module loaded by default on most Compute Canada/SHARCNET systems. The instructions on how to link your code with these libraries at compile time are provided on the MKL page.


My code is written in C/C++, can I still use those libraries?

Yes. Most of the libraries have C interfaces. If you are not sure about the C interface or you need assistance in using those libraries written in Fortran, we can help you out on a case to case basis.

What packages are available?

Various packages have been installed on Compute Canada/SHARCNET clusters at users' requests. The full, up to date list is available on the Compute Canada documentation wiki (link). If you do not see a package that you need on this list, please request it by submitting a problem ticket.

You can also search this wiki or the Compute Canada wiki for the package you are interested in to see if there is any additional information about it available.

We can also help you with compiling/installing a package into your own file space if you prefer that approach.

What interconnects are used on SHARCNET clusters?

Currently, several different interconnects are being used on SHARCNET clusters: Quadrics, Myrinet, InfiniBand and standard IP-based ethernet.

Debugging serial and parallel programs

Debugger is a program which helps to identify mistakes ("bugs") in programs - either run-time, or "post-mortem" (by analyzing the core file produced by a crashed program). Debuggers can be either command-line, or GUI (graphical user interface) based. Before a program can be debugged, it needs to be (re-)compiled with a switch, -g, which tells the compiler to include symbolic information into the executable. For MPI problems on the HP XC clusters, -ldmpi includes the HP MPI diagnostic library, which is very helpful for discovering incorrect use of the API.

SHARCNET highly recommends using our commercial debugger DDT. It has a very friendly GUI, and can also be used for debugging serial, threaded, MPI, and CUDA (GPGPU) programs. A short description of DDT and cluster availability information can be found on its software page. Please also refer to our detailed Parallel Debugging with DDT tutorial.

SHARCNET also provides gdb (installed on all clusters, type "man gdb" to get a list of options and see our Common Bugs and Debugging with gdb tutorial).


What is NaN ?

NaN stands for "Not a Number". It is an undefined or unrepresentable value, typically encountered in floating point arithmitic (eg. the square root of a negative number). To debug this in your program one typically has to unmask or trap floating point exceptions. This is fairly straightforward with Fortran compilers (e.g. with the Intel's ifort one simply needs to add one switch, "-fpe0"), but somewhat more complicated with C/C++ codes, where the best solution is to use feenableexcept() function. There are further details in the Common Bugs and Debugging with gdb tutorial.

My program exited with an error code XXX - what does it mean?

Your application crashed, producing an error code XXX (where XXX is a number). What does it mean? The answer may depend on your application. Normally, user codes are not touching the first 130 or so error codes, which are reserved for the Operational System level error codes. On most of our clusters, typing

 perror  XXX

will print a short description of the error. (This is a MySQL utility, and for XXX>122 it will start printing only MySQL-related error messages.) The accurate for the current OS (operational system) list of system error codes can be found on our clusters by printing the content of the file /usr/include/asm-x86_64/errno.h (/usr/include/asm-generic/errno.h on some systems).


Getting Help

I have encountered a problem while using a Compute Canada/SHARCNET system and need help, who should I talk to?

If you have access to the Internet, we encourage you to use the problem ticketing system (described in detail below) . This is the most efficient way of reporting a problem as it minimizes email traffic and will likely result in you receiving a faster response than through other channels.

You are also welcome to contact system administrators and/or high performance technical computing consultants at any time. You may find their contact information on the directory page.

How long should I expect to wait for support?

Unfortunately Compute Canada/SHARCNET does not have adequate funding to provide support 24 hours a day, 7 days a week. User support and system monitoring is limited to regular business hours: there is no official support on weekends or holidays, or outside 9:00 - 17:00 EST .

Please note that this includes monitoring of our systems and operations, so typically when there are problems overnight or on weekends/holidays system notices will not be posted until the next business day.

Compute Canada Problem Ticket System

What is a "problem ticket system"?

This is a system that allows anyone with a Compute Canada account to start a persistent email thread that is referred to as a "problem ticket". When a user submits a new ticket it will be brought to the attention of an appropriate and available Compute Canada/SHARCNET staff member for resolution.

You can interact with the ticket system entirely via email. There is also a web interface to see tickets you have submitted in the past.

What do I need to specify in a ticket ?

To help us address your question faster, please try to do the following when submitting a ticket:

  1. specify which of our systems is involved
  2. if the problem pertains to a job, then report the jobid associated with the job; this is an integer that is returned by the scheduler when you submit the job
  3. report the exact commands necessary to duplicate the problem, as well as any error output that helps identify the problem; if relevant, this should include how the code is compiled, how the job is submitted, and/or anything else you are doing from the command line relating to the problem
  4. if you'd like for a particular staff member to be aware of the ticket, mention them

How do I submit a ticket?

In general, you can submit a new ticket by emailing support@computecanada.ca with the email address associated with your Compute Canada account. If you are using another email address, please provide your full name, your Compute Canada default username (if available) and your university or institution.

If you like, you can also target your inquiry more specifically, by using the following addresses to submit your ticket:

Fall 2018 changes to the ticket system

In the Fall of 2018 SHARCNET has retired its separate ticket system. Now all tickets are handled by the Compute Canada ticket system. The help@sharcnet.ca address will still work for ticket submission, it will just redirect the email to the Compute Canada ticket system. The new system does not make its tickets visible to all users, so you cannot search all existing tickets via a web interface like you could in the old system. Also, in the new system there is no web form to submit the ticket, instead it must always be done by email. On the plus side, the new system is capable of handling file attachments.

How do I give other users access to my files ?

There are two ways to provide other users with access to your files. The first is by changing the file attributes of your directories directly with the chmod command and the second is by using file access control lists (acl). Using ACLs is more flexible as it allows you to specify individual users and groups and their respective privileges, whereas using chmod is more coarse grained and only allows you to set the permissions for your group and global access.

Enabling Per-user/group Access: chmod Method

Suppose you have a program and some files in:

/home/account/research/projectx

that you want to provide access to some users and/or groups.

The first step is to make the "top" directory you control have world execute permission. This will allow other users to be able to cd (change directory) into subdirectories under such. You only need to have world execute permission; world read permission is not needed. (Enabling world read permission will allow anyone to see all file and subdirectory names in that directory --so you may wish to keep / turn such off.)

The "top" directories you control on SHARCNET are these (where $USER is your Compute Canada userid):

  • /home/$USER
  • /scratch/$USER

(Please note that project file space is special in that the top project directory is owned by the PI of the research group. By default all group members have read access to the contents of their project directory.)

So if you want to provide access to the directory:

/home/$USER/research/projectx

you would run (once) the following command:

chmod o+x /home/$USER

or equivalently (since this is your home directory):

chmod o+x ~

If you also want to be sure others cannot see the files and subdirectories in your home directory, then add a -rw (which turns off the ability to read the directory contents or the ability to write into and delete directory contents) to the chmod command as follows:

chmod o+x-rw /home/$USER

Similarly, you can set your "top" scratch directory if that is where you want to provide access, like this:

chmod o+x /scratch/$USER

or with the added -rw as follows:

chmod o+x-rw /scratch/$USER

NOTE: If unsure and you want to err on the side of keeping things private, use "chmod o+x-rw DIRECTORY_NAME".

Now repeat this process for all directories in the path you want to provide access to except the last one. For example, to provide access to this projectx directory:

/home/$USER/research/projectx

you would need to run:

chmod o+x /home/$USER
chmod o+x /home/$USER/research

For the last directory, i.e., projectx, provide both read and execute permissions, and, if you want to allow others to write to that directory, also allow write permission.

To only provide read and execute permission with the last directory, run:

chmod o+rx /home/$USER/research/projectx

and to provide read, write, and execute permission, run:

chmod o+rwx /home/$USER/research/projectx

More realistically however, you would like others to be able to do either of the following:

  1. read everything in the projectx directory (and disallow others' ability to write/update/delete), or,
  2. read everything and be able to modify/write contents within the projectx directory.

To do the former (i.e., Item 1), run on the last directory (i.e., "projectx" in this example):

chmod -R o+rX-w /home/$USER/research/projectx

and to do the latter (i.e., Iterm 2), run on the last directory (i.e., "projectx" in this example):

chmod -R o+rwX /home/$USER/research/projectx

Now, you can tell the users you want to be able to access this directory its FULL PATH, i.e.,

/home/yourloginname/research/projectx

and those users will be able to run:

cd /home/yourlogginname/research/projectx

to have the access you've granted. (Don't tell the user a path with $USER in it --that won't work: you must use the full path. If you are unsure, "cd" to that directory and run the "pwd" command which will output the full path to the "present working directory".)

NOTE: The "other" permission settings in this section allow ANY other user actions permitted implied by the permissions you've set. If this is too open, then read the sections below that use the setfacl command.

Disabling Per-user/group Access: chmod Method

At some point, you will want to revoke permissions granted to others. If you had previously provided access to your "projectx" directory using these commands:

chmod o+x /home/$USER
chmod o+x /home/$USER/research
chmod -R o+rwX /home/$USER/research/projectx

then you would revoke access using:

chmod o-rwx /home/$USER
chmod o-rwx /home/$USER/research
chmod -R o-rwx /home/$USER/research/projectx

Know that this will revoke all "other" access. If you have other users using other directories under /home, then you will not want to run:

chmod o-x /home/$USER

as that will prevent those users from accessing those other directories.

Controlling Access to Files/Directories Using setfacl

An Access Control List (ACL) is a list of users and groups with their associated file access privileges which is associated with a file/directory. Using ACLs allow fine-grained control over which users and/or which groups of users can access files and/or directories.

NOTE: If you are granting access to multiple SHARCNET staff, you may prefer to grant access to all SHARCNET staff at one time, e.g., you may/will be/are receiving assistance from multiple staff members. If so, you may find it much easier to grant access to the sn_staff group instead of each individual SHARCNET staff. (Similarly, "cc_staff" group can be used to provide access to all Compute Canada staff.)


Enabling Per-user/group Access: setfacl Method

Although you can use the setfacl command to grant permissions everywhere needed, it is simpler to use the chmod command to set execute permissions on your "top" directory and all directories below the "top" one first. Suppose you want to grant access to the following "projectx" directory (to everything in and under it):

/home/$USER/research/projectx

where $USER is your userid (i.e., Compute Canada login). The "top" directory is:

/home/$USER

so you would run to give others execute permission to it:

chmod o+x /home/$USER

If you prefer giving only a specific user, called USERNAME, access then use setfacl to do this instead:

setfacl -m u:USERNAME:x /home/$USER

Notice the 'u' before "u:USERNAME". The 'u' means "user" and replace USERNAME with the user's name.

If you want to only give a specific group, e.g., sn_staff, access then use setfacl as follows:

setfacl -m g:sn_staff:x /home/$USER

Notice the 'g' before "g:sn_staff". The 'g' means "group" and replace "sn_staff" with the name of the group you want to provide access to.

Similarly, you will want to provide access to the directories under the "top" one except the last one. If you wanted to grant access to your "projectx" directory located here:

/home/$USER/research/projectx

then you will need to grant execute permission to both /home/$USER and /home/$USER/research, e.g.,

chmod o+x /home/$USER
chmod o+x /home/$USER/research

or use setfacl to do the same for some USERNAME:

setfacl -m u:USERNAME:x /home/$USER
setfacl -m u:USERNAME:x /home/$USER/research

or some group (e.g., sn_staff):

setfacl -m g:sn_staff:x /home/$USER
setfacl -m g:sn_staff:x /home/$USER/research

With the last directory, you will want to either grant to all content within that directory:

  1. read and execute permissions without the ability to modify/write,
  2. read, write, and execute permissions.

To do Item 1 (i.e., grant read and execute but no write) with setfacl for some user name to the "projectx" directory:

setfacl -R -m u:USERNAME:rwX /home/$USER/research/projectx

and for some group, e.g., sn_staff, one would write:

setfacl -R -m g:sn_staff:rwX /home/$USER/research/projectx

Now, you can tell the users you want to be able to access this directory its FULL PATH, i.e.,

/home/yourloginname/research/projectx

and those users will be able to run:

cd /home/yourlogginname/research/projectx

to have the access you've granted. (Don't tell the user a path with $USER in it -- that won't work: you must use the full path. If you are unsure, "cd" to that directory and run the "pwd" command which will output the full path to the "present working directory".)


Disabling Per-user/group Access: setfacl Method

At some point, you will want to revoke permissions granted to others. If you had previously provided access to your "projectx" directory using these commands for a directory:

chmod o+x /home/$USER
chmod o+x /home/$USER/research

then you would revoke access using:

chmod o-rwx /home/$USER
chmod o-rwx /home/$USER/research

Know that this will revoke all "other" users' access through these directories.

If you used setfacl, then run the same command you previously used but replace -m with -x.

For example, if you granted permissions using:

setfacl -m u:USERNAME:x /home/$USER
setfacl -m u:USERNAME:x /home/$USER/research
setfacl -R -m u:USERNAME:rwX /home/$USER/research/projectx

or:

setfacl -m g:sn_staff:x /home/$USER
setfacl -m g:sn_staff:x /home/$USER/research

then you would revoke these permissions using (respectively):

setfacl -x u:USERNAME /home/$USER
setfacl -x u:USERNAME /home/$USER/research
setfacl -R -x u:USERNAME /home/$USER/research/projectx

or:

setfacl -x g:sn_staff /home/$USER
setfacl -x g:sn_staff /home/$USER/research
setfacl -R -x g:sn_staff /home/$USER/research/projectx

You can verify that user's don't have access using getfacl.

A Brief Overview of the getfacl and setfacl Commands

One can see the ACL for a particular file/directory with the getfacl command, eg.

$ getfacl /home/sn_user
getfacl: Removing leading '/' from absolute path names
# file: home/sn_user
# owner: sn_user
# group: sn_user
user::rwx
group::r-x
other::--x

One uses the setfacl command to modify the ACL for a file/directory. To add read and execute permissions for this directory for user ricky, eg.

$ setfacl -m u:ricky:rx /home/sn_user

Now there is an entry for user:ricky with r-x permissions:

$ getfacl /home/sn_user
getfacl: Removing leading '/' from absolute path names
# file: home/sn_user
# owner: sn_user
# group: sn_user
user::rwx
user:ricky:r-x
group::r-x
mask::r-x
other::--x

To remove an ACL entry one uses the setfacl command with the -x argument, eg.

$ setfacl -x u:ricky /home/sn_user

Now there is no longer an entry for ricky:

$ getfacl /home/sn_user
getfacl: Removing leading '/' from absolute path names
# file: home/sn_user
# owner: sn_user
# group: sn_user
user::rwx
group::r-x
mask::r-x
other::--x

Note that if one wants to provide access to a nested directory then the permissions need to be changed on all the parent directories using the -R flag. Please see the man pages for these commands man getfacl; man setfacl for further information. If you'd like help utilizing ACLs please email help@sharcnet.ca.

I am new to parallel programming, where can I find quick references at SHARCNET?

SHARCNET has a number of training modules on parallel programming using MPI, OpenMP, pthreads and other frameworks. Each of these modules has working examples that are designed to be easy to understand while illustrating basic concepts. You may find these along with copies of slides from related presentations and links to external resources on the Main Page of this training/help site.

I am new to parallel programming, can you help me get started with my project?

Absolutely. We will be glad to help you from planning the project, architecting your application programs with appropriate algorithms and choosing efficient tools to solve associated numerical problems to debugging and analyzing your code. We will do our best to help you speed up research. If your programming project would involve a significant staff time, you should consider applying for Dedicated Programming support. (We run the competition annually; see https://www.sharcnet.ca/my/research/programming).

Can you install a package on a cluster for me?

Certainly. We suggest you make the request by sending e-mail to help@sharcnet.ca with the specific request.

I am in a process of purchasing computer equipment for my research, would you be able to provide technical advice on that?

If you tell us what you want, we may be able to help you out.

Does SHARCNET have a mailing list or user group?

Yes. You may subscribe to one or more mailing lists on the email list page available once you log into the web portal. To find it, please go to MyAccount - Settings - Details in the menu bar on the left and then click on Mail on the "details" page. Don't forget to save your selections.

How do I add/remove myself to/from a SHARCNET mailing list?

To add/remove yourself to/from a SHARCNET mailing list, do the following:

  1. Log in to the SHARCNET portal: https://www.sharcnet.ca/
  2. Click on the My Account menu item.
  3. Click on the Settings menu item under My Account.
  4. Click on the Details menu item under Settings.
  5. Click on the Mail link near the bottom of the page.

The page that appears has checkboxes that allow you to add/remove yourself to/from a SHARCNET mailing list. To add yourself, click the checkbox on the line of the mailing list you are interested in; to remove yourself, uncheck the checkbox instead. Finally, when done, be sure to click the Save button at the bottom of the page to record these changes.

Does SHARCNET provide any training on programming and using the systems?

Yes. SHARCNET provides workshops on specific topics from time to time and offers courses at some sites. Every summer (usually late May to early June), SHARCNET holds an annual HPC Summer School with a variety of in-depth, hands-on workshops. Many materials from past workshops/presentations can be found on the SHARCNET's web portal.

SHARCNET also offers a series of online seminars (so-called "General interest webinars"), typically delivered every second Wednesday at lunch time. These are announced via the SHARCNET events mailing list and one can see the schedule at the SHARCNET event calendar. Past seminars are recorded and posted on our youtube channel. A full listing of the past webinars is available on the Online Seminars page.

Attending SHARCNET Webinars

SHARCNET makes a number of seminar events available online (New User Seminar, general interest talks, etc.) using software/services from Vidyo. Vidyo allows both the presenter and the attendees to offer or participate in online seminars by using their web browser or installing a small application. If this is your first Vidyo seminar please join the seminar ahead of the official start, to sort out any technical issues. Vidyo is supported on most platforms, both "stationary" (Windows, MacOS, Linux) and mobile (iOS, Android).

Please note that if your device has a microphone (highly recommended) and/or webcam, they will be used by Vidyo to transmit your audio and video to all seminar participants. They will be on by default, but you can always disable them by clicking on a corresponding button at the bottom of your Vidyo window. We ask that all attendees keep their microphones muted, unless you want to ask something.

We normally record our seminars, and make them available to all SHARCNET users. All recent and new webinars are posted on our youtube channel, http://youtube.sharcnet.ca . The links to the video recordings, slides and abstracts can be found on our online seminars page.

If you do not have headphones and or microphone, we provide a toll free number call-in option: 1-855-728-4677, ext 5542.

To receive email notifications about upcoming General Interest seminars, if you are a SHARCNET user please enable "Events" mailing list in your settings, and if you are not a SHARCNET user please send an email to syam@sharcnet.ca .

Please note that times for our webinars are for the Eastern Time (EST/EDT) zone.

Research at SHARCNET

Where can I find what other people do at SHARCNET?

You may find some of the research activities at SHARCNET by visiting our research initiatives and researcher profile pages.

I have a research project I would like to collaborate on with SHARCNET, who should I talk to?

You may contact SHARCNET head office or contact members of the SHARCNET technical staff.

How can I contribute compute resources to SHARCNET so that other researchers can share it?

Most people's research is "bursty" - there are usually sparse periods of time when some computation is urgently needed, and other periods when there is less demand. One problem with this is that if you purchase the equipment you need to meet your "burst" needs, it'll probably sit, underutilized, during other times.

An alternative is to donate control of this equipment to SHARCNET, and let us arrange for other users to use it when you are not. We prefer to be involved in the selection and configuration of such equipment. Some of SHARCNET's most useful clusters were created this way — Goblin, Wobbie and others were purchased with user contributions, and Orca's newest/fastest nodes are contributed. Our promise to contributors is that as much as possible, they should obtain as much benefit from the cluster as if it were not shared. Owners get preferential access. Naturally, owners are also able to burst to higher peak usage, since their equipment has been pooled with other contributions. (Technically, SHARCNET cannot itself own such equipment — it remains owned by the institution in question, and will be returned to the contributor upon request.) If you think this model will also work for you and you would like to contribute your computational resource to help the research community at SHARCNET, you can contact us for such arrangement.

I do not know much about computation, nor is it my research interest. But I am interested in getting my research done faster with the help of the high performance computing technology. In other words, I do not care about the process and mechanism, but only the final results. Can SHARCNET provide this type of help?

We will be happy to bring the technology of high performance computing to you to accelerate your research, if at all possible. If you would like to discuss your plan with us, please feel free to contact our high performance computing specialists. They will be happy to listen to your needs and are ready to provide appropriate suggestions and assistance.

I am a faculty member from non-SHARCNET member institution. Could I apply for an account and sponsor my student's accounts?

As long as you and your students can obtain a Compute Canada account you will be able to obtain SHARCNET accounts. See above for further information on what is required to obtain an account.

I need access to more CPU cores or storage than are available by default, what programs exist to support demanding computation?

SHARCNET participates in the Compute Canada NRAC (National Resource Allocation Competition) and provides a continual competition for groups that require more than the default level of access to our resources. Please see Dedicated Resources for further information.

I heard SHARCNET offers fellowships, where can I get more information?

SHARCNET no longer actively runs a fellowship program. You may find information regarding past fellowships and other dedicated resource opportunities on the Research Fellowships page of the web portal.

I would like to do some research at SHARCNET as a visiting scholar, how should I apply?

In general, you will need to find a hosting department or a person affiliated with one of the SHARCNET institutions. You may also contact us directly for more specific information.

I would like to send my students to SHARCNET to do some work for me. How should I proceed?

See above.



Contacting SHARCNET

How do I contact SHARCNET for research, academic exchanges, and technical issues?

Please contact SHARCNET head office.

How do I contact SHARCNET for business development, education and other issues?

Please contact SHARCNET head office.

How do I contact a specific staff member at SHARCNET?

See staff directory for contact information.

How to Acknowledge SHARCNET in Publications

How do I acknowledge SHARCNET in my publications?

We recommend one cite the following:

This work was made possible by the facilities of the Shared Hierarchical 
Academic Research Computing Network (SHARCNET:www.sharcnet.ca) and Compute/Calcul Canada.

I've seen different spellings of the name, what is the standard spelling of SHARCNET?

We suggest the spelling SHARCNET, all in upper case.


What types of research programs / support are provided to the research community?

Our overall intent is to provide support that can both respond to the range of needs that the user community presents and help to increase the sophistication of the community and enable new and larger-in-scope applications making use of SHARCNET's HPC facilities. The range of support can perhaps best be understood in terms of a pyramid:

Level 1

At the apex of the pyramid, SHARCNET supports a small number of projects with dedicated programmer support. The intent is to enable projects that will have a lasting impact and may lead to a "step change" in the way research is done at SHARCNET. Inter-disciplinary and inter-institutional projects are particularly welcomed. For the latest information about the program, including application guidelines, please see the Programming Competition page in our web portal. For information about projects that have been supported please see: Dedicated Programming Support Projects.

Level 2

The middle layers of support are provided through a number of initiatives.

These include:

  • Programming support of more modest duration (several days to one month engagement, usually part time)
  • Training on a variety of topics through workshops, seminars and online training materials
  • Consultation. This may include user-initiated interactions on particular programs, algorithms, techniques, debugging, optimization etc., as well as unsolicited help to ensure effective use of SHARCNET systems
  • Site Leaders play an important role in working with the community to help researchers connect with SHARCNET staff and to obtain appropriate help and support.

Level 3

The base level of the pyramid handles the very large number of small requests that are essential to keeping the user community working effectively with the infrastructure on a day-to-day basis. Several of these can be answered by this FAQ; many of the issues are presented through the ticketing system. The support is largely problem oriented with each problem being time limited.