From Documentation
Jump to: navigation, search
Note: This page's content only applies to the SHARCNET's legacy systems; it does not apply to Graham.


Introduction

You should read this knowledge base entry to learn how to best organize their files at SHARCNET. It is common to use multiple filesystems and important to understand the limitations of each.

If you run into problems or have demanding storage needs please open a ticket or contact our technical team staff for assistance.

Storage at SHARCNET

Narwhal, Requin, Whale and Saw have SFS (basically HP's version of the Lustre parallel filesystem) for /work (for those that haven't been migrated to global work) and /scratch

Global work is based on Lustre.

All other filesystems in SHARCNET are NFS (including /home) except for node-local storage, eg. /tmp

SFS

One should aim to minimize the number of "metadata" operations, eg. creating, renaming/moving, copying, closing, deleting files. These operations always involve a single server (the "MDS") which can be overloaded. When the operations are read-only (file lookups) it is can be very fast because much of the information can be cached in ram, but for write/update operations such as file creation, renaming, deletion, or when read data is not cached, the rate is only about 1000/second. This sounds like a lot, but a single user with a moderate number of jobs can easily saturate the server, drastically reducing performance for all other users and jobs.

Metadata contention is frequently the problem with Whale - consider 150 jobs, each of which updates a set of 45 result files every 10 second timestep. if each update is done by open-append-close, the metadata load is around 1400/second, already saturating the server, even though the aggregate data (content) rate may be only a few KB/s. the 3000 other processes on the cluster will be unable to make efficient progress. the problem can be solved by putting all the output into one file, or by keeping the files open and simply flushing after each update. both aproaches will perform IO on data content servers while avoiding metadata load.

File content is handled by multiple servers and can support very high bandwidth. Best performance can be obtained using large (>4MB) transfers and many separate files. aspects of this are tunable - the block size and stripe count can be selected per file or directory (eg. see this section below)

There are three main places where bottlenecks occur, and the worst case will dominate. first, the node's network may be the bottleneck - this is the case on Whale, for instance, where each node has only a 1 Gb connection (110 MB/s theoretical, ~80 MB/s realistic max). the network interface of the storage server may also be a bottleneck, though it is often the same as the node interface within a cluster. finally, the disk array being served will have a characteristic max bandwidth (around 100 MB/s on most SHARCNET storage). this last factor is an important feature of SFS storage that is tunable, though:

NFS

In this filesystem the same server handles metadata and content, but often seems to deliver better metadata performance. content bandwidth is about the same as a single SFS server (a little under 100 MB/s).

/tmp

Use /tmp if possible as it's a filesystem that is mounted from the node's local disk. It isn't large or even very fast, but it's not being shared by hundreds of other jobs so this often means a net win (definitely for meta-data heavy operations).

One way to exploit /tmp is to have your job stage data from /work or /scratch at the start of the run; the job can then use /tmp for all it's I/O, and then the results can be copied back to /work or /scratch at the end of the job.

general tips

How to identify a job that spends a lot of time waiting on I/O

The easiest way to quickly identify jobs that may be spending too much time waiting on I/O is by looking at your completed jobs in our web portal or via the job scheduler at the command line (see Monitoring_Jobs for details). If your User Time is much less than your Alloc. Time then your job is not spending time computing and this may indicate that it is waiting on I/O.

Metadata

The general rule of thumb is to minimize the amount of metadata operations.

For example, if you are appending (scattering) results to several files, try to keep the files open (possibly flushing periodically) to minimize metadata operation overhead instead of opening and closing them whenever a write needs to take place. You could also merge the files (ie. keep all the output in a single file) to further reduce the metadata overhead.

coping with lots of files per directory

All filesystems prefer directories containing few files - in most cases, modern filesystems will use a hash to do lookups, so "large" number of files per directory is probably > 10,000. Using very large directories (eg. 100,000) will work on some systems, but not others, and will be noticeably slow where it works. If you are stuck with a slow directory, keep in mind that by default, "ls" is aliased to "ls --color. This means that each file has to be examined to determine its file type. using "ls -C" will be much faster by avoiding metadata operations. "echo *" is even faster.

In general one should try to consolidate files (eg. archive and unpack on demand with the tar command) and/or use nested directories to categorize the files to avoid having too many files per directory.

If one is stuck with a large directory that requires archiving, it probably makes sense to pack every X files into an archive, where X may be 1000. In this case one can create a small helper script called tarup, containing this text:

sqsub -r 7d -o "$1.out" tar cvfj "$1.tar.bz2" "$@"

make it executable:

chmod +x tarup

This script accepts a list of files and submits a job to generate a compressed tarfile of all of the files with a name based on the first file.

Then one can use the "-n" option of the xargs command to call it for groups of 1000 files like this:

 ls . | xargs -n 1000 ./tarup

file formats

If your files are smallish (say, < 1 MB) and there are few of them, then using ASCII for your data is sensible, convenient and efficient enough.

Once your data becomes large, you should using binary formatting. This will use less space, resulting in faster transfers and read operations (without parsing overhead).

If you need to ensure your data is portable you should use a library routine like HDF or netCDF to write output. These have the added bonus of supporting parallel programs / filesystems.

If your data compresses well you could also consider using a library routine like zlib to automatically compress your output when you write the file.

compression

CPU cycles are cheap and temporal, diskspace is not. As such it is worthwhile to compress your files.

One may use the tar command (as explained below or the gzip command directly to compress files. etc. In general bzip2 provides better compression than gzip but may run a bit slower.

One can also use a library routine like zlib to automatically compress your output when you write the file.

How to manage really large files

Several SHARCNET systems use the SFS file system (a version of Lustre from HP). This means the storage presented as /work and /scratch is actually the amalgamation of multiple pools of storage ("OSTs") on multiple servers. Lustre provides the ability to "stripe" a file across multiple OSTs so that sequential portions of a large file are stored on successive OSTs.

  • if the cluster interconnect is faster than a single OST, this may provide higher bandwidth.
  • striping may or may not be default, depending on the cluster.
  • once a file is assigned to OSTs, its ability to expand is determined by free space on those OSTs.
  • if a file happens to involve full OSTs, it will not be extendable at all, and you may think that the cluster is out of space.
  • this is irrelevant for small files, since the portion of the file on an OST (before going into to the next OST) is typically 1 MB.
  • some kind of access to the file, such as querying its size, are slower for striped files, since each OST has to be contacted.


Depending on the vintage of the filesystem, the lfs command, which is used to manage striping, may behave differently. At SHARCNET there are two versions, one on requin, and another for the CentOS clusters. Their use is as follows:

lfs on requin

For optimal handling of large files, both for performance and expandability, the filesystem should be told to stripe across multiple OSTs:

lfs setstripe <file or directory> <size> -1 <servers>

where <file or directory> is the file to or directory to be striped, <size> is the size in bytes to allocate on each OST, and <servers> is the number of OSTs to stripe across (note that the penultimate parameter is a minus one and not a -l).

For example:

lfs setstripe /scratch/isaac/test/ 4m -1 8

this stripes the file across 8 OSTs (or 8-way stripe). Please note that each server has a little different version of LFS. You can check this in detail as

[isaac@requin377 ~]$ lfs
lfs > help setstripe
setstripe: Create a new file with a specific striping pattern or
set the default striping pattern on an existing directory or
delete the default striping pattern from an existing directory
usage: setstripe <filename|dirname> <stripe_size> <stripe_index> <stripe_count>
      or 
      setstripe <filename|dirname> [--size|-s stripe_size]
                                   [--index|-i stripe_index]
                                   [--count|-c stripe_count]
      or 
      setstripe -d <dirname>   (to delete default striping)
      stripe_size:  Number of bytes on each OST (0 filesystem default)
      Can be specified with k, m or g (in KB, MB and GB respectively)
      stripe_index: OST index of first stripe (-1 filesystem default)
      stripe_count: Number of OSTs to stripe over (0 default, -1 all)


lfs on CentOS systems

For optimal handling of large files, both for performance and expandability, the filesystem should be told to stripe across multiple OSTs:

lfs setstripe -s <size> -c <servers> <file or directory>

where <file or directory> is the file to or directory to be striped, <size> is the size in bytes to allocate on each OST, and <servers> is the number of OSTs to stripe across.

For example:

lfs setstripe -s 4m -c 8 /scratch/isaac/test

stripes the file across 8 OSTs (or 8-way stripe). You can see the difference between this version of lfs and the one on requin as follows:

 lfs > help setstripe
 setstripe: Create a new file with a specific striping pattern or
 set the default striping pattern on an existing directory or
 delete the default striping pattern from an existing directory
 usage: setstripe [--size|-s stripe_size] [--count|-c stripe_count]
                 [--index|-i|--offset|-o start_ost_index]
                 [--pool|-p <pool>] <directory|filename>
       or 
       setstripe -d <directory>   (to delete default striping) 
	stripe_size:  Number of bytes on each OST (0 filesystem default)
	              Can be specified with k, m or g (in KB, MB and GB
	              respectively)
	start_ost_index: OST index of first stripe (-1 default)
	stripe_count: Number of OSTs to stripe over (0 default, -1 all)
	pool:         Name of OST pool to use (default none)
general concerns

If you want to restripe a file, you still need to copy it. For example, to restripe "foo", first set the striping in the directory then do "mv foo foo-old && cp foo-old foo && rm foo-old". The 'cp' is necessary in order to create a new file and thus pick up the new stripe settings.

Saw is a good server to practice this on. For example, Saw currently has 93TB free, and all of the OSTs have at least 2T available, which means that the "minimum maximum" size of an 8-way striped file is 16TB. The maximum effective size of a file is the stripe-count times the lowest available space on any of its OSTs.

The following points apply:

  • specifying a directory causes all newly created files under it to be striped (most likely what you want),
  • a smaller size/more servers adds performance as multiple servers can be streaming data at once, but also reduces performance due to overhead (suggested size might be on the order of 1-4MB and servers on the order of 4-8), and
  • more servers increases the maximum file size by decreasing the amount stored on each file server, but also decreases the maximum file size by making it more likely a full server will be in set.
  • there is a disadvantage to striping too large. If any of the OSTs that the file is on goes down for some reason then any access to the file will hang. Access usually recovers once the problem is fixed.
  • on Requin the network bandwidth is around 800 MB/s, so maximal bandwidth to a single file is achieved by striping across ~8 storage servers.

How to manage lots of files

To determine how many files you have in a particular directory one may run the following command:

 find /work/$USER -type f   | wc -l

At present SHARCNET does not limit the number of files a user may store in a given filesystem, but the cumulative effect of storing many millions of files has a severe impact on the performance of our parallel filesystem. All metadata operations on each of our parallel filesystems are handled by a single metadata server and depending on how files are being accessed this can become overwhelmed and lead to poor overall I/O performance for all users. Further than this, if one stores too many files in a particular directory then performance will severely degrade (see above: Analyzing_I/O_Performance#coping_with_lots_of_files_per_directory).

One way to cope with lots of files is to consolidate them into larger files. One may compress and archive files using the tar command:

 tar -cvjf tar_archive.tar.bz2 /work/$USER/directory_to_archive

The j is for bzip2 compression, one could alternatively use z for gzip. bzip2 is generally better at reducing file size but runs a bit slower than gzip. Then one can unpack this with:

 tar -xvjf tar_archive.tar.bz2

While the above archival methods may serve to rectify the immediate situation, to ensure the best possible performance for your programs moving forward we recommend that you consider optimizing your workflow to minimize the number of files that are created by your jobs and the amount of metadata operations that your program incurs. Doing so will lead to the best performance as more time can be spent computing instead of waiting for the storage system.

For example, if you are appending results to several files, try to keep the files open (possibly flushing periodically, as necessary) to minimize the metadata operation overhead instead of opening and closing them whenever a write needs to take place. Going further, one can merge the files (ie. keep all the output in a single file) to further reduce the overhead. Other storage methods (eg. using a database instead of files) may also yield far better results depending on your program's I/O requirements.

SHARCNET's Technical Team can assist you in this matter and we strongly suggest that you contact us at help@sharcnet.ca if you would like assistance in optimizing your storage usage.

Other resources

These two articles provide a good walk-through of using strace to identify and measure I/O activity: