At the introductory level, one should read this knowledge base entry to learn how to organize files at SHARCNET.
For users who have demanding storage needs we recommend speaking with a consultant to determine the best approach for your work.
Storage at SHARCNET
All other filesystems in SHARCNET are NFS (including /home) except for node-local storage, eg. /tmp
One should aim to minimize the number of "metadata" operations, eg. creating, renaming/moving, copying, closing, deleting files. These operations always involve a single server (the "MDS") which can be overloaded. When the operations are read-only (file lookups) it is can be very fast because much of the information can be cached in ram, but for write/update operations such as file creation, renaming, deletion, or when read data is not cached, the rate is only about 1000/second. This sounds like a lot, but a single user with a moderate number of jobs can easily saturate the server, drastically reducing performance for all other users and jobs.
Metadata contention is frequently the problem with Whale - consider 150 jobs, each of which updates a set of 45 result files every 10 second timestep. if each update is done by open-append-close, the metadata load is around 1400/second, already saturating the server, even though the aggregate data (content) rate may be only a few KB/s. the 3000 other processes on the cluster will be unable to make efficient progress. the problem can be solved by putting all the output into one file, or by keeping the files open and simply flushing after each update. both aproaches will perform IO on data content servers while avoiding metadata load.
File content is handled by multiple servers and can support very high bandwidth. Best performance can be obtained using large (>4MB) transfers and many separate files. aspects of this are tunable - the block size and stripe count can be selected per file or directory (eg. see this section below)
There are three main places where bottlenecks occur, and the worst case will dominate. first, the node's network may be the bottleneck - this is the case on Whale, for instance, where each node has only a 1 Gb connection (110 MB/s theoretical, ~80 MB/s realistic max). the network interface of the storage server may also be a bottleneck, though it is often the same as the node interface within a cluster. finally, the disk array being served will have a characteristic max bandwidth (around 100 MB/s on most SHARCNET storage). this last factor is an important feature of SFS storage that is tunable, though:
In this filesystem the same server handles metadata and content, but often seems to deliver better metadata performance. content bandwidth is about the same as a single SFS server (a little under 100 MB/s).
Use /tmp if possible as it's a filesystem that is mounted from the node's local disk. It isn't large or even very fast, but it's not being shared by hundreds of other jobs so this often means a net win (definitely for meta-data heavy operations).
One way to exploit /tmp is to have your job stage data from /work or /scratch at the start of the run; the job can then use /tmp for all it's I/O, and then the results can be copied back to /work or /scratch at the end of the job.
The general rule of thumb is to minimize the amount of metadata operations.
For example, if you are appending (scattering) results to several files, try to keep the files open (possibly flushing periodically) to minimize metadata operation overhead instead of opening and closing them whenever a write needs to take place. You could also merge the files (ie. keep all the output in a single file) to further reduce the metadata overhead.
coping with lots of files per directory
All filesystems prefer directories containing few files - in most cases, modern filesystems will use a hash to do lookups, so "large" number of files per directory is probably > 10,000. Using very large directories (eg. 100,000) will work on some systems, but not others, and will be noticeably slow where it works. If you are stuck with a slow directory, keep in mind that by default, "ls" is aliased to "ls --color. This means that each file has to be examined to determine its file type. using "ls -C" will be much faster by avoiding metadata operations. "echo *" is even faster.
If your files are smallish (say, < 1 MB) and there are few of them, then using ASCII for your data is sensible, convenient and efficient enough.
Once your data becomes large, you should using binary formatting. This will use less space, resulting in faster transfers and read operations (without parsing overhead).
If your data compresses well you could also consider using a library routine like zlib to automatically compress your output when you write the file.
compress your files. gzip. etc.
One can also use a library routine like zlib to automatically compress your output when you write the file.
How to manage really large files
Several SHARCNET systems use the SFS file system (a HP variant of an earlier version of Lustre). This means the storage presented as /work and /scratch is actually the amalgamation of the storage on several file servers (technically called an Object Store Targets (OST)). By default, files are not split across the file servers, but reside on one of them. This means
- a file can only be as large as the remaining space on which ever server it is stored on, and
- it is possible to run out of space despite there appearing to be lots of free space,
where the later occurs because the file server the file is on becomes full (why the systems says out of space) despite other servers still having space (why the systems says there is still space).
The long and short is that for effective handling of large files, the system needs to be told to split them across multiple file servers. This is done via the following command
lfs setstripe <file or directory> <size> -1 <servers>
where <file or directory> is the file to or directory to split, <size> is the size in bytes to allocate on each file server, and <servers> is the number of file servers to split across (note that is a minus one and not a -l).
For example, you may try in Saw
lfs setstripe /scratch/isaac/test/ 4m -1 8
this stripes the file across 8 bricks (or 8-way stripe). Please note that each server has a little different version of LFS. You can check this in detail as
[isaac@saw377 ~]$ lfs lfs > help setstripe setstripe: Create a new file with a specific striping pattern or set the default striping pattern on an existing directory or delete the default striping pattern from an existing directory usage: setstripe <filename|dirname> <stripe_size> <stripe_index> <stripe_count> or setstripe <filename|dirname> [--size|-s stripe_size] [--index|-i stripe_index] [--count|-c stripe_count] or setstripe -d <dirname> (to delete default striping) stripe_size: Number of bytes on each OST (0 filesystem default) Can be specified with k, m or g (in KB, MB and GB respectively) stripe_index: OST index of first stripe (-1 filesystem default) stripe_count: Number of OSTs to stripe over (0 default, -1 all)
If you want to restripe a file on whale you still need to copy it. For example, to restripe "foo", first set the striping in the directory then do "mv foo foo-old && cp foo-old foo && rm foo-old". The 'cp' is necessary in order to create a new file and thus pick up the new strip settings.
Saw is the best server to practice this. For example, Saw currently has 93TB free, and all of the OSTs have at least 2T available, which means that the "minimum maximum" size of an 8-way striped file is 16TB. The maximum effective size of a file is the stripe-count times the lowest available space on any of its OSTs.
The following points apply
- specifying a directory causes all newly created files under it to be split (most likely what you want),
- a smaller size/more servers adds performance as multiple servers can be streaming data at once, but also reduces performance due to overhead (suggested size might be on the order of 1-4MB and servers on the order of 4-8), and
- more servers increases the maximum file size by decreasing the amount stored on each file server, but also decreases the maximum file size by making it more likely a full server will be in set.
- there is a disadvantage to striping too large. If any of the OSTs that the file is on goes down for some reason then any access to the file will hang. Access usually recovers once the problem is fixed.
- on Bull or Requin the network bandwidth is around 800 MB/s, so maximal bandwidth to a single file is achieved by striping across ~8 storage servers
These two articles provide a good walk-through of using strace to identify and measure I/O activity: