From Documentation
Revision as of 16:32, 14 September 2010 by Hahn (Talk | contribs) (How to manage really large files)

Jump to: navigation, search

Introduction

At the introductory level, one should read this knowledge base entry to learn how to organize files at SHARCNET.

For users who have demanding storage needs we recommend speaking with a consultant to determine the best approach for your work.

Storage at SHARCNET

Bull, Narwhal, Requin, Whale and Saw have SFS (basically HP's version of the Lustre parallel filesystem) for /work and /scratch

All other filesystems in SHARCNET are NFS (including /home) except for node-local storage, eg. /tmp

SFS

One should aim to minimize the number of "metadata" operations, eg. creating, renaming/moving, copying, closing, deleting files. These operations always involve a single server (the "MDS") which can be overloaded. When the operations are read-only (file lookups) it is can be very fast because much of the information can be cached in ram, but for write/update operations such as file creation, renaming, deletion, or when read data is not cached, the rate is only about 1000/second. This sounds like a lot, but a single user with a moderate number of jobs can easily saturate the server, drastically reducing performance for all other users and jobs.

Metadata contention is frequently the problem with Whale - consider 150 jobs, each of which updates a set of 45 result files every 10 second timestep. if each update is done by open-append-close, the metadata load is around 1400/second, already saturating the server, even though the aggregate data (content) rate may be only a few KB/s. the 3000 other processes on the cluster will be unable to make efficient progress. the problem can be solved by putting all the output into one file, or by keeping the files open and simply flushing after each update. both aproaches will perform IO on data content servers while avoiding metadata load.

File content is handled by multiple servers and can support very high bandwidth. Best performance can be obtained using large (>4MB) transfers and many separate files. aspects of this are tunable - the block size and stripe count can be selected per file or directory (eg. see this section below)

There are three main places where bottlenecks occur, and the worst case will dominate. first, the node's network may be the bottleneck - this is the case on Whale, for instance, where each node has only a 1 Gb connection (110 MB/s theoretical, ~80 MB/s realistic max). the network interface of the storage server may also be a bottleneck, though it is often the same as the node interface within a cluster. finally, the disk array being served will have a characteristic max bandwidth (around 100 MB/s on most SHARCNET storage). this last factor is an important feature of SFS storage that is tunable, though:

NFS

In this filesystem the same server handles metadata and content, but often seems to deliver better metadata performance. content bandwidth is about the same as a single SFS server (a little under 100 MB/s).

/tmp

Use /tmp if possible as it's a filesystem that is mounted from the node's local disk. It isn't large or even very fast, but it's not being shared by hundreds of other jobs so this often means a net win (definitely for meta-data heavy operations).

One way to exploit /tmp is to have your job stage data from /work or /scratch at the start of the run; the job can then use /tmp for all it's I/O, and then the results can be copied back to /work or /scratch at the end of the job.

general tips

Metadata

The general rule of thumb is to minimize the amount of metadata operations.

For example, if you are appending (scattering) results to several files, try to keep the files open (possibly flushing periodically) to minimize metadata operation overhead instead of opening and closing them whenever a write needs to take place. You could also merge the files (ie. keep all the output in a single file) to further reduce the metadata overhead.

coping with lots of files per directory

All filesystems prefer directories containing few files - in most cases, modern filesystems will use a hash to do lookups, so "large" number of files per directory is probably > 10,000. Using very large directories (eg. 100,000) will work on some systems, but not others, and will be noticeably slow where it works. If you are stuck with a slow directory, keep in mind that by default, "ls" is aliased to "ls --color. This means that each file has to be examined to determine its file type. using "ls -C" will be much faster by avoiding metadata operations. "echo *" is even faster.

file formats

If your files are smallish (say, < 1 MB) and there are few of them, then using ASCII for your data is sensible, convenient and efficient enough.

Once your data becomes large, you should using binary formatting. This will use less space, resulting in faster transfers and read operations (without parsing overhead).

If you need to ensure your data is portable you should use a library routine like HDF or netCDF to write output. These have the added bonus of supporting parallel programs / filesystems.

If your data compresses well you could also consider using a library routine like zlib to automatically compress your output when you write the file.

compression

compress your files. gzip. etc.

One can also use a library routine like zlib to automatically compress your output when you write the file.

How to manage really large files

Several SHARCNET systems use the SFS file system (a version of Lustre from HP). This means the storage presented as /work and /scratch is actually the amalgamation of multiple pools of storage ("OSTs") on multiple servers. Lustre provides the ability to "stripe" a file across multiple OSTs so that sequential portions of a large file are stored on successive OSTs.

  • if the cluster interconnect is faster than a single OST, this may provide higher bandwidth.
  • striping may or may not be default, depending on the cluster.
  • once a file is assigned to OSTs, its ability to expand is determined by free space on those OSTs.
  • if a file happens to involve full OSTs, it will not be extendable at all, and you may think that the cluster is out of space.
  • this is irrelevant for small files, since the portion of the file on an OST (before going into to the next OST) is typically 1 MB.
  • some kind of access to the file, such as querying its size, are slower for striped files, since each OST has to be contacted.

For optimal handling of large files, both for performance and expandability, the filesystem should be told to stripe across multiple OSTs:

lfs setstripe <file or directory> <size> -1 <servers>

where <file or directory> is the file to or directory to be striped, <size> is the size in bytes to allocate on each OST, and <servers> is the number of OSTs to stripe across (note that the penultimate parameter is a minus one and not a -l).

For example, you may try in Saw

lfs setstripe /scratch/isaac/test/ 4m -1 8

this stripes the file across 8 OSTs (or 8-way stripe). Please note that each server has a little different version of LFS. You can check this in detail as

[isaac@saw377 ~]$ lfs
lfs > help setstripe
setstripe: Create a new file with a specific striping pattern or
set the default striping pattern on an existing directory or
delete the default striping pattern from an existing directory
usage: setstripe <filename|dirname> <stripe_size> <stripe_index> <stripe_count>
      or 
      setstripe <filename|dirname> [--size|-s stripe_size]
                                   [--index|-i stripe_index]
                                   [--count|-c stripe_count]
      or 
      setstripe -d <dirname>   (to delete default striping)
      stripe_size:  Number of bytes on each OST (0 filesystem default)
      Can be specified with k, m or g (in KB, MB and GB respectively)
      stripe_index: OST index of first stripe (-1 filesystem default)
      stripe_count: Number of OSTs to stripe over (0 default, -1 all)


If you want to restripe a file, you still need to copy it. For example, to restripe "foo", first set the striping in the directory then do "mv foo foo-old && cp foo-old foo && rm foo-old". The 'cp' is necessary in order to create a new file and thus pick up the new stripe settings.

Saw is the best server to practice this. For example, Saw currently has 93TB free, and all of the OSTs have at least 2T available, which means that the "minimum maximum" size of an 8-way striped file is 16TB. The maximum effective size of a file is the stripe-count times the lowest available space on any of its OSTs.

The following points apply:

  • specifying a directory causes all newly created files under it to be striped (most likely what you want),
  • a smaller size/more servers adds performance as multiple servers can be streaming data at once, but also reduces performance due to overhead (suggested size might be on the order of 1-4MB and servers on the order of 4-8), and
  • more servers increases the maximum file size by decreasing the amount stored on each file server, but also decreases the maximum file size by making it more likely a full server will be in set.
  • there is a disadvantage to striping too large. If any of the OSTs that the file is on goes down for some reason then any access to the file will hang. Access usually recovers once the problem is fixed.
  • on Bull or Requin the network bandwidth is around 800 MB/s, so maximal bandwidth to a single file is achieved by striping across ~8 storage servers. Narwhal's network bandwidth is 250 MB/s, so striping more than 2-way will not improve performance. on Whale, a single OST is about the same speed as the network, so striping does not improve bandwidth.

Other resources

These two articles provide a good walk-through of using strace to identify and measure I/O activity: