This self-study tutorial will discuss issues in handling large amount of data in HPC and also discuss a variety of parallel I/O strategies for doing large-scale Input/Output (I/O) with parallel jobs. In particular we will focus on using MPI-IO and then introduce parallel I/O libraries such as NetCDF, HDF5 and ADIOS.
- 1 HPC I/O Issues & Goal
- 2 Parallel filesystem
- 3 Data Formats
- 4 Serial and Parallel I/O
HPC I/O Issues & Goal
Many today’s problems are increasingly computationally expensive, requiring large parallel runs on large distributed-memory machines (clusters). There would be basically three big I/O activities in these types of jobs. First is the HPC application requires to read initial dataset or conditions from the designated file. Secondly, mostly at the end of a calculation, data need to be stored on disk for follow-up runs or post-processing. As you may guess, parallel applications commonly need to write distributed arrays to disk Thirdly, the application state needs to be written into a file for restarting the application in case of a system failure The figure below shows a simple sketch of I/O bottleneck problem when using many cpus or nodes in a parallel job. As Amdahl’s law says, the speedup of a parallel program is limited by the time needed for the sequential fraction of the program. So, if the I/O part in the application works sequentially as shown, the performance of the code would be not as scalable as desired.
- Reading initial conditions or datasets for processing
- Writing numerical data from simulations for later analysis
- Checkpointing to files
Efficient I/O without stressing out the HPC system is challenging
We will go over the physical problem and limitation in handling data with memory or hard-disk but it is simply expected that load/store operation from memory or hard-disk takes much more time than multiply operations in CPU. Commonly, the total execution time consists of computation time in CPU, communication tim in inter-connection or network and I/O time. So, the efficient I/O handling in high performance computing is a key factor to get best performance.
- Load and store operations are more time-consuming than multiply operations
- Total Execution Time = Computation Time + Communication Time + I/O time
- Optimize all the components of the equation above to get best performance!!
Disk access rates over time
An HPC system, I/O related systems are typically slow as compared to its other parts. The figure in this slide shows how the internal drive access rate has been improved over the time. From 1960 to 2014 top supercomputer speed increased by 11 orders of magnitude. However, as shown in the figure, a Single hard disk drive capacity in the same period of time grew by 6 orders and furthermore average internal drive access rate which we can store data at grew by 3-4 orders of magnitude. So, this discrepancy explains that we are producing much more data which we cannot possibly store it at the proportional rate and hence we need to pay special attention to how to store the data appropriately.
Here is a memory and storage latency. Memory/storage latency refers to delays in transmitting data between the CPU and medium. Most CPUs operate at 1 nano second time scale. As shown the figure, for example, Writing to L2 cache takes about 10 times than CPU operation. As such, accessing to memory has a physical limitation and it also affects the I/O operations.
How to calculate I/O speed
Before we proceed more, we better make sure two following performance measurements. Firstly, there is ‘IOPs’. IOPs means I/O operations per second. The operation includes read/write and so on and IOPs is an inverse of latency (think about period (latency) and frequency(IOPs)). And also there is ‘I/O Bandwidth’. The bandwidth is defined as ‘quantity you read/write’. I believe all of you are quite used to this terminology from Internet connection at your home or office. Anyway, here is an information chart for several I/O devices. As you can see, Top-of-the-line SSDs on a PCI Express can push to unto 1GB IOPs. However, the device is still very expensive so it’s not a right fit for several hundreds terabyte supercomputing systems.
One thing I would like to emphasize is that parallel filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes. So, it does not result in “supercomputing” performance in I/O.
- IOPs = Input / Output operations per second (read/write/open/close/seek) ; essentially an inverse of latency
- I/O Bandwidth = quantity you read / write
Parallel (distributed) filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes, do not result in “supercomputing” performance
- disk-access time + communication over the network (limited bandwidth, many users)
I/O Software + Hardware stack
- I/O Hardware --> Parallel filesystem --> I/O Middleware --> High-end I/O library --> Application
When it comes to organizing parallel I/O, there are several layers of abstraction you should keep in mind. First of all, let’s start from the bottom. There is a I/O hardware which is a physical array or hard-disks attached to the cluster. On top of that, we are running parallel file system.
In SHARCNET, for most of system we are running Lustre which is an open-source filesystem. The purpose of the parallel filesystem is to maintain the logical partitions and provide efficient access to data. Then we have I/O middleware on top of the parallel filesystem. It organizes access from many processes by optimizing two-phase I/O, disk I/O and data flow over the network and also provides data sieving by converting many small non-contiguous I/O requests into fewer/bigger requests. Then there would be a high-end I/O library such as HDF5, NetCDF and so on. What it does is that it maps application abstractions to storage abstractions I/O in terms of the data structures of the code. So, data is stored directly to the disk by calling this library and this library is implemented to work quite efficiently. It is better to use this kind of libraries since SHARCNET supports both of HDF5 and NetCDF. You could also use I/O middleware which is MPI-IO. In today’s talk, I will focus more on MPI-IO which is a part of MPI-2. However, I will also discuss the pros and cons of different approaches. And then, as you may see, there is the application which is mostly your program and your program will decide whether to use high-end I/O library or I/O middleware.
In SHARCNET, we do have a parallel filesystem designed to scale to tens of thousand of computing nodes efficiently. For better performance, files can be striped across multiple drives. It means file does not reside on a single hard drive but multiple drives so that while a hard drive taking reading operation and another drive can send back the data to the program.
In order to avoid that two or more different processes access to a same file, parallel file systems use locks to manage this kind of concurrent file access. What actually happens is that the Files are pieced into ‘lock’ units and scattered across multiple hard drives. Then, Client nodes which is computing node obtain locks on units that they access before I/O occurs
- Files can be striped across multiple drives for better performance
- Locks used to manage concurrent file across in most parallel file system
- Files are pieced into ‘lock’ units (scattered across many drives)
- Client nodes obtain locks on units that they access before I/O occurs
- Enables caching on clients
- Locks are reclaimed from clients when others desire access
The most important part we should know is that the parallel filesystem is optimized for storing large shared files which can be possibly accessible from many computing nodes. So, it shows very poor performance to store many small size files. As you may get told in our new user seminar, we strongly recommend users not to generate millions of small size files.
Also, how you read and write, your file format, the number of files in a directory, and how often you ls command, affects every user! Quite often we got a ticket reporting that the user cannot even ‘ls’ in his or her /work directory. Most of cases for this situation are caused by a user doing very high I/O activities in the directory and it obviously makes system slower. The file system is shared over the ethernet network on a cluster: hammering the file system can hurt process communications which mostly related to MPI communication. That also affects others too.
Please note that the file systems are not infinite: bandwidth, IOPs, number of files, space, . .
- Optimized for large shared files
- Poor performance under many small reads/writes (high IOPs): Do not store millions of small files
- Your use of it affects everybody! (Different from case with CPU and RAM which are not shared)
- Critical factors: how you read / write, file format, # of files in a directory and how often per sec
- File system is shared over the ethernet network on a cluster: heavy I/O can prevent the processes from communication
- File systems are LIMITED: bandwidth, IOPs, # of files, space and etc.
Best Practices for I/O
What would be best practices for I/O.
First of all, it is always recommended to make a plan for your data needs such as how much will be generated and how much do you need to save and where to keep.
In SHARCNET, /home directory has 10GB quota but no expiry, /work directory has 1TB with no expiry and finally /scratch directory has unlimited quota with 4 month expiry. So, please plan it before you run a job.
And please minimize use of ‘ls’ or ‘du’ command especially in a directory with many files.
Regularly check your disk usage with quota command. Furthermore, please take warning signs that should prompt careful consideration when you have more than 100,000 files in your space and average data file size less than 100MB (if writing lots of data)
Please do ‘housekeeping’ regularly to maintain a reasonable number of file and quota. Gzip and tar command are very popular to compress multiple files and group them. So, you could reduce the number of files using these commands.
- Make a plan for your data needs: How much will you generate? How much do you need to save? Where will you keep it?
- Monitor and control usage: Minimize use of filesystem commands like ‘ls’ and ‘du’ in large directories
- Check your disk usage regularly with ‘quota’
- more than 100K files in your space
- average data file size less than 100 MB for large output
- Do ‘housekeeping’ (gzip, tar, delete) regularly
First of all, there is a ASCII or someone refers it as ‘text’ format. It is a human readable file format but not efficient. So, it’s good for a small input or parameter file to run a code. The ASCII format takes larger amount of storage than other types of formats and automatically it costs more for read/write operation. You could check your code implementation if you could find ‘fprintf’ in C code or open command with ‘formatted’ option in FORTRAN code.
ASCII = American Standard Code for Information Interchange
- pros: human readable, portable (architecture independent)
- cons: inefficient storage (13 bytes per single precision float, 22 bytes per double precision, plus delimiters), expensive for read/write
fprintf() in C open(6,file=’test’,form=’formatted’);write(6,*) in F90
Binary format is much ‘cheaper’ in computational sense than ASCII. ASCII has 13 for single precision and 22 for double precision. The table shows an experiment in writing 128M doubles into different locations; /scratch and /tmp on GPCS system in SciNet. As you can see, it is apparent that binary writing takes way shorter time than ASCII format.
|ASCII||173 s||260 s|
|Binary||6 s||20 s|
pros: efficient storage (4 bytes per single precision float, 8 bytes per double precision, no delimiters), efficient read / write cons: have to know the format to read, portability (endians)
fwrite() in C open(6,file=’test’,form=’unformatted’); write(6)in F90
While the binary format works fine and efficient, sometimes there would be a need to store additional information such as number of variables in the array, dimensions and size of the array and so on. So, the metadata is a useful to describe the binary. In case of passing the binary files to someone else or some other programs, it would be very helpful to include those information and to use the meta data format. By the way, it could be also done by using the high-end libraries such as HDF5 and NetCDF.
- Encodes data about data: number and names of variables, their dimensions and sizes, endians, owner, date, links, comments, etc.
Database data format is good for many small records. Using the database, data organizing and analysis can be greatly simplified. CHARENTE supports three different database packages. It is not quite common in the numerical simulation, though.
- very powerful and flexible storage approach
- data organization and analysis can be greatly simplified
- enhanced performance over seek / sort depending on usage
- open-sourcesoftware: SQLite(serverless), PostgreSQL, mySQL
Standar scientific dataset libraries
There are standard scientific dataset libraries. As mentioned in the previous slide, these libraries are very good not just storing the large-scale arrays in an efficient way but also they include data descriptions that the metadata format is good at. Moreover, the libraries provide data portability across platforms and languages which means the binaries generated in one machine can be read in other machines without a problem. The libraries store data automatically with compression. It could be extremely useful. For example, if you run a large-scale simulation and needs to store large dataset in particular with many repeating value such as zero, then the libraries can compress those repeating values efficiently so that you could save the data storage dramatically.
- HDF5 = Hierarchical Data Format
- NetCDF = Network Common Data Format
- Open standards and open-source libraries
- Provide data portability across platforms and languages
- Store data in binary with optional compression
- Include data description
- Optionally provide parallel I/O
Serial and Parallel I/O
In large parallel calculations your dataset is distributed across many processors/nodes. As shown in the right, for example, the calculation domain is decomposed into several work-load pieces and each node takes each allocation. Therefore, each node will compute the allocated domain and try to store the data into the disk. Unfortunately, in this case using parallel filesystem isn’t sufficient – you must organize parallel I/O yourself. It will be discussed shortly. For the file format, there are a couple of options such as a raw binary without metadata information or using high-end libraries (HDF5/NetCDF).
- In large parallel calculations your dataset is distributed across many processors/nodes
- In this case using parallel filesystem isn’t enough – you must organize parallel I/O yourself
- Data can be written as raw binary, HDF5 and NetCDF.
Serial I/O (Single CPU)
When you try to write your data from memory in multiple computing node to a single file on the disk, there would be a couple of approaches. The simplest approach is to set a ‘spokesperson’ to collect all of data from other members in the communication. Once the data is entirely collected using communication it writes the data into a file as a regular serial I/O. It is a really simple solution and easy to implement but there are several following problems. Firstly, the bandwidth for writing is limited by the rate of one client and it applies to the memory limit as well. Secondly, the operation time linearly increases with the amount of data or problem size, and moreover it increases with number of member processes because it will take longer time to collect all data into a single node or cpu. Therefore, this type of approach cannot scale.
Pros: trivially simple for small I/O some I/O libraries not parallel Cons: bandwidth limited by the rate one client can sustain may not have enough memory on a node to hold all data won’t scale (built-in bottleneck)
Serial I/O (N processors)=
What you can do instead is to organize each participating process to do a serial I/O. In other words, all processes perform I/O to individual files. It is somewhat efficient than the previous model but up to a certain limit.
Firstly, when you have a lot of data, you will end up with many files. One file per processors. If you run a large-calculation with many iterations with many variables, even single simulation run could generate over a thousand output files. In this case, as we discussed before, the parallel filesystem performs poor. Again we reviewed I/O best practices and hundreds of thousand files are strongly prohibited.
Secondly, output data often has to be post-processed into a file. It is additional step and it would be quite inefficient surely. Furthermore, when each processors tries to access to the disk about same time uncoordinated I/O may swamp the filesystem (file locks!)
Pros: no interprocess communication or coordination necessary possibly better scaling than single sequential I/O Cons: as process counts increase, lots of (small) files, won’t scale data often must be post-processed into one file uncoordinated I/O may swamp the filesystem (file locks!)
Parallel I/O (N processe to/from 1 file)
The best approach is to do an appropriate parallel I/O. So then each participating process write the data simultaneously into a single file using the parallel I/O. The only thing you should be aware of is that you may want to do this parallel I/O in a coordinated fashion. Otherwise, it will swamp the filesystem.
Pros: only one file (good for visualization, data management, storage) data can be stored canonically avoiding post-processing will scale if done correctly Cons: uncoordinated I/O will swamp the filesystem (file locks!) requires more design and thought