This self-study tutorial will discuss issues in handling large amount of data in HPC and also discuss a variety of parallel I/O strategies for doing large-scale Input/Output (I/O) with parallel jobs. In particular we will focus on using MPI-IO and then introduce parallel I/O libraries such as NetCDF, HDF5 and ADIOS.
HPC I/O Issues & Goal
Many today’s problems are increasingly computationally expensive, requiring large parallel runs on large distributed-memory machines (clusters). There would be basically three big I/O activities in these types of jobs. First is the HPC application requires to read initial dataset or conditions from the designated file. Secondly, mostly at the end of a calculation, data need to be stored on disk for follow-up runs or post-processing. As you may guess, parallel applications commonly need to write distributed arrays to disk Thirdly, the application state needs to be written into a file for restarting the application in case of a system failure The figure below shows a simple sketch of I/O bottleneck problem when using many cpus or nodes in a parallel job. As Amdahl’s law says, the speedup of a parallel program is limited by the time needed for the sequential fraction of the program. So, if the I/O part in the application works sequentially as shown, the performance of the code would be not as scalable as desired.
- Reading initial conditions or datasets for processing
- Writing numerical data from simulations for later analysis
- Checkpointing to files
Efficient I/O without stressing out the HPC system is challenging
We will go over the physical problem and limitation in handling data with memory or hard-disk but it is simply expected that load/store operation from memory or hard-disk takes much more time than multiply operations in CPU. Commonly, the total execution time consists of computation time in CPU, communication tim in inter-connection or network and I/O time. So, the efficient I/O handling in high performance computing is a key factor to get best performance.
- Load and store operations are more time-consuming than multiply operations
- Total Execution Time = Computation Time + Communication Time + I/O time
- Optimize all the components of the equation above to get best performance!!
Disk access rates over time
An HPC system, I/O related systems are typically slow as compared to its other parts. The figure in this slide shows how the internal drive access rate has been improved over the time. From 1960 to 2014 top supercomputer speed increased by 11 orders of magnitude. However, as shown in the figure, a Single hard disk drive capacity in the same period of time grew by 6 orders and furthermore average internal drive access rate which we can store data at grew by 3-4 orders of magnitude. So, this discrepancy explains that we are producing much more data which we cannot possibly store it at the proportional rate and hence we need to pay special attention to how to store the data appropriately.
Here is a memory and storage latency. Memory/storage latency refers to delays in transmitting data between the CPU and medium. Most CPUs operate at 1 nano second time scale. As shown the figure, for example, Writing to L2 cache takes about 10 times than CPU operation. As such, accessing to memory has a physical limitation and it also affects the I/O operations.
How to calculate I/O speed
Before we proceed more, we better make sure two following performance measurements. Firstly, there is ‘IOPs’. IOPs means I/O operations per second. The operation includes read/write and so on and IOPs is an inverse of latency (think about period (latency) and frequency(IOPs)). And also there is ‘I/O Bandwidth’. The bandwidth is defined as ‘quantity you read/write’. I believe all of you are quite used to this terminology from Internet connection at your home or office. Anyway, here is an information chart for several I/O devices. As you can see, Top-of-the-line SSDs on a PCI Express can push to unto 1GB IOPs. However, the device is still very expensive so it’s not a right fit for several hundreds terabyte supercomputing systems.
One thing I would like to emphasize is that parallel filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes. So, it does not result in “supercomputing” performance in I/O.
- IOPs = Input / Output operations per second (read/write/open/close/seek) ; essentially an inverse of latency
- I/O Bandwidth = quantity you read / write
Parallel (distributed) filesystems are optimized for efficient I/O by multiple users on multiple machines/nodes, do not result in “supercomputing” performance
- disk-access time + communication over the network (limited bandwidth, many users)
I/O Software + Hardware stack
- I/O Hardware --> Parallel filesystem --> I/O Middleware --> High-end I/O library --> Application
When it comes to organizing parallel I/O, there are several layers of abstraction you should keep in mind. First of all, let’s start from the bottom. There is a I/O hardware which is a physical array or hard-disks attached to the cluster. On top of that, we are running parallel file system.
In SHARCNET, for most of system we are running Lustre which is an open-source filesystem. The purpose of the parallel filesystem is to maintain the logical partitions and provide efficient access to data. Then we have I/O middleware on top of the parallel filesystem. It organizes access from many processes by optimizing two-phase I/O, disk I/O and data flow over the network and also provides data sieving by converting many small non-contiguous I/O requests into fewer/bigger requests. Then there would be a high-end I/O library such as HDF5, NetCDF and so on. What it does is that it maps application abstractions to storage abstractions I/O in terms of the data structures of the code. So, data is stored directly to the disk by calling this library and this library is implemented to work quite efficiently. It is better to use this kind of libraries since SHARCNET supports both of HDF5 and NetCDF. You could also use I/O middleware which is MPI-IO. In today’s talk, I will focus more on MPI-IO which is a part of MPI-2. However, I will also discuss the pros and cons of different approaches. And then, as you may see, there is the application which is mostly your program and your program will decide whether to use high-end I/O library or I/O middleware.
In SHARCNET, we do have a parallel filesystem designed to scale to tens of thousand of computing nodes efficiently. For better performance, files can be striped across multiple drives. It means file does not reside on a single hard drive but multiple drives so that while a hard drive taking reading operation and another drive can send back the data to the program.
In order to avoid that two or more different processes access to a same file, parallel file systems use locks to manage this kind of concurrent file access. What actually happens is that the Files are pieced into ‘lock’ units and scattered across multiple hard drives. Then, Client nodes which is computing node obtain locks on units that they access before I/O occurs
- Files can be striped across multiple drives for better performance
- Locks used to manage concurrent file across in most parallel file system
- Files are pieced into ‘lock’ units (scattered across many drives)
- Client nodes obtain locks on units that they access before I/O occurs
- Enables caching on clients
- Locks are reclaimed from clients when others desire access