



# Parallel Computing Models

David McCaughan, *HPC Analyst* SHARCNET, University of Guelph

dbm@sharcnet.ca



### **Parallel Computing Models**



- Parallel Random Access Memory (PRAM)
  - conceptual model of multiple processors accessing memories
  - the idea of "clusters" broadens our notion of parallelism
- · How is memory access controlled?
  - Exclusive Read, Exclusive Write (EREW)
    - · only one processor can access a memory location at one time
  - Concurrent Read, Exclusive Write (CREW)
    - multiple processors can read a memory location, but only one at a time can write to it
  - Exclusive Read, Concurrent Write (ERCW)
  - · multiple processors can write to a memory location but reads are exclusive
  - Concurrent Read, Concurrent Write (CRCW)
    - · processors may freely read or write to memory locations without restriction

HPC Resources



# Flynn's Taxonomy (1966)



- · Flynn characterized systems by
- number of instruction streams
  - number of data streams

#### Instructions

single multiple SISD MISD •1 processor odd concept •1 program •vector processors (?) •von Neumann SIMD **MIMD** •1 CPU / many ALUs •multiple CPUs each with memory •each with memory \*synchronous •asynchronous

HPC Resources



#### SISD



- · Traditional model of sequential computation
  - one processor
  - one memory space
- Sequential algorithms state a step-wise solution to a problem using simple instructions and only the following three classes of operation
  - sequence
  - iteration
  - branching
- · Processor executes one instruction at a time
- · Single Instruction Single Data (SISD)



























- interconnection networks
  - multistage interconnection networks (MIN)
  - · mesh, toroid, hypercube
- routing solutions
  - pipelining
  - · wormhole routing





- Modern processors don't really see memory directly
  - view it through their cache (often multiple levels of it)
  - cache holds copies of values from main memory in very fast local storage
- · Cache has such a dramatic effect on system performance it cannot be ignored (unit stride movement through memory, or close to it, is very common)

HPC Resources



- ignoring it can produce unusual behaviour
- · Traditional strategies for cache coherence
  - write-through
    - · variables written to cache are written to memory immediately
    - this works with a shared bus as other caches can see the memory update on the bus and update their contents
    - · places pressure on bus bandwidth
  - writeback
    - · cache only writes back contents when it needs the space
    - · caches coordinate contents with one another via dedicated hardware



- As we consider more complex models cache coherence remains an issue
  - shared memory systems become increasingly complex
    - especially true for non-uniform memory models
  - in a message passing environment the issue can be ignored between nodes
    - · communication between nodes is explicit
- Programmer typically does not have to deal with cache coherence issues directly
  - overall performance can be affected depending on the cache coherence policy implemented by the system vendor









- processors nammer available bandwidth
- does not scale well
  increasing complexity (particularly the switches)
  - · increasing cost





## Non-uniform Memory/ Clusters



- Allow models to scale by abandoning uniform memory access
  - processors organized into nodes containing processors and local (fast) memory
  - nodes connected using an interconnect
  - local memory will be much faster than accessing memory on another node
    - shared memory systems may make global memory appear as a single memory space
    - in the extreme case is message passing in a cluster where there is no direct remote memory access

HPC Resources

HPC Resources



- increased opportunity for contention



SHARCNET" **Hypercube Topologies** Minimize number of hops by

organizing node adjacency on a hypercube

maximize performance

- Worst case number of hops for n nodes is log<sub>2</sub>n
  - e.g.

HPC Resources

- 256 nodes in 8-cube has max 8 hops (1024 connections)
- have max 30 hops (225 connections)





HPC Resources

# SHARCNET™ **Toroid Topologies** Mesh topology with "wraparound" connections between nodes on the edge of the mesh · Cuts number of hops in the worst case by 50%

· More complex to realize than simple meshes



HPC Resources



#### Routing



- Pipelining
  - all data in a given communication can be routed using the same path
  - data packets sent one after another after the initial packet
  - path length (number of hops) only an issue for the first packet
    - · remaining packets will arrive in sequence immediately after the first
    - · significantly improves performance in multi-hop interconnects
- Wormhole routing (William Dally, MIT)
  - Allow data to route itself through an interconnection network
    - · bits encoding destination sent at beginning of message
    - · each router interprets the leading bits to determine outgoing path, strips these bits and sends rest of message unchanged
    - · data sent by pipelining
    - the visual metaphor is of the head of the message moving through nodes like a worm with the data trailing behind