Supercomputing for Mortals

All Hail to the Summit!

The US has recently wrestled back the HPC Super Computer Crown from China with Summit, a 200 PetaFLOP Goliath with over 9000 22-core PowerPC CPUs and a mind boggling 27,000 NVIDA Tesla V100 GPUs.

These huge computing resources are usually built for specific numeric computationally expensive problems. In the case of Summit, its main use case is Nuclear Weapons Simulation, though interestingly, this isn’t one of the use cases they advertise on their website!

HPC Architectures

All modern High Performance Computing (HPC) architectures are of the basic design shown below.

HPC Architecture

The HPC Development Challenge

The challenge is how to build a software framework to distribute and parallel process code over such an hardware architecture.

Historically, the main framework used has been Message Passing Interface, or MPI. In MPI tasks are distributed over the HPC architecture and communicate with each other via messages. The challenge with MPI is that the Software Engineer has to make decisions on how to split the tasks, the message interface and what type of network model the tasks will be run. These network models are broken down into:

Point to Point - where two tasks send messages to each other directly
Broadcast - where data is published to all tasks
Scatter - where data is broken into partitions and distributed to multiple tasks
Gather - essentially the reverse of scattering, where a single task gathers data back from multiple tasks
Reduce - where a single task processing the data from each remote task. This is very much a coordinated version of Point to Point

For simple parallel problems, such as algorithms which can be classed as embarrassingly parallel, the design choice can be relatively simple. However, as the set of algorithms and processing pipelines becomes more complicated, the MPI implementation becomes challenging. Often, with MPI, more time can be spent on the ‘plumbing’, rather than writing the code to the solve the domain problem. It’s no surprise that HPC software development with MPI is a very specialist skill and the advantages of large scale computing is out of reach for the ‘average’ Software Engineer or Data Scientist.

Python Dask to the Rescue

This is where the Python Dask library comes to the rescue. Dask excels in that it provides a familiar Python Numpy and Pandas like interface for common numeric, scientific and data science problems, wth a ‘bag’ API for more general purpose map./reduce like computation, suited to unstructured data. The real power of Dask, though, comes from the fact that it builds the most optimum task graph for you, so you can concentrate your effort on solving the domain problem, not how to maximise the resources of the HPC.

Show Me the Code

Lets see some code. Here’s how you get a Dask Cluster started

from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

Once spun up and available, Dask should return a message telling you the number of Cores and Memory the cluster has allocated. If you’re using a shared HPC Cluster with other workloads running, you may not get all the Cores and Memory you request.

Lets create a large numeric array of data, using the familiar Numpy like syntax.

from dask.distributed import Client, progress
import dask.array as da
x = da.random.random((10000, 10000), chunks=(1000, 1000))
x
dask.array

Lets now carry out a simple computation on this array.

y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z
dask.array mean_agg-aggregate, shape=(5000,), dtype=float64, chunksize=(500,)

Notice here that z has not returned an answer, but a pointer to a new Dask array. By default, all calls to the Dask API are lazy, and it’s only when you issue a .compute() call that the whole preceding task graph gets submitted to the Cluster for computation and evaluation - like this.

z.compute()
array([0.99524228, 1.01138959, 1.00196082, ..., 0.99702404, 1.00168843,
       0.99625349])

The power of Dask is that you can get started on a Laptop with Python, then transfer your algorithm to a HPC Cluster and scale up the computation with no change to the code.

HPC for Everyone

Ah, you’ll say, I don’t have access to a HPC like Summit. True, most people don’t have access to a HPC Cluster. I’m personally lucky that I work for a company that has it’s own private HPC environment. Unless you work in Academia, or for a large industrial company, typically in the Automotive, Aerospace / Defence and Pharmaceutical industries, you unlikely to be able to access that level of compute.

This is where Cloud Computing comes to the rescue. In particular services such as Amazon EC2 Spot Instances.EC2 Spot Instances allows to you request compute resources at substantial on-demand discounted rates. This is because Amazon have the right to interrupt and pause your compute with only 2 minutes notice. For example, at the time of writing this article, you can have a m4.16xlarge (64 vCPU 256GB RAM) Spot Instance at ~$1 per hour - which is incredible. However, this particular configuration comes with a potential interruption rate of greater than 20%. However, if you optimise your compute workload in Dask to suit, for example you work all 64 vCPUs, you may never see any interruptions ~80% of the time.

Don’t Panic

So there you have it. Super Computing is now available for everyone. All you have to do is work out what major massive computation problem to solve. I’d recommend looking at this video for inspiration.