Overview of Vector Supercomputer SX-ACE and Its Applications

Hiroaki Kobayashi
Tohoku University
koba@tohoku.ac.jp

Russian Supercomputing Days
Moscow, Russia
September 26-27, 2016
About WSSP

Toward Future HPC Technologies

Tohoku University, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), High Performance Computing Center Stuttgart (HLRS), and NEC Corporation are pleased to announce that the 23rd Workshop on Sustained Simulation Performance (WSSP) will be held on March 16th and 17th, 2016 in Sendai, Miyagi, Japan. The purpose of the workshop is to discuss future supercomputers, through the latest research efforts in large-scale computing with high performance and high efficiency.

We are looking forward to seeing you in the workshop.

Technical Program Overview

In the workshop, two keynote talks are scheduled.

Keynote Talk I

Parallel Algorithms: Theory, Practice and Education

Prof. Vladimir Voevodin
(Moscow State Univ.)
Missions of Cyberscience Center
As a National Supercomputer Center

High-Performance Computing Center founded in 1969

- Offering leading-edge high-performance computing environments to academic users nationwide in Japan
  - 24/7 operations of large-scale vector-parallel and scalar-parallel systems
  - 1500 users registered in AY 2015
- User supports
  - Benchmarking, analyzing, and tuning users’ programs
  - Holding seminars and lectures
- Supercomputing R&D, collaborating work with NEC
  - Designing next-generation high-performance computing systems and their applications for highly-productive supercomputing
  - 57-year history of collaboration between Tohoku University and NEC on High Performance Vector Computing
- Education
  - Teaching and supervising BS, MS and Ph.D. Students as a cooperative laboratory of graduate school of information sciences, Tohoku university

Russian Supercomputing Days

September 26-27, 2016
Tohoku Univ.'s New Supercomputer System (2015.2.20~)

3D Tiled Display Wall System

1PB Lustre
3PB Shared Disk

Ethernet

HPCI

IXS

Brand-new Vector System installed in 2015

68 nodes
Xeon IvyBridge Cluster installed in 2014

LX 406Re-2
24 Cores
128GB

Scalar Parallel System
31.3 TFlops, 8.5TB Memory

SX-ACE 2560 nodes
(5 Clusters)
4 Cores, 64GB/node

Vector Parallel System
707 TFlops, 160TB Memory

Russian Supercomputing Days

September 26-27, 2016
10 Petaflops (K computer).
9.2 Petaflops (other systems)
Peak performance as of April, 2016
## Organization of Tohoku Univ. SX-ACE System

### Cluster 0
- **Core:** 1
- **CPU (Socket):** 4 Cores
- **Node:** 1CPU
- **Cluster:** 512 Nodes
- **Total System:** 5 Clusters

### Performance
- **(VPU+SPU):** 69GFlop/s (68GF+1GF)
- **Node:** 276G
- **Cluster:** 141T (139TF+2TF)
- **Total System:** 707T (697TF+10TF)

### Memory BW
- **Mem. BW:** 256GB/s
- **Node:** 131TB/s
- **Total System:** 655TB/s

<table>
<thead>
<tr>
<th>Memory Cap.</th>
<th>Total System</th>
</tr>
</thead>
<tbody>
<tr>
<td>64GB</td>
<td>160TB</td>
</tr>
</tbody>
</table>

### IXS Node BW
- **IXS Node BW:** 4GB/s x2

### File System (3PB+1PB)

### 3D Tiled Display System

### Academic Network (TAINS, SINET)
## Features of Tohoku Univ. SX-ACE System

### Significant Performance Improvement with Lower Power and Less Space

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Number of Cores</td>
<td>1</td>
<td>4</td>
<td>4x</td>
</tr>
<tr>
<td><strong>CPU Performance</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total Flop/s</td>
<td>118.4Gflop/s</td>
<td>276Gflop/s</td>
<td>2.3x</td>
</tr>
<tr>
<td>Memory Bandwidth</td>
<td>256GB/sec</td>
<td>256GB/sec</td>
<td>1</td>
</tr>
<tr>
<td>ADB Capacity</td>
<td>256KB</td>
<td>4MB</td>
<td>16x</td>
</tr>
<tr>
<td><strong>Total Performance, Footprint, Power Consumption</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total Flop/s</td>
<td>34.1Tfop/s</td>
<td>706.6Tflop/s</td>
<td>20.7x</td>
</tr>
<tr>
<td>Total Memory Bandwidth</td>
<td>73.7TB/s</td>
<td>655TB/s</td>
<td>8.9x</td>
</tr>
<tr>
<td>Total Memory Capacity</td>
<td>18TB</td>
<td>160TB</td>
<td>8.9x</td>
</tr>
<tr>
<td>Power Consumption (Max)</td>
<td>590kVA</td>
<td>1,080kVA</td>
<td>1.8x</td>
</tr>
<tr>
<td>Footprint</td>
<td>293m²</td>
<td>430m²</td>
<td>1.5x</td>
</tr>
</tbody>
</table>

### Powerful CPU/Node Performance and Higher B/F rate

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>CPU (Node) Performance</strong></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Clock Frequency</td>
<td>1GHz</td>
<td>2GHz</td>
<td>0.5x</td>
</tr>
<tr>
<td>Flop/s per Core</td>
<td>64Gflop/s</td>
<td>16Gflop/s</td>
<td>4x</td>
</tr>
<tr>
<td>Cores per CPU</td>
<td>4</td>
<td>8</td>
<td>0.5x</td>
</tr>
<tr>
<td>Flop/s per CPU</td>
<td>256Gflop/s</td>
<td>128Gflop/s</td>
<td>2x</td>
</tr>
<tr>
<td>Bandwidth</td>
<td>256GB/s</td>
<td>64GB/s</td>
<td>4x</td>
</tr>
<tr>
<td>Bytes per Flop (B/F)</td>
<td>1</td>
<td>0.5</td>
<td>2x</td>
</tr>
<tr>
<td>Memory Capacity</td>
<td>64GB</td>
<td>16GB</td>
<td>4x</td>
</tr>
</tbody>
</table>

A Balanced System for High Sustained Performance, resulting in High Productivity in the Wide Area of Applications in Academia and Industry
Features of the SX-ACE Vector Processor

- 4 high-performance core Configuration, each with High-Performance Vector-Processing Unit and Scalar Processing Unit
  - 272Gflop/s of VPU + 4Gflop/s of SPU per socket
    - 68Gflop/s + 1Gflop/s per core
  - 1MB private ADB per core (4MB per socket)
    - Software-controlled on-chip memory for vector load/store
    - 4x compared with SX-9
    - 4-way set-associative
    - MSHR with 512 entries (address+data)
    - 256GB/s to/from Vec. Reg.
      - 4B/F for Multiply-Add operations
  - 256 GB/s memory bandwidth, Shared with 4 cores
    - 1B/F in 4-core Multiply-Add operations
      ~ 4B/F in 1-core Multiply-Add operations
  - 128 memory banks per socket

Other improvement and new mechanisms to enhance vector processing capability, especially for efficient handling of short vectors operations and indirect memory accesses

- Out of Order execution for vector load/store operations
- Advanced data forwarding in vector pipes chaining
- Shorter memory latency than SX-9
Performance Evaluations of SX-ACE
Performance Evaluation of SX-ACE by using HPCG

⭐ HPCG (High Performance Conjugate Gradients) is designed
  • to exercise computational and data access patterns that more closely match a broad set of important applications, and
  • to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications.

✓ HPL for top500 is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications.

⭐ HPCG is a complete, stand-alone code that measures the performance of basic operations in a unified code:

✓ Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids with Sparse matrix-vector multiplication.
  • Sparse triangular solve.
  • Vector updates.
  • Global dot products.
  • Local symmetric Gauss-Seidel smoother.
  • Reference implementation is written in C++ with MPI and OpenMP support.
Breakdown of the HPCG Benchmark

**Benchmarking Flow**

- Problem Setup
- Opt.
- Validation
- CG
- CG
- CG
- CG
- CG
- CG
- Report

**Evaluation metric**

\[ \text{GFlop/s} = \frac{\text{refnops}}{(\text{times}[0] + \text{fNumberOfCgSets} \times \text{times}[7]/10.0)/1.0E9} \]

\[ \ast \text{frefnops} : \text{total number of floating point operations for CG (}# \text{ of iterations} = 50) \]

Ver. 2.4 → Ver. 3.0: Setup overhead considered for individual CG iterations!

**Ver. 2.4**

\[ \text{GFlop/s} = \frac{\text{frefnops}}{(\text{times}[0] + \text{fNumberOfCgSets} \times \text{times}[7]/10.0)/1.0E9} \]

**Ver. 3.0**

\[ \text{GFlop/s} = \frac{\text{frefnops}}{(\text{times}[0] + \text{fNumberOfCgSets} \times (\text{times}[7]/10.0 + \text{times}[9]/10.0))/1.0E9} \]
Optimizations of the HPCG Benchmark for SX-ACE

- Data packing for vector-friendly matrix memory allocation of sparse matrices
- Parallelization of 27-point stencil computation by using coloring and hyperplane methods
- Selective reusable-data caching and blocking for effective use of ADB

HPCG Results (Gflop/s)

- Original
- JAD+color
- ELL+color
- ELL+Hyper
- +ADB
- +Blocking

MG Result (Gflop/s)

- Selective caching
- All caching

Single CPU Performance (max 256Gflop/s)

Russian Supercomputing Days

September 26-27, 2016
Scalability of the HPCG Benchmark

HPCG Performance (Gflops) vs. Efficiency (%)

HPCG 2.4 (Gflops) Efficiency 2.4 (%)
HPCG 3.0 (Gflops) Efficiency 3.0 (%)

Number of Processes (Cores) (104x200x552 grids/process)

10 20 40 80 160 320 640 1280 2560 5120 10240 20480 40960

0% 2.5% 5% 7.5% 10% 12.5%
Efficiency Evaluation of HPCG Performance
(As of ISC16 data)

<table>
<thead>
<tr>
<th>System</th>
<th>Efficiency (%)</th>
<th>HPCG/W (MF/W)</th>
</tr>
</thead>
<tbody>
<tr>
<td>SX-ACE</td>
<td>10.8</td>
<td>107.5</td>
</tr>
<tr>
<td>K/FX</td>
<td>3.7</td>
<td>49.2</td>
</tr>
<tr>
<td>Xeon</td>
<td>2.2</td>
<td>36.6</td>
</tr>
<tr>
<td>BlueGene</td>
<td>1.5</td>
<td>38.6</td>
</tr>
<tr>
<td>GPU</td>
<td>1.4</td>
<td>58.0</td>
</tr>
<tr>
<td>Xeon Phi</td>
<td>1.0</td>
<td>25.8</td>
</tr>
<tr>
<td>Sunway Taihulight</td>
<td>0.3</td>
<td>24.1</td>
</tr>
</tbody>
</table>

Efficiency of Sustained Performance to Peak [%]

Power-Efficiency [Mflops/W]
Necessary Peak Perf. to Obtain the Same Sustained Performance
Normalized by SX-ACE (As of ISC 16 data)

<table>
<thead>
<tr>
<th>System</th>
<th>Eff. (%)</th>
<th>HPCG/W (MF/W)</th>
<th>Req. Peak</th>
</tr>
</thead>
<tbody>
<tr>
<td>SX-ACE</td>
<td>10.8</td>
<td>107.5</td>
<td>1.0x</td>
</tr>
<tr>
<td>K/FX</td>
<td>3.7</td>
<td>49.2</td>
<td>3.1x</td>
</tr>
<tr>
<td>Xeon</td>
<td>2.2</td>
<td>36.6</td>
<td>5.6x</td>
</tr>
<tr>
<td>BlueGene</td>
<td>1.5</td>
<td>38.6</td>
<td>7.3x</td>
</tr>
<tr>
<td>GPU</td>
<td>1.4</td>
<td>58.0</td>
<td>8.7x</td>
</tr>
<tr>
<td>XeonPhi</td>
<td>1.0</td>
<td>25.8</td>
<td>10.2x</td>
</tr>
<tr>
<td>Sunway TaihuLight</td>
<td>0.3</td>
<td>24.1</td>
<td>36.0x</td>
</tr>
</tbody>
</table>

Diagram:
- SX-ACE
- K · FX
- Xeon
- Bluegene
- GPU
- Xeon Phi
- Sunway TaihuLight

Graph showing performance normalized by SX-ACE.
Leading Science and Engineering Fields supported by the Supercomputer of Tohoku University

Next-Generation CFD Analysis

Turbine Design

Perpendicular Magnetic Recording Medium Design

Nano Material Design

Tsunami Inundation Analysis

Earthquake Analysis

Industrial Use

Antenna Analysis

Heat Shock Analysis

Combustion Flow Simulation

Ozone-hole Analysis

MRJ

Russian Supercomputing Days September 26-27, 2016
The number of fatalities due to heat waves has increased in Europe, North America, and Asia.

Heatstroke

- The number of people hospitalized suffering heatstroke is increasing in JAPAN
  - 58,000 patients in 2014
  - 12,000 fatalities from 1968 to 2014 in Japan

The changes of body temperature strongly depend on individual differences
- body size, age, male/female, tend to perspire a lot or not, difference in genders, etc.

Heatstroke Risk Simulator

- Simulating the changes in body temperature
  - Developed by Prof. Hirata (Nagoya Institute of Tech.

The body temperatures of children, elderly men/women, pregnant are ease to increase.
Scalability

- Parallelize “temprise_k” sub-routine by MPI
- 866 x 320 x 160
- 5400 steps

DO K=1,MODELZ
  DO J=1,MODELY
    DO I=1,MODELX
      temperature update calculation
    END DO
  END DO
END DO

One case (simulation) completed within one minutes
Performance of Magneto-Hydro-Dynamics Simulation on SX-ACE

Ref: Y. Yamamoto, R. Egawa, Y. Isobe, and Y. Tsuji, “Performance evaluation of DNS code based on high-order accuracy finite difference methods,” Japan-Russia Workshop @ Nagoya, Dec 10, 2015.
Future Vector Systems R&D*

*This work is partially conducted with NEC, but the contents do not reflect any future products of NEC
# Timeline of the Cyberscience Center HPC System
## Development and R&D For the Future

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Systems &amp; Facility</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>SX-9 (29TF)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>New HPC Building Construction (1,500m²)</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>5-Cluster SX-ACE (707TF)</td>
<td>LX 406Re2 (31TF)</td>
<td>Storage Systems (4PB)</td>
<td>3D Tiled Display</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Projects</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Design and Procurement process for enhancement of Server, Storage &amp; Visualization Systems</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Design and Procurement process of the next supercomputer system</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Feasibility study for future HPC systems</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Next System with 30x or more??</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**R&D for the next system**
The Time for Vector Computing has Come Again!

Modern and future microprocessors aggressively introduce vector computing mechanism for efficient processing of data-level parallelism

- Intel Many Cores with AVX (Advanced Vector eXtension)
  - Xeon with 256b-AVX2 and Xeon Phi with 72 AVX-512b
- ARM with SVE (Scalable Vector Extension)
  - 128b~2,48b-width Vector Extension
  - ARM64 with 512bit-vector for post-K computer in 2020?
- Power9 with 256~512b-VSX (Vector Scaling eXtension)
- GPU is the powerful vector computing platform
  - Pascal GP100 has 30 SMs(Streaming Multiprocessors), 2K-element vector processing capability
The Time for Vector Computing has Come Again!

Modern and future computing platforms:

- Intel Many Core Xeon with AVX (Advanced Vector eXtension)
- ARM with 128b~248b-width Vector Extension
- ARM64 with 512bit-vector for post-K computer in 2020?
- Power9 with 256~512b-VSX (Vector Scaling eXtension)
- GPU is the powerful vector computing platform

Vector instructions, once a powerful performance innovation of supercomputing in the 1970s and 1980s, became an esoteric technology in the 1990s. But like the mythical phoenix reemerge, vector instructions have arisen from the ashes. Here is the history of a technology that went from new to old then back to new.

But first, a few definitions. A vector instruction is an SIMD instruction, Single Instruction Multiple Data. A vector instruction refers to vector registers where multiple data resides. For example, a Cray-1's vector register contained up to 64 64-bit double-precision floating point numbers. The Cray-1 had eight of these registers. Many operations, for example: add and multiply can be issued to add or multiply two vector registers and place the result in a third vector register.

Vectors Become New

In 1976 Cray Research and Seymour Cray created the Cray-1, the first commercially successful supercomputer with vector instructions. The first Cray-1 delivered to Los Alamos National Laboratory...
The time for Vector Computing has Come Again!

🌟 NEC’s Next Generation Vector Supercomputer

Project Aurora 2018

Successor to SX-ACE
The highest single core performance.
The largest memory bandwidth per core.
High affinity for PC cluster systems.

Source: NEC
Summary

★ SX-ACE, brand-new vector supercomputer of Tohoku University

✓ large-single core performance of 256-element vector processing

✓ a high-bandwidth memory subsystem

✓ No.1. computing-efficiency and power-efficiency in the HPCG Benchmark ranking

★ The time for vector systems has come again

✓ Many modern processors employ vector-processing mechanism, and their vector lengths are increasing year by year.

✓ However, escalation of vector processing capability is not a only factor,

✓ Memory subsystem is now a key factor to increase sustained vector processing performance.
Acknowledgements to Members of Tohoku Univ-N EC Joint Research Division of HPC Technologies and Applications

★ Founded in June, 2014, 4-year period

★ Objectives

- R&D on HPC technologies to exploit high-sustained performance of science and engineering applications on current HPC Systems and to realize Future HPC Systems
- Evaluation and Improvement of the current HPC environments through migration of SX-9 applications to SX-ACE
- Detailed Evaluation and Analysis of Modern HPC Systems, not only Vector Systems but also Scalar-Parallel and Accelerator-Based Systems
- Feasibility study of a future highly balanced HPC system for high sustained performance of practical applications in the post-peta scale era

★ Faculty Members

- Hiroaki Kobayashi, Professor and division director
- Hiroyuki Takizawa, Associate Professor
- Ryusuke Egawa, Associate Professor
- Akihiko Musa (NEC), Visiting Professor
- Mistuo Yokokawa (Kobe Univ), Visiting Professor
- Shintaro Momose (NEC), Visiting Associate Professor
- Masayuki Sato, Assistant Professor
- In collaboration with visiting researchers from NEC and the technical staff of Cyberscience Center

HPC R&D
HPC System Operation
Human Resource Development
HPC User Support