# RISC-V на международных суперкомпьютерных конференциях: доклады, новинки, тренды

Валерия Пузикова, к.ф.-м.н., эксперт по разработке ПО





# Валерия Пузикова

К.ф.-м.н., эксперт по разработке ПО, руководитель команды разработки математических библиотек, YADRO

- С 2010 года разрабатываю и реализую на С/С++ с CUDA/MPI/OpenMP численные методы для решения задач линейной алгебры, вычислительной аэрогидродинамики, AR/VR.
- Работала в Huawei, Fortum, ИСП РАН им. В.П. Иванникова, МГТУ им. Н.Э. Баумана и др.

# HPC на RISC-V: почему уже пора?

HPC в мировых трендах экосистемы RISC-V

Примеры докладов с HPC RISC-V воркшопов 2024

RISC-V HW: на чем тестировать HPC SW уже сейчас и чего ждать

Выводы и полезные ссылки



# HPC на ARM: Fujitsu Fugaku №1 – с 2020 года



- Впервые суперкомпьютер на ARM (причем гомогенный) стал №1 в Тор500.
- Единственный в истории стал №1 во всех основных суперкомпьютерных рейтингах
- До сих пор №1 в HPCG, HPL-AI, Graph500.

- **Мировой рекорд:** число ядер увеличили на 4,5%, а производительность на Linpack выросла на 6,4%, на HPCG в 5,4 раза.
- **На 45% превосходит** производительность всех остальных суперкомпьютеров из **Top10 HPCG**.

# Fujitsu готовила SW экосистему с 2014 года

# Мораль: развитие SW экосистемы HW – дело <u>важное</u> и <u>долгое</u>!



**Рис. 8.** Совместные разработки Fujitsu с независимыми поставщиками прикладного ПО, которые могут выполняться на FX1000, FX700 и Fugaku.







# Немного статистики из мира RISC-V





# **TA**

# Прогнозируемый рост впечатляет



\* CAGR – совокупный среднегодовой темп роста.



# К 2030 году RISC-V займет не менее пятой части рынков





# Ближайшие прогнозы по высокопроизводительным ядрам

# **Industry outlook: Datacenter & Cloud**



RISC-V offers unique opportunity for accelerators



Custom computing for Al and other emerging workloads



Achieve your performance and power targets

RISC-V CPU core market will grow 115% CAGR, capturing >14% of all CPU cores by 2025

Semico Research, December 2021



# **K**O

# Плюсы RISC-V: модульная архитектура





# Векторное расширение RISC-V RVV









### **Vector Length Agnostic Code**

- VL is loaded prior to executing the vector instruction with a special instruction
- No need to handle "loop tails"
- Makes the code "vector length agnostic"

```
void axpy(double a, double *dx, double *dy, int n) {
      long gvl = __builtin_epi_vsetvl(n, __epi_e64, __epi_m1);
       epi 1xf64 \ v \ a = MM \ SET \ f64(a, gvl);
12
13
      for (i = 0; i < n; i += gvl) {
        gvl = builtin epi vsetvl(n - i, epi e64, epi m1);
14
15
         epi 1xf64 \ v \ dx = MM \ LOAD \ f64(\&dx[i], gvl);
16
         epi_1xf64 v_dy = MM_LOAD_f64(\&dy[i], gvl);
17
          epi 1xf64 \text{ v res} = \text{MM MACC } f64(\text{v dy}, \text{v a, v dx, gvl});
18
         MM STORE f64(&dy[i], v res, gvl);
19
```



# Разработка матричных расширений RISC-V

https://habr.com/ru/companies/yadro/articles/827430/



https://habr.com/ru/companies/yadro/articles/827434/



https://habr.com/ru/companies/yadro/articles/827432/



https://habr.com/ru/companies/yadro/articles/833948/



HPC на RISC-V: почему уже пора?

## HPC в мировых трендах экосистемы RISC-V

Примеры докладов с HPC RISC-V воркшопов 2024

RISC-V HW: на чем тестировать HPC SW уже сейчас и чего ждать

Выводы и полезные ссылки

# Новый мировой тренд – парадигма Ореп НРС

# He только стек HPC SW должен быть open-source, но и HPC HW







# RISC-V HPC инициативы

Европейская инициатива по развитию собственной технологической независимости EPI (European Processor Initiative) предполагает разработку процессоров и ускорителей на базе RISC-V: решения на ARM не признаются ее частью.





Крупнейшие европейские HPC центры – BSC (Barcelona Supercomputing Center) и EPCC (Edinburgh Parallel Computing Center) – активно развивают центры компетенции RISC-V в рамках грантовой поддержки Euro HPC (правительственная инициатива).

# HPC тренды в развитии RISC-V HW и SW

- Появляются высокопроизводительные ядра: несколько IP провайдеров разрабатывают процессоры для дата-центров.
- На базе архитектуры RISC-V разрабатываются не только СРU, но и **ускорители** (GPU, AI).





- Первая RISC-V HPC система ожидается в 2025-2026 гг. в рамках 6-й реинкарнации европейского суперкомпьютера MareNostrum (BSC).
- Консорциум RISE (RISC-V Software Ecosystem) фокусируется на адаптации ключевого стека ПО для RISC-V, а также ускорении разработки СПО для RISC-V.

# RISE RISC-V Optimization Guide



Vendor agnostic porting and optimization guide

- Does not cover CPU specific microarchitecture
   Best practices for high performance RISC-V cores
  - Including assembly code examples

Zero can be folded into any instruction with a register operand. There's no need to initialize a temporary register with 0 for the sole purpose of using that register in a subsequent instruction. The following table identifies cases where a temporary register can be eliminated by prudent use of x0.

| 00                         | Don't                                 |  |
|----------------------------|---------------------------------------|--|
| fmv.d.x f0,x0              | li x5,0<br>fmv.d.x f0,x5              |  |
| amoswap.w.aqrl a0,x0,(x10) | li x5,0<br>amoswap.w.aqrl x6,x5,(x10) |  |
| sb x0,0(x5)                | li x6,0<br>sb x6,0(x5)                |  |
| bltu x0,x7,1f              | li x5,0<br>bltu x5,x7,1f              |  |





https://riscv-optimization-guide.riseproject.dev/

# Тематика HPC RISC-V воркшопов

- Примеры использования и тематические исследования с RISC-V
- Уроки, извлеченные из использования RISC-V в HPC
- Отраслевые документы, посвященные изучению использования RISC-V
- Перенос кода на RISC-V
- Новое оборудование и ускорители на основе RISC-V
- Инструменты и методы, помогающие использовать RISC-V для HPC
- Наработки в библиотеках HPC для их переноса на RISC-V
- Расширения RISC-V, ускоряющие HPC приложения
- Компилятор и поддержка среды выполнения для RISC-V
- Экосистема RISC-V
- Взгляд в будущее: как RISC-V может развить сообщество HPC
- И все, что связано с RISC-V и HPC!



- Организация
   <u>семинаров</u> по НРС
   на RISC-V на
   профильных
   международных
   конференциях
- Р Цель популяризация RISC-V в НРС (способствовать портированию НРС SW на RISC-V и т.д.)

# HPC RISC-V воркшопы: 2024







### Strategic EU-level perspective on RISC-V

RISC-V: the cornerstone ISA for the next generation of HPC infrastructures

17<sup>th</sup> January 2024 | **Alexandra Kourfali** | Munich, DE

### HiPEAC 2024

- RISC-V Workshop: RISC-V: the cornerstone ISA for the next generation of HPC infrastructures
  - · Organizers: E4 and BSC
  - Accepted
- Full Day workshop
- Munich, Germany
- January 17-19, 2024
- More details next meeting...

## **Upcoming RISC-V HPC Events**



- HPC Asia RISC-V Workshop
  - https://riscv.epcc.ed.ac.uk/community/hpcasia24-workshop/
  - o 25th Jan 2024



- Fourth International workshop on RISC-V for HPC
  - https://riscv.epcc.ed.ac.uk/community/isc24-workshop/
  - o 16th May 2024



- RISC-V Summit Europe
  - https://riscv.epcc.ed.ac.uk/community/isc24-workshop/
  - o 24 27th June 2024

## Workshop at HPC-Asia

- Workshop HPC-Asia (Nagoya, Japan): Third International Workshop on RISC-V for HPC (RVHPC)
  - · Michael Wong, Nick Brown, and John Davis submitted a workshop proposal
  - · Conference, end of January, 2024 about 500 people
  - Accepted
  - ½ Day (morning)

HPC на RISC-V: почему уже пора?

HPC в мировых трендах экосистемы RISC-V

# Примеры докладов с HPC RISC-V воркшопов 2024

RISC-V HW: на чем тестировать HPC SW уже сейчас и чего ждать

Выводы и полезные ссылки



# Challenges of Building an Open Source Ecosystem (1/2)

### **Problem Statement**

- Silicon companies want to release hardware for which optimised OSS stacks are already present
- OSS community wants to support all platforms which users would like to run on
- Many different micro-architectures to bring to market
- Representative hardware doesn't exist yet (although RISC-V vector CPUs do exist)
- Avoid leaking of proprietary information before hardware is released
- Limited resources in:
  - Silicon companies can't port everything themselves
  - o OSS community need to prioritise work in terms of impact



# Challenges of Building an Open Source Ecosystem (2/2)

### Call to Action

- Participate in consortia / standardisation bodies
  - o RISE, RVI, UXL
- Contribution to frameworks and tools used in multiple projects
  - Frameworks / APIs such as oneAPI, xsimd, OpenMP runtimes
  - Compilers are ubiquitous. Improvements in toolchain needed for one project helps many others
    - Raise issues for GCC and LLVM
- Improvements in OS packages, such as SIMD for frequently used operations (e.g. zlib (de)compression)
- RISE put out <u>RFPs</u> for prioritised development work
- Contribute to your favourite project

Open source software very open to pull requests, e.g.

- OS bring up (Fedora, Ubuntu, Android)
- OpenBLAS has had branch risc-v since August 2022
- Linux kernel supports hardware interfaces for with new non-ISA specs (e.g. hardware probe)
- 60/100 last commits in QEMU have a 'riscv' tag

## How to Optimise Code

- What are good / bad practices?
  - Use up-to-date toolchain. Most recently released, or better close to tip of tree
  - LLVM has autovec, GCC in the works
  - RISE optimisation guide imminently available (RV64GCV)
- Micro-architecture agnostic
  - o Can double check codegen on Compiler Explorer
  - Proxy for performance via dynamic instruction counts in emulator
- Optimised for 'generic' target
  - $\circ\quad$  Stick to intrinsics, and v1.0 vector spec
  - What LMUL, what ILP, what order? Don't over optimise. Strip-mined loops are good.
  - Don't optimise for order OoO doesn't care, and different in-order may prefer different ordering
  - Performance estimation via LLVM MCA, for example



# WebRISC-V: a web-based educational simulator

Providing a simple, easy accessible educational tool to test RISC-V programs on a pipelined processor.





# SW stack for future HPC machines based on RISC-V 128 bit

## Opportunities and challenges for RISC-V and 128 bit

### Heterogeneity

ISA extension is about managing heterogeneity in an homogenous way:

- Base RISC-V ISA on all clusters
- Various set of ISA extension on different clusters

### **Operating System**

A 128 bit address space can provide a unified view of the 100 M cores:

- · Single system image of the machine
- A 128 bit process spans the whole machine with a single virtual address space

### Issues:

- Distributed system issues. Like what is the status of a 128 bit process?
- Threads migration across clusters?
- Need for transactions?

### Starting point:

PGAS and its variants

### **Langages & Tools**

A single address space for a process means:

- The compiler could work on the full application at once
- Room for a generalised OpenMP-like programming model
- Potentially replace MPI by VM operations

### Opportunity:

MLIR-based DSL.

### Issues:

• Need for transactions?

[1] A. Waterman and K. Asanović, "Chapter 6, RV128I Base Integer Instruction Set, Version 1.7," in The RISC-V Instruction Set Manual - Volume I: Unpriviliged ISA, 20191213, The RISC-V Foundation, 2019. Available online at https://riscv.org/technical/

| Name  | Registers | Register width | Address width |
|-------|-----------|----------------|---------------|
| RV32  | 32        | 32             | bit           |
| RV32E | 16        | 32             | bit           |
| RV64  | 32        | 64 bit         |               |
| RV128 | 32        | 128 bit        |               |

specifications/

- Addresses and integers are 128 bit wide.
- Still base ISA + ISA extensions, just like 32 and 64 bit :
  - Integer multiply & divide
  - 32 bit floating-point
  - 64 bit floating-point
  - 128 bit floating-point
  - · etc.

The challenge is how to take advantage of RISC-V and 128 bit to improve

- The heterogeneity (►) of the machine
- •The operating system stack
- Programming languages and tools



# RISC-V for AI: enabling modern workloads on modern HW

### **Use case: Man Down Detection**

**Goal**: real-time distributed Al surveillance at a large scale

**Target**: detect people in lying on the ground in distress

**Design**: leverage YOLO-V5 on multiple RISC-V-based edge nodes in a tree structure connected via FastFlow



**Challenges**: limited RISC-V ecosystem, need to port:

- FastFlow
- Al library (e.g., PyTorch)

**Outcome**: implemented with the Fast Federated Learning (FFL) framework:

- based on C/C++ for performance (libtorch + FastFlow)
- supports both federated training and distributed inference

## **Accelerated PyTorch WiP: preliminary results**

| System            | Cores | Total [s] | [ms]/image |
|-------------------|-------|-----------|------------|
| k230              | 1     | 254.11    | 79.41      |
| Milk-V (OpenBLAS) | 1     | 254.91    | 79.66      |
| Milk-V            | 64    | 137.91    | 43.09      |
| Milk-V (OpenBLAS) | 64    | 25.88     | 8.08       |
| Intel             | 1     | 11.76     | 3.67       |
| Intel             | 64    | 1.95      | 0.61       |

4-layer (2 convolutional + 2 fully connected) DNN performance on 100 batches of 32 MNIST images





# Performance analysis (& optimization) of BERT on RISC-V

### We focus on BERT + inference

- Useful across several NLP tasks
- Illustrative of the potential of architectures and space for optimization in transformers
- Inference typically deployed on low-power CPUs, typically with SIMD

# Results - C910, 8x8 microkernel, square matrices



- Auto-Baseline vs. OpenBLAS
  - 1.72x improvement vs. OpenBLAS RVV Generic
  - Similar performance than OpenBLAS C910
- Auto-Op1 (bcast)
  - 2.38x improvement vs. Auto-Baseline
- Auto-Op2 (gather)
  - 2.62x improvement vs. Auto-Baseline
- Auto-Op3 (load reorder)
  - 2.90x improvement vs. Auto-Baseline
  - 2.59x improvement vs. C910 OpenBLAS
- Auto-Op4 (SW pipelining)
  - 2.88x improvement vs. Auto-Baseline

HPC на RISC-V: почему уже пора?

HPC в мировых трендах экосистемы RISC-V

Примеры докладов с HPC RISC-V воркшопов 2024

RISC-V HW: на чем тестировать HPC SW уже сейчас и чего ждать

Выводы и полезные ссылки

# **TA**

# Кластеры

## Barcelona Supercomputer Center(<u>BSC</u>)

| Board                                 | os            | Details          |
|---------------------------------------|---------------|------------------|
| PolarFire                             | Fedora        | 4 cores w/ 2 GB  |
| BeagleV                               | Fedora        | 2 cores w/ 8 GB  |
| Unmatched                             | Fedora/Ubuntu | 4 cores w/ 16 GB |
| Allwinner D1<br>(Vector<br>extension) | Fedora        | 1 core w/ 2 GB   |

## Edinburgh Parallel Computing Center (EPCC)

| Board                      | Processor (SoC) | # Cores | DRAM (GB) | Qty |
|----------------------------|-----------------|---------|-----------|-----|
| NezhaSTU                   | C906 (D1)       | 1       | 0.5       | 4   |
| MangoPi MQ-Pro             | C906 (D1)       | 1       | 1         | 2   |
| HiFive Unmatched           | U74 (FU740)     | 4       | 16        | 1   |
| StarFive VisionFive V1     | U74(JH7100)     | 2       | 8         | 3   |
| StarFive VisionFive V2     | U74(JH7110)     | 4       | 8         | 15  |
| Lichee Pi 4A<br>(on order) | C910 (TH1520)   | 4       | 16        | 2   |

## E4: Monte Cimone (V1)

### 4x E4 RV007 1U Custom Server Blades:

- 2x SiFive U740 SoC with 4x U74 RV64GCB cores
- 16GB of **DDR4**
- 1TB node-local NVME storage
- PCle expansion card w/InfiniBand HCAs
- Ethernet + IB parallel networks

## E4: Monte Cimone (V2 = V1 + SG 2042)





<sup>\*</sup> Источник: https://excalibur.ac.uk/excalibur-events-isc-23/

# Milk-V Pioneer – IP для датацентров и AI/ML







- **Pioneer Box** 
  - 1X SG2042 CPU
  - 1x Developer Board
  - 250W ATX Power supply
  - Intel AX210 WiFi 6E / BT5.2 card
  - Dual 10G SFP Network Card
  - Graphice Card AMD Radeon RX550 4GB
  - Nice and compact enclousre with carrying handle
  - 1TB Nyme SSD
  - 2x 16G DDR4
  - Powerful RGB CPU cooler

Milk-V Vega – первый в мире RISC-V коммутатор стандарта 10GbE компании Shenzhen MilkV Technology (Milk-V).

milky

Предназначен для:

- сетей широкополосного доступа,
- платформ видеонаблюдения и аудиовизуальных сервисов,
- систем умных городов и пр.



### > 1 TFLOPS(FP64)

- 64 Cores
- 2 GHz
- 120 W TDP
- 3200 MHz (Max DIMM Frequency)
- 1 Gbit Ethernet
- 1 LPC



- Up to 256 GB RAM
- 4 MB L1 Cache
- 16 MB L2 Cach2
- 64 MB L3 Cache
- 2 SPI Flash Interface
- · 2 General SPI Controller

\* Источник: https://servernews.ru/1081875

# Banana Pi BPI-F3: 8-ядерный процессор SpacemiT K1

## https://docs.banana-pi.org/en/BPI-F3/BananaPi\_BPI-F3

- Ядра SpacemiT X60 (4 ядра из 8 с Integrated Matrix Extension).
  - 256-битные векторные регистры.
  - 1.3x Arm Cortex A55.
- <u>Бенчмарки</u>: 2.0 TOPs AI.
- Спецификация: <a href="https://github.com/space-mit/riscv-ime-extension-spec">https://github.com/space-mit/riscv-ime-extension-spec</a>



BPI-F3 SpacemiT K1
Octa-core RISC-V



Доступны для заказа.

**7 794 ₽** Цена за 1 лот •



# EUPILOT: разработка RISC-V ускорителей для HPC и AI



- Consortium: 19 Partners

  \*\*Consortium: 19 Partn
- Создание европейской платформы для НРС и AI.
- Достижение европейского цифрового суверенитета в НРС.

## **Target: Chips** → **Deployments**

- Основа для европейских систем Exascale.
- Расширение экосистемы RISC-V на домены HPC и HPDA.
- Hardware Chips → Modules → Boards
   Systems Boards → Systems → Liquid Immersion Deployments
   Software Drivers → OS → Compilers → Frameworks → Apps

http://pulp-platform.org



# Occamy от ETH (Zurich)

A 432-core, Multi-TFLOPs RISC-V-Based 2.5D Chiplet System for Ultra-Efficient (Mini-)Floating-Point Computation

Our latest design Occamy: 0.75 TFLOP/s, 400+ cores

### **Dual Chiplet System Occamy:**

- 216+1 RISC-V Cores
- 0.75 TFLOP/s
- GF12LPP
- Area: 73mm<sup>2</sup>

2x 16GByte HBM2e DRAMs Micron

2.5D Integration

### Silicon Interposer Hedwig:

- Technology: 65nm, passive (only BEOL)
- Area: 26.3mm x 23.05mm

### Carrier PCB:

- RO4350B (Low-CTE, high stability)
- 52.5mm x 45mm
- Initial discussions 20<sup>th</sup> of October 2020
- Started on 20<sup>th</sup> of April 2021
- Taped out Chiplet on 1<sup>st</sup> of July 2022
- Taped out Interposer on 15<sup>th</sup> of October 2022
- · Currently being assembled









@pulp\_platform



M = MAC(int)

F = FMAC(float)

**CPUs** 

# **TA Dro**

# Разработки Semidynamics для Big Data & AI/ML



pose high change

risk





**NPU** 



# Atrevido 423-V8 от Semidynamics для Big Data & AI/ML

## **The Semidynamics Proposal**

- Powerful Out Of Order based on Risc-V
- Combine CPU with Vector and Tensor unit to create powerful Al capable Compute building blocks
- Enable Hypervisor Support for Containerization
- Enable Crypto for Security / Privacy
- Easy to combine with custom logic / Unit 3 custom instructions
- Use of Gazzillion™ Technology to efficiently manage large date sets

### **Benefits**

- · Easy to program
- High Performance for Parallel Codes
- Zero Communication Latency









<sup>&</sup>quot; Источник: https://www.eejournal.com/article/want-tailormade-screamingly-high-performance-risc-v64-ip/

# Ventana Veyron (V2)

## **Veyron V2: Momentum to Mainstream with Complete Platform**



### Highlights

- +40% performance, 32 cores per cluster, 4nm
- UCle chiplet
- RISC-V Vector Extension support
- Ventana Al Matrix Extensions
- Server-class IOMMU
- RISE support
- Domain Specific Acceleration

### Up to 32 cores Vevron V2 Core 128 MB RVA23 MOP Cache physically sliced 128 KB L1 D-cache 512 KB I-cache 1 MB L2 D-cache **Coherent Bus CPU Cluster** Veyron V2 Chiplet

CHI-over-UCIe D2D

### Available as UCIe-Compatible Chiplets or IP



### **Vector Unit**

- Full RVV1.0 "V" support plus new standard and custom RISC-V extensions:
  - Vector crypto
  - FP16 / BF16
  - Widening 8x8 int8 and BF16 matrix multiplies
- DSA Chiplet VLEN = DLEN =512
  - 32 64B-wide vector registers
  - 64B-wide fully pipelined load, store, and register operations
    - No double pumping of datapaths
    - 64B load plus 64B store per cycle (with arbitrary alignments)
  - Area and power efficient high-performance design
    - · Separate vector register-operation scheduler, register file, and execution pipes from the "scalar" core
    - · Five parallel execution pipes: Arithmetic, Mask, Permute, Load data, Store data
    - Out-of-order execution across execution pipes and within pipes without register renaming
      - LMUL chaining
      - · Interleaving of LMUL>1 operations and complex operations within each pipe based on dependencies
    - No speculative register execution but full speculative load/store execution
      - No speculative execution recovery buffers

Early Boot, BIOS

**Platform** 

HPC Server

# TA DYC

# Ventana Veyron: планы



A STATE OF THE PARTY OF THE PAR

### Computer Vision Speech Natural Language **Autonomous** Recommendations Finance python **Applications** Processing **Systems** (3 ResNet нмм GPT SLAM **Content Filter** ARIMA Models VGGNet YOLO **LSTM BERT** ControlNet **Gradient Boosted Monte Carlo** O PyTorch ONNX **Frameworks** EEVM ONNX RUNTIME **○** GLOW **TFRT Runtimes** MLIR ROCm HAL **INVIDIA** Open**CL** CUDA Libraries **Hypervisors KVM** (0, d) Containers kubernetes docker MARRY IN Linux fedoro **Operating System** ubuntu<sup>®</sup> OpenOCD \* tianocore Firmware **ACPI** Early Boot, BIOS **Platform**

Ventana Veyron AI/ML Server

Over 16,000 RISC-V Cores

# ET-SoC-1: Esperanto's RISC-V Supercomputer on a Chip

Over 1,000 RISC-V Cores









Over 6,000 RISC-V Cores



# Esperanto ET-Minion



## ET-Minion is an Energy-Efficient RISC-V CPU with a Vector/Tensor Unit CPU is tailored for Massively Parallel ML Applications

### esperanto.ai

### ET-MINION IS A CUSTOM BUILT 64-BIT RISC-V PROCESSOR

- · In-order pipeline with low gates/stage to improve MHz at low voltages
- · Architecture and circuits optimized to enable low-voltage operation
- · Two hardware threads of execution
- · Software configurable L1 data-cache and/or scratchpad

### ML OPTIMIZED VECTOR/TENSOR UNIT

512-bit wide integer per cycle

128 8-bit integer operations per cycle, accumulates to 32-bit Int

256-bit wide floating point per cycle
16 32-bit single precision operations per cycle

32 16-bit half precision operations per cycle

New multi-cycle Tensor Instructions

- · Can run for up to 512 cycles (up to 32K operations) with one tensor instruction
- · Reduces instruction fetch bandwidth and reduces power
- RISC-V integer pipeline put to sleep during tensor instructions

Vector transcendental instructions



ET-Minion RISC-V Core and Tensor/Vector unit optimized for low-voltage operation to improve energy-efficiency

RISC-V is the right choice for future merged ML/HPC Systems



Optimized for energy-efficient ML operations. Each ET-Minion can deliver peak of 128 Int8 GOPS per GHz

RISC-V is not only the best choice, RISC-V is the only logical choice for future ML/HPC systems

Making systems easier to program with scalable set of processors with one instruction set should be the goal

- x86 and ARM processors too heavyweight to serve as both main CPU and accelerator
- GPU's too hard to program, can't be the main processor
- Only RISC-V has the ability for both:
- · High performance main cores: e.g. Tenstorrent, SemiDynamics, Ventana, Andes, RIVOS, ET-Maxion and others
- Lightweight RVV vector cores: Esperanto's ET-Minions and likely many others

RISC-V is now mature and ready to start the revolution for future ML/HPC computing systems

Dave's prediction: RISC-V based system will win the Green500 in the next 5 years

# Vortex: OpenCL Compatible RISC-V GPGPU





<u>Работает на FPGA</u>, есть <u>конвейер</u> для запуска программ на NVIDIA CUDA

### **ISA Considerations**

| Operation Type      | Considerations                                                                                            |
|---------------------|-----------------------------------------------------------------------------------------------------------|
| Vertex/Frag Shaders | V Extension or Vec4 Custom                                                                                |
| Number of Registers | Typically GPUs have more Vector Registers (e.g. 128) to avoid use of stack in a multithreaded environment |
| Data Types          | Single Precision / Half Precision / fixed point (8 or + for HDR )                                         |
| ISA Width           | Often wide instructions 128-bit with embedded shuffle and write masks                                     |
| Constant Register   | GPUs have also a number of constant registers for uniforms                                                |
| ABI                 | How to map Varyings / Uniforms / Attributes ?                                                             |

### **POCL** SPIR-V-OpenCL **NVVM-SPIR-V** execute Vortex translator translator **CUDA OpenCL** Object file **NVVM IR** SPIR-V (RISC-V GPU) source code (Sec. 3.5) (Sec. 3.4) (Sec. 3.3) link RISC-V library

# Translating applications in Rodinia benchmark Vortex(v0.2.2) NVPTX-SPIR-V translator(v0.1.0)

| application    | feature         | support? |
|----------------|-----------------|----------|
| b+tree         | -               | yes      |
| bfs            | -               | yes      |
| cfd            | double3 type    | yes      |
| huffman        | atomic          | yes      |
| pathfinder     | memory hierachy | yes      |
| gaussian       | -               | yes      |
| hotspot        | -               | yes      |
| hotspot3D      | -               | yes      |
| lud            | memory hierachy | yes      |
| nw             | -               | yes      |
| streamcluster  | -               | yes      |
| particlefilter | d2i             | on going |
| backprop       | log2f           | on going |
| lavaMD         | d2i             | on going |
| kmeans         | texture         | no       |
| hybrid sort    | texture         | no       |
| leukocyte      | texture         | no       |
|                |                 |          |

- Vortex:
  - Support RISC-V RV32IMF ISA
  - Scalability: up to 64 cores with optional L2 and L3 caches
- Performance:
  - 1024 total threads running at 250 MHz
  - 128 Gflops of compute bandwidth
  - 16 GB/s of memory bandwidth

- Software: OpenCL 1.2 Support Supported FPGAs:
- Intel Arria 10
- Intel Stratix 10

HPC на RISC-V: почему уже пора?

HPC в мировых трендах экосистемы RISC-V

Примеры докладов с HPC RISC-V воркшопов 2024

RISC-V HW: на чем тестировать HPC SW уже сейчас и чего ждать

Выводы и полезные ссылки

# Выводы

- **RISC-V быстро развивается**: новые расширения ISA, процессоры серверного класса, ускорители, интерконнекты.
- Перенос HPC кодов и реализация HPC алгоритмов на RISC-V длительный процесс, RISC-V HPC SIG призывает начинать его уже сейчас.
- **Актуальная задача перенос SparseBLAS на RISC-V,** можно начать с библиотек Eigen, SuiteSparse, Kokkos.
- Для тестирования сейчас используются кластеры, собранные из существующих RISC-V плат, симуляторы, есть RISC-V GPU на FPGA.
- В мире работы идут более трех лет.



### **SIG-HPC Initiatives**

- Guide and enable the community
  - Virtual Memory
    - SV57, SV57K, SV64, SV128
  - Accelerators
  - ISA Extensions
  - HPC Software Stack
    - Starting with HPC Libraries
  - HPC SW & HW ecosystem & roadmap

## (!) Обновления на <u>Github</u>:

- HPC SIG
- AI/ML & Graphics SIG
- Vector SIG









Москва, ул. Рочдельская, 15, стр. 13 +7 800 777-06-11

yadro.com