

# RaPping and Compilation for Highly Dynamic Parallelism

#### **Gregor Snelting**

#### Presentation at ANDREAS-50, 26.2.16



#### www.kit.edu



### SFB Transregio InvasIC: FAU, KIT, TUM

- new language, soft- and hardware for highly dynamic parallelism
- 3 universities, 15 PIs, 50 doctoral researchers
- DFG funding 2010–2018 (2022?)



### SFB Transregio InvasIC: FAU, KIT, TUM

- new language, soft- and hardware for highly dynamic parallelism
- 3 universities, 15 PIs, 50 doctoral researchers
- DFG funding 2010–2018 (2022?)

Scenario: big heterogenous PGAS architectures, several kCores / Tile



### SFB Transregio InvasIC: FAU, KIT, TUM

- new language, soft- and hardware for highly dynamic parallelism
- 3 universities, 15 PIs, 50 doctoral researchers
- DFG funding 2010–2018 (2022?)

Scenario: big heterogenous PGAS architectures, several kCores / Tile Goal: support highly dynamic parallelism through all system levels



SFB Transregio InvasIC: FAU, KIT, TUM

- new language, soft- and hardware for highly dynamic parallelism
- 3 universities, 15 PIs, 50 doctoral researchers
- DFG funding 2010–2018 (2022?)

Scenario: big heterogenous PGAS architectures, several kCores / Tile Goal: support highly dynamic parallelism through all system levels

Sonderforschungsbereich/Transregio 89 Transregional Collaborative Research Center 89



## **RaP to the Future**



Fundamental idea: Resource-aware programming

- applications dynamically determine their resource requirements
- allocate and release resources (in particular CPUs) dynamically
- supported by language, compiler, OS, and dedicated hardware
- ⇒ optimize resource utilization, energy consumption, fault tolerance

## **RaP to the Future**



Fundamental idea: Resource-aware programming

- applications dynamically determine their resource requirements
- allocate and release resources (in particular CPUs) dynamically
- supported by language, compiler, OS, and dedicated hardware
- ⇒ optimize resource utilization, energy consumption, fault tolerance

Slogan (J. Teich): "Invade, infect, combat, retreat"



## Example: InvasIC Multi-Tile MPSoC



tiles: RISC multicore, memory, I/O, TCPA, ... 3 applications have invaded the example tiles



Demonstrator implemented via FPGAs

## **Fundamental InvasIC Ingredients**



- Dynamic X10 + Compiler: see below
- constraint system: attribute hierarchy for resource selection
- OctoPos OS: dedicated scheduling, load balancing
- Agent system: priorizes / allocates dynamic resource requests
- CiC: hardware support on local tile for app code dispatch / monitoring
- NoC: network on chip
- i-core: adaptive hardware supporting specific accelerators
- TCPAs: support (numerical) loop parallelization

## **Fundamental InvasIC Ingredients**



- Dynamic X10 + Compiler: see below
- constraint system: attribute hierarchy for resource selection
- OctoPos OS: dedicated scheduling, load balancing
- Agent system: priorizes / allocates dynamic resource requests
- CiC: hardware support on local tile for app code dispatch / monitoring
- NoC: network on chip
- i-core: adaptive hardware supporting specific accelerators
- TCPAs: support (numerical) loop parallelization

Further aspects:

- applications: HPC, robotics
- dark silicon: dynamic selection of non-overheated cores etc.
- predictability: soft realtime and security
- hardware sensors: temperature, speed, fault status etc

## PP Group @ InvasIC



People: Gregor Snelting, (Matthias Braun), Sebastian Buchwald, Manuel Mohr, Andreas Zwinkau

## PP Group @ InvasIC



People: Gregor Snelting, (Matthias Braun), Sebastian Buchwald, Manuel Mohr, Andreas Zwinkau

Contributions:

- framework design (based on X10)
- fundamental RaP examples
- compiler, library, RTS
- code generation for MPSoCs
- optimizations for InvasIC OS / hardware
- interfaces to simulator, OS, TCPA subcompiler [FAU], ...
- application support

## **Basic RaP Patterns & Constraints**



fundamental "invade/infect/retreat" pattern (X10) uses closure for app code:

```
val ilet = (id:IncarnationID) => {
    do_something(id);
    };
    claim = Claim.invade(constraints)
    claim.infect(ilet)
    claim.retreat()
```

## **Basic RaP Patterns & Constraints**



fundamental "invade/infect/retreat" pattern (X10) uses closure for app code:

```
val ilet = (id:IncarnationID) => {
    do_something(id);
  };
  4 claim = Claim.invade(constraints)
  5 claim.infect(ilet)
  6 claim.retreat()
```



## **Basic RaP Patterns & Constraints**



fundamental "invade/infect/retreat" pattern (X10) uses closure for app code:

```
1 val ilet = (id:IncarnationID) => {
        do something(id):
2
                                                              Hint
                                                                                   PredicateConstraints
  };
3
                                                          ScalabilityCurve
                                                                                     FPUAvailable
4 claim = Claim.invade(constraints)
                                                            AppClass
                                                                                     LocalMemory
5 claim.infect(ilet)
                                                         PotentiallyFewerPEs
                                                                                        Type
6 claim.retreat()
                                                         PotentiallyMorePEs
                                                                                    LatencyToMaster
                                                                                    LatencyToMemory
                                                                                      ThisPlace
                                                           SetConstraints
"reinvade()" allows OS to adapt resources
"reinvade(nC)" allows app to reallocate
                                                            PEQuantity
                                                                                    MultipleConstraints
                                                         PartitionConstraints
1 claim.infect(ilet)
                                                                                        AND
2 // optimize resource allocation:
                                                                                        OR
                                                           PlaceCoherent
3 val changed1 = claim.reinvade()
                                                         LatencvWithinTeam
4 claim.infect(ilet)
                                                            TileSharing
5 // respecify resource needs
6 val changed2 = claim.reinvade(otherConstraints)
```

Constraint

## **Example: Invasive Quicksort**



Sequential X10 code:

```
public static def sort(data: Array[Int](1)) =
    gsort(data. 0. data.size-1):
2
3
4 private static def gsort(
    data: Array[Int](1),
5
    left: Int,
6
    right: Int)
7
8 {
    if (data.size>1) {
9
      val p = partition(data, left, right);
10
   val i = p.first;
11
  val j = p.second;
12
      gsort(data. left. i):
13
      qsort(data, j, right);
14
    }
15
16 }
```

8

## **Goal: Parallelize Recursion**



Pedagogical engineering on this part:

1 qsort(data, left, j ); 2 qsort(data, i, right);

Requirements:

- Invasive
- Multi-core
- Multi-tile
- Dynamically Resource-aware

Disclaimer:

Pedagogical example
 e.g. does not parallelize partition step

## Handling the two parts



After decision to parallelize: two parallel program paths

- 1. qsort()
- 2. invade() + (copying) + qsort() + (copying) + retreat()
- Important: large and small part, not left and right
- Hope:  $t_{large} \approx t_{invade} + t_{copying_1} + t_{small} + t_{copying_2} + t_{retreat}$
- ⇒ Always handle large part locally
- $\Rightarrow$  try to handle small part on new PE



## **RaP 1: Invasive Single-Tile**



small part runs on new core in local tile (X10: "ThisPlace")

## **RaP 1: Invasive Single-Tile**



small part runs on new core in local tile (X10: "ThisPlace")

"isInvasionWorthwhile()": user defined; checks new thread for small part; based on problem size, processor speed, invade latencies, ...

Decision tree:



### RaP 1: X10 Code



```
finish {
1
      if (large.size > 1) { async qsort(large); }
2
      if (isInvasionWorthwhile(small)) {
3
        val constraints = new AND(); // constraint system
Λ
5
        constraints.add(new PEOuantity(0. 1)):
        constraints.add(new ThisPlace());
6
        constraints.add(new ScalabilityHint([100, 190]));
7
        val claim = Claim.invade(constraints):
8
        if (claim.size() == 1) {
9
           val ilet = (id: IncarnationID) => gsort(small):
10
          claim.infect(ilet):
11
        } else { gsort(small); }
12
13
        claim.retreat():
      } else if (small.size > 1) {
14
        gsort(small);
15
      3
16
    }
17
```

## **RaP 2: Invasive Multi-Tile**



small part runs on new tile

## RaP 2: Invasive Multi-Tile



small part runs on new tile

"isRemoteWorthwhile()" checks new tile for small part; based on problem size, tile attributes, network latencies, ...

Decision tree:



## RaP 2: X10 Code



```
finish {
1
      if (large.size > 1) { async gsort(large); }
2
      if (isInvasionWorthwhile(small)) {
3
         val constraints = new AND();
4
         constraints.add(new PEQuantity(0, 1));
5
6
         if (isRemoteWorthwhile(small)) {
           constraints.add(new PreferOtherPlace());
7
           constraints.add(new ScalabilitvHint([100. 160]):
8
         } else {
9
           constraints.add(new ThisPlace());
10
           constraints.add(new ScalabilitvHint([100. 190]):
12
         val claim = Claim.invade(constraints);
13
14
         if (claim.size() == 1) {
           val ilet = (id: IncarnationID) => gsort(small);
15
           claim.infect(ilet);
16
         } else {
17
           gsort(small);
18
         3
19
         claim.retreat():
20
       } else if (small.size > 1) {
21
22
         gsort(small):
23
    } // finish
24
```

## RaP 3: Fine Tuning



Other sensible constraints in this example: ThroughputToMaster, ThroughputToMemory, ThroughputWithinTeam, LatencyToMaster, LatencyToMemory, LatencyWithinTeam, ...

## RaP 3: Fine Tuning



Other sensible constraints in this example:

ThroughputToMaster, ThroughputToMemory, ThroughputWithinTeam, LatencyToMaster, LatencyToMemory, LatencyWithinTeam, ...

#### General Principle:

- Application dynamically crafts its constraints based on *locally* available data, e.g. the input size
- The *larger* the input size, the *weaker* the constraints
  - Three "levels" in quicksort example
- Application decides when it is certainly not sensible to parallelize
- Application supplies scalability hints for invade() call
- OS ("Agent system") decides which application actually gets PEs

## **The Compiler**



working compiler was built with only a few PJ!

- front end: adapted from IBM X10 compiler
- back end: PP@KIT, based on libFIRM (~> Goos, Hack et al.) "FIRM is like llvm, only better"

## **The Compiler**



working compiler was built with only a few PJ!

- front end: adapted from IBM X10 compiler
- back end: PP@KIT, based on libFIRM (→ Goos, Hack et al.)
   "FIRM is like llvm, only better"
- generates code for SPARC (Leon 5)
- "invade" etc implemented as OctoPos calls
- compiler tested and integrated into demonstrator (FPGA system)

## **The Compiler**



working compiler was built with only a few PJ!

- front end: adapted from IBM X10 compiler
- back end: PP@KIT, based on libFIRM (→ Goos, Hack et al.)
   "FIRM is like llvm, only better"
- generates code for SPARC (Leon 5)
- "invade" etc implemented as OctoPos calls
- compiler tested and integrated into demonstrator (FPGA system)

Current work:

- specialised optimizations for InvasIC inference of app characteristics for better load balancing, infer minimal execution time for "isInvasionWorthwhile()", infer efficient i-core configurations (accelerators, permutation instructions, ...), efficient copying of data to other places,
- generate code for NoC, i-Core

## Case Study: Multigrid Software



- Invasive multigrid: solver for part.diff.equations from TUM+KIT
- discretization with dynamically varying grid resolutions
- application: simulation of laser engraving on metal plate
   ⇒ heat equation (Laplace operator) must be discretized and solved
- grid levels:  $N \times N$ ,  $N/2 \times N/2$ ,  $N/4 \times N/4$ , ...
- result at timesteps 50/200/450:







## X10 Code + Dynamic Loads



```
1 vcvcle(N. x. b);
2 r = computeResidual(N, x, b)
3 while |r| > threshold:
4 vcvcleIteration(N, x, b)
5 r = computeResidual(N, x, b)
 6
7
8 vcycleIteration(N, x, b):
9 smoother(N, x, b) # pre-smooth
10 r = residual(N, x, b) # residual
12 nc = reinvade(N, claim) # reinvade claim
13
14 \text{ Nr} = \text{N}/2
   # restricted level
15 rr = restrict(N, r)
   # restrict residual to new claim
16 er = (Nr. ℚ)
   # setup error with 0 values
17
18 nc2 = vcycleIteration(Nr, er, rr)
   # vcycle
19
20 # redistribute
21 if (nc != nc2):
22 nc = nc2
23 x.redistribute(Nr. nc)
24 b.redistribute(Nr. nc)
25
26 e = prolongate(Nr, er)
   # prolongate error
27 x = x + e
   # apply correction
28
29
  smoother(N, x, b)
                            # post-smooth
  ruary 9, 2016
return nc # possibly modified claim
```

18

RaPping and Compilation for Highly Dynamic Parallelism

## X10 Code + Dynamic Loads



```
1 vcvcle(N. x. b);
2 r = computeResidual(N, x, b)
3 while |r| > threshold:
   vcvcleIteration(N. x. b)
   r = computeResidual(N, x, b)
 5
 6
 7
8
  vcycleIteration(N, x, b):
  smoother(N, x, b)
 9
                             # pre-smooth
   r = residual(N. x. b) # residual
10
12
  nc = reinvade(N, claim) # reinvade claim
13
14 \text{ Nr} = \text{N}/2
   # restricted level
15 rr = restrict(N, r)
   # restrict residual to new claim
16 er = (Nr. ℚ)
   # setup error with 0 values
17
18 nc2 = vcycleIteration(Nr, er, rr)
   # vcycle
19
   # redistribute
20
21 if (nc != nc2):
22 nc = nc2
  x.redistribute(Nr. nc)
23
24
  b.redistribute(Nr. nc)
25
26 e = prolongate(Nr, er)
   # prolongate error
27 x = x + e
   # apply correction
28
29
   smoother(N, x, b)
                             # post-smooth
   uary 9, 2016
return nc # possibly modified claim
```

18

Four multigrid applications; combat mode on 38 PEs:



RaPping and Compilation for Highly Dynamic Parallelism

### Conclusion



- InvasIC: hard- + software support for resource-aware parallel programming
- Dynamic RaPping based on local estimations:
  - 1. check whether additional resources required,
  - 2. invade resources ( = obtain a "claim" of PEs),
  - 3. infect them ( = load app code + data to new PEs + execute),
  - 4. combat ( = OS balances dynamic resource requests),
  - 5. retreat ( = free invaded claims)

19

### Conclusion

- InvasIC: hard- + software support for resource-aware parallel programming
- Dynamic RaPping based on local estimations:
  - 1. check whether additional resources required,
  - 2. invade resources ( = obtain a "claim" of PEs),
  - 3. infect them ( = load app code + data to new PEs + execute),
  - 4. combat ( = OS balances dynamic resource requests),
  - 5. retreat ( = free invaded claims)
- PP@KIT: language + compiler + applications
- next step: validate resource utilization / energy consumption / fault tolerance in combat setting on demonstrator system

### Thank You!

