Das Leibniz-Rechenzentrum (LRZ) hatte im ersten Quartal des Jahres 2000 als Höchstleistungsrechner in Bayern (HLRB) ein Rechnersystem SR8000-F1 der japanischen Firma Hitachi installiert, das Anfang 2002 noch einmal um mehr als die Hälfte weiter ausgebaut wurde. Damit war am LRZ zeitweise der schnellste Rechner Europas verfügbar.

Die Entscheidung für diesen Rechner fiel im Sommer 1999 auf der Basis von aktuellen Benchmark-Programmen der Hochleistungsrechner-Benutzer; die Auswahl sollte nämlich zu einem Rechner mit dem besten Kosten-/Nutzenverhältnis für realistische Anwendungen führen.

Das System enthielt in der ersten Ausbaustufe 112 Pseudo-Vektor-Knoten, die aus jeweils 8 effektiv nutzbaren CPUs bestehen. Jeder Knoten liefert eine Peak-Performance von 12 GFlop/s und verfügt über 8 GByte Hauptspeicher; vier Knoten sind sogar mit 16 GByte Hauptspeicher ausgestattet. Damit ergibt sich eine Spitzenrechenleistung von 1.3 TFlop/s. Die Knoten sind über einen dreidimensionalen Crossbar miteinander verbunden, der eine Bandbreite von 950 MByte/s zwischen jeweils zwei Knoten und eine Latenz (das ist die Verzögerung der ersten zwischen zwei Knoten versendeten Nachricht) von 19 Mikrosekunden gewährleistet. Weitere Details zur Ausstattung des HRLB in der zweiten Ausbaustufe sind in einer gesonderten Tabelle angegeben.

Das folgende Bild zeigt den Endausbau der Hitachi SR8000-F1/168 mit 168 Knoten im Rechenraum des LRZ. Die Gesamtlänge der Maschine beträgt 10 Meter, die maximale Breite 8 Meter.

Hitachi SR8000-F1/168 mit 168 Rechenknoten

Die Nutzung des HLRB soll Projekten aus ganz Deutschland ermöglicht werden, deren Durchführung einerseits aus wissenschaftlichen Gründen dringend geboten ist, andererseits auf allen sonst zur Verfügung stehenden Plattformen nicht möglich wäre. Im Unterschied zu den bislang am LRZ vorhandenen Hochleistungsrechnern kann daher den bayerischen Hochschulen kein genereller Zugang gewährt werden, sondern die Zuteilung von Ressourcen erfolgt projektbezogen (und bundesweit) über ein Gutachtergremium. Schwerpunktmäßig werden auf dem HLRB gut vektorisierende Programme bevorzugt, jedoch ist die Architektur der SR8000-F1 flexibel genug, dass der HLRB auch als MPP-System genutzt werden kann.

Technische Daten:

Endausbau 2002

Spitzenleistung des Gesamtsystems: 2,0 TFlop/s

Erwartete Anwendungsleistung: 600 GFlop/s

Hauptspeicherausbau: 1376 GByte

	Endausbau 2002
Spitzenleistung des Gesamtsystems:	2,0 TFlop/s
Erwartete Anwendungsleistung:	600 GFlop/s
Hauptspeicherausbau:	1376 GByte

Charakteristik:

Gemeinsamer Speicher für je 9 Prozessoren (d.h. einen SMP-Knoten)
Verteilter Speicher im Gesamtsystem
Sehr flexible und hocheffiziente Nutzung jedes Knotens als SMP-System oder Pseudo-Vektorprozessor (PVP)
Vergleich mit echten Vektorprozessoren:
Mindestens gleiche Leistung, bei kleinen Datenmengen deutlich bessere Leistung
Parallelisierung mit OpenMP innerhalb eines Knotens; rechnerweit mit Message Passing: MPI

Einsatzgebiet:

Vektorisierbare und/oder grobkörnig parallelisierbare Programme mit sehr hohen Ressourcenanforderungen

LRZ: Hardware Description of the SR8000-F1

Hardware Description of the SR8000-F1

Edition: 2003-03-12

The following table shows details of the HLRB:

Number of SMP-Nodes 168

CPUs per Node 8 (COMPAS, 9 physical)

Number of Processors 168*8 = 1344

Peak Performance per CPU 1.5 GFlop/s ¹

Peak Performance per Node 12 GFlop/s¹

Peak Performance of the whole System 2016 GFlop/s¹

LINPACK Performance of the whole System 1645 GFlop/s¹

Expected Efficiency (from LRZ Benchmarks) > 600 GFlop/s¹

Performance from main memory (most unfavourable case) > 244 GFlop/s¹

Memory per node 8 GBytes
(ca. 6.5 GByte in user space)
4 Nodes with 16 GByte each

Memory of the whole system 1376 GBytes

Processor Characteristics
      Clock Frequency
      Number of Floatingpoint Registers
      Number of Integer Registers
      Data Cache Size
      DCache Line Size
      Dache Copy back or Write through
      DCache set associativness
      DCache Mappping
      Bandwidth Registers to L1 DCache
               relative to frequency
               relative to compute performance
      Bandwidth to Memory
               relative to compute frequency
               relative to compute performance
      Instruction Cache
      ICache set associativness
      ICache Mapping
375 MHz
160 (Global:32, Slide:128)
32
128 KB
128 B
Write through
4-way
direct
12 GByte/s
32 Bytes/cycle
1 DP Word / theor. Flop²
4 GBytes/s
10 Bytes/cycle
1/3 DP Words / theor. Flop²
64 KB
2-way
direct

Aggregated Disk Storage 10 TBytes³

Disk storage for HOME-Directories (/home) 800 GBytes

Disk storage for temporary and pseudo-temporary data 5 TBytes³

Aggregated I/O Bandwidth to /home > 600 MByte/s

AggregatedI/O Bandwidth to
temporary data (/tmpxyz, /ptmp) 2.4 GByte/s

Communication bandwidth measured unidirectionally between two nodes (available bidirectionally)
       using MPI without RDMA
       using MPI and RDMA
       hardware
770 MByte/s
   950 MByte/s
1000 MByte/s

Communication capacity of the whole system (2 x unidirectional bisection bandwidth) with MPI
and RDMA 2x79=158 GByte/s
(Hardware: 2x84 =168 GByte/s)

¹ 1 GFlop/s = 1 Giga Floatingpoint operations/second = 1000000000 (1 with 9 Zeros, Giga) Floating Point Operations per second.
² Machine Balance: Number of Double Precision (64-bit) Words per theoretical possible Floating Point Operation
³ 1 TByte = 1TeraByte = 1000 GBytes

Imprint, Matthias Brehm, Reinhold Bader, Ralf Ebner, 2003-03-12.

Number of SMP-Nodes	168
CPUs per Node	8 (COMPAS, 9 physical)
Number of Processors	168*8 = 1344
Peak Performance per CPU	1.5 GFlop/s ¹
Peak Performance per Node	12 GFlop/s¹
Peak Performance of the whole System	2016 GFlop/s¹
LINPACK Performance of the whole System	1645 GFlop/s¹
Expected Efficiency (from LRZ Benchmarks)	> 600 GFlop/s¹
Performance from main memory (most unfavourable case)	> 244 GFlop/s¹
Memory per node	8 GBytes (ca. 6.5 GByte in user space) 4 Nodes with 16 GByte each
Memory of the whole system	1376 GBytes
Processor Characteristics Clock Frequency Number of Floatingpoint Registers Number of Integer Registers Data Cache Size DCache Line Size Dache Copy back or Write through DCache set associativness DCache Mappping Bandwidth Registers to L1 DCache relative to frequency relative to compute performance Bandwidth to Memory relative to compute frequency relative to compute performance Instruction Cache ICache set associativness ICache Mapping	375 MHz 160 (Global:32, Slide:128) 32 128 KB 128 B Write through 4-way direct 12 GByte/s 32 Bytes/cycle 1 DP Word / theor. Flop² 4 GBytes/s 10 Bytes/cycle 1/3 DP Words / theor. Flop² 64 KB 2-way direct
Aggregated Disk Storage	10 TBytes³
Disk storage for HOME-Directories (/home)	800 GBytes
Disk storage for temporary and pseudo-temporary data	5 TBytes³
Aggregated I/O Bandwidth to /home	> 600 MByte/s
AggregatedI/O Bandwidth to temporary data (/tmpxyz, /ptmp)	2.4 GByte/s
Communication bandwidth measured unidirectionally between two nodes (available bidirectionally) using MPI without RDMA using MPI and RDMA hardware	770 MByte/s 950 MByte/s 1000 MByte/s
Communication capacity of the whole system (2 x unidirectional bisection bandwidth) with MPI and RDMA	2x79=158 GByte/s (Hardware: 2x84 =168 GByte/s)

Bundeshöchstleistungsrechner 'Hitachi SR8000-F1'

Hardware Description of the SR8000-F1