Foreword |
|
vii | |
Preface |
|
xvii | |
PART 1 GENERAL CONCEPTS |
|
|
Multiprocessing and Scalability |
|
|
3 | (38) |
|
Multiprocessor Architecture |
|
|
6 | (7) |
|
Single versus Multiple Instruction Streams |
|
|
7 | (1) |
|
Message-Passing versus Shared-Memory Architectures |
|
|
8 | (5) |
|
|
13 | (7) |
|
|
14 | (2) |
|
|
16 | (4) |
|
|
20 | (17) |
|
Scalable Interconnection Networks |
|
|
24 | (7) |
|
|
31 | (2) |
|
|
33 | (1) |
|
Summary of Hardware Architecture Scalability |
|
|
34 | (1) |
|
Scalability of Parallel Software |
|
|
35 | (2) |
|
Scaling and Processor Grain Size |
|
|
37 | (2) |
|
|
39 | (2) |
|
Shared-Memory Parallel Programs |
|
|
41 | (46) |
|
|
41 | (5) |
|
|
46 | (4) |
|
|
48 | (1) |
|
|
48 | (1) |
|
|
49 | (1) |
|
|
49 | (1) |
|
|
49 | (1) |
|
|
50 | (1) |
|
|
50 | (2) |
|
Basic Program Characteristics |
|
|
51 | (1) |
|
Parallel Application Execution Model |
|
|
52 | (1) |
|
Parallel Execution under a PRAM Memory Model |
|
|
53 | (2) |
|
Parallel Execution with Shared Data Uncached |
|
|
55 | (1) |
|
Parallel Execution with Shared Data Cached |
|
|
56 | (2) |
|
Summary of Results with Different Memory System Models |
|
|
58 | (1) |
|
Communication Behavior of Parallel Applications |
|
|
59 | (1) |
|
Communication-to-Computation Ratios |
|
|
59 | (3) |
|
|
62 | (22) |
|
Classification of Data Objects |
|
|
62 | (2) |
|
Average Invalidation Characteristics |
|
|
64 | (1) |
|
Basic Invalidation Patterns for Each Application |
|
|
65 | (2) |
|
|
67 | (1) |
|
|
67 | (2) |
|
|
69 | (2) |
|
|
71 | (2) |
|
|
73 | (1) |
|
|
73 | (3) |
|
Summary of Individual Invalidation Distributions |
|
|
76 | (1) |
|
|
76 | (1) |
|
Effect of Number of Processors |
|
|
76 | (2) |
|
Effect of Finite Caches and Replacement Hints |
|
|
78 | (2) |
|
Effect of Cache Line Size |
|
|
80 | (3) |
|
Invalidation Patterns Summary |
|
|
83 | (1) |
|
|
84 | (3) |
|
System Performance Issues |
|
|
87 | (30) |
|
|
88 | (1) |
|
|
89 | (6) |
|
Nonuniform Memory Access (NUMA) |
|
|
90 | (1) |
|
Cache-Only Memory Architecture (COMA) |
|
|
91 | (2) |
|
Direct Interconnect Networks |
|
|
93 | (1) |
|
|
93 | (1) |
|
|
94 | (1) |
|
Latency Reduction Summary |
|
|
95 | (1) |
|
|
95 | (16) |
|
|
96 | (4) |
|
|
100 | (3) |
|
Multiple-Context Processors |
|
|
103 | (5) |
|
Producer-Initiated Communication |
|
|
108 | (2) |
|
|
110 | (1) |
|
|
111 | (5) |
|
|
112 | (1) |
|
|
113 | (3) |
|
|
116 | (1) |
|
|
117 | (26) |
|
Scalability of System Costs |
|
|
117 | (17) |
|
Directory Storage Overhead |
|
|
119 | (8) |
|
|
127 | (5) |
|
|
132 | (1) |
|
Summary of Directory Storage Overhead |
|
|
133 | (1) |
|
Implementation Issues and Design Correctness |
|
|
134 | (8) |
|
Unbounded Number of Requests |
|
|
134 | (2) |
|
Distributed Memory Operations |
|
|
136 | (3) |
|
|
139 | (1) |
|
Error Detection and Fault Tolerance |
|
|
139 | (2) |
|
|
141 | (1) |
|
|
142 | (1) |
|
Scalable Shared-Memory Systems |
|
|
143 | (30) |
|
|
143 | (7) |
|
|
144 | (1) |
|
|
144 | (2) |
|
|
146 | (1) |
|
IEEE Scalable Coherent Interface |
|
|
147 | (2) |
|
|
149 | (1) |
|
|
150 | (7) |
|
|
151 | (1) |
|
|
152 | (2) |
|
|
154 | (1) |
|
Kendall Square Research KSR-1 and KSR-2 |
|
|
155 | (2) |
|
Reflective Memory Systems |
|
|
157 | (2) |
|
|
157 | (1) |
|
|
158 | (1) |
|
Non-Cache-Coherent Systems |
|
|
159 | (3) |
|
|
159 | (1) |
|
|
160 | (1) |
|
|
161 | (1) |
|
Vector Supercomputer Systems |
|
|
162 | (4) |
|
|
163 | (1) |
|
|
164 | (2) |
|
Virtual Shared-Memory Systems |
|
|
166 | (4) |
|
|
166 | (1) |
|
|
167 | (2) |
|
MIT/Motorola *T and *T-NG |
|
|
169 | (1) |
|
|
170 | (3) |
PART 2 EXPERIENCE WITH DASH |
|
|
|
173 | (32) |
|
|
174 | (7) |
|
|
175 | (2) |
|
|
177 | (3) |
|
|
180 | (1) |
|
|
181 | (3) |
|
|
184 | (14) |
|
|
185 | (2) |
|
|
187 | (5) |
|
|
192 | (1) |
|
|
193 | (5) |
|
|
198 | (3) |
|
|
198 | (2) |
|
|
200 | (1) |
|
|
200 | (1) |
|
Protocol General Exceptions |
|
|
201 | (1) |
|
|
202 | (3) |
|
Prototype Hardware Structures |
|
|
205 | (32) |
|
|
206 | (5) |
|
SGI Multiprocessor Bus (MPBUS) |
|
|
206 | (1) |
|
|
207 | (3) |
|
|
210 | (1) |
|
|
211 | (1) |
|
|
211 | (7) |
|
|
218 | (6) |
|
|
224 | (2) |
|
Network and Network Interface |
|
|
226 | (3) |
|
|
229 | (3) |
|
Logic Overhead of Directory-Based Coherence |
|
|
232 | (4) |
|
|
236 | (1) |
|
Prototype Performance Analysis |
|
|
237 | (40) |
|
|
237 | (9) |
|
Overall Memory System Bandwidth |
|
|
238 | (2) |
|
Other Memory Bandwidth Limits |
|
|
240 | (1) |
|
Processor Issue Bandwidth and Latency |
|
|
241 | (3) |
|
|
244 | (1) |
|
Summary of Memory System Bandwidth and Latency |
|
|
244 | (2) |
|
Parallel Application Performance |
|
|
246 | (14) |
|
Application Run-Time Environment |
|
|
246 | (1) |
|
|
247 | (3) |
|
|
250 | (7) |
|
Application Speedup Summary |
|
|
257 | (3) |
|
|
260 | (11) |
|
|
260 | (4) |
|
Alternative Memory Operations |
|
|
264 | (7) |
|
|
271 | (6) |
PART 3 FUTURE TRENDS |
|
|
|
277 | (28) |
|
TeraDASH System Organization |
|
|
277 | (9) |
|
TeraDASH Cluster Structure |
|
|
278 | (2) |
|
|
280 | (3) |
|
|
283 | (1) |
|
TeraDASH Directory Structure |
|
|
284 | (2) |
|
TeraDASH Coherence Protocol |
|
|
286 | (10) |
|
Required Changes for the Scalable Directory Structure |
|
|
286 | (2) |
|
Enhancements for Increased Protocol Robustness |
|
|
288 | (6) |
|
Enhancements for Increased Performance |
|
|
294 | (2) |
|
|
296 | (7) |
|
|
297 | (1) |
|
Potential Application Speedup |
|
|
298 | (5) |
|
|
303 | (2) |
|
Conclusions and Future Directions |
|
|
305 | (6) |
|
|
306 | (1) |
|
|
307 | (1) |
|
|
308 | (3) |
Appendix Multiprocessor Systems |
|
311 | (6) |
References |
|
317 | (16) |
Index |
|
333 | |