Architecture Design


Apply Now

ARCHITECTURE DESIGN

Contents

1 Introduction

Overview

Evidence of soft errors

Types of soft errors

Cost-effective solutions to mitigate the impact of soft errors

Faults

Errors

Metrics

Dependability models

Reliability

Availability

Miscellaneous models

Permanente faults in complementary metal oxide semiconductor

Technology

Metal failure modes

Gate oxide failure modes

Radiation-induced transient faults in CMOS transistors

The alpha particle

The neutron

Interaction of alpha particles neutrons

With silicon crystals

Architectural fault models for alpha particle

And neutron strikes

Silent data corruption and detected unrecoverable errors

Basic definitions: SDC and DUE

SDC and DUE budgets

Soft error scaling trends

SRAM and latch scaling trends

DRAM scaling trends

Summary

Historical anecdote

References

 

2 Device-and circuit-level modeling, management and mitigation

Overview

Modeling circuit-level SERs

Impact of alpha particle or neutron on circuit elements

Critical charge

Timing vulnerability factor

Masking effects in combinatorial logic gates

Vulnerability of clock circuits

Measurement

Field data collection

Accelerated alpha particle tests

Accelerated neutron tests

Mitigation techniques

Device enhancements

Circuit enhancements

Summary

Historical anecdote

References

 

3 Architectural vulnerability analyses

Overview

  AVF basics

Does bit matter ?

SDC and DUE equations

Bit-level SDC and DUE FIT equations

Chip-level SDC and DUE FIT equations

False DUE AVF

Case study: false DUE from lock stepped checkers

Process-kill versus system-kill DUE AVF

ACE principles

Types of ACE and Un-ACE bits

Point-of-strike model  versus propagated fault model

Microarchitecural Un-ACE bits

Point –of-strike model versus propagated fault model

Microarchiectural Un-ACE bits

Idle or invalid state

Misspeculated state

Predictor structures

Ex-ACE state

Architectural Un-ACE bits

NOP instructions

Performance-enhancing operations

Predicated false instructions

Dynamically dead instructions

Logical masking

AVF equations for a hardware structure

Computing AVF with little’s law

Implications of little’s law for AVF computation

Computing AVF with a performance model

Limitations of AVF analysis with performance models

ACE analysis using the point-of-strike fault model

AVF results from an itanium 2 performance model

ACE analysis using the propagated fault model

Summary

Historical anecdote

References

 

4 Advanced architectural vulnerability analysis

Overview

Lifetime analysis of RAM arrays

Basic idea of lifetime analysis

Accounting for structural differences in lifetime analysis

Impact of working set size for lifetime analysis

Granularity of lifetime analysis

Computing the DUE AVF

Lifetime analysis of CAM arrays

Handling false-positive matches in a CAM array

Handling false-negative matches in a CAM array

Effect of cooldown in lifetime analysis

AVF results for cache data translation buffer, and store buffer

Unknown components

RAM arrays

CAM arrays

DUE AVF

Computing AVF using SFI into an RTL model

Comparison of fault injection and ACE analyses

Random sampling in SFI

Determining if an injected fault will result in an error

Case study of SFI

The Illinois SFI study

SFI methodology

Transient faults in pipeline state

Transient faults in logic blocks

Summary

Historical anecdote

References

 

5 . Error coding techniques

Overview

Fault detection and ECC for state bits

Basics of error coding

Error detection using parity codes

Single-error correction codes

Single-error correct double-error detect code

 Double -error correct triple -error detect code

Cyclic redundancy check

Error detection codes for execution units

AN codes

Residue codes

Parity prediction circuits

Implementation overhead of error detection

And correction codes

Number of logic levels

Overhead in area

Scrubbing analysis

DUE FIT from temporal double –bit error with No scrubbing

DUE rate from temporal double-bit error with

Fixed interval scrubbing

Summary

Historical anecdote

References

 

6 Fault detection via redundant execution

Overview

Sphere of replication

Components of the sphere of replication

The size of sphere of replication

Output comparison and input replication

Fault detection via cycle-by-cycle lock stepping

Advantages of lock stepping

Disadvantages of lock stepping

Lockstepping in the status fitserve

Lockstepping in the   Hewlett-packard  nonstop

Himalaya architecture

Lockstepping in the IBM Z-series processors

Fault detection via RMT

RMT in the marathon endurance server

RMT in the Hewlett-packard nonstop advanced architecture

RMT within a single-processor core

A simultaneous multithreaded processor

Design space for SMT in a single core

Output comparison in an SRT processor

Input replication in an SRT processor

Two techniques to enhance performance of an SRT processor

Performance evaluation of an SRT implementation

Alternate single-core RMT implementation

RMT in a multicore architecture

DIVA: RMT using specialized checker processor

RMT enhancements

Relaxed input replication

Relaxed output comparison

Partial RMT

Summary

Historical anecdote

References

 

7 Hardware error recovery

Overview

Classification of hardware error recovery schemes

Reboot

Forward error recovery

Backward error recovery

Forward error recovery

Fail-over systems

DMR with recovery

Triple modular redundancy

Pair-and-space

Backward error recovery with fault detection before

Register commit

Fujitsu SPARC64 V: parity with retry

IBM Z-series:  lockstepping with retry

 Recovery in an SRT processor

Revive: backward error recovery using global checkpoints

Safety net: backward error recovery error recovery using local checkpoints

Backward error recovery with fault detection

After I/O commit

Summary

Historical anecdote

References

 

8 Software detection and recovery  

Overview

Fault detection using

Fault detection using

Fault detection using software RMT

Error detection by duplicated instructions

Software-implemented fault tolerance

Configurable transient fault detection

Via dynamic binary translation

Fault detection using hybrid RMT

CRAFT: A Hybrid RMT Implementation

CRAFT evaluation

Fault detection using RVMs

Application –level recovery

Forward error recovery using software RMT and AN codes for

Fault detection

Log –based backward error recovery in database systems

Checkpoint-based backward error recovery for shared memory

Programs

OS-level and VMM-level recoveries

Summary

References