Apply Now
ARCHITECTURE DESIGN
Contents
1 Introduction
Overview
Evidence of soft errors
Types of soft errors
Cost-effective solutions to mitigate the impact of soft errors
Faults
Errors
Metrics
Dependability models
Reliability
Availability
Miscellaneous models
Permanente faults in complementary metal oxide semiconductor
Technology
Metal failure modes
Gate oxide failure modes
Radiation-induced transient faults in CMOS transistors
The alpha particle
The neutron
Interaction of alpha particles neutrons
With silicon crystals
Architectural fault models for alpha particle
And neutron strikes
Silent data corruption and detected unrecoverable errors
Basic definitions: SDC and DUE
SDC and DUE budgets
Soft error scaling trends
SRAM and latch scaling trends
DRAM scaling trends
Summary
Historical anecdote
References
2 Device-and circuit-level modeling, management and mitigation
Overview
Modeling circuit-level SERs
Impact of alpha particle or neutron on circuit elements
Critical charge
Timing vulnerability factor
Masking effects in combinatorial logic gates
Vulnerability of clock circuits
Measurement
Field data collection
Accelerated alpha particle tests
Accelerated neutron tests
Mitigation techniques
Device enhancements
Circuit enhancements
Summary
Historical anecdote
References
3 Architectural vulnerability analyses
Overview
AVF basics
Does bit matter ?
SDC and DUE equations
Bit-level SDC and DUE FIT equations
Chip-level SDC and DUE FIT equations
False DUE AVF
Case study: false DUE from lock stepped checkers
Process-kill versus system-kill DUE AVF
ACE principles
Types of ACE and Un-ACE bits
Point-of-strike model versus propagated fault model
Microarchitecural Un-ACE bits
Point –of-strike model versus propagated fault model
Microarchiectural Un-ACE bits
Idle or invalid state
Misspeculated state
Predictor structures
Ex-ACE state
Architectural Un-ACE bits
NOP instructions
Performance-enhancing operations
Predicated false instructions
Dynamically dead instructions
Logical masking
AVF equations for a hardware structure
Computing AVF with little’s law
Implications of little’s law for AVF computation
Computing AVF with a performance model
Limitations of AVF analysis with performance models
ACE analysis using the point-of-strike fault model
AVF results from an itanium 2 performance model
ACE analysis using the propagated fault model
Summary
Historical anecdote
References
4 Advanced architectural vulnerability analysis
Overview
Lifetime analysis of RAM arrays
Basic idea of lifetime analysis
Accounting for structural differences in lifetime analysis
Impact of working set size for lifetime analysis
Granularity of lifetime analysis
Computing the DUE AVF
Lifetime analysis of CAM arrays
Handling false-positive matches in a CAM array
Handling false-negative matches in a CAM array
Effect of cooldown in lifetime analysis
AVF results for cache data translation buffer, and store buffer
Unknown components
RAM arrays
CAM arrays
DUE AVF
Computing AVF using SFI into an RTL model
Comparison of fault injection and ACE analyses
Random sampling in SFI
Determining if an injected fault will result in an error
Case study of SFI
The Illinois SFI study
SFI methodology
Transient faults in pipeline state
Transient faults in logic blocks
Summary
Historical anecdote
References
5 . Error coding techniques
Overview
Fault detection and ECC for state bits
Basics of error coding
Error detection using parity codes
Single-error correction codes
Single-error correct double-error detect code
Double -error correct triple -error detect code
Cyclic redundancy check
Error detection codes for execution units
AN codes
Residue codes
Parity prediction circuits
Implementation overhead of error detection
And correction codes
Number of logic levels
Overhead in area
Scrubbing analysis
DUE FIT from temporal double –bit error with No scrubbing
DUE rate from temporal double-bit error with
Fixed interval scrubbing
Summary
Historical anecdote
References
6 Fault detection via redundant execution
Overview
Sphere of replication
Components of the sphere of replication
The size of sphere of replication
Output comparison and input replication
Fault detection via cycle-by-cycle lock stepping
Advantages of lock stepping
Disadvantages of lock stepping
Lockstepping in the status fitserve
Lockstepping in the Hewlett-packard nonstop
Himalaya architecture
Lockstepping in the IBM Z-series processors
Fault detection via RMT
RMT in the marathon endurance server
RMT in the Hewlett-packard nonstop advanced architecture
RMT within a single-processor core
A simultaneous multithreaded processor
Design space for SMT in a single core
Output comparison in an SRT processor
Input replication in an SRT processor
Two techniques to enhance performance of an SRT processor
Performance evaluation of an SRT implementation
Alternate single-core RMT implementation
RMT in a multicore architecture
DIVA: RMT using specialized checker processor
RMT enhancements
Relaxed input replication
Relaxed output comparison
Partial RMT
Summary
Historical anecdote
References
7 Hardware error recovery
Overview
Classification of hardware error recovery schemes
Reboot
Forward error recovery
Backward error recovery
Forward error recovery
Fail-over systems
DMR with recovery
Triple modular redundancy
Pair-and-space
Backward error recovery with fault detection before
Register commit
Fujitsu SPARC64 V: parity with retry
IBM Z-series: lockstepping with retry
Recovery in an SRT processor
Revive: backward error recovery using global checkpoints
Safety net: backward error recovery error recovery using local checkpoints
Backward error recovery with fault detection
After I/O commit
Summary
Historical anecdote
References
8 Software detection and recovery
Overview
Fault detection using
Fault detection using
Fault detection using software RMT
Error detection by duplicated instructions
Software-implemented fault tolerance
Configurable transient fault detection
Via dynamic binary translation
Fault detection using hybrid RMT
CRAFT: A Hybrid RMT Implementation
CRAFT evaluation
Fault detection using RVMs
Application –level recovery
Forward error recovery using software RMT and AN codes for
Fault detection
Log –based backward error recovery in database systems
Checkpoint-based backward error recovery for shared memory
Programs
OS-level and VMM-level recoveries
Summary
References