Skip to main content

[]

Intended for healthcare professionals
Skip to main content

Abstract

We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.
The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

Get full access to this article

View all access and purchase options for this article.

References

Agostinelli M, Pae S, Yang W, et al. (2005) Random charge effects for PMOS NBTI in ultra-small gate area devices. In: Proceedings of the 2005 IEEE international reliability physics symposium (IRPS), pp. 529–532.
Ahn DH, Supinski BRD, Laguna I, et al. (2009) Scalable temporal order analysis for large scale debugging. In: International conference for high-performance computing, networking, storage and analysis (SC).
Austin TM (1999) DIVA: A reliable substrate for deep submicron microarchitecture design. In: Proceedings of the annual international symposium on microarchitecture (MICRO), pp. 196–207.
Avizienis A (1973) Arithmetic algorithms for error-coded operands. IEEE Transactions on Computers C-22(6): 567–572.
Avižienis A, Laprie J, Randell B, et al. (2004) Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing 1(1): 11–33.
Avritzer A, Bondi A, Grottke M, et al. (2006) Performance assurance via software rejuvenation: Monitoring, statistics and algorithms. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 435–444.
Bailey FR, Bell G, Blondin J, et al. (2007) Petascale metrics panel report. Available at: http://research.microsoft.com/en-us/um/people/gbell/supers/ascac_petascale_metrics_panel_report_and_executive_summary_2007-02-12.pdf (accessed 25 February 2014)
Ballesteros FJ, Evans N, Forsyth C, et al. (2012) Nix: A case for a manycore system for cloud computing. Bell Labs Technical Journal 17(2): 41–54.
Banerjee P, Abraham J (1986) Bounds on algorithm-based fault tolerance in multiple processor systems. IEEE Transactions on Computers C-35(4): 296–306.
Banerjee P, Rahmeh J, Stunkel C, et al. (1990) Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Transactions on Computers 39(9): 1132–1145.
Bautista-Gomez LA, Tsuboi S, Komatitsch D, et al. (2011a) FTI: High performance fault tolerance interface for hybrid systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Bautista-Gomez L, Komatitsch D, Maruyama N, et al. (2011b) FTI: High performance fault tolerance interface for hybrid systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Birge J, Louveaux F (1997) Introduction to Stochastic Programming. Berlin: Springer Verlag.
Bland W, Bouteiller A, Herault T, et al. (2012) An evaluation of user-level failure mitigation support in MPI. In: Träff J, Benkner S, Dongarra J (eds) Recent Advances in the Message Passing Interface. New York, NY: Springer, pp. 193–203.
Borkar S (2005) Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro 25(6): 10–16.
Bosilca G, Delmas R, Dongarra J, et al. (2009) Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69(4): 410–416.
Bouteiller A, Herault T, Bosilca G, et al. (2011) Correlated set coordination in fault tolerant message logging protocols. In: Euro-Par 2011: Parallel Processing Workshops (eds Jeannot E, Namyst R, Jean R), 29 August– 2 September 2011, France, pp. 51–64. New York, NY: Springer.
Bouteiller A, Herault T, Krawezik G, et al. (2006) MPICH-V project: A multiprotocol automatic fault-tolerant MPI. International Journal of High Performance Computing Applications 20(3): 319–333.
Bower F, Sorin D, Ozev S (2007) Online diagnosis of hard faults in microprocessors. ACM Transactions on Architecture and Code Optimization 4(2).
Bronevetsky G, Laguna I, Bagchi S, et al. (2010) AutomaDeD: Automata-based debugging for dissimilar parallel tasks. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 231–240.
Cai K, Qin Z, Memory Device with Soft-Decision Decoding. US Patent 20130107611 A1, May 2, 2013.
Cappello F, Geist A, Gropp B, et al. (2009) Toward exascale resilience. International Journal of High Performance Computing Applications 23(4): 374–388.
Cappello F, Guermouche A, Snir M (2010) On communication determinism in parallel HPC applications. In: Proceedings of the 19th international conference on computer communications and networks (ICCCN), pp. 1–8.
Carulli J, Anderson T (2005) Test connections-tying application to process. In: IEEE Workshop on Silicon Errors in Logic–System Effects, Stanford University, CA.
Castelli V, Harper RE, Heidelberger P, et al. (2001) Proactive management of software aging. IBM Journal of Research and Development 45(2): 311–332.
Chan JTY, Tseng CW, Chu YC, et al. (1998) Experimental results for IDDQ and VLV testing. In: Proceedings of the IEEE VLSI test symposium, pp. 118–125.
Chen D, Eisley NA, Heidelberger P, et al. (2011) The IBM Blue Gene/Q interconnection network and message unit. In: International conference for high-performance computing, networking, storage and analysis (SC).
Chen Z, Dongarra J (2006) Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: Proceedings of the 20th international parallel and distributed processing symposium (IPDPS).
Chow P (2007) Stochastic Partial Differential Equations. Boca Raton/ London/ New York: Chapman & Hall/CRC.
Chung J, Lee I, Sullivan M, et al. (2012) Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Conn AR, Gould NI, Toint PL (1987) Trust-Region Methods. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Daly J, Adolf B, Borkar S, et al. (2012) Inter agency workshop on HPC resilience at extreme scale. Available at: http://institutes.lanl.gov/resilience/docs/Inter-AgencyResilienceReport.pdf (accessed 25 February 2014).
Daly JT (2006) A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22(3): 303–312.
Dean J, Ghemawat S (2008) MapReduce: Simplified data processing on large clusters. Communications of the ACM 51(1): 107–113.
DeBardeleben N, Laros J, Daly J, et al. (2010b) High-end computing resilience: Analysis of issues facing the HEC community and path-forward for research and development. Technical Report LA-UR-10-00030, DARPA, VA. available at http://www.csm.ornl.gov/~engelman/publications/debardeleben09high-end 2/25/14
DeHon A, Carter N, Quinn H (eds) (2011) Final report for CCC cross-layer reliability visioning study. 3 March Available at: http://xlayer.org/FinalReport (accessed 25 February 2014).
Dimitrov M, Zhou H (2007) Unified architectural support for soft-error protection or software bug detection. In: Proceedings of the conference on parallel architecture and compilation techniques, pp. 73–82.
Dixit A, Heald R, Wood A (2009) Trends from ten years of soft error experimentation. In: The workshop on silicon Available at: http://softerrors.info/selse/images/selse_2009/Papers/selse5_submission_29.pdf (acessed 25 February 2014).
Dongarra J, Beckman P, Moore T, et al. The international exascale software project roadmap International Journal of High Performance Computing Applications, 25(1), 3–60, 2011.
Downing R, Nowak J, Tuomenoksa L (1964) No. 1 ESS maintenance plan. Bell System Technical Journal 43(5): 1961–2019.
Du P, Bouteiller A, Bosilca G, et al. (2012) Algorithm-based fault tolerance for dense matrix factorizations. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, New York, NY, pp. 225–234.
Elnozahy ENM, Alvisi L, Wang YM, et al. (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys 34(3): 375–408.
Elnozahy (editor) System Resilience at Extreme Scale White Paper available at http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.205.4240 accessed 2/25/14
EMC (2014) Smarts: Automated IT management enabling service assurance. Available at: http://www.emc.com/it-management/smarts/index.htm (accessed 25 February 2014).
Ernst MD, Perkins JH, Guo PJ, et al. (2007) The Daikon system for dynamic detection of likely invariants. Science of Computer Programming 69(1): 35–45.
Fadden S (2012) An introduction to GPFS version 3.5. Available at: www-03.ibm.com/systems/jo/resources/introduction-to-gpfs-3-5.pdf (accessed 25 February 2014).
Fagg G, Dongarra J (2000) FT-MPI: Fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra J, et al. (eds) Recent Advances in Parallel Virtual Machine and Message Passing Interface (Lecture Notes in Computer Science, vol. 1908). Berlin/Heidelberg: Springer, pp. 346–353.
Feng S, Gupta S, Ansari A, et al. (2010) Shoestring: Probabilistic soft error reliability on the cheap. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS), pp. 385–396.
Ferreira KB, Stearley J, Laros JH III, et al. (2011) Evaluating the viability of process replication reliability for exascale systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Fletcher R (1981) Practical Methods of Optimization. Volume 2: Constrained Optimization. New York, NY: John Wiley & Sons.
Fujita H, Schreiber R, Chien AA (2013) It’s time for new programming models for unreliable hardware. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS).
Gainaru A, Cappello F, Kramer W (2012a) Taming of the shrew: Modeling the normal and faulty behavior of large-scale HPC systems. In: Proceedings of the IEEE international parallel & distributed processing symposium (IPDPS).
Gainaru A, Cappello F, Fullop J, et al. (2011a) Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In: Proceedings of managing large-scale systems via the analysis of system logs and the application of machine learning techniques (SLAM’11), pp. 4:1–4:8.
Gainaru A, Cappello F, Snir M, et al. (2012b) Fault prediction under the microscope: A closer look into HPC systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Gainaru A, Cappello F, Trausan-Matu S, et al. (2011b) Event log mining tool for large scale HPC systems. In: Euro-Par 2011: Parallel Processing Workshops. New York Alexander M, D’Ambra P, Belloum A, et al. (eds), NY: Springer.
Gao B, Zhang H, Chen B, et al. (2011) Modeling of retention failure behavior in bipolar oxide-based resistive switching memory. IEEE Electron Device Letters 32(3): 276–278.
Gao Q, Qin F, Panda DK (2007) DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In: International conference for high-performance computing, networking, storage and analysis (SC).
Gao Q, Zhang W, Qin F (2010) FlowChecker: Detecting bugs in MPI libraries via message flow checking. In: International conference for high-performance computing, networking, storage and analysis (SC).
Gattiker A, Nigh P, Grosch D, et al. (1996) Current signatures for production testing [CMOS ICs]. In: IEEE international workshop on IDDQ testing, pp. 25–28.
Geist A, Lucas B, Snir M, et al. (2012) U.S. Department of Energy fault management workshop. Technical report, U.S. Department of Energy, DC.
Gill B, Seifert N, Zia V (2009) Comparison of alpha-particle and neutron-induced combinational and sequential logic error rates at the 32nm technology node. In: IEEE international reliability physics symposium, pp. 199–205.
Goloubeva O, Rebaudengo M, Reorda MS, et al. (2003) Soft-error detection using control flow assertions. In: Proceedings of the international symposium on defect and fault tolerance in VLSI systems, pp. 581–588.
Griewank A, Corliss G (1991) Automatic Differentiation of Algorithms: Theory, Implementation, and Application. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Grottke M, Trivedi KS (2007) Fighting bugs: Remove, retry, replicate, and rejuvenate. IEEE Computer 40(2): 107–109.
Guermouche A, Ropars T, Brunet E, et al. (2011) Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In: IEEE international parallel & distributed processing symposium (IPDPS), pp. 989–1000.
Guermouche A, Ropars T, Snir M, et al. (2012) HydEE: Failure containment without event logging for large scale send-deterministic MPI applications. In: IEEE international parallel & distributed processing symposium (IPDPS), pp. 1216–1227.
Gunnels J, Katz D, Quintana-Orti E, et al. (2001) Fault-tolerant high-performance matrix multiplication: Theory and practice. In: Proceedings of the international conference on dependable systems and networks (DSN), pp. 47–56.
Hackbusch W (1985) Multi-Grid Methods and Applications. Berlin: Springer-Verlag.
Hafner JL, Deenadhayalan V, Belluomini W, et al. (2008) Undetected disk errors in RAID arrays. IBM Journal of Research and Development 52(4.5): 413–425.
Hamming R (1987) Numerical Methods for Scientists and Engineers. New York: Dover Publications.
Hangal S, Lam MS (2002) Tracking down software bugs using automatic anomaly detection. In: Proceedings of the 2002 international conference on software engineering.
Hao H, McCluskey E (1993) Very-low-voltage testing for weak CMOS logic ICs. In: Proceedings of the IEEE international test conference (ITC), pp. 275–284.
Hari SKS, Adve SV, Naeimi H (2012a) Low-cost program-level detectors for reducing silent data corruptions. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN).
Hari SKS, Adve SV, Naeimi H, et al. (2012b) Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS).
Hari SKS, Li ML, Ramachandran P, et al. (2009) mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems. In: Proceedings of the annual international symposium on microarchitecture (MICRO), pp. 122–132.
Hazucha P, Karnik T, Bloechel SWB, et al. (2003) Measurements and analysis of SER tolerant latch in a 90 nm dual-Vt CMOS process. In: IEEE custom integrated circuits conference, pp. 617–620.
Hedges R, Loewe B, McLarty T, et al. (2005) Parallel file system testing for the lunatic fringe: The care and feeding of restless I/O power users. In: Proceedings of the 22nd IEEE/13th NASA Goddard conference on mass storage systems and technologies, pp. 3–17.
Heien E, Kondo D, Gainaru A, et al. (2011) Modeling and tolerating heterogeneous failures in large parallel systems. In: International conference for high-performance computing, networking, storage and analysis (SC).
Heiser G, Ryzhyk L, Von Tessin M, et al. (2011) What if you could actually trust your kernel. In: 13th workshop on hot topics in operating systems (HotOS).
Hess WN, Patterson HW, Wallace R, et al. (1959) Cosmic-ray neutron energy spectrum. Physical Review 116(2): 445.
Hogan S, Hammond J, Chien AA (2012) An evaluation of difference and threshold techniques for efficient checkpointing. In: 2nd workshop on fault-tolerance for HPC at extreme scale (FTXS 2012).
Huang KH, Abraham J (1984) Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6): 518–528.
Hunter R (1975) Engine failure prediction techniques. Aircraft Engineering and Aerospace Technology 47(3): 4–14.
Hwang AA, Stefanovici IA, Schroeder B (2012) Cosmic rays don’t strike twice: Understanding the nature of DRAM errors and the implications for system design. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS), pp. 111–122.
Ibe E, Taniguchi H, Yahagi Y, et al. (2010) Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. IEEE Transactions on Electron Devices 57(7): 1527–1538.
Katz D, Some R (2003) NASA advances robotic space exploration. Computer 36(1): 52–61.
Katz DS, Daly J, DeBardeleben N, et al. (2009) 2009 fault tolerance for extreme-scale computing workshop. Technical report ANL/MCS-TM-312, Argonne National Laboratory, IL.
Kerbyson D, Rajamony R, Van Hensbergen E (2012) Performance health monitoring for large-scale systems. In: Second international workshop on high-performance infrastructure for scalable tools.
Kubota K, Iri M (1992) Estimates of rounding errors with fast automatic differentiation and interval analysis. Journal of Information Processing 14(3): 508–515.
Kundu S, Mak T, Galivanche R (2004) Trends in manufacturing test methods and their implications. In: Proceedings of the international test conference (ITC), pp. 679–687.
Laguna I, Ahn DH, de Supinski BR, et al. (2012) Probabilistic diagnosis of performance faults in large-scale parallel applications. In: Proceedings of the 21st international conference on parallel architectures and compilation techniques, pp. 213–222.
Laguna I, Gamblin T, de Supinski BR, et al. (2011) Large scale debugging of parallel tasks with AutomaDeD. In: International conference for high-performance computing, networking, storage and analysis (SC).
Lange J, Pedretti K, Hudson T, et al. (2010) Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing. In: IEEE international symposium on parallel & distributed processing (IPDPS), pp. 1–12.
Lee GL, Ahn DH, Arnold DC, et al. (2007) Benchmarking the stack trace analysis tool for Blue Gene/L. In: International conference on parallel computing: Architectures, algorithms and applications (ParCo).
Lee GL, Ahn DH, Arnold DC, et al. (2008) Lessons learned at 208K: Towards debugging millions of cores. In: International conference for high-performance computing, networking, storage and analysis (SC).
Li ML, Ramachandran P, Sahoo S, et al. (2008a) Trace-based microarchitecture-level diagnosis of permanent hardware faults. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN).
Li ML, Ramachandran P, Sahoo S, et al. (2008b) Understanding the propagation of hard errors to software and implications for resilient systems design. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS), pp. 265–276.
Lindekugel K, DiGirolamo A, Stanzione D (2008) Architecture for an offline parallel debugger. In: International symposium on parallel and distributed processing with applications (ISPA’08), pp. 227–235.
Linderoth J, Wright S (2003) Decomposition algorithms for stochastic programming on a computational grid. Computational Optimization and Applications 24(2): 207–250.
Lo JC (1994) Reliable floating-point arithmetic algorithms for error-coded operands. IEEE Transactions on Computers 43(4): 400–412.
Lo J, Thanawastien S, Rao T (1989) Concurrent error detection in arithmetic and logical operations using Berger codes. In: Proceedings of 9th symposium on computer arithmetic, pp. 233–240.
Los Alamos National Lab (2006) Operational data to support and enable computer science research. Available at: http://institutes.lanl.gov/data/fdata/ (accessed 25 February 2014).
Lourenço J, Cunha J (2001) Fiddle: A flexible distributed debugger architecture. In: International conference on computational science (ICCS), pp. 821–830.
Lu G, Zheng Z, Chien AA (2013) When are multiple checkpoints needed? In: 3rd workshop on fault-tolerance for HPC at extreme scale (FTXS 2013).
Lunardini D, Narasimham B, Ramachandran V, et al. (2004) A performance comparison between hardened-by-design and conventional-design standard cells. In: 2004 workshop on radiation effects on components and systems, radiation hardening techniques and new developments.
Lyle G, Cheny S, Pattabiraman K, et al. (2009) An end-to-end approach for the automatic derivation of application-aware error detectors. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 584–589.
Maxwell P, O’Neill P, Aitken R, et al. (2000) Current ratios: A self-scaling technique for production IDDQ testing. In: Proceedings of the international test conference (ITC), pp. 1148–1156.
Meixner A, Bauer ME, Sorin DJ (2007) Argus: Low-cost, comprehensive error detection in simple cores. In: Proceedings of the annual international symposium on microarchitecture (MICRO), pp. 210–222.
Mirgorodskiy AV, Maruyama N, Miller BP (2006) Problem diagnosis in large-scale computing environments. In: International conference for high-performance computing, networking, storage and analysis (SC).
Mitchell R (1977) The Underground Grammarian, Vol., No. 1, January. Available at http://www.sourcetext.com/grammarian/ (accessed 25 February 2014).
Mitra S, Zhang M, Seifert N, et al. (2007) Built-in soft error resilience for robust system design. In: IEEE international conference on integrated circuit design and technology.
Mokhtarani A, Kramer W, Hick J (2008) Reliability results of NERSC systems. https://publications.lbl.gov/islandora/object/ir%3A150330 (accessed 25 February 2014).
Moody A, Bronevetsky G, Mohror K, et al. (2010) Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: International conference for high-performance computing, networking, storage and analysis (SC).
Moré JJ, Wild SM (2012) Estimating derivatives of noisy simulations. ACM Transactions of Mathematical Software 38(3): 19: 1–19: 21.
MPIPlugIn (2013) MPI plugin for KDevelop. Available at: http://sourceforge.net/projects/mpiplugin/ (accessed 25 February 2014).
Nakano J, Montesinos P, Gharachorloo K, et al. (2006) ReVive I/O: Efficient handling of I/O in highly-available rollback-recovery servers. In: Proceedings of the international symposium on high performance computer architecture (HPCA).
Naksinehaboon N, Taerat N, Leangsuksun C, et al. (2010) Benefits of software rejuvenation on HPC systems. In: International symposium on parallel and distributed processing with applications (ISPA), pp. 499–506.
Nassif S, Kleeberger V, Schlichtmann U (2012) Goldilocks failures: Not too soft, not too hard. In: 2012 IEEE international reliability physics symposium (IRPS), pp. 2F–1.
NCAR (2014) Community earth system model. Available at: http://www2.cesm.ucar.edu/ (accessed 25 February 2014).
Network Working Group (2009) The syslog protocol. Available at: http://tools.ietf.org/html/rfc5424 (accessed 25 February 2014).
Nigh P, Gattiker A (2000) Test method evaluation experiments and data. In: Proceedings of the international test conference (ITC), pp. 454–463.
Oh J, Washington SP, Nam D (2006) Accident prediction model for railway-highway interfaces. Accident Analysis and Prevention 38(2): 346–356.
Oliner A, Stearley J (2007) What supercomputers say: A study of five system logs. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 575–584.
Park Y, Van Hensbergen E, Hillenbrand M, et al. (2012) FusedOS: Fusing LWK performance with FWK functionality in a heterogeneous environment. In: 24th international symposium on computer architecture and high performance computing (SBAC-PAD), pp. 211–218.
Pattabiraman K, Nakka N, Kalbarczyk Z, et al. (2008) SymPLFIED: Symbolic program-level fault injection and error detection framework. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN).
Pattabiraman K, Saggese GP, Chen D, et al. (2006) Dynamic derivation of application-specific error detectors and their implementation in hardware. In: European dependable computing conference, pp. 97–108.
Prvulovic M, Zhang Z, Torrellas J (2002) ReVive: Cost-effective architectural support for rollback recovery in shared-memo multiprocessors. In: Proceedings of the annual international symposium on computer architecture (ISCA).
Racunas P, Constantinides K, Manne S, et al. (2007) Perturbation-based fault screening. In: Proceedings of the international symposium on high performance computer architecture (HPCA), pp. 169–180.
Ramachandran P (2011) Detecting and recovering from in-core hardware faults through software anomaly treatment. PhD Thesis, University of Illinois at Urbana Champaign, IL.
Randall A V (2006) The Eckert tapes: Computer pioneer says ENIAC team couldn’t afford to fail – and didn’t. Computerworld 40(8): 18
Rao TRN (1974) Error Coding for Arithmetic Processors. Orlando, FL: Academic Press, Inc.
Reddy V, Krishnan A, Marshall A, et al. (2005) Impact of negative bias temperature instability on digital circuit reliability. Microelectronics Reliability 45(1): 31–38.
Reis G, Chang J, Vachharajani N, et al. (2005a) Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization 2(4): 366–396.
Reis GA, Chang J, Vachharajani N, et al. (2005b) SWIFT: Software implemented fault tolerance. In: Proceedings of the international symposium on code generation and optimization, pp. 243–254.
Rogue Wave Software (2013) TotalView Debugger. Available at: http://www.roguewave.com/products/totalview.aspx (accessed 25 February 2014).
Ropars T, Guermouche A, Uçar B, et al. (2011) On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. Euro-Par 2011: Parallel Processing Workshops. In: 17th International Euro-ParConference (eds Emmanuel J, Raymond N, Jean R), Bordeaux, France, 29 August– 2 September 2011, pp. 567–578. New York, NY: Springer.
Roth PC, Arnold DC, Miller BP (2003) MRNet: A software-based multicast/reduction network for scalable tools. In: International conference for high-performance computing, networking, storage and analysis (SC).
Roy-Chowdhury A, Bellas N, Banerjee P (1996) Algorithm-based error-detection schemes for iterative solution of partial differential equations. IEEE Transactions on Computers 45(4): 394–407.
Sahoo S, Li ML, Ramchandran P, et al. (2008) Using likely program invariants to detect hardware errors. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 70–79.
Salfner F, Lenk M, Malek M (2010) A survey of online failure prediction methods. ACM Computing Surveys 42: 1–42.
Saxena N, McCluskey E (2002) Dependable adaptive computing systems – the ROAR project. In: IEEE international conference on systems, man, and cybernetics, pp. 2172–2177.
Schroeder B, Gibson GA (2007) Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you. In: Proceedings of the 5th USENIX conference on file and storage technologies (FAST), pp. 1–16.
Schroeder B, Gibson GA (2010) A large-scale study of failures in high-performance computing systems. IEEE Transactions on Dependable and Secure Computing 7(4): 337–350.
Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: A large-scale field study. In: Proceedings of the eleventh international joint conference on measurement and modeling of computer systems, pp. 193–204.
Seltborg P, Polanski A, Petrochenkov S, et al. (2005) Radiation shielding of high-energy neutrons in SAD. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 550(1): 313–328.
Shipman G, Dillow D, Oral S, et al. (2010) Lessons learned in deploying the world’s largest scale Lustre file system. In: The 52nd Cray user group conference.
Slayman C (2011) Soft error trends and mitigation techniques in memory devices. In: Proceedings of the annual reliability and maintainability symposium (RAMS), pp. 1–5.
Slegel TJ, Averill RM III, Check MA, et al. (1999) IBM’s S/390 G5 microprocessor design. IEEE Micro 19(2): 12–23.
Snir M, Bader DA (2004) A framework for measuring supercomputer productivity. International Journal for High Performance Computing Applications 18(4): 417–432.
Sorin D, Martin MMK, Hill MD, et al. (2002) SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings of the annual international symposium on computer architecture (ISCA).
Spainhower L, Gregg T (1999) IBM S/390 parallel enterprise server G5 fault tolerance: A historical perspective. IBM Journal of Research and Development 43(5.6): 863–873.
Sridharan V, Liberty D (2012) A study of DRAM failures in the field. In: International conference for high-performance computing, networking, storage and analysis (SC).
Stearley J (2005) Defining and measuring supercomputer reliability, availability, and serviceability (RAS). In: Proceedings of the Linux clusters institute conference.
Taleb N (2010) The Black Swan: The Impact of the Highly Improbable. New York: Random House Trade Paperbacks.
Trottenberg U, Oosterlee C, Schüller A (2001) Multigrid. New York, NY: Academic Press.
Turmon M, Granat R, Katz D (2000) Software-implemented fault detection for high-performance space applications. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 107–116.
Turmon M, Granat R, Katz D, et al. (2003) Tests and tolerances for high-performance software-implemented fault detection. IEEE Transactions on Computers 52(5): 579–591.
Van Horn J (2005) Towards achieving relentless reliability gains in a server marketplace of teraflops, laptops, kilowatts, and ‘‘cost, cost, cost’’…: Making peace between a black art and the bottom line. In: Proceedings of the IEEE international test conference (ITC), p. 8.
Wang N, Patel S (2006) ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing 3(3): 188–201.
Wittgenstein L (1953) Philosophical Investigations.: The Macmillan Company, New York.
Yang J, Zhang M, Strachan J, et al. (2010) High switching endurance in TaOx memristive devices. Applied Physics Letters 97(23): 232102.
Young JW (1974) A first order approximation to the optimum checkpoint interval. Communications of the ACM 17(9): 530–531.
Yu J, Garzaran MJ, Snir M (2009) Esoftcheck: Removal of non-vital checks for fault tolerance. In: Proceedings of the 7th annual IEEE/ACM international symposium on code generation and optimization, pp. 35–46.
Yu S, Yin Chen Y, Guan X, et al. (2012) A Monte Carlo study of the low resistance state retention of HfOx based resistive switching memory. Applied Physics Letters 100(4): 043507.
Zhang M, Mitra S, Mak TM, et al. (2006) Sequential element design with built-in soft error resilience. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 14(13): 1368–1378.
Zheng G, Ni X, Kalé L (2012) A scalable double in-memory checkpoint and restart scheme towards exascale. In: Proceedings of the IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 1–6.
Zhou J, Wang M, Wong M (2010) Instability of p-channel poly-Si thin-film transistors under dynamic negative bias temperature stress. In: 17th IEEE international symposium on the physical and failure analysis of integrated circuits (IPFA), pp. 1–4.
Zio E, Maio FD, Stasi M (2010) A data-driven approach for predicting failure scenarios in nuclear systems. Annals of Nuclear Energy 37: 482–491.

Biographies

Marc Snir is the Director of Argonne’s Mathematics and Computer Science Division and the Michael Faiman and Saburo Muroga Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. His research is focused on HPC, with recent work on programming models, performance analysis, and resilience. Snir received his PhD from the Hebrew University of Jerusalem. He spent time at NYU, where he worked on the NYU Ultracomputer, and at IBM Research, where he led the research team that worked on the software for the IBM SP and Blue Gene systems. At UIUC, he headed the CS department and led the creation of the Illinois Informatics Institute. Marc Snir is an AAAS, ACM, IEEE, and Argonne Fellow. He has recently received the IEEE Award for Excellence in Scalable Computing and the IEEE Computer Society Seymour Cray Computer Engineering Award.
Pavan Balaji holds appointments as a Computer Scientist at the Argonne National Laboratory, as an Institute Fellow of the Northwestern-Argonne Institute of Science and Engineering at Northwestern University, and as a Research Fellow of the Computation Institute at the University of Chicago. He leads the Programming Models and Runtime Systems group at Argonne. His research interests include parallel programming models and runtime systems for communication and I/O, modern system architecture (multicore, accelerators, complex memory subsystems, high-speed networks), and cloud computing systems. He has nearly 100 publications in these areas and has delivered nearly 120 talks and tutorials at various conferences and research institutes. He is a recipient of several awards including the U.S. Department of Energy Early Career award in 2012, TEDx Midwest Emerging Leader award in 2013, Crain’s Chicago 40 under 40 award in 2012, Los Alamos National Laboratory Director’s Technical Achievement award in 2005, Ohio State University Outstanding Researcher award in 2005, five best-paper awards, and various others. He serves as the worldwide chairperson for the IEEE Technical Committee on Scalable Computing (TCSC). He has also served as a chair or editor for nearly 50 journals, conferences, and workshops, and as a technical program committee member in numerous conferences and workshops. He is a senior member of the IEEE and a professional member of the ACM. More details are available at http://www.mcs.anl.gov/~balaji.
Todd Munson received a BS in Computer Science from the University of Nebraska in 1995, and an MS in 1996 and PhD in 2000 in Computer Science from the University of Wisconsin at Madison. He is a Computational Scientist in the Mathematics and Computer Science Division at Argonne National Laboratory, a Senior Fellow in the Computation Institute at the University of Chicago and Argonne National Laboratory. The primary focus of his research is algorithms and applications of numerical optimization and variational inequalities. He has been widely recognized for his contributions. Among other honors he was awarded a Presidential Early Career Award for Scientists and Engineers from the White House, an Early Career Scientist and Engineer Award from the U.S. Department of Energy in 2006, and the Beale-Orchard-Hayes Prize from the Mathematical Programming Society in 2003. He has twice been invited to the White House to meet the President of the United States (Bush 41 and Bush 43).
Andrew A Chien is the William Eckhardt Professor in Computer Science at the University of Chicago. He is also a Senior Fellow at UC’s Computation Institute and a Senior Computer Scientist at Argonne National Laboratory. His research interests include parallel computing, computer architecture, and cloud computing. From 2005 to 2010, Chien was Vice President of Research at Intel Corporation where he launched new initiatives in parallel software, mobile computing, cloud computing, and exascale research. From 1998 to 2005, Chien was the SAIC Endowed Chair Professor in the Department of Computer Science and Engineering where he founded the Center for Networked Systems at the University of California San Diego. From 1990 to 1998, he was a Professor of Computer Science at the University of Illinois at Urbana-Champaign and the National Center for Supercomputing Applications (NCSA). He has served on numerous advisory committees for the National Science Foundation, Department of Energy, and universities such as Stanford, EPFL, and Cal-Berkeley. Chien earned BS, MS, and PhD degrees at the Massachusetts Institute of Technology, and is a Fellow of the ACM, IEEE, and AAAS.
Pradip Bose is a research scientist at IBM T. J. Watson Research Center, where he manages a department on power-efficient, resilient systems. He holds a PhD from the University of Illinois at Urbana-Champaign. He has been associated with the definition and pre-silicon modeling of virtually all POWER-series processors, beginning with the original pre-product super scalar RISC project at IBM. He is a member of IBM’s Academy of Technology and an IEEE Fellow.
Al Geist is a Corporate Research Fellow at Oak Ridge National Laboratory. He is the Chief Technology Officer of the Leadership Computing Facility. His recent research is on exascale computing and resilience needs of hardware and software.
Saurabh Bagchi is a Professor in the School of Electrical and Computer Engineering and the Department of Computer Science (by courtesy) at Purdue University in West Lafayette, Indiana. He is a Senior Member of IEEE and ACM, a Distinguished Speaker for ACM, an IMPACT Faculty Fellow at Purdue (2013–14), and an Assistant Director of the CERIAS security center at Purdue. He leads the Dependable Computing Systems Laboratory (DCSL), where his group performs research in practical system design and implementation of dependable distributed systems. Since 2011, he has been serving as a Visiting Scientist with IBM Austin Research Lab.
Mattan Erez is an Associate Professor at the Department of Electrical and Computer Engineering at the University of Texas at Austin. His research focuses on improving the performance, efficiency, and scalability of computing systems through advances in hardware architecture, software systems, and programming models. The vision is to increase the cooperation across system layers and develop flexible and adaptive mechanisms for proportional resource usage. Erez received a BSc in Electrical Engineering and a BA in Physics from the Technion, Israel Institute of Technology, and his MS and PhD in Electrical Engineering from Stanford University.
Sarita V Adve is Professor in Computer Science at the University of Illinois. Her research interests are broadly in computer architecture and systems. She leads the SWAT project, one of the early projects to explore holistic software-driven solutions for hardware resiliency. She is an ACM Fellow, an IEEE Fellow, and an ABI Women of Vision award winner in innovation.
Sven Leyffer is a senior computational mathematician in the Mathematics and Computer Science Division at Argonne National Laboratory, and a Senior Fellow of the Computation Institute. He obtained his PhD from the University of Dundee, UK, and has held postdoc positions at Dundee, Northwestern University, and Argonne. He is a Fellow of the Society for Industrial and Applied Mathematics.
Nathan DeBardeleben received his PhD in Computer Engineering from Clemson University in 2004 and started at Los Alamos National Laboratory the same year. DeBardeleben has been influential in defining the field of HPC resilience, its challenges and potentials. He has co-authored a handful of governmental position papers on the subject as well as his own research publications. In his own research, his focus is on characterizing the impact of soft errors on systems and applications. DeBardeleben is on numerous reliability program committees, runs his own workshop (Fault-tolerance for HPC at Extreme Scale (FTXS)) and runs the Los Alamos National Laboratory resilience site (http://institute.lanl.gov/resilience/).
Christian Engelmann is Task Lead of the System Software Team in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. He earned his PhD in Computer Science in 2008 and his MSc in Computer Science in 2001, both from the University of Reading, UK. He also obtained a German Certified Engineer diploma in Computer Systems Engineering in 2001 from the University of Applied Sciences, Berlin. Engelmann’s research aims at computer science challenges for extreme-scale HPC system software, such as dependability, scalability, and portability. His primary expertise is in HPC resilience, that is, providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. His secondary expertise is in HPC hardware/software co-design through lightweight simulation of extreme-scale systems with millions of processor cores to study the impact of hardware properties on parallel application performance.
Jim Belak is a senior scientist in the Condensed Matter and Materials Division at Lawrence Livermore National Laboratory. He is Co-PI and Deputy Director for the Exascale Co-design Center for Materials in Extreme Conditions (ExMatEx), a joint project with Los Alamos National Laboratory, ORNL, SNL-A, Stanford, and CalTech, funded by the DOE Office of Advanced Scientific Computing Research. The goal of ExMatEx is to use the supercomputer codes used to study matter under extreme conditions to guide the design of future supercomputers and use the understanding gained to refactor and create new supercomputer codes. He earned his PhD in Condensed Matter Physics from Colorado State University.
Fred Johnson is currently with SAIC serving as senior SAIC technical advisor to the DOE NNSA Advanced Simulation & Computing organization. He has retired as the Senior Technical Manager for Computer Science in DOE/ASCR where he was the Program Manager responsible for fundamental computer science research and research on high-performance system software and tools including programming models, debugging and performance evaluation tools, software component architectures for high-performance systems, and next-generation runtime and OSs.
Pedro Diniz earned his PhD from the University of California, Santa Barbara, in Computer Science in 1997. Since then he has been a Research Assistant Professor of Computer Science with the University of Southern California in Los Angeles, California. He has also been involved in several research projects focusing on programming technology and execution models addressing productivity-related issues as well as fault-tolerance for large-scale high-performance architectures. He has participated in various scientific proposal review boards at the National Science Foundation as well as at the European Commission in Brussels. Over the last 20 years he has been heavily involved in the scientific community having participated as part of the technical program committee of over 15 international conferences in the area of HPC, reconfigurable and field-programmable computing.
Paul Coteus is an IBM Fellow in the Systems Department at the Thomas J. Watson Research Center. Coteus completed his PhD in Physics at Columbia University and joined IBM in 1988, leaving his position as Assistant Professor of Physics at the University of Colorado. He has directed and designed advanced packaging for high-speed electronics, memory systems, and processor complexes. He is currently the Chief Engineer of Data Centric Systems, and also leads the system engineering for the full line of Blue Gene Supercomputers, honored in 2008 with the National Medal of Technology and Innovation. He is an IEEE Fellow, a member of IBM’s Academy of Technology, and an IBM Master Inventor. He has authored more than 90 papers in the field of electronic packaging, and holds over 120 US patents.
Rinku Gupta is a senior scientific developer at Argonne National Laboratory. She received her MS degree in Computer Science from Ohio State University in 2002. She has several years of experience developing systems and infrastructure for enterprise HPC. Her research interests primarily lie towards middleware libraries, programming models, and designing fault-tolerance frameworks in HEC systems. More details about her are available at http://www.mcs.anl.gov/~rgupta.
Franck Cappello holds a Senior Computer Scientist position at Argonne National Laboratory where he leads the resilience effort. He is the main PI of the G8 ‘Enabling Climate Simulation at Extreme Scale’ project gathering research groups from six countries. He is also the initiator and co-director of the INRIA-Illinois-ANL Joint Laboratory on Petascale Computing. Before moving to USA, he led the Grand-Large and Grid’5000 projects in France at INRIA, focusing on high-performance issues and research methodology for large-scale distributed systems. He has authored more than 130 papers and contributed to more than 70 program committees. He is an editorial board member of the international Journal of Grid Computing, Journal of Grid and Utility Computing, and Journal of Cluster Computing. He served in the steering committees of IEEE HPDC and IEEE/ACM CCGRID. He is the Program Co-Chair of ACM HPDC 2014 and ACM CAC 2014.
Rob Schreiber is a Distinguished Technologist at Hewlett Packard Laboratories. Schreiber’s research spans sequential and parallel algorithms for matrix computation, compiler optimization for parallel languages, and high-performance computer design. With Moler and Gilbert, he developed the sparse matrix extension of Matlab. He created the NAS CG parallel benchmark. He was a designer of the High Performance Fortran language. At HP, he led the development of PICO, a system for synthesis of custom hardware accelerators. His recent work concerns architectural uses of CMOS nanophotonic communication and NVM architecture. He is an ACM Fellow, a SIAM Fellow, and was awarded, in 2012, the Career Prize from the SIAM Activity Group in Supercomputing.
Dean Liberty is a Fellow at Advanced Micro Devices (AMD). He leads the Reliability/Availability/Serviceability (RAS) Architecture and Strategy team, focusing on long-term planning, detailed architecture, and short-term implementation for resilience in AMD processors. Dean has been in the computer industry for over 30 years, and involved in HPC systems for over 20 years. His experience covers a range of hardware and software, and his interests lie in bridging the gap between the two.
Eric Van Hensbergen is currently a principal design engineer at ARM Research in Austin, Texas. His current research focuses on exploring energy-efficient approaches to HPC through balance-driven co-design. Previous to ARM he was a research staff member in the Future Systems department at IBM’s Austin Research Lab. Over 12 years at IBM Research, he worked on distributed OSs for HPC, low-power dense server and network processor appliance blades, DRAM power management, full system simulation, HPC, hypervisors, and the Linux OS. Before coming to IBM, he worked for four years at Lucent Technologies Bell Laboratories on the Plan 9 and Inferno OSs.
Sriram Krishnamoorthy received his BE degree from the College of Engineering-Guindy, Anna University, Chennai, and his MS and PhD degrees from The Ohio State University, Columbus, Ohio. He is currently a research scientist at Pacific Northwest National Laboratory. His research focuses on parallel programming models, fault tolerance, and compile-time/runtime optimizations for HPC. He has over 60 peer-reviewed conference and journal publications, receiving best-paper awards for his publications at the International Conference on High Performance Computing (HiPC’03) and the International Parallel and Distributed Processing Symposium (IPDPS’04). He is a recipient of the U.S. Department of Energy Early Career award and Pacific Northwest National Laboratory’s Ronald L. Brodzinski Award for Early Career Exceptional Achievement in 2013. He is a senior member of the IEEE and a professional member of ACM.
Subhasish Mitra directs the Robust Systems Group in the Department of Electrical Engineering and the Department of Computer Science of Stanford University, where he is the Chambers Faculty Scholar of Engineering. Before joining Stanford, he was a Principal Engineer at Intel Corporation. His research interests include robust system design, VLSI design, CAD, validation and test, and emerging nanotechnologies. His research results have seen widespread proliferation in industry, and have been recognized by several prestigious awards including the Presidential Early Career Award for Scientists and Engineers from the White House, the Intel Achievement Award, Intel’s highest corporate honor, and several best-paper awards for publications at major conferences and journals. He is a Fellow of the IEEE.
Jon Stearley is a senior member of technical staff at Sandia National Laboratories. His interests include historical and live mining of system logs to identify the root causes of faults, the propagation of errors, and their effects on user jobs, towards faster fixes today and better designs tomorrow.
Saverio Fazzari works for Booz Allen acting as a senior technical advisor to DARPA and other government agencies for numerous programs. Fazzari has a strong background in all areas of semi-conductor design and fabrication, from algorithm development through device implementation. His specialty is advanced circuit design and development strategies with a focus on hardware cyber security issues including trusted design and fabrication. His experience includes extensive commercial experience, leading production innovation and development across all facets of the electronic design process.
Jacob A Abraham is a Professor in the Department of Electrical and Computer Engineering at the University of Texas at Austin. He is also director of the Computer Engineering Research Center and holds a Cockrell Family Regents Chair in Engineering. He received a bachelor degree in Electrical Engineering from the University of Kerala, India, in 1970. His MS degree, in Electrical Engineering, and PhD, in Electrical Engineering and Computer Science, were received from Stanford University, California, in 1971 and 1974, respectively. From 1975 to 1988 he was on the faculty of the University of Illinois, Urbana, Illinois.
William Carlson is a member of the research Computing Sciences Staff at the IDA Center for Computing Sciences where, since 1990, his focus has been on applications and system tools for large-scale parallel and distributed computers. He also leads the UPC Language Effort, a consortium of industry and academic research institutions aiming to produce a unified approach to parallel C programming based on global address space methods. Carlson graduated from Worcester Polytechnic Institute in 1981 with a BS degree in Electrical Engineering. He then attended Purdue University, receiving MSEE and PhD degrees in Electrical Engineering in 1983 and 1988, respectively. From 1988 to 1990, Carlson was an Assistant Professor at the University of Wisconsin–Madison, where his work centered on performance evaluation of advanced computer architectures.
Robert W Wisniewski is an ACM Distinguished Scientist and the Chief Software Architect for Extreme-Scale Computing and a Senior Principal Engineer at Intel Corporation. He has published over 60 papers in the area of HPC, computer systems, and system performance, and has filed over 50 patents. Before coming to Intel, he was the chief software architect for Blue Gene Research and manager of the Blue Gene and exascale research software team at the IBM T.J. Watson Research Facility, where he was an IBM Master Inventor and lead the software effort on Blue Gene/Q, which was the fastest machine in the world on the June 2012 Top 500 list, and occupied four of the top 10 positions. Prior to working on Blue Gene, he worked on the K42 scalable OS project targeted at scalable next-generation servers and the DARPA HPCS project on continuous program optimization that utilizes integrated performance data to automatically improve application and system performance. Before joining IBM Research, and after receiving a PhD in Computer Science from the University of Rochester, he worked at Silicon Graphics on high-end parallel OS development, parallel real-time systems, and real-time performance monitoring.

Cite article

Cite article

Cite article

OR

Download to reference manager

If you have citation software installed, you can download article citation data to the citation manager of your choice

Share options

Share

Share this article

Share with email
Email Article Link
Share on social media

Share access to this article

Sharing links are not relevant where the article is open access and not available if you do not have a subscription.

For more information view the Sage Journals article sharing page.

Information, rights and permissions

Information

Published In

Article first published online: March 21, 2014
Issue published: May 2014

Keywords

  1. Resilience
  2. fault-tolerance
  3. exascale
  4. extreme-scale computing
  5. high-performance computing

Rights and permissions

© The Author(s) 2014.
Request permissions for this article.

Authors

Affiliations

Marc Snir
Argonne National Laboratory, IL, USA
Robert W Wisniewski
Intel Corporation, CA, USA
Jacob A Abraham
University of Texas at Austin, TX, USA
Sarita V Adve
University of Illinois at Urbana-Champaign, IL, USA
Saurabh Bagchi
Purdue University, IN, USA
Pavan Balaji
Argonne National Laboratory, IL, USA
Jim Belak
Lawrence Livermore National Laboratory, CA, USA
Pradip Bose
IBM T.J. Watson Research Center, NY, USA
Franck Cappello
Argonne National Laboratory, IL, USA
Bill Carlson
IDA Center for Computing Sciences, MD, USA
Andrew A Chien
The University of Chicago, IL, USA
Paul Coteus
IBM T.J. Watson Research Center, NY, USA
Nathan A DeBardeleben
Los Alamos National Laboratory, NM, USA
Pedro C Diniz
USC Information Sciences Institute, CA, USA
Christian Engelmann
Oak Ridge National Laboratory, TN, USA
Mattan Erez
University of Texas at Austin, TX, USA
Saverio Fazzari
Booz Allen Hamilton, VA, USA
Al Geist
Oak Ridge National Laboratory, TN, USA
Rinku Gupta
Argonne National Laboratory, IL, USA
Fred Johnson
Sriram Krishnamoorthy
Pacific Northwest National Laboratory, WA, USA
Sven Leyffer
Argonne National Laboratory, IL, USA
Dean Liberty
Advanced Micro Devices, MA, USA
Subhasish Mitra
Stanford University, CA, USA
Todd Munson
Argonne National Laboratory, IL, USA
Rob Schreiber
Hewlett Packard, CA, USA
Jon Stearley
Sandia National Laboratory, NM, USA
Eric Van Hensbergen

Notes

Marc Snir, Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue Argonne, IL 60439. Email: [email protected]

Metrics and citations

Metrics

Journals metrics

This article was published in The International Journal of High Performance Computing Applications.

View All Journal Metrics

Article usage*

Total views and downloads: 1379

*Article usage tracking started in December 2016


Articles citing this one

Receive email alerts when this article is cited

Web of Science: 207 view articles Opens in new tab

Crossref: 249

  1. Towards resilient and energy efficient scalable Krylov solvers
    Go to citationCrossrefGoogle Scholar
  2. A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
    Go to citationCrossrefGoogle Scholar
  3. SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
    Go to citationCrossrefGoogle Scholar
  4. SC24: International Conference for High Performance Computing, Networking, Storage and Analysis
    Go to citationCrossrefGoogle Scholar
  5. 2024 IEEE International Conference on Cluster Computing (CLUSTER)
    Go to citationCrossrefGoogle Scholar
  6. 2024 IEEE International Conference on Cluster Computing (CLUSTER)
    Go to citationCrossrefGoogle Scholar
  7. A Visual Comparison of Silent Error Propagation
    Go to citationCrossrefGoogle Scholar
  8. Fault-Tolerant Parallel Multigrid Method on Unstructured Adaptive Mesh
    Go to citationCrossrefGoogle Scholar
  9. 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
    Go to citationCrossrefGoogle Scholar
  10. Runtime Performance Anomaly Diagnosis in Production HPC Systems Using Active Learning
    Go to citationCrossrefGoogle Scholar
  11. View More

Figures and tables

Figures & Media

Tables

View Options

Access options

If you have access to journal content via a personal subscription, university, library, employer or society, select from the options below:

IOM3 members can access this journal content using society membership credentials.

IOM3 members can access this journal content using society membership credentials.


Alternatively, view purchase options below:

Purchase 24 hour online access to view and download content.

Access journal content via a DeepDyve subscription or find out more about this option.

View options

PDF/EPUB

View PDF/EPUB

Full Text

View Full Text