Plan on having about ten percent of the cluster failed or failing at any given time. If you need a machine with 10 nodes operational, you had best plan on having 12 nodes, and some spare parts. The larger the cluster is, the more failed hardware you can expect. Really large clusters have hardware failures on a more or less continuous basis. Alternatively, you can just build a lot of extra nodes and take bad nodes offline as the cluster “burns in” (this seems expensive and wasteful to me). Run the cluster on a good UPS. It is not an option. You need clean power to get good hardware life, and with this many computers the investment in a UPS will pay off in terms of longer hardware life.


Consumer grade electronics is designed with an operational life of two years.
