Common casues for a system crash are device driver and hardware problems. When a system crashes it will display a diagnostic screen (Blue Screen Of Death) or restart, depending on your systems configuration. If a BSOD is displayed it may display helpful information, such as the name of the device driver that crashed. If this is the case, then update that driver, or uninstall the related hardware/driver if possible. Otherwise, you can use this diagnostic information to search the Internet for possible solutions to your problem. If your system is configured to restart on a system failure (System->Advanced->Startup and recovery, Settings->System failure=automatically restart), then investigating a Kernel memory dump (e.g. \Windows\Minidump\*.dmp) can be helpful to identify the faulty component.
Excessive heat can cause a system to fail over time, so monitoring the temperature while stress testing the system can help identify this. The solution could be fixing/cleaning fans, re-doing the CPU grease or adding more cooling.
When a system crashesProblems can occur if your computer runs out of system resources because there is some process or driver that doesn't release memory, handles, semaphores, etc.. back to the operating system. After a long period of uptime Windows runs out of resources and dies a terrible death. What can you do about this? Identify the offending software, if you can, and disable it. This can even be a bug in the Operating system however.
Computer can have a Random Crash. What do we mean by this? Many things can bring down a computer. Typical things would be a spike on the power line, a strong burst of Electromagnetic interference (e.g. Mobile phones, electric motors, etc..). If your system is running at its limits due to overclocking or your components are running at the top of their temperature range, small external influences can push your system over the edge, resulting in a terrible death. If you believe in Chaos theory (and most scientists now do), then you also have to believe that computers will just crash unexpected from time to time, how often would depend on the design tolerances built into your hardware. What can you do about this?
- Do as the military do. Buy military specification computer hardware that has higher tolerances.
- Do what NASA does. Run 3 computers at the same time, expecting one to give the wrong answer or crash.
- Do what most big banks do. Run a hot standby system, that can takeover the job of the main computer in a few seconds.
- Do what the Telecommunications industry does. Buy equipment with N+1 redundancy and switch traffic off the faulty hardware. Almost all Telecommunications hardware also has a built in Auto-reboot function. Why? because they know it will eventually fail.
Timing issue. Some software / hardware bugs only show up in very very rare occasions. Classic examples of this are Hardware or Software Interrupts occurring in a critical section of code. What can you do about these types of bugs? Almost nothing as a user. They have plagued software since the first line of code was written they are very difficult problems to find and are almost never picked up during software testing. Problems can occur in Drivers, the operating system, your hardware, everywhere. As everyone is always on a tight deadline, endurance testing often doesn't make it into a software developers test plan.
Mundane program bugs are, of course, also a major cause of failure.