Why Your Computer Crashes
By, Amir Majidimehr
Ever wonder why your computer “crashes?” What is a crash anyway? To understand that, we need to first step back and understand the architecture of our PC.
When you turn on your machine, the hardware automatically executes one program: we call this the operating system or amongst people who write the code for it, the “kernel.” As the name implies, the kernel is the core of your machine. It sits between the hardware and the programs that run on top of it. Examples are Windows, MacOS, Linux, and iOS (Apple mobile operating system).
The kernel’s job is to provide an environment where the application programs can run on top of it and with it, hide the complexity and differences of the hardware below it. For example when you ask your word processor to open a document, the exact same code in it opens the file whether it is stored on the hard disk of flash thumb drive. These are very different hardware devices yet from an application point of view, or how you might use them to browse files stored on them, they appear identical. This sharply reduces the work for the application developers or your efforts to manage your files.
So at high level you have three pieces stacked on top of each other. The hardware is at the bottom. The kernel sits on top of that. And all the applications run above the kernel. There is one piece of hardware and one kernel but many applications. By the way, your desktop, the thing that shows all your files and such, is also an application, albeit, one that ships with the operating system and always runs.
The next important concept is to realize that there is no way to write perfect software that is of any complexity. The permutations in any computer program are infinite in scope so there is no way that all the possible paths can be verified to be correct before software is released. Further, software may access other components in the operating system or elsewhere which may have flaws or “bugs” as we call them. This is an aptly named problem as anyone who has tried to chase bugs to kill them knows that you can get most of them, but invariably a few get away.
To make you feel even better, modern audio/video electronics has also gotten so complex that many devices such as TV, Blu-ray players, cable and satellite set-top boxes run an operating system (usually a variant of Linux). So don’t be surprised if those devices also crash like your computer can!
On top of the software bugs, we also have to deal with hardware that can have faulty software embedded them (called “firmware”) or design. They can also flat out break; something that thankfully our software doesn’t do.
A hard disk that fails may stop responding all of a sudden in which case your program which is trying to save its file to it hangs indefinitely. Or it may corrupt data told to write to its media and keep going as if nothing has happened. This doesn’t happen often but can. And when it does, figuring out that it occurred can be incredibly tough if not impossible. But again, this is not a common occurrence so don’t lose sleep over it.
Failures then can occur up and down the “stack” of hardware, kernel and applications. The failure manifests itself very differently however depending on where it exists.
Let’s start with the easy part and look at what happens when the problem is in the applications. As an example, assume we have a program that expects a number from 1 to 9 to be input to it and you instead put in a name. The program attempts to use that string of characters as a number and things go bad from there on. One of two situations manifest themselves at this point:
The engine that does any work in your computer is the Central Processing Unit or CPU. The CPU runs both application code and that of the operating system. In the case of above errant program, the CPU happily executes what it is told in the form of code in that program. During this operation however, it is always checking to see if the application is doing something it should not be doing such as going outside its bounds. Should it attempt to do so, the CPU halts executions of your application at that precise moment, and calls special code in the operating system to complain. That code then verifies what has occurred, and pops up the crash message saying the application has done something wrong and it is being terminated. See some examples for MacOS and Windows to the right.
So let’s review again. Your program is running at full speed at potentially billions of instructions per second. But on every instruction a check is made to make sure it is not attempting on purpose or accidently accessing anything that it is not his. The latter is the key here: when a program has bugs, sooner or later it starts to execute random or incorrect instructions. That code invariably generates requests to data that is outside of its bounds (or “illegal” such as attempting to write on top of its own code). The CPU stops on that precise instruction and reports to the kernel that something has gone wrong, resulting in the crash message displayed by the operating system with the program in question named.
Now here is the good news. Application programs are partitioned enough that they cannot take the computer down with them when they crash (there are some notable exceptions to this but for now, let’s go with this simplification). So in essence then, your computer cannot crash because a program has done something wrong. So don’t go reinstalling your program hoping it would fix something. Likely it would not.
Now let’s take what we just learned and apply it to a situation where the system does actually crash. Even though the kernel is “king” so to speak and has lots of power in your system, it also lets the CPU monitor its behavior just as it does for applications. As with user applications, the kernel has its own boundaries of where its code and data exist and it allows the CPU to warn it if its own code attempts to access what it should not.
Now imagine an errant piece of code in the operating system that gets triggered because you did something unusual. Let’s say it is plugging in a device into the USB jack of your computer which has faulty “driver” (a piece of kernel code that interfaces with that piece of hardware). As soon as you plug in the cable, the bug gets triggered. Let’s say that causes an incorrect access to occur to a location outside of the kernel code. The CPU dutifully catches that event and reports it to the same piece of code it used when an application crashed.
The behavior is radically different now. The operating system examines the nature of this “fault” and realizes it is its own code that was the source of the problem. Fearing that continuing to run may lead to more drastic failures such as corrupting user data, and importantly, losing track of what has gone wrong, it attempts to commit suicide by popping up the message that every user in the world hates: the system has crashed. In Windows, this is the Blue Screen of Death which is often abbreviated to BSOD. A sample is on the right.
MacOS also has a crash message contrary to popular belief of its lack of existence as seen below.
What happens next is that the kernel will attempt to take a snapshot of critical memory data so that it can be analyzed later to potentially find the cause of the crash. I say potentially because while the failure endpoint is known, what got us there may be totally obscured. An operating system bug may corrupt some data that is not used hours or even days later leading to the visible crash. The snapshot of the system at crash point then has little useful information as to why we got there as so much has happened since.
Operating system companies like Microsoft collect crash data (for both applications and the kernel) and work on resolving them based on frequency of occurrence. So be sure to give consent to have the computer upload such information to them after you have restarted your computer. Additional crash “dumps” also helps the engineer triangulate the problem better resulting in higher odds that the solution is found.
Having spent years tracing through crash dumps to find and fix operating system bugs, I can speak firsthand to the difficulty of detective work required to back trace the problem to its root cause. Some bugs literally took months of intense code review and crash analysis to unravel. So don’t be surprised if there is no quick resolution to your problem from the system provider for these crashes.
As end users, you can also attempt to troubleshoot what may have caused the system to crash. That goes beyond the scope of this introductory article but know that there is a bit of self-help available. Suffice it to say, you may be able to find out if it was indeed the broken device or driver for that printer which caused it.
There is a common myth that your computer crashes because it runs out of memory. That just doesn’t happen! It almost doesn’t matter how much memory your computer has; you cannot exhaust it. No, you read that right. There is no relationship between the two. I can have a computer with two Gigabytes of memory and run eight Gigabytes worth of programs and nothing will crash!
Reason for that is that the operating system uses the hard disk as an extension of system memory. So as long as you have hard disk space, you can keep running programs. And since hard disk is much larger than your computer memory, you essentially have unlimited ability to use more memory by running as many applications as you like. Now, if you reach the limit of free hard disk space, the operating system will complain but usually in the form of not wanting to run more programs or existing programs crashing as they fail to get space to store their data. But the operating system will almost invariably stay operational. You can stop some programs, recover space and keep going. The technical term for this feature is “virtual memory.”
Likewise, running out of disk space should just result in error messages and not outright system crash. So don’t go adding memory or disk space to your computer to stop system crashes. It will not help (although sometimes changes the system behavior enough to make it act differently).
So there. You may not know how your operating system runs things, but now know a bit about what makes it not do that!
By, Amir Majidimehr
Ever wonder why your computer “crashes?” What is a crash anyway? To understand that, we need to first step back and understand the architecture of our PC.
When you turn on your machine, the hardware automatically executes one program: we call this the operating system or amongst people who write the code for it, the “kernel.” As the name implies, the kernel is the core of your machine. It sits between the hardware and the programs that run on top of it. Examples are Windows, MacOS, Linux, and iOS (Apple mobile operating system).
The kernel’s job is to provide an environment where the application programs can run on top of it and with it, hide the complexity and differences of the hardware below it. For example when you ask your word processor to open a document, the exact same code in it opens the file whether it is stored on the hard disk of flash thumb drive. These are very different hardware devices yet from an application point of view, or how you might use them to browse files stored on them, they appear identical. This sharply reduces the work for the application developers or your efforts to manage your files.
So at high level you have three pieces stacked on top of each other. The hardware is at the bottom. The kernel sits on top of that. And all the applications run above the kernel. There is one piece of hardware and one kernel but many applications. By the way, your desktop, the thing that shows all your files and such, is also an application, albeit, one that ships with the operating system and always runs.
The next important concept is to realize that there is no way to write perfect software that is of any complexity. The permutations in any computer program are infinite in scope so there is no way that all the possible paths can be verified to be correct before software is released. Further, software may access other components in the operating system or elsewhere which may have flaws or “bugs” as we call them. This is an aptly named problem as anyone who has tried to chase bugs to kill them knows that you can get most of them, but invariably a few get away.
To make you feel even better, modern audio/video electronics has also gotten so complex that many devices such as TV, Blu-ray players, cable and satellite set-top boxes run an operating system (usually a variant of Linux). So don’t be surprised if those devices also crash like your computer can!
On top of the software bugs, we also have to deal with hardware that can have faulty software embedded them (called “firmware”) or design. They can also flat out break; something that thankfully our software doesn’t do.
A hard disk that fails may stop responding all of a sudden in which case your program which is trying to save its file to it hangs indefinitely. Or it may corrupt data told to write to its media and keep going as if nothing has happened. This doesn’t happen often but can. And when it does, figuring out that it occurred can be incredibly tough if not impossible. But again, this is not a common occurrence so don’t lose sleep over it.
Failures then can occur up and down the “stack” of hardware, kernel and applications. The failure manifests itself very differently however depending on where it exists.
Let’s start with the easy part and look at what happens when the problem is in the applications. As an example, assume we have a program that expects a number from 1 to 9 to be input to it and you instead put in a name. The program attempts to use that string of characters as a number and things go bad from there on. One of two situations manifest themselves at this point:
- Your program keeps going but does the wrong thing (including hanging which means chasing its tail forever, not responding to you).
It is your job then to realize something has gone wrong and not trust the output of the application. Important thing here is that nothing crashes and the system keeps going.
- The program crashes (is removed from the system) with the operation system putting up a notice. We call this an “exception” or “fault.”
Now here is the good news. Application programs are partitioned enough that they cannot take the computer down with them when they crash (there are some notable exceptions to this but for now, let’s go with this simplification). So in essence then, your computer cannot crash because a program has done something wrong. So don’t go reinstalling your program hoping it would fix something. Likely it would not.
Now let’s take what we just learned and apply it to a situation where the system does actually crash. Even though the kernel is “king” so to speak and has lots of power in your system, it also lets the CPU monitor its behavior just as it does for applications. As with user applications, the kernel has its own boundaries of where its code and data exist and it allows the CPU to warn it if its own code attempts to access what it should not.
The behavior is radically different now. The operating system examines the nature of this “fault” and realizes it is its own code that was the source of the problem. Fearing that continuing to run may lead to more drastic failures such as corrupting user data, and importantly, losing track of what has gone wrong, it attempts to commit suicide by popping up the message that every user in the world hates: the system has crashed. In Windows, this is the Blue Screen of Death which is often abbreviated to BSOD. A sample is on the right.
MacOS also has a crash message contrary to popular belief of its lack of existence as seen below.
Operating system companies like Microsoft collect crash data (for both applications and the kernel) and work on resolving them based on frequency of occurrence. So be sure to give consent to have the computer upload such information to them after you have restarted your computer. Additional crash “dumps” also helps the engineer triangulate the problem better resulting in higher odds that the solution is found.
Having spent years tracing through crash dumps to find and fix operating system bugs, I can speak firsthand to the difficulty of detective work required to back trace the problem to its root cause. Some bugs literally took months of intense code review and crash analysis to unravel. So don’t be surprised if there is no quick resolution to your problem from the system provider for these crashes.
As end users, you can also attempt to troubleshoot what may have caused the system to crash. That goes beyond the scope of this introductory article but know that there is a bit of self-help available. Suffice it to say, you may be able to find out if it was indeed the broken device or driver for that printer which caused it.
There is a common myth that your computer crashes because it runs out of memory. That just doesn’t happen! It almost doesn’t matter how much memory your computer has; you cannot exhaust it. No, you read that right. There is no relationship between the two. I can have a computer with two Gigabytes of memory and run eight Gigabytes worth of programs and nothing will crash!
Reason for that is that the operating system uses the hard disk as an extension of system memory. So as long as you have hard disk space, you can keep running programs. And since hard disk is much larger than your computer memory, you essentially have unlimited ability to use more memory by running as many applications as you like. Now, if you reach the limit of free hard disk space, the operating system will complain but usually in the form of not wanting to run more programs or existing programs crashing as they fail to get space to store their data. But the operating system will almost invariably stay operational. You can stop some programs, recover space and keep going. The technical term for this feature is “virtual memory.”
Likewise, running out of disk space should just result in error messages and not outright system crash. So don’t go adding memory or disk space to your computer to stop system crashes. It will not help (although sometimes changes the system behavior enough to make it act differently).
So there. You may not know how your operating system runs things, but now know a bit about what makes it not do that!
Last edited: