Software Reliability and Randomness are slippery concepts that may be conceptually easy to understand, but hard to pin down. As programmers, we can write the equivalent of ‘hello world’ in dozens of languages on hundreds of platforms and once the program is functioning – it is reliable. It will produce the same results every time it is executed. Yet systems built from thousands of modules and millions of lines of code function less consistently than our hello world programs – and are functionally less reliable.
As programmers we often look for a source of randomness in our programs, and it is hard to find. Fundamentally we see computers as deterministic systems without any inherent entropy (for our purposes – randomness). For lack of true random numbers we generate Pseudo Random Numbers (PRNs), which are not really random. They are used in generating simulations, and in generating session keys for secure connections, and this lack of true randomness in computer generated PRNs has been the source of numerous security vulnerabilities.
In this post I am going to discuss how software can be “unreliable”, deterministic behavior, parallel systems / programming, how modern computer programs / systems can be non-deterministic (random), and how that is connected to software reliability.
The topics of software reliability, deterministic behavior, and randomness in computers is a field that is massively deep and complex. The discussions in this blog are high level, lightweight, and I make some broad generalizations and assertions that are mostly correct (if you don’t look to closely) – but hopefully still serve to illustrate the discussion.
I also apologize in advance for this incredibly dry and abstract post.
Hardware reliability, more precisely “failure” is most often occurs when some device in a system breaks (the smoke comes out), and the system no longer functions as expected. Software failures do not involve broken hardware or devices. Software failures are based on the concept that there are a semi-infinite number of paths (or states) through a complex software package, and the vast majority will result in the software acting and functioning as expected. However there are some paths through the code that will result in the software not functioning as expected. When this happnes, the software and system are doing exactly what the code is telling it to do – so from that perspective, there is no failure. However from the concept of a software failure, the software is not doing what is expected – which we interpret as a software failure, which provides a path to understand the concept of software reliability.
Deterministic operation in software means that a given program with a given set if inputs will function in exactly the same manner every time it is executed – without any unexpected behaviors. For the most part this characteristic is what allows us to effectively write software. If we carry this further, and look at software on simple (8 / 16 bit) microprocessors / microcontrollers, where the software we write runs exclusively on the device, operation is very deterministic.
In contrast – on a modern system, our software exists in a relatively high level on top of APIs (application programming interfaces), libraries, services, and a core operating system – and in most cases this is a multitasking/multi-threaded/multi-cored environment. In the world of old school 8 / 16 bit microprocessors / microcontrollers, none of these layers exist. When we program for that environment, our program is compiled down to machine code that runs exclusively on that device.
In this context, our program not only operates deterministically in how the software functions, but the timing and interactions external to the microprocessor is deterministic. In the context of modern complex computing systems, this is generally not the case. In any case, the very deterministic operation of software on dedicated microprocessor makes it ideal for real world interactions and embedded controllers. This is why this model is used for toasters, coffee pots, microwave ovens and other appliances. The system is closed – meaning its inputs are limited to known and well defined sources, and its functions are fixed and static, and generally these systems are incredibly reliable. After all how often it is necessary to update the firmware on an appliance?
If this war our model the world of software and software reliability, we would be ignoring much of what has happened in the world of computing over the last decade or two. More importantly – we need to understand that this model is an endpoint, not the whole story, and to understand where we are today we need to look further.
One of the most pervasive trends in computing over the last decade (or so) is the transition from increasingly faster single threaded systems to increasingly parallel systems. This parallelism is accomplished through multiple computing cores on a single device and through multiple processing threads on a single core, which are both mechanisms to increase the ability of the processor to produce more work by being able to support concurrently running programs. A typical laptop today can have two to four cores and support two hardware threads per core, resulting in 8 relatively independent processes running at the same time. Servers with 16 to 64 cores would have qualified as supercomputers (small ones) a decade ago are now available off the shelf.
Parallel Programming: the Masochistic Way
Now – back in the early 80s as an intern at Cray, my supervisor spent one afternoon trying to teach me about how Cray computers (at that time) were parallel coded. As one of the first parallel processing systems, and as systems where every cycle was expensive – much of the software was parallel programmed in assembly code. The process is exactly how would imagine. There was a hardware scheduler that would transfer data to/from each processor to main memory every so many cycles. In between these transfers the processors would execute code. So if the system had four processors, you would write assembly code for each processor to execute some set of functions that were time synchronized ever so many machine cycles, with NOPs (no operation) occasionally used to pad the time. NOPs were considered bad practice since cycles were precious and not to be wasted on a NOP. At the time, it was more than I wanted to take on, and I was shuffled back to hardware troubleshooting.
Over time I internalized this event, and learned something about scalability. It was easy to imagine somebody getting very good at doing two (maybe even 3 or 4) dissimilar time synchronous parallel programs. Additionally, since many programs also rely on very similar parallel functions, it was also easy to imagine somebody getting good at writing programs that did the same thing across a large number of parallel processors. However, it is much harder to imagine somebody getting very good at writing dissimilar time synchronous parallel programs effectively over a large number of parallel processors. This is in addition to the lack of scalability inherent in assembly language.
Parallel Programming – High Level Languages
Of course in the 80s or even the 90s, most computer programmers did not need to be concerned with parallel programming, and every Operating System was single threaded, and the argument of the day was Cooperative multitasking versus Preemptive multitasking. Much like the RISC vs CISC argument from the prior decade, these issues were rendered irrelevant by the pace of processor hardware improvements. Now many of us walk around with the equivalent that Cray supercomputer in our pockets.
In any case the issue of parallel programming was resolved in two parts. The first being the idea of a multi-tasking operating systems with a scheduler – the core function that controls what programs are running (and how long they run) in parallel at any one time. The second being the development of multi-threaded programming in higher level languages (without the time synchronization of early Crays).
Finally getting back to my original point… The result today is that all modern operating systems have some privileged block of code – the kernel running continuously, but have a number of other services that run the OS, including the memory manager and the task scheduler.
The key to this whole story is that these privileged processes manage access to shared resources on the computer. Of these two, the task manager is the most interesting – mostly due the arcane system attributes it uses to determine which processes have access to which core / thread on the processor. This is one of the most complex aspects of a multitasking / multi-core / multithreaded (hardware) system. The attributes the scheduler looks at include affinity flags that processes use to indicate core preference, priority flags, resource conflicts and hardware interrupts.
The net result is that if we take any set of processes on a highly parallel system there are some characteristics of this set that are sufficiently complex and impacted by unknown external elements that they are random – truly random. For example if we create three separate processes that generate a pseudo random number set based on some seed (using unique values in each), and point all of them to some shared memory resource- where the value is read as input and the output is written back. Since the operation of the task scheduler means that the order of execution of these three threads is completely arbitrary, it is not possible to determine what the sequence is deterministically – the result would be something more random than a PRNG. A not so subtle (and critical) assumption is that the system has other tasks and processes it is managing, which directly impact the scheduler, introducing entropy to the system.
Before we go on, lets take a closer look at this. Note that if some piece of software functions the same (internally and externally) every time it executes, it is deterministic. If this same piece of software functions differently based on external factors that are unrelated to this software, that is non-deterministic. Since kernel level resource managers (memory, scheduler, etc) function in response to system factors and factors from each and every running process – that means that from the perspective of any one software package, certain environmental factors are non-deterministic (i.e. random). In addition to the scheduling and sequencing aspects identified above, memory allocations will also be granted or moved in a similar way.
Of course this system level random behavior is only half the story. As software packages are built to take advantage of gigabytes of RAM, and lots of parallel execution power, they are becoming a functional aggregation of dozens (to hundreds) of independently functioning threads or processes, which introduce a new level of sequencing and interdependancies which are dependent on the task manager.
Bottom Line – Any sufficiently complex asynchronous and parallel system will have certain non-deterministic characteristics based on the number of independent sources that will influence access / use of system shared resources. Layer the complexity of parallel high level programming, and certain aspects of program operation are very non-deterministic.
Back to Software Reliability
Yes we have shown that both multitasked parallel hardware and parallel programmed software contribute to some non-deterministic behavior in operation, but we also know that for the most part software is relatively reliable. Some software is better and some is worse, but there clearly is some other set of factors in play.
The simple and not very useful answer is “better coding” or “code quality”. A slightly more insightful answer would tell you that code that depends on or uses some non-deterministic feature of the system is probably going to be less reliable. An obvious example is timing loops. Back in the days of single threaded programs and single threaded platforms, programmers would introduce relatively stable timing delays with empty timing loops. This practice was easy, popular and produced fairly consistent timing – showing deterministic behavior. As systems hardware and software have evolved, the assumptions these coding practices rely on become less and less valid. Try writing a timing loop program on a modern platform and the results can be workable much of the time, but it can also vary by orders of magnitude – in a very non-deterministic manner. There are dozens of programming practices like this that use to work just fine, but no longer do – but they don’t completely break, just operate a little bit randomly. In many cases, the behavior is close enough to “correct” that the program appears to function, but not very reliably.
Another coding practice that used to work on single threaded systems was to call some function and expect the result would be available on the next line of code. It worked on single threaded systems because execution was handed off to that function, and did not return until it was complete. Fast forward to today, and if this is written as a parallel program – the expected data may not be there when your code thinks is should be. There is a lesson here – high level parallel programming languages make writing parallel code fairly easy, but that does not mean that writing robust parallel programs is easy. Parallel inter-dependencies issues can be just as ugly as parallel assembly code on a Cray system.
A single piece of code running exclusively on a dedicated processor is very deterministically, but parallel programmed software on a multitasking parallel hardware system can be very non-deterministic, and difficult to test. Much of software reliability is based on how little a given software package depends on these non-deterministic features. Managing software reliability and failure mechanisms requires that programmers understand the system beyond the confines of the program.