There are basically two ways in which one can make use of a modern CPU with multiple cores for computationally intensive work.
- Using multiple threads within one program (multithreading).
- Using multiple (single-threaded) programs that communicate (multiprocessing).
In the first case, all data is implicitly shared. In the second case, data must be explicitly shared or communicated.
The first option is often said to be more convenient. I would like to make the case that this usually makes the task more difficult, because of the need to manage all shared data.
Please note that I’m a mechanical engineer that enjoys programming as part of my engineering toolbox. Since I don’t have a computer science background I might not describe things in the way they are usually thought of in computer science. Since I am mostly familiar with UNIX-like operating systems and I do most of my programming in Python these days, I might label some concepts in UNIX or Python terms. For that, I ask the reader’s indulgence.
Functional programming languages can be said to side-step shared data access because they avoid mutable data and changing state. Since I don’t have much experience with those, I cannot really comment on them.
But I strongly suspect that TANSTAAFL applies. If functional languages make multithreading easier, there will be something else that they make harder.
All programs basically run on a machine that processes instructions. This can be the actual CPU, or a kind of virtual machine that processes higher-level bytecode.
In both contexts there is the fundamental concept of atomicity. An action is said to be atomic if it can be carried out in a single, uninterruptible instruction. This is important because in a multithreaded program only data access that is atomic does not need to be protected by a lock of some kind on the data.
It is worth noting that generally both writing and reading data needs to be protected with locks.
Reading needs to be protected because otherwise another thread might modify the data during the read, leading to unpredictable results. Say for example we have an array of integer values. One thread is calculating the sum of the members of this array. While that is happening, another thread is incrementing all members of the array by one. So what is the result of the summation? If the summation is running ahead of the increment, the sum it that of the original array. The other way around, the sum is that of the incremented array. But if reading and incrementing are interleaved, the result could be anything in between! And the really weird thing is that two of the three sums could be said to be correct!
Writing needs to be protected because two threads modifying the same data also leads to unpredictable results. Say in the array example above that one thread is incrementing the members of the array while another is multiplying them by a factor. Here again we have an uncertain outcome. For all the array members goes that they can be incremented and then multiplied or they are multiplied and then incremented.
From these examples it it clear that the results are highly sensitive to the timing of different threads. This is called a race condition.
The examples also show that not all such issues cause the program to crash. In fact it would be better if such issues did cause a crash because then it would be clear that there is a problem. As is, multiple runs of the program could generate different results from the same input data. That is unacceptable.
Since in a multithreading environment everything that is visible in a global context is shared, all accesses to that data need to be protected unless the programmer knows with certainty that only one thread will access it.
And while protection mechanisms exist, they can cause their own share of problems like e.g. deadlock and resource starvation.
In multiprocessing, we typically have one process (the “parent”) launch several “children” to do some work.
Communication between these processes can happen in several ways.
- Child processes can be given arguments when they start and they return a numerical result code on completion. If processes are started with fork, they also inherit the data of the parent.
- Processes can talk to each other over pipes or sockets.
- Programs can save data in files for other programs to access.
- Processes can explicitly share memory, e.g. via mmap.
Of course the first way is a one-time communication.
The second way is interactive and messages generally arrive in the order they are sent. Sending messages provides automatic serialization; you can generally not read a message before it was sent. One only has to take into account that it is sometimes possible to read messages partially. The only downside of this is that it is not really suitable for large amounts of data.
For large amount of data, the third way works well, especially on an SSD. I would use this in combination with the first or second way, as in sending the child process a message such as “process the data from <filename>”. If the sender takes care to write and close the file before sending the message to the receiving process, access to the data is automatically serialized.
Shared memory is also a good way to share large amounts of data, but here we have the same problems as with multithreading; serializing data access. Some environments (like Python) have serialization for some kinds of shared memory built-in.
Because data is implicitly shared in multithreading, the programmer has the burden of explicitly protecting access.
In general, multiprocessing has to explicitly make data available. This can often provide automatic access serialization. Process isolation implicitly provides protection for the in-memory data of a process.
For these reasons, I generally prefer multiprocessing over multithreading. For me the advantage of easy access to data is overshadowed by the overhead of serializing data access.