Async vs. Threads

Concurrency models compared

There are currently many concurrency models available, all useful for different situations. The choice can be overwhelming when not familiar with them, especially since various combinations and hybrid are also possible. In this article I will attempt to describe how each model works and what the disadvantages and advantages of each is, as well as some guidance on when to use which.

Find a model directly

Synchronization, race conditions, and deadlock

You may see me use some words that you are unfamiliar with. I will attempt to explain them here. If you know this stuff feel free to skip to the first concurrency model.

Race conditions

A race condition happens when code works differently depending on when it is scheduled, when it is interrupted, or other factors outside your control. Race conditions are very hard to debug since they are probabilistic and won't happen every time. Your code might run fine 99% of the time but fail 1%. It might work fine on your computer but fail 70% of the time on your customers' laptop. Race conditions are easier to prevent than solve.

A classic example of a race condition is this pseudocode:

        money = 5

        function spend:
             loop:
                 if money >= 10:
                      print "Spent 10 money on buying another goldfish"
                      money -= 10
                      print "now I have $"
                      print money
        end function

        def earn:
            loop:
                money += 1

        create_thread(earn)
        repeat 10 times:
             create_thread(spend)
    

Despite checking if you have enough money to replace your dead goldfish, this code will inevitably go into the negatives eventually. Why? Suppose the money goes over 10 for the first time, and in then interrupted. A second thread checks if money > 10 too, and also succeeds. Then both threads subtract 10 from money, and it becomes -10. The bank will not be pleased.

Race conditions can happen in all code, even without any form of concurrency, as long as some resource is shared. Even files on the hard drive can cause race conditions. However, it is much more likely with some concurrency models then others.

Synchronization Primitives

A synchronization primitive is any structure that attempts to control the order of your code and prevent race conditions.

The most simple primitive is known as the lock. Only one thread can own the lock at a time. If a thread wants a lock but antother thread currently has it must wait till the other thread is done with the lock. By making sure you only access a data structure while having it's lock you can avoid multiple threads accessing the data at the same time.

For example, applying a lock to the example above would lead to this:

        money = 5
        money_lock = Lock()

        function spend:
             loop:
                 money_lock.acquire()
                 if money >= 10:
                      print "Spent 10 money on buying another goldfish"
                      money -= 10
                      print "now I have $"
                      print money
                money_lock.release()
        end function

        def earn:
            loop:
                money_lock.acquire()
                money += 1
                money_lock.release()

        create_thread(earn)
        repeat 10 times:
             create_thread(spend)
    

Since all modifications of money require the money_lock we can be sure it doesn't change between checking money>=10 and reducing the money by 10

Other primitives exist too like RWLocks that allow multiple readers but not writers.

Synchronization is notorious for being tricky to do correctly.

Deadlock

A deadlock happens when no process can perform any action because they are all waiting for a different process, which is in turn also waiting for another process and so forth. Deadlocks typically happen when misusing synchronization primitives.

A deadlock is in itself also a type of race condition, so must be verified by reading the code rather than by testing.

Consider this pseudocode:

        money_lock = Lock()
        stuff_lock = Lock()
        money = 5
        stuff = 0

        function buy_stuff:
            acquire(money_lock)
            if money >= 10:
                 money -= 10
                 acquire(stuff_lock)
                 stuff_lock+=1
                 release(stuff_lock)
            release(money_lock)

        function sell_stuff:
            acquire(stuff_lock)
            if stuff >= 1:
                 stuff -= 1
                 acquire(money_lock)
                 money+=10
                 release(money_lock)
            release(stuff_lock)

        start_thread(buy_stuff)
        start_thread(sell_stuff)
    

Can you see how this will deadlock? buy_stuff can acquire the money lock, while at the same time sell_stuff acquires the stuff lock. Then buy_stuff will attempt to acquire the stuff lock, but will need to wait for sell_stuff to release it. Meanwhile sell_stuff will wait for buy_stuff to release the money lock for it to continue. Neither can do anything

A common solution to this problem is to have a fixed order of locks and only acquire them in that order.

Note that lock based deadlocks are just one type of deadlock, many others exist too.

Multiprocessing

Multiprocessing offers the highest level of isolation of all the concurrency models. Code is run in different processes. Processes share no memory and can only communicate with pipes or over the network, or via the disk. They are best used when you have many long-running tasks that require minimal communication.

The high level of isolation gives a high degree of freedom of what processes can do. A python program can spawn a C executable as a process to do some computation for it. It may ask a foreign computer to perform some computation. Processes can spawn subprocesses then exit themselves and the processes can finish the computation and store the results independently.

Communication is quite tricky though. All data sent between processes must be serialized as bytes, a process that is computationally heavy. You can not always share some types of objects, like open files or sockets, though what exactly can be shared depends on the operating system. On linux you can efficiently transfer data at the beginning of a process using fork, but afterwards new data must be serialized.

Advantages

  • No synchronization needed, since there is no shared memory
  • If one process crashes the rest can keep going
  • Processes can easily be built in completely different languages or use different interpreters, or you can use existing third party programs as processes to do your computation.
  • Processes can be run on a different computer entirely sometimes, this is known as distributed computing
  • Bypasses GIL or similar synchronization structures that could slow down your program.
  • Suitable for both CPU and IO bound tasks

Disadvantages

  • There is quite some overhead when spawning a new process, so you want to use as little as possible
  • Communication is difficult, it is slow and objects need to be serialized
  • Each process will need its own copy of the code and objects, leading to more memory use. Your OS might be able to reduce duplicate memory a bit depending on various factors, but even if it can eliminate some duplication there will still be a lot.
  • You need to write a good amount of safety code to deal with proceses crashing, invalid messages, etc.

Threading (no GIL)

Threading is similar to multiprocessing except the process can share some or all memory. This makes for much easier communication since threads can find tasks to do directly and immediately place results in the relevant data structure.

Threads can be scheduled by the operating system at any time and are outside your control. A thread may be paused at any point in execution. This can cause Race conditions when the result of the program changes depending on the exact order of operations, a cause for bugs that are tricky to debug.

Most languages offer synchronization primitives like locks and semaphores to help you control what can run at what time and in what order.

Advantages

  • Less memory use since threads can share memory
  • Usually quite easy to setup
  • Suitable both for IO and CPU heavy tasks

Disadvantages

  • Possibility of race conditions if not carefully synchronized
  • Some overhead, threads need to be long enough to justify the cost of spawning a thread
  • Synchronization primitives have their own overhead that can make threading+synchronization slower overall than a single threaded solution in some cases.

Threading (GIL)

A GIL, or a Global Interpreter Lock, is a feature of python and some other interpreted languages that creates acquires a lock when the interpreter does computation. This prevents more than one thread from performing computation tasks at the same time, however, you can perform many IO tasks at the same time.

It's a common myth that the GIL removes every performance benefit from multithreading. However, this is not the case. In GIL languages the main use case for threading is when you need to perform multiple IO tasks and at most 1 CPU task at the same time.

Threads can be scheduled by the operating system at any time and are outside your control. A thread may be paused at any point in execution. This can cause Race conditions when the result of the program changes depending on the exact order of operations, a cause for bugs that are tricky to debug.

Most languages offer synchronization primitives like locks and semaphores to help you control what can run at what time and in what order.

Advantages

  • Allows running multiple IO operations in parallel
  • This included IO operations without explicit non-blocking support
  • No manual preempting needed, which means you can safely run long tasks and trust your own code will still run

Disadvantages

  • Threads can still be preempted at any time, so you still need synchronization primitives
  • There is still overhead when context switching or using locks
  • Not suitable for multiple CPU bound tasks

Async

Async works using a scheduler running inside your process itself. You can schedule a task to happen, either as fast as possible or in response to a certain external event (like a IO operation completing). When a task finishes the scheduler will schedule a different task. All tasks run in the same thread and process. Often you divide a task into multiple parts, and at the end ask the scheduler to schedule the next part later. The await syntax is syntactic sugar around this process.

In terms of use case, async is very similar to threading with a GIL. It can perform only one CPU bound task at a time but many IO bound tasks. The main feature is manual preemption, you can choose explicitly places where your code may be preempted. This reduces most of the need for synchronization, though race conditions can still happen.

Another major difference is that the different tasks are handled within the interpreter itself. This makes spawning new tasks very cheap. While with threading you typically can go upto a few tens of threads before the overhead becomes too much, you can easily have hundreds of simultaneous async tasks, which can continuously be created and destroyed without much issue.

A common myth is that race conditions are impossible in async code. This is not the case, but they are for sure easier to prevent then with multithreading. For some reason many async frameworks do not offer their own primitives, but they are a lot easier to write yourself since you can easily guarantee atomicity of a code segment. Using threading primitives will not work since they will block all operations while waiting, which would deadlock.

When running a long CPU heavy task in async context, you need to manually insert interrupts, or places where other async code is allowed to run. Typically, this will be in the form await sleep(0), which will wait as little as possible but still allow other code to run if it has been queued for a while. If all your CPU tasks are short this is unnecessary.

Note that async code does not need to use any special syntax or the await statement. Whenever you use a callback function you are writing async code.

Advantages

  • Very efficient, can run many IO tasks at the same time
  • Easy to synchronize since race conditions can only happen near await statements or between callbacks
  • Easy to work with, natural syntax for spawning tasks

Disadvantages

  • Not suitable for multiple CPU tasks
  • Manual preemption may lead to starvation if a task does not allow preemption often enough
  • Hard to integrate async with normal code
  • IO operations are not automatically async, if not supported they may block the entire thread removing the benefit.

Combining Techniques

You can use multiple models side by side. For example, you might run most of your code async but have a seperate process run a CPU task in the background. It's quite common in python to use a thread for blocking IO then wrap waiting for the thread in a async interface.

Multiprocessing requires IO to communicate, thus it effectively transforms a CPU operation into a IO operation. You can then concurrently wait for the process to finish and perform other IO operations with threading or async.

Honorable mentions: SIMD and GPU

In some cases these types of concurrency might not be the best way to improve performances. If you have a very large amount of data and need to perform a relatively simple operation for each datapoint, you may be able to run it using SIMD or on your GPU. SIMD is usually done with wrapper libraries like numpy. I am far from an expert on these topics, so I would encourage you do further research to determine if these could work for your situation.

Thank you

Thank you for reading, I hope this article helped you choose what type of concurrency to use. If you enjoyed this article consider reading my other article here