Python Multiprocessing: Pool vs Process – Comparative Analysis

Python Multiprocessing Pool vs Process – Comparative Analysis

Introduction To Python Multiprocessing

Multiprocessing is a great way to improve performance. We came across Python Multiprocessing when we had the task of evaluating the millions of Excel expressions using Python code.

In such a scenario, evaluating the expressions serially becomes imprudent and time-consuming. So, we decided to use Python Multiprocessing.

Generally, in multiprocessing, you execute your task using a process or thread. To get a better advantage of multiprocessing, we decided to use thread. However, while doing research, we learned that GIL Lock disables the multi-threading functionality in Python. Further digging, we learned that Python provides two classes for multiprocessing, such as Process and Pool.

In the following sections, I have briefly overviewed our experience using pool and process classes. And the performance comparison using both types. I have also detailed the performance comparison, which will help you choose the method for your multiprocessing task.

Python Multiprocessing: The Pool and Process class

Though Pool and Process both execute the task parallelly, their way of executing tasks parallelly is different.

The pool distributes the tasks to the available processors using a FIFO scheduling. It works like a map-reduce architecture. It maps the input to the different processors and collects the output from all the processors. After the execution of the code, it returns the output in the form of a list or array. It waits for all the tasks to finish and then returns the output. The processes in execution are stored in memory, and other non-executing processes are stored out of memory.

Python-Process

The process class puts all the processes in memory and schedules execution using the FIFO policy. When the process is suspended, it pre-empts and schedules a new process for execution.

When to use Pool and Process

I think choosing an appropriate approach depends on the task at hand. The pool allows you to do multiple jobs per process, which may make it easier to parallelize your program. If you have a million tasks to execute in parallel, you can create a Pool with a number of processes as many as CPU cores and then pass the list of the million tasks to the pool. Map. The pool will distribute those tasks to the worker processes(typically the same number as available cores), collect the return values as a list, and pass it to the parent process. Launching separate million processes would be less practical (probably breaking your OS).

Python-Pool

Pool Process

On the other hand, if you have a small number of tasks to execute in parallel and only need each task done once, it may be perfectly reasonable to use a separate multiprocessing. Process for each task rather than setting up a Pool.

We used both Pool and Process classes to evaluate excel expressions. Following are our observations about the pool and process class:

Task number

As we have seen, the pool allocates only executing processes in memory, and the process allocates all the tasks in memory, so when the task number is small, we can use process class, and when the task number is large, we can use the pool. In the case of large tasks, if we use a process, memory problems might occur, causing system disturbance. In the case of Pool, there is overhead in creating it. Hence, with small task numbers, the performance is impacted when Pool is used.

IO operations

The pool distributes the processes among the available cores in a FIFO manner. On each core, the allocated process executes serially. So, if there is a long IO operation, it waits till the IO operation is completed and does not schedule another process. This leads to an increase in execution time. The Process class suspends executing IO operations and schedules another process. So, in the case of a long IO operation, it is advisable to use process class.

Python Multiprocessing: Performance Comparison

In our case, the performance using the Pool class was as follows:

  1. Using pool- 6 secs
  2. Without using the pool- 10 secs
  3. Process () works by launching an independent system process for every parallel process you want to run. When we used Process class, we observed machine disturbance as 1 million processes were created and loaded in memory.

To test further, we reduced the number of arguments in each expression and ran the code for 100 expressions.

The performance using the Pool class is as follows:

  1. Using pool- 4secs
  2. Without using the pool- 3 secs

Then, we increased the arguments to 250 and executed those expressions.

The performance using the Pool class is as follows:

  1. Using pool- 0.6secs
  2. Without using the pool- 3 secs

To summarize this, pool class works better when there are more processes and a small IO wait. Process class works better when methods are small in number and IO operations are extended. What was your experience with Python Multiprocessing? I would be more than happy to have a conversation about this.