Pandas Easy Parallelization with df.iterrows() or For Loop

Do you need to use Parallelization with df.iterrows() / For loop in Pandas? If so this article will describe two different ways of this technique. This optimization speeds up operations significantly.

df.iterrows() Parallelization in Pandas

The first example shows how to parallelize independent operations. Let's consider next example:

from langdetect import detect
detect("War doesn't show who's right, just who's left.")

result:

'en'

Detection of language for short text.

If we apply the operation on single column of DataFrame like:

langs = []

for title in df.title:
    if type(title) is str:
        langs.append(detect(title))
    else:
        langs.append(0)
    
df['lang'] = langs

For 10 K titles we get 3 minutes execution time.

Much more efficient way is parallel processing of the titles. This can be done by a simple code modification:

import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count())

def func( arg ):
    idx,row = arg

    if type(row['title']) is str:
        return detect(title)
    else:
        return 0


langs = pool.map( func, [(idx,row) for idx,row in df.iterrows()])
df['lang'] = langs

Where processes=mp.cpu_count() returns the number of the available cores. For 10 K titles execution time drops from 3 minutes to 20 secs.

Pandas + Dask Parallelization

Alternative approach is parallelization using Python library Dask. This solution is a bit faster and more readable.

import dask.dataframe as ddf

def func( arg ):
    idx,row = arg

    if type(row['title']) is str:
        return detect(title)
    else:
        return 0

df_dask = ddf.from_pandas(df, npartitions=16)   
df_dask['output'] = df_dask.apply(lambda x: func(x), meta=('str')).compute(scheduler='multiprocessing')

Where 16 is the number of partitions/number of cores you want to use. Number of cores can be extracted by processes=mp.cpu_count()

This optimization is faster than the previous solution roughly with 5 - 15%.

> Python Basics

> Advanced Tutorials

> Python Errors

> Pandas Advanced

> Pandas Count

> Pandas Column

> Pandas Basics

> Pandas DataFrame

> Pandas Row

> User Interface

> Advanced Linux

> Troubleshoot

> Video & Sound

> Linux Commands

> MySQL

> SQL Basics

> Python

> DB apps

> JupyterLab

> Jupyter Tips

> Jupyter Display

> Regex in Text Editor

> Regex Basics

> Regex Match

> Regex Date

> PyCharm Advanced

> Git and PyCharm

> PyCharm Error

> PyCharm Tips

> Linux Mint Applications

> VIrtual Machine

> Miscellaneous

> Java

> Automation

> Windows

> Office

> Cheat Sheet

df.iterrows() Parallelization in Pandas

Pandas + Dask Parallelization