Do you need to use Parallelization with df.iterrows() / For loop in Pandas? If so this article will describe two different ways of this technique. This optimization speeds up operations significantly.
df.iterrows() Parallelization in Pandas
The first example shows how to parallelize independent operations. Let's consider next example:
from langdetect import detect
detect("War doesn't show who's right, just who's left.")
result:
'en'
Detection of language for short text.
If we apply the operation on single column of DataFrame like:
langs = []
for title in df.title:
if type(title) is str:
langs.append(detect(title))
else:
langs.append(0)
df['lang'] = langs
For 10 K titles we get 3 minutes execution time.
Much more efficient way is parallel processing of the titles. This can be done by a simple code modification:
import multiprocessing as mp
pool = mp.Pool(processes=mp.cpu_count())
def func( arg ):
idx,row = arg
if type(row['title']) is str:
return detect(title)
else:
return 0
langs = pool.map( func, [(idx,row) for idx,row in df.iterrows()])
df['lang'] = langs
Where processes=mp.cpu_count()
returns the number of the available cores. For 10 K titles execution time drops from 3 minutes to 20 secs.
Pandas + Dask Parallelization
Alternative approach is parallelization using Python library Dask. This solution is a bit faster and more readable.
import dask.dataframe as ddf
def func( arg ):
idx,row = arg
if type(row['title']) is str:
return detect(title)
else:
return 0
df_dask = ddf.from_pandas(df, npartitions=16)
df_dask['output'] = df_dask.apply(lambda x: func(x), meta=('str')).compute(scheduler='multiprocessing')
Where 16 is the number of partitions/number of cores you want to use. Number of cores can be extracted by processes=mp.cpu_count()
This optimization is faster than the previous solution roughly with 5 - 15%.