How Not to Use pandas' "apply"

Recently, I tripped over a use of the `apply`

function in pandas in perhaps one of the worst possible ways. The scenario is this: we have a DataFrame of a moderate size, say 1 million rows and a dozen columns. We want to perform some row-wise computation on the DataFrame and based on which generate a few new columns.

Let’s also assume that the computation is rather complex, so those wonderful vectorized operations in that comes with pandas are out of question (the official performance enhancement tips is a nice read on this). And luckily it has been packaged as a function that returns a few values:

```
def complex_computation(a):
# do lots of work here...
# ...
# and finally it's done.
return value1, value2, value3
```

We want to put the computed results together into a new DataFrame.

A natural solution is to call the `apply`

function of the DataFrame and pass in a function which does the said computation:

```
def func(row):
v1, v2, v3 = complex_computation(row[['some', 'columns']].values)
return pd.Series({'NewColumn1': v1,
'NewColumn2': v2,
'NewColumn3': v3})
df_result = df.apply(func, axis=1)
```

According to the documentation of `apply`

, the result depends on what `func`

returns. If we pass in such a `func`

(returning a Series instead of a single value), the result would be a nice DataFrame containing three columns as named.

Expressed in a more loopy manner, the following yields an equivalent result:

```
v1s, v2s, v3s = [], [], []
for _, row in df.iterrows():
v1, v2, v3 = complex_computation(row[['some', 'columns']].values)
v1s.append(v1)
v2s.append(v2)
v3s.append(v3)
df_result = pd.DataFrame({'NewColumn1': v1s,
'NewColumn2': v2s,
'NewColumn3': v3s})
```

However, at the first glance, the loopy version just does not seem elegant compared to the `apply`

version. Plus, leaving the work of putting together the results to pandas seems to be a good idea – could some magics be performed in the background by pandas, making the loop complete faster?

That was what I thought, but it turns out we have just constructed a silent memory eating monster with such use of `apply`

. To see that, let’s put together the above pieces of code and consider a minimal reproducible example (and the pandas version here is `0.16.2`

):

```
import pandas as pd
import numpy as np
%load_ext memory_profiler
def complex_computation(a):
# Okay, this is not really complex, but this is just for illustration.
# To keep reproducibility, we can't make it order a pizza here.
# Anyway, pretend that there is no way to vectorize this operation.
return a[0]-a[1], a[0]+a[1], a[0]*a[1]
def func(row):
v1, v2, v3 = complex_computation(row.values)
return pd.Series({'NewColumn1': v1,
'NewColumn2': v2,
'NewColumn3': v3})
def run_apply(df):
df_result = df.apply(func, axis=1)
return df_result
def run_loopy(df):
v1s, v2s, v3s = [], [], []
for _, row in df.iterrows():
v1, v2, v3 = complex_computation(row.values)
v1s.append(v1)
v2s.append(v2)
v3s.append(v3)
df_result = pd.DataFrame({'NewColumn1': v1s,
'NewColumn2': v2s,
'NewColumn3': v3s})
return df_result
def make_dataset(N):
np.random.seed(0)
df = pd.DataFrame({
'a': np.random.randint(0, 100, N),
'b': np.random.randint(0, 100, N)
})
return df
def test():
from pandas.util.testing import assert_frame_equal
df = make_dataset(100)
df_res1 = run_loopy(df)
df_res2 = run_apply(df)
assert_frame_equal(df_res1, df_res2)
print 'OK'
df = make_dataset(1000000)
```

Before anything else, let’s check the correctness first on a small set of input data (i.e. both implementations yield identical results):

```
test()
# OK
```

And now it’s time for some `%memit`

. The loopy version gives:

```
%memit run_loopy(df)
# peak memory: 272.18 MiB, increment: 181.38 MiB
```

How about the elegant `apply`

?

```
%memit run_apply(df)
# peak memory: 3941.29 MiB, increment: 3850.10 MiB
```

Oops, that’s a 10 times more in memory usage! Not good. Apparently, in order to achieve its flexibility, the `apply`

function somehow has to store all the intermediate `Series`

that appeared along the way, or something like that.

Speed-wise we have:

```
%timeit run_loopy(df)
# 1 loops, best of 3: 36.2 s per loop
%timeit run_apply(df)
# 1 loops, best of 3: 2min 48s per loop
```

Looping is slow; but it is actually a lot faster than this way of using `apply`

! The overhead of creating a Series for every input row is just too much.

Combining both its memory and time inefficiency, I have just presented to you one of the worst possible ways to use the `apply`

function in pandas. For some reason, this did not appear obvious to me when I first encountered it.

**TL;DR**: When applying a function on a DataFrame using `DataFrame.apply`

by row, be careful of what the function returns – making it return a `Series`

so that `apply`

results in a DataFrame can be very memory inefficient on input with many rows. And it is slow. Very slow.