How Not to Use pandas' "apply"

Recently, I tripped over a use of the apply function in pandas in perhaps one of the worst possible ways. The scenario is this: we have a DataFrame of a moderate size, say 1 million rows and a dozen columns. We want to perform some row-wise computation on the DataFrame and based on which generate a few new columns.

Let’s also assume that the computation is rather complex, so those wonderful vectorized operations in that comes with pandas are out of question (the official performance enhancement tips is a nice read on this). And luckily it has been packaged as a function that returns a few values:

def complex_computation(a):
    # do lots of work here...
    # ...
    # and finally it's done.
    return value1, value2, value3

We want to put the computed results together into a new DataFrame.

A natural solution is to call the apply function of the DataFrame and pass in a function which does the said computation:

def func(row):
    v1, v2, v3 = complex_computation(row[['some', 'columns']].values)
    return pd.Series({'NewColumn1': v1,
                      'NewColumn2': v2,
                      'NewColumn3': v3})
df_result = df.apply(func, axis=1)

According to the documentation of apply, the result depends on what func returns. If we pass in such a func (returning a Series instead of a single value), the result would be a nice DataFrame containing three columns as named.

Expressed in a more loopy manner, the following yields an equivalent result:

v1s, v2s, v3s = [], [], []
for _, row in df.iterrows():
    v1, v2, v3 = complex_computation(row[['some', 'columns']].values)
    v1s.append(v1)
    v2s.append(v2)
    v3s.append(v3)
df_result = pd.DataFrame({'NewColumn1': v1s,
                          'NewColumn2': v2s,
                          'NewColumn3': v3s})

However, at the first glance, the loopy version just does not seem elegant compared to the apply version. Plus, leaving the work of putting together the results to pandas seems to be a good idea – could some magics be performed in the background by pandas, making the loop complete faster?

That was what I thought, but it turns out we have just constructed a silent memory eating monster with such use of apply. To see that, let’s put together the above pieces of code and consider a minimal reproducible example (and the pandas version here is 0.16.2):

import pandas as pd
import numpy as np
%load_ext memory_profiler

def complex_computation(a):
    # Okay, this is not really complex, but this is just for illustration.
    # To keep reproducibility, we can't make it order a pizza here.
    # Anyway, pretend that there is no way to vectorize this operation.
    return a[0]-a[1], a[0]+a[1], a[0]*a[1]

def func(row):
    v1, v2, v3 = complex_computation(row.values)
    return pd.Series({'NewColumn1': v1,
                      'NewColumn2': v2,
                      'NewColumn3': v3})

def run_apply(df):
    df_result = df.apply(func, axis=1)
    return df_result

def run_loopy(df):
    v1s, v2s, v3s = [], [], []
    for _, row in df.iterrows():
        v1, v2, v3 = complex_computation(row.values)
        v1s.append(v1)
        v2s.append(v2)
        v3s.append(v3)
    df_result = pd.DataFrame({'NewColumn1': v1s,
                              'NewColumn2': v2s,
                              'NewColumn3': v3s})
    return df_result

def make_dataset(N):
    np.random.seed(0)
    df = pd.DataFrame({
            'a': np.random.randint(0, 100, N),
            'b': np.random.randint(0, 100, N)
         })
    return df

def test():
    from pandas.util.testing import assert_frame_equal
    df = make_dataset(100)
    df_res1 = run_loopy(df)
    df_res2 = run_apply(df)
    assert_frame_equal(df_res1, df_res2)
    print 'OK'

df = make_dataset(1000000)  

Before anything else, let’s check the correctness first on a small set of input data (i.e. both implementations yield identical results):

test()
# OK

And now it’s time for some %memit. The loopy version gives:

%memit run_loopy(df)
# peak memory: 272.18 MiB, increment: 181.38 MiB

How about the elegant apply?

%memit run_apply(df)
# peak memory: 3941.29 MiB, increment: 3850.10 MiB

Oops, that’s a 10 times more in memory usage! Not good. Apparently, in order to achieve its flexibility, the apply function somehow has to store all the intermediate Series that appeared along the way, or something like that.

Speed-wise we have:

%timeit run_loopy(df)
# 1 loops, best of 3: 36.2 s per loop

%timeit run_apply(df)
# 1 loops, best of 3: 2min 48s per loop

Looping is slow; but it is actually a lot faster than this way of using apply! The overhead of creating a Series for every input row is just too much.

Combining both its memory and time inefficiency, I have just presented to you one of the worst possible ways to use the apply function in pandas. For some reason, this did not appear obvious to me when I first encountered it.

TL;DR: When applying a function on a DataFrame using DataFrame.apply by row, be careful of what the function returns – making it return a Series so that apply results in a DataFrame can be very memory inefficient on input with many rows. And it is slow. Very slow.