Create Return Function on CPU and GPU

In most algorithms used by the toolkit, we want to precompute the Return Matrix before doing the iterations, to improve performance. Let me explain in the context of an example.
a
We are solving the income fluctuation problem with discrete VFI:

V(a,z) = \max_{a'} \{ F(a',a,z) + \beta \sum_{z'} V(a',z') \pi(z,z') \}

where

F(a',a,z) = u( (1+r)a+wz-a' )

is the “static” return function we want to precompute. Conceptually, we write a function that, given the inputs (a',a,z) and other parameters, returns the payoff F. So we first write a function as this one:

function F = ifp_returnfn(aprime,a,z,r,w)

c = (1+r)*a+w*z-aprime;

F = -inf;
if c>0
    F = log(c);
end

end %end function

Then we compute the F matrix as an array with 3 dimensions:

F_matrix = zeros(n_a,n_a,n_z);
tic
for z_c=1:n_z
    z_val = z_grid(z_c);
    for a_c=1:n_a
        a_val = a_grid(a_c);
        for ap_c=1:n_a
            aprime_val = a_grid(ap_c);
            F = ifp_returnfn(aprime_val,a_val,z_val,r,w);
            F_matrix(ap_c,a_c,z_c) = F;
        end
    end
end

In the toolkit, this is done is a separate way depending on whether the user is on CPU or GPU.

On a CPU, the array F_matrix is built in the function CreateReturnFnMatrix_Case1_Disc. The implementation is similar to the nested loops I wrote above, and sometimes it uses a parfor.

On a GPU, the array F_matrix is built in the function CreateReturnFnMatrix_Case1_Disc_Par2 using a very different method, based on arrayfun (for GPU). First we move to GPU the relevant grids and reshape them:

a_grid_gpu = gpuArray(a_grid); 
z_grid_gpu = gpuArray(z_grid);

aprime_vals = reshape(a_grid_gpu,[n_a,1,1]);
a_vals      = reshape(a_grid_gpu,[1,n_a,1]);
z_vals      = reshape(z_grid_gpu,[1,1,n_z]);

Then we call arrayfun:

ReturnFn = @(aprime,a,z,r,w) ifp_returnfn(aprime,a,z,r,w);
tic
F_matrix_gpu = arrayfun(ReturnFn, aprime_vals,a_vals,z_vals,r,w);
toc

The important thing to note here is that we are using automatic expansion: the arrays a’,a,z need to be compatible, but they don’t need to have the same size.

Is it possible to do the same arrayfun on the CPU, so the toolkit could use the same functions?!

It turns out that arrayfun on the CPU behaves differently from the GPU, which is quite annoying and a source of confusion

arrayfun on the cpu wants all input arguments to have the same size, which is of course inefficient. So we can do this (it works):

aprime_vals = reshape(a_grid,[n_a,1,1]);
a_vals      = reshape(a_grid,[1,n_a,1]);
z_vals      = reshape(z_grid,[1,1,n_z]);

aprime_vals = repmat(aprime_vals,[1,n_a,n_z]);
a_vals      = repmat(a_vals,[n_a,1,n_z]);
z_vals      = repmat(z_vals,[n_a,n_a,1]);
r_vals = r*ones(n_a,n_a,n_z);
w_vals = w*ones(n_a,n_a,n_z); 

tic
F_matrix_cpu = arrayfun(ReturnFn, aprime_vals,a_vals,z_vals,r_vals,w_vals);
toc

But is more verbose and also slower than the nested loops (which is weird, since with arrayfun we are telling Matlab that the loop iterations are independent).

In my humble opinion, Matlab should spend some time optimizing arrayfun for the cpu :slight_smile:

1 Like

Nice clear explanation of what is going on internally :+1:

Yeah, that different behaviour of arrayfun() on cpu and gpu is odd.

The gpu use of arrayfun() is also 100% easier to generalize to higher dimensions than the use of the nested loops (in the sense you can add another combination — like 2 d, 1 a, 3 z — in just an extra two lines for gpu, but takes an extra 10 to write another nested loop; you can half sidestep this with gridvals (joint grids) instead of grid (stacked columns), but only half).

I’m guessing here, but I think that the cpu version of arrayfun() is just stuck as legacy code (Matlab don’t like to break functions), and that if Matlab rewrote it from scratch it would behave like the gpu because the later is clearly superior.

1 Like

Thanks! Yes, I agree with you. If you look into the Matlab documentation for arrayfun on cpu, https://uk.mathworks.com/help/matlab/ref/arrayfun.html, it says that arrayfun was introduced before 2006.

The key weakness is these lines in the doc:

B = arrayfun(func,A1,…,An) applies func to the elements of the arrays A1,…,An, so that B(i) = func(A1(i),…,An(i)). The function func must take n input arguments and return a scalar. The arrays A1,…,An all must have the same size.

Gpu support and automatic expansion came much later (automatic expansion in 2016). The same lines for arrayfun on the gpu are:

B = arrayfun(func,A1,…,An) applies func to the elements of the arrays A1,…,An, so that B(i,j,…) = func(A1(i,j,…),…,An(i,j,…)). The function func must take n input arguments and return a scalar. The sizes of A1,…,An must match or be compatible.

In the context of the toolkit, arrayfun on the cpu is really a no-go, since one would have to replicate even scalar parameters, as I did in my example! This wastes so much memory (less a concern on the cpu but still very inefficient)

1 Like

For more info on this, see this questions/answers on the matlab forum: https://uk.mathworks.com/matlabcentral/answers/2103516-arrayfun-vs-loops-again

Joss from Mathworks replied to my question and explained why arrayfun on cpu is unlikely to be ever inproved. In one word: backward compatibility, so your guess was spot on, @robertdkirkby

1 Like