Runtime improvement to FHorz agent distribution iteration with semi-exogenous state

As a user you don’t have to follow this post, just enjoy faster runtimes for creating agent distribution in FHorz models with semi-exogenous shocks. Is explaining internal details of an improvement.

I have a FHorz model with co-author with small prod(n_d) and large prod(n_semiz), and N_j=64. Turned out the agent dist was taking longer than the value fn (132s vs 102s), which seemed like something that must be improvable. So I’ve found a way to improve the agent dist with semi-exogenous shocks. It slashed our agent dist runtime from 132s to 52s.

Core idea was that pi_semiz (transition matrix for semi-exogenous shocks) typically contains lots of zeros. So if we can get rid of these zeros before we create Gammatranspose=sparse() [the matrix used for first step of Tan improvement] then the sparse() will be faster. I found a way to do exactly this, at the cost of a small setup overhead. Since the overhead is once-off, and the gain is in sparse() every jj, there are substantial runtime gains as long as it is finite horizon (if N_j=5 runtime is largely unchanged, but if N_j=64 the gains are substantial). [Note, does not work for InfHorz, as not worth the setup overhead since we only do sparse() once.]

Following is copy-paste of a code that gives more detail, first half covers the basic ideas. Second half contains the old and new content of the agent dist command, you cannot run it, is just archive.

%% When we do agent distribution iteration we create sparse() matrix.
% Because the transition matrix for semiz often contains lots of zeros, we
% could make this faster if we can get rid of the zeros. To see the basic
% idea of what we are trying to exploit.
% [In principle you could do this same thing to pi_z, but we already use the
% Tan improvement, so we don't have this sparse() step involving pi_z. This 
% trick is not going to beat the Tan improvement.]

% Here we do a 'first version' which is the step creating sparse() we do to iterate the agent
% distribution with semi-exogenous shocks.
% The 'second version' does the same thing, but dropping lots zeros before doing sparse().

N=1000;
Nshort=500;

% When we do sparse() with N-by-N and lots of zeros
VV=[rand(N,Nshort),zeros(N,N-Nshort)]; % So (at most) Nshort out of N per line are non-zero
II=ceil(N*rand(N,N));
JJ=repmat((1:1:N)',1,N);
tic;
GG=sparse(II,JJ,VV,N,N);
toc
% and the run time was 0.02s

% Now we do the same, but without all those zeros
VV2=VV(:,1:Nshort);
II2=II(:,1:Nshort);
JJ2=JJ(:,1:Nshort);
tic;
GG2=sparse(II2,JJ2,VV2,N,N);
toc
% now the run time is just 0.009s

% So even with Nshort as high as N/2 the runtime is substantially cut, and
% the runtime gains get much faster as Nshort shrinks

% Obviously with pi_semiz, these zeros won't all just be the last elements.
% So it will take some time to set up the second version, the question then
% becomes is the runtime cost of setting up for the second version worth it
% to get the runtime gains from sparse() in the second version.

% Based on running some actual examples: for the second version, the runtime 
% gains from the faster sparse() are going to be worthwhile as long as you
% are going to be doing it 5+ times.
% So in infinite horizon, stick to first version.
% But for finite horizon, second version gives runtime gains. The runtime
% gains will be neglible for N_j=5, but will be substantial for N_j=60.

%% Create smaller version of transition matrix, so we have less zeros
% Trying to get a folded down version that drops zeros
% Pretend AA is the probability transition matrix, which is N-by-N
% We 'know' that many of the elements of AA are zeros
% So we want to create AAshort, which will be N-by-Nshort
% Nshort is 'the largest number of nonzero elements in any given row'

N=10;
AA=zeros(N,N);

M=50; % number of non-zeros
ind=ceil(N*N*rand(M,1));
AA(ind)=round(15*rand(M,1))

Nshort=max(sum((AA>0),2))

[BB1, idx] = sort(AA,2); % puts all the zeros on the left of the matrix

AAshort=BB1(:,end-Nshort+1:end)

idxshort=idx(:,end-Nshort+1:end)

% This is the setup we need to do for the second version.





%% Rest of this code won't run, is just copy pasted from a file to show full details of how the code for iterating agent distribution changed.

%% StationaryDist_FHorz_Iteration_SemiExo_nProbs_raw

%% New Version
% It is likely that most of the elements in pi_semiz_J are zero, we can
% take advantage of this to speed things up. Ignore for a moment the
% dependence on d and j, and pretend it is just a N_semiz-by-N_semiz
% matrix. Then we can calculate N_semizshort=max(sum((pi_semiz>0),2)), the
% maximum number of non-zeros in any row of pi_semiz. And we then use this
% in place of N_semiz as the second dimension.

disp('NEw')
tic;
N_semizshort=max(max(max(sum((pi_semiz_J>0),2))));
% Create smaller version of pi_semiz_J that eliminates as many non-zeros as possible
[pi_semiz_J_short, idx] = sort(pi_semiz_J,2); % puts all the zeros on the left of the matrix

pi_semiz_J_short=pi_semiz_J_short(:,end-N_semizshort+1:end,:,:);

size(pi_semiz_J)
size(pi_semiz_J_short)

idxshort=idx(:,end-N_semizshort+1:end,:,:);

Policy_dsemiexo=reshape(Policy_dsemiexo,[N_a*N_semiz*N_z,1,N_j]);
semizindex_short=repmat(repelem((1:1:N_semiz)',N_a,1),N_z,1)+N_semiz*(0:1:N_semizshort-1)+gather((N_semiz*N_semizshort)*(Policy_dsemiexo-1))+(N_semiz*N_semizshort*N_dsemiz)*shiftdim((0:1:N_j-1),-1); % index for semiz, plus that for semiz' (in the semiz' dim) and dsemiexo; their indexes in pi_semiz_J
pi_semiz_J_short=gather(pi_semiz_J_short);


size(idxshort)

temp=(0:1:N_semiz);
temp2=temp(idxshort(semizindex_short));

% Policy_aprime is currently [N_a,N_semiz*N_z,N_probs,N_j]
Policy_aprimesemizz=repelem(reshape(gather(Policy_aprime),[N_a*N_semiz*N_z,N_probs,N_j]),1,N_semizshort)+repmat(N_a*(idxshort(semizindex_short)-1),1,N_probs,1)+repelem(N_a*N_semiz*(0:1:N_z-1)',N_a*N_semiz,1); % Note: add semiz' index following the semiz' dimension, add z' index following the z dimension for Tan improvement
% Policy_aprimesemizz is currently [N_a,N_semiz*N_z,N_probs*N_semizshort,N_j]
size(Policy_aprimesemizz)



% semizindex is [N_a*N_semiz*N_z,N_semiz,N_j]
% used to index pi_semiz_J which is [N_semiz,N_semiz,N_dsemiz,N_j]

PolicyProbs2=reshape(PolicyProbs,[N_a*N_semiz*N_z,N_probs,N_j]);
PolicyProbs2=repelem(gather(PolicyProbs2),1,N_semizshort,1).*repmat(pi_semiz_J_short(semizindex_short),1,N_probs,1); % is of size [N_a*N_semiz*N_z,N_probs*N_semiz,N_j]
% WHY DONT I CREATE PolicyProbs ON GPU, THEN gather()? UNLESS IT RUNS OUT OF GPU MEMORY THIS SHOULD BE FASTER?

toc

pi_z_J=gather(pi_z_J);

size(Policy_aprimesemizz)
size(PolicyProbs2)


II2=repelem((1:1:N_a*N_semiz*N_z)',1,N_semizshort*N_probs); % Index for this period (a,semiz), note the N_probs-copies
N_bothz=N_semiz*N_z;
jj=1;
tic;
GammatransposeNew=sparse(Policy_aprimesemizz(:,:,jj),II2,PolicyProbs2(:,:,jj),N_a*N_bothz,N_a*N_bothz); % Note: sparse() will accumulate at repeated indices [only relevant at grid end points]
toc

GammatransposeNewStore=GammatransposeNew;

tic;

StationaryDistNew=zeros(N_a*N_semiz*N_z,N_j,'gpuArray'); % StationaryDist cannot be sparse
StationaryDistNew(:,1)=jequaloneDistKron;
StationaryDist_jj=sparse(gather(jequaloneDistKron)); % use sparse matrix

for jj=1:(N_j-1)
    Gammatranspose=sparse(Policy_aprimesemizz(:,:,jj),II2,PolicyProbs2(:,:,jj),N_a*N_bothz,N_a*N_bothz); % Note: sparse() will accumulate at repeated indices [only relevant at grid end points]
    
    % First step of Tan improvement
    StationaryDist_jj=reshape(Gammatranspose*StationaryDist_jj,[N_a*N_semiz,N_z]);

    % Second step of Tan improvement
    StationaryDist_jj=reshape(StationaryDist_jj*pi_z_J(:,:,jj),[N_a*N_bothz,1]);

    StationaryDistNew(:,jj+1)=gpuArray(full(StationaryDist_jj));
end
toc

%% Old Version
disp('Old')
tic;
% Policy_aprime is currently [N_a,N_semiz*N_z,N_probs,N_j]
Policy_aprimesemizz=repelem(reshape(gather(Policy_aprime),[N_a*N_semiz*N_z,N_probs,N_j]),1,N_semiz)+repmat(N_a*(0:1:N_semiz-1),1,N_probs)+repelem(N_a*N_semiz*(0:1:N_z-1)',N_a*N_semiz,1); % Note: add semiz' index following the semiz' dimension, add z' index following the z dimension for Tan improvement
% Policy_aprimesemizz is currently [N_a,N_semiz*N_z,N_probs*N_semiz,N_j]

Policy_dsemiexo=reshape(Policy_dsemiexo,[N_a*N_semiz*N_z,1,N_j]);

% precompute; Don't want `PolicyProbs` on GPU anyway, so leave these in CPU RAM
semizindex=repmat(repelem((1:1:N_semiz)',N_a,1),N_z,1)+N_semiz*(0:1:N_semiz-1)+gather((N_semiz*N_semiz)*(Policy_dsemiexo-1))+(N_semiz*N_semiz*N_dsemiz)*shiftdim((0:1:N_j-1),-1); % index for semiz, plus that for semiz' (in the semiz' dim) and dsemiexo; their indexes in pi_semiz_J
pi_semiz_J=gather(pi_semiz_J);
% semizindex is [N_a*N_semiz*N_z,N_semiz,N_j]
% used to index pi_semiz_J which is [N_semiz,N_semiz,N_dsemiz,N_j]

PolicyProbs=reshape(PolicyProbs,[N_a*N_semiz*N_z,N_probs,N_j]);
PolicyProbs=repelem(gather(PolicyProbs),1,N_semiz,1).*repmat(pi_semiz_J(semizindex),1,N_probs,1); % is of size [N_a*N_semiz*N_z,N_probs*N_semiz,N_j]
% WHY DONT I CREATE PolicyProbs ON GPU, THEN gather()? UNLESS IT RUNS OUT OF GPU MEMORY THIS SHOULD BE FASTER?

toc

pi_z_J=gather(pi_z_J);

II2=repelem((1:1:N_a*N_semiz*N_z)',1,N_semiz*N_probs); % Index for this period (a,semiz), note the N_probs-copies
N_bothz=N_semiz*N_z;
jj=1;
tic;
Gammatranspose=sparse(Policy_aprimesemizz(:,:,jj),II2,PolicyProbs(:,:,jj),N_a*N_bothz,N_a*N_bothz); % Note: sparse() will accumulate at repeated indices [only relevant at grid end points]
toc

GammatransposeStore=Gammatranspose;

tic;

StationaryDist=zeros(N_a*N_semiz*N_z,N_j,'gpuArray'); % StationaryDist cannot be sparse
StationaryDist(:,1)=jequaloneDistKron;
StationaryDist_jj=sparse(gather(jequaloneDistKron)); % use sparse matrix

% Precompute; II2 used only for sparse matrix creation...best done on CPU
% II2=repelem((1:1:N_a*N_semiz*N_z)',1,N_semiz*N_probs); % Index for this period (a,semiz), note the N_probs-copies

for jj=1:(N_j-1)
    Gammatranspose=sparse(Policy_aprimesemizz(:,:,jj),II2,PolicyProbs(:,:,jj),N_a*N_bothz,N_a*N_bothz); % Note: sparse() will accumulate at repeated indices [only relevant at grid end points]
    
    % First step of Tan improvement
    StationaryDist_jj=reshape(Gammatranspose*StationaryDist_jj,[N_a*N_semiz,N_z]);

    % Second step of Tan improvement
    StationaryDist_jj=reshape(StationaryDist_jj*pi_z_J(:,:,jj),[N_a*N_bothz,1]);

    StationaryDist(:,jj+1)=gpuArray(full(StationaryDist_jj));
end

toc

%%

% New times were
% 2.7s
% 0.08s
% 5.2s

% Old times were
% 2.2s
% 0.24s
% 16.3s

% In an actual life-cycle model with 64 time periods that I'm doing the
% runtime gain was to slash the time from 132s to 52s for computing the
% agent distribution. So can be big for models with large N_j, and where
% N_semiz is large but the maximum number of non-zeros per row of
% pi_semiz_J is much smaller than N_semiz.
1 Like

Currently I have only implemented the nProbs with semiz and z version.

Hopefully I can get the rest done this week.

Turns out it was easy to roll out everywhere, only took an hour or two. So now all done :smiley:

Note: for models where pi_semiz is dense (no zeros), this new approach is fractionally slower (say 1-3% slower), but I think that is worthwhile for the large speed gains when there are lots of zeros.

2 Likes