for i = 0:I-1, j = 0:J-1
A[i+1,j+1] = A[i+1,j] + A[i,j+1]
end
This loop is not trivial to parallelize because it requires serial execution of both the i and j loops. The store into A[i+1,j+1] will be loaded on the next iteration of the j loop by A[i+1,j], and also by the next iteration of the i loop by A[i,j+1]. We define the store into A[i+1,j+1] as the source, and the subsequent loads from this memory location as the targets which must hold the stored value.
If the relationships between sources and targets across loop iterations are cast as a dependency graph with coordinates corresponding to the iteration space indices, we can model and transform those relationships using dependence polyhedra, allowing application of optimization techniques from linear programming and operations research.
In general, we define polyhedra using a system of inequalities.
Ax≤b
where inequalities involving vectors apply elementwise, bold capital letters indicate matrices, and the vector arrow x is used to indicate column vectors.
For dependence polyhedra, we define one per dependence. The aim is to describe how both i and j relate in the source and target through a system of inequalities. We use the s subscript indicate the source (parent) and t subscript the target (child). With this notation, the store into A[is+1,js+1] and subsequent load from A[it+1,jt] imposes two constraints: is+1=it+1 and js+1=jt. Encoding this information, as well as the loop bounds, yields the following system of inequalities, which we can interpret as a polyhedra that gives the space of all dependent iterations (note that we can disprove dependence by proving a depdendence polyehedra is empty, which we will demonstrate in a future blog post).
As seen in the orthogonalize_loops post, we like to define loop schedules as affine functions of the indices. We'll discuss valid schedules and their interpretation in a subsequent post, but for now, suffice it to say that we need to prove things about the differences in schedules at certain levels of the loop nest. The source iteration must occur before the target iteration, which means that we generally need to prove one level of the loop (i.e., one row of the scheduling matrix) either equals, or iterates, before another. Letting ϕtl(t) and ϕsl(s) be the affine schedules for level l for the target and source respectively, such that
ϕtl(t)=αtl+βtTt
and all other schedules are defined similarly, with the superscript T indicating transposition.
Then we may wish to constrain ourselves to solutions that satisfy
ϕtl(t)−ϕsl(s)≥δ
for some δ.
Is there some way we can find constraints on allowed values of αt and βt? Enter Farkas' lemma!
According to the affine form of Farkas' lemma given in Uday's thesis, an affine function ψ(x) is non-negative everywhere in a polyhedron D iff it is a non-negative linear combination of the faces:
ψ(x)≡λ0+λPT(b−Ax),λ0≥0,λP≥0.
That is, if some λ0≥0,λP≥0 exists such that the equality holds, then ψ(x) is non-negative everywhere inside the polyhedron.
To apply that to our problem of dependence, we let
ψ([s,t])=ϕtl(t)−ϕsl(s)−δ=λ0+λPT(b−Ax)
and then use Fourier-Motzkin elimination to remove the λ.
Returning to the first dependence of our motivating example, we have
We realize we can just drop them. Thus, our final equations are
βjslβislδ≤βjtl≤βitl≤βjtl+αtl−αsl
Thus, the coefficient for the target must be at least as great as those from the source. Additionally, if the offsets are equal, then βjtl must exceed our desired δ.
Repeating this procedure for the other dependency would similarly yield
βjslβislδ≤βjtl≤βitl≤βitl+αtl−αsl
Note that to parallelize a hyperplane, we need δ for all unsatisfied dependencies to be 0. One approach is to let βit0=βis0=βjt0=βjs0=1 to satisfy both dependencies in the outer loop. Then we can freely parallelize the inner loop.