Farkas' Lemma

2022-03-24

Consider the loop

for i = 0:I-1, j = 0:J-1
    A[i+1,j+1] = A[i+1,j] + A[i,j+1]
end

This loop is not trivial to parallelize because it requires serial execution of both the i and j loops. The store into A[i+1,j+1] will be loaded on the next iteration of the j loop by A[i+1,j], and also by the next iteration of the i loop by A[i,j+1]. We define the store into A[i+1,j+1] as the source, and the subsequent loads from this memory location as the targets which must hold the stored value.

If the relationships between sources and targets across loop iterations are cast as a dependency graph with coordinates corresponding to the iteration space indices, we can model and transform those relationships using dependence polyhedra, allowing application of optimization techniques from linear programming and operations research.

In general, we define polyhedra using a system of inequalities.

Axb\begin{aligned} \textbf{A}\vec{x} \le \vec{b} \end{aligned}

where inequalities involving vectors apply elementwise, bold capital letters indicate matrices, and the vector arrow x\vec{x} is used to indicate column vectors.

For dependence polyhedra, we define one per dependence. The aim is to describe how both ii and jj relate in the source and target through a system of inequalities. We use the ss subscript indicate the source (parent) and tt subscript the target (child). With this notation, the store into A[is+1,js+1]A[i_s+1, j_s+1] and subsequent load from A[it+1,jt]A[i_t+1,j_t] imposes two constraints: is+1=it+1i_s+1=i_t+1 and js+1=jtj_s+1=j_t. Encoding this information, as well as the loop bounds, yields the following system of inequalities, which we can interpret as a polyhedra that gives the space of all dependent iterations (note that we can disprove dependence by proving a depdendence polyehedra is empty, which we will demonstrate in a future blog post).

[100010000100010000100010000100011010101001010101][isjsitjt][I10J10I10J100011]\begin{aligned} \begin{bmatrix} 1 & 0 & 0 & 0\\ -1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & -1 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 0 & -1 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 0 & -1\\ 1 & 0 & -1 & 0\\ -1 & 0 & 1 & 0\\ 0 & 1 & 0 & -1\\ 0 & -1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix}i_s\\j_s\\i_t\\j_t\end{bmatrix} \le \begin{bmatrix}I-1\\0\\J-1\\0\\I-1\\0\\J-1\\0\\0\\0\\-1\\1\end{bmatrix} \end{aligned}

Pruning a few redundant bounds, we get

[10001000010001001010101001010101][isjsitjt][I10J200011]\begin{aligned} \begin{bmatrix} 1 & 0 & 0 & 0\\ -1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & -1 & 0 & 0\\ 1 & 0 & -1 & 0\\ -1 & 0 & 1 & 0\\ 0 & 1 & 0 & -1\\ 0 & -1 & 0 & 1\\ \end{bmatrix} \begin{bmatrix}i_s\\j_s\\i_t\\j_t\end{bmatrix} \le \begin{bmatrix}I-1\\0\\J-2\\0\\0\\0\\-1\\1\end{bmatrix} \end{aligned}

As seen in the orthogonalize_loops post, we like to define loop schedules as affine functions of the indices. We'll discuss valid schedules and their interpretation in a subsequent post, but for now, suffice it to say that we need to prove things about the differences in schedules at certain levels of the loop nest. The source iteration must occur before the target iteration, which means that we generally need to prove one level of the loop (i.e., one row of the scheduling matrix) either equals, or iterates, before another. Letting ϕtl(t)\phi^l_t\left(\vec{t}\right) and ϕsl(s)\phi^l_s\left(\vec{s}\right) be the affine schedules for level ll for the target and source respectively, such that

ϕtl(t)=αtl+βtTt\begin{aligned} \phi^l_t\left(\vec{t}\right) &= \alpha_{t}^l + \vec{\beta}_{t}^T\vec{t} \end{aligned}

and all other schedules are defined similarly, with the superscript TT indicating transposition.

Then we may wish to constrain ourselves to solutions that satisfy

ϕtl(t)ϕsl(s)δ\begin{aligned} \phi^l_t\left(\vec{t}\right) - \phi^l_s\left(\vec{s}\right) \ge \delta \end{aligned}

for some δ\delta.

Is there some way we can find constraints on allowed values of αt\alpha_{t} and βt\vec{\beta}_{t}? Enter Farkas' lemma!

According to the affine form of Farkas' lemma given in Uday's thesis, an affine function ψ(x)\psi(\vec{x}) is non-negative everywhere in a polyhedron D\mathcal{D} iff it is a non-negative linear combination of the faces:

ψ(x)λ0+λPT(bAx),λ00,λP0.\begin{aligned} \psi\left(\vec{x}\right)&\equiv \lambda_0 + \vec{\lambda}^T_P\left(\vec{b} - \textbf{A}\vec{x}\right), \lambda_0\ge 0,\vec{\lambda}_P\ge 0. \end{aligned}

That is, if some λ00,λP0\lambda_0\ge 0,\vec{\lambda}_P\ge 0 exists such that the equality holds, then ψ(x)\psi\left(\vec{x}\right) is non-negative everywhere inside the polyhedron.

To apply that to our problem of dependence, we let

ψ([s,t])=ϕtl(t)ϕsl(s)δ=λ0+λPT(bAx)\begin{aligned} \psi\left(\left[\vec{s},\vec{t}\right]\right) &= \phi^l_t\left(\vec{t}\right) - \phi^l_s\left(\vec{s}\right) - \delta\\ &= \lambda_0 + \vec{\lambda}^T_P\left(\vec{b} - \textbf{A}\vec{x}\right) \end{aligned}

and then use Fourier-Motzkin elimination to remove the λ\lambda.

Returning to the first dependence of our motivating example, we have

ψl([s,t])=ψl([is,js,it,jt])=ϕtl([it,jt])ϕsl([is,js])δ=αtl+βitlit+βjtljtαslβisisβjsjsδ=λ0+λ1(I1is)+λ2(is)+λ3(J2js)+λ4(js) +λ5(itis)+λ6(isit)+λ7(1+jtjs)+λ8(1+jsjt)\begin{aligned} \psi^l\left(\left[\vec{s},\vec{t}\right]\right) &= \psi^l\left(\left[i_s,j_s,i_t,j_t\right]\right) \\ &= \phi^l_t\left(\left[i_t,j_t\right]\right) - \phi^l_s\left(\left[i_s,j_s\right]\right) - \delta\\ &= \alpha_{t}^l + \beta^l_{i_t}i_t + \beta^l_{j_t}j_t - \alpha_{s}^l - \beta_{i_s}i_s - \beta_{j_s}j_s - \delta\\ &= \lambda_0 + \lambda_1\left(I-1 - i_s\right) + \lambda_2\left(i_s\right) + \lambda_3\left(J-2 - j_s\right) + \lambda_4\left(j_s\right) \\ &\ \ \ \ + \lambda_5\left(i_t - i_s\right) + \lambda_6\left(i_s - i_t\right) + \lambda_7\left(-1 + j_t - j_s\right) + \lambda_8\left(1 + j_s - j_t\right) \end{aligned}

We can now produce the following system of equations by matching induction variable coefficients:

βisl=λ1+λ2λ5+λ6βjsl=λ3+λ4λ7+λ8βitl=λ5λ6βjtl=λ7λ80=λ10=λ3αtlαslδ=λ0λ12λ3λ7+λ8\begin{aligned} -\beta^l_{i_s} &= -\lambda_1 + \lambda_2 - \lambda_5 + \lambda_6\\ -\beta^l_{j_s} &= -\lambda_3 + \lambda_4 - \lambda_7 + \lambda_8\\ \beta^l_{i_t} &= \lambda_5 - \lambda_6\\ \beta^l_{j_t} &= \lambda_7 - \lambda_8\\ 0 &= \lambda_1\\ 0 &= \lambda_3\\ \alpha^l_t - \alpha^l_s - \delta &= \lambda_0 - \lambda_1 - 2\lambda_3 -\lambda_7+\lambda_8 \end{aligned}

and imposing the inequalities λi0,i=0,,8\lambda_i\ge0, i = 0,\ldots,8.

We can perform a couple rounds of Gaussian elimination to simplify this to:

βitlβisl=λ2βjtlβjsl=λ4βitl=λ5λ6βjtl=λ7λ8βjtl+αtlαslδ=λ0\begin{aligned} \beta^l_{i_t}-\beta^l_{i_s} &= \lambda_2\\ \beta^l_{j_t}-\beta^l_{j_s} &= \lambda_4\\ \beta^l_{i_t} &= \lambda_5 - \lambda_6\\ \beta^l_{j_t} &= \lambda_7 - \lambda_8\\ \beta^l_{j_t} + \alpha^l_t - \alpha^l_s - \delta &= \lambda_0 \end{aligned}

before proceding with Fourier-Motzkin elimination to eliminate the Fourier-Motzkin multipliers (the λ\lambdas).

To first eliminate λ2\lambda_2, we have the following inequalities including λ2\lambda_2:

λ2βitlβislλ2βitlβislλ20\begin{aligned} \lambda_2 &\le \beta^l_{i_t}-\beta^l_{i_s}\\ \lambda_2 &\ge \beta^l_{i_t}-\beta^l_{i_s}\\ \lambda_2 &\ge 0 \end{aligned}

Eliminating λ2\lambda_2 produces

βjtlβjsl=λ4βitl=λ5λ6βjtl=λ7λ8βjtl+αtlαslδ=λ00βitlβislλ00λ40λ50λ60λ70λ80\begin{aligned} \beta^l_{j_t}-\beta^l_{j_s} &= \lambda_4\\ \beta^l_{i_t} &= \lambda_5 - \lambda_6\\ \beta^l_{j_t} &= \lambda_7 - \lambda_8\\ \beta^l_{j_t} + \alpha^l_t - \alpha^l_s - \delta &= \lambda_0\\ 0 &\le \beta^l_{i_t}-\beta^l_{i_s}\\ \lambda_0&\ge0\\ \lambda_4&\ge0\\ \lambda_5&\ge0\\ \lambda_6&\ge0\\ \lambda_7&\ge0\\ \lambda_8&\ge0 \end{aligned}

Similarly, eliminating λ0\lambda_0, λ4\lambda_4, λ5\lambda_5, and λ7\lambda_7, we have:

λ4βjtlβjslλ4βjtlβjslλ40λ5βitl+λ6λ5βitl+λ6λ50λ7βjtl+λ8λ7βjtl+λ8λ70λ0βjtl+αtlαslδλ0βjtl+αtlαslδλ00\begin{aligned} \lambda_4 &\le \beta^l_{j_t}-\beta^l_{j_s}\\ \lambda_4 &\ge \beta^l_{j_t}-\beta^l_{j_s}\\ \lambda_4 &\ge 0\\ \lambda_5 &\le \beta^l_{i_t} + \lambda_6\\ \lambda_5 &\ge \beta^l_{i_t} + \lambda_6\\ \lambda_5 &\ge 0\\ \lambda_7&\le \beta^l_{j_t} + \lambda_8\\ \lambda_7&\ge \beta^l_{j_t} + \lambda_8\\ \lambda_7&\ge 0\\ \lambda_0 &\le \beta^l_{j_t} + \alpha^l_t - \alpha^l_s - \delta\\ \lambda_0 &\ge \beta^l_{j_t} + \alpha^l_t - \alpha^l_s - \delta\\ \lambda_0&\ge 0 \end{aligned}

yielding

0βjtlβjsl0βitl+λ60βjtl+λ80βjtl+αtlαslδ0βitlβislλ60λ80\begin{aligned} 0 &\le \beta^l_{j_t}-\beta^l_{j_s}\\ 0 &\le \beta^l_{i_t} + \lambda_6\\ 0 &\le \beta^l_{j_t} + \lambda_8\\ 0 &\le \beta^l_{j_t} + \alpha^l_t - \alpha^l_s - \delta\\ 0 &\le \beta^l_{i_t}-\beta^l_{i_s}\\ \lambda_6&\ge0\\ \lambda_8&\ge0 \end{aligned}

Finally, eliminating λ6\lambda_6 and λ8\lambda_8:

λ6βitlλ60λ8βjtlλ80\begin{aligned} \lambda_6 &\ge -\beta^l_{i_t}\\ \lambda_6&\ge0\\ \lambda_8 &\ge - \beta^l_{j_t}\\ \lambda_8&\ge0 \end{aligned}

We realize we can just drop them. Thus, our final equations are

βjslβjtlβislβitlδβjtl+αtlαsl\begin{aligned} \beta^l_{j_s} &\le \beta^l_{j_t}\\ \beta^l_{i_s} &\le \beta^l_{i_t}\\ \delta &\le \beta^l_{j_t} + \alpha^l_t - \alpha^l_s\\ \end{aligned}

Thus, the coefficient for the target must be at least as great as those from the source. Additionally, if the offsets are equal, then βjtl\beta^l_{j_t} must exceed our desired δ\delta.

Repeating this procedure for the other dependency would similarly yield

βjslβjtlβislβitlδβitl+αtlαsl\begin{aligned} \beta^l_{j_s} &\le \beta^l_{j_t}\\ \beta^l_{i_s} &\le \beta^l_{i_t}\\ \delta &\le \beta^l_{i_t} + \alpha^l_t - \alpha^l_s\\ \end{aligned}

Note that to parallelize a hyperplane, we need δ\delta for all unsatisfied dependencies to be 0. One approach is to let βit0=βis0=βjt0=βjs0=1\beta^0_{i_t} = \beta^0_{i_s} = \beta^0_{j_t} = \beta^0_{j_s} = 1 to satisfy both dependencies in the outer loop. Then we can freely parallelize the inner loop.

Reference: Effective Automatic Parallelization and Locality Optimization using the Polyehdral Model by Uday Bondhugula.