Let's say we're given a loop nest like:
using Test
function trisolve!(A, B, U)
M, N = size(A)
@assert size(B) == (M, N)
@assert size(U) == (N, N)
for m = 1:M
for n = 1:N
A[m, n] = B[m, n]
end
for n = 1:N
A[m, n] /= U[n, n]
for k = n+1:N
A[m, k] -= A[m, n] * U[n, k]
end
end
end
end
M, N = 4, 7
B = rand(M, N);
U = UpperTriangular(rand(N, N));
A = similar(B);
trisolve!(A, B, U)
@test A ≈ B / U
When we analyze it in LLVM, scalar evolution presents us address evolution as a function of loop trip counts, that will essentially transform the loop into the one below:
using OffsetArrays, Test
function trisolve_preprocess!(_A, _B, _U)
M, N = size(_A)
@assert size(_B) == (M, N)
@assert size(_U) == (N, N)
A = OffsetArray(_A, -1, -1)
B = OffsetArray(_B, -1, -1)
U = OffsetArray(_U, -1, -1)
for m = 0:M-1
for n = 0:N-1
A[m, n] = B[m, n]
end
for n = 0:N-1
A[m, n] /= U[n, n]
for k = 0:N-n-2
A[m, k+n+1] -= A[m, n] * U[n, k+n+1]
end
end
end
end
M, N = 4, 7
B = rand(M, N);
U = UpperTriangular(rand(N, N));
A = similar(B);
trisolve_preprocess!(A, B, U)
@test A ≈ B / U
That is, all arrays become 0-indexed (which is most convenient), and all loops start at 0 – which is perhaps a reasonable canonicalization, but can cause us some difficulty when building up constraint systems.
Intuitively, both versions of the code represent the exact same program with the same traversal order of memory addresses, and thus should be interpreted the same way.
To understand an issue with the second representation let us first consider the dependence between the store A[m,n] = A[m,n] / U[n,n]
, and the loads used for updating in the inner loop, that is A[m, k]
for the first version:
A[m, k] = A[m, k] - A[m, n] * U[n, k]
or A[m, k+n+1]
for the second version:
A[m, k+n+1] = A[m, k+n+1] - A[m, n] * U[n, k+n+1]
Let's take for granted that the loads at a particular address (A[m, k]
and A[m, k+n+1]
, respectively) must happen before the store A[m,n]
. To see that this is the case, consider that once we compute A[m, n]
for a particular value of n
, say n = x
, we update all values of A[m, y]
, where x < y
and y
is still within bounds. Only on a later iteration after n
has been incremented y-x
times do we finally perform the A[m,n]
store to that address. That is, the store happens after the load, because for any particular address, we stored after the load.
For the first example, we have the iteration spaces for A[m,n]
, hereafter called the target
, or t
, and the load A[m,k]
, henceforth referred to as the source, or s
:
⎣⎢⎢⎢⎡−1−1000010000110−10010−1⎦⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎡1MNmtnt⎦⎥⎥⎥⎥⎥⎤≥0⎣⎢⎢⎢⎢⎢⎢⎢⎡−1−1−1000000100000011100−10001−10−1000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎡1MNmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎤≥0
Our dependence constraint system for this relation is thus (we have a dependence when ms=mt and ks=nt):
A1E100=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡−1−100−1−1−10000010000100000100001110−10000000010−10000000000100−100000001−10−10000000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=[0000001001−10000−1]≤A1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=E1⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
However, for ease of comparison with the second example, let us shift this to use 0-based indexing:
⎣⎢⎢⎢⎡00−1−10010000110−10010−1⎦⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎡1MNmtnt⎦⎥⎥⎥⎥⎥⎤≥0⎣⎢⎢⎢⎢⎢⎢⎢⎡000−1−1−1000100000011100−10001−10−1000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎡1MNmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎤≥0
Our dependence constraint system for this relation is thus (we have a dependence when ms=mt and ks=nt):
A1E1=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡00−1−1000−1−1−10010000100000100001110−10000000010−10000000000100−100000001−10−10000000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=[0000001001−10000−1]
For the second example, our individual constraint systems look like
⎣⎢⎢⎢⎡00−1−10010000110−10010−1⎦⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎡1MNmtnt⎦⎥⎥⎥⎥⎥⎤≥0⎣⎢⎢⎢⎢⎢⎢⎢⎡000−1−1−2000100000011100−1000100−1−100100−1⎦⎥⎥⎥⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎡1MNmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎤≥0
and our dependence constraints are
A2E200=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡00−1−1000−1−1−20010000100000100001110−10000000010−10000000000100−10000000100−1−1000000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=[0−100001001−100−10−1]≤A2⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=E2⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Now, we want a schedule, i.e values of
ϕsωsϕtωt=[ϕmsϕnsϕks]=[ϕmtϕnt]
such that
ϕmtmt+ϕntnt+ωt−ϕmsms−ϕnsns−ϕksks−ωs≥0
for all values of mt,nt,ms,ns,ks within the iteration space. Farkas' lemma tell us that this is true iff we have λ0,λ≥0 such that
λ0+λ′⎣⎢⎡AE−E⎦⎥⎤⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=[(ωt−ωs)00ϕt′−ϕs′]⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡1MNmtntmsnsks⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Thus, if the following is satisfied
[A′E′−E′]λ=⎣⎢⎢⎢⎢⎢⎡(ωt−ωs)−λ000ϕt−ϕs⎦⎥⎥⎥⎥⎥⎤
for some λ≥0, then the target iteration will take place after the source iteration for all possible values of loop induction variables (i.e. mt, ks, etc) and program variables (i.e. M and N) that are within our dependence constraints. Our focus here is on Δω delta, i.e. ωt−ωs. The Δω represents shifting across loop iterations, which is a peculiar transform we'd rather not resort to applying. Our linear program's objective function minimizes this (as well as the ϕ values).
But, here we have a problem. Let's look at the first row of [A′E′−E′] for our first example:
⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100000000001000000001000−110−10000−1010−10000000010000000010000000−11−11000−100−101000−10−1010000−100010−1000000100−1000−101000000−1001⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ0λ]=⎣⎢⎢⎢⎢⎢⎡(ωt−ωs)00ϕt−ϕs⎦⎥⎥⎥⎥⎥⎤
second is
⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100000000001000000001000−110−10000−1010−1000000001000000001000000001−11000−100−101000−10−201000−1−100010−100−100010−1−1000−101001000−1011⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ0λ]=⎣⎢⎢⎢⎢⎢⎡(ωt−ωs)00ϕt−ϕs⎦⎥⎥⎥⎥⎥⎤
Now, one of the chief optimizations we are interested in is register tiling. As part of this, we want to hoist the loads and stores out of the inner loop. Let's say we thus want to match indices in our load and store. So for the first example, that means selecting
ϕtϕs=[01]=[001]
For the second example, we have
ϕtϕs=[01]=[011]
We could put these into a linear program solver, but it should be quickly apparent – or at least easy to confirm – that valid solutions are to set the λ12=1, and all others to 0:
⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100000000000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ01]⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡10000000−100010−1−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ01]=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡0000100−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡000010−1−1⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
For the first example, we can set λ0=0, while for the second we must set λ0=1, which isn't a problem. The λ12 came from the equality matrix, setting the loop induction variables into each of the array's dimensions as equal. We have two copies of these columns, with sign flipped. This suggests matching schedules that line up with the indices used for indexing into arrays is relatively straightforward: we don't have to search for optimal λ values very long. If the offset is 0, as in the first example, we can immediately find the combination that equals the corresponding schedule, without needing to run a linear program.
However, if we have an offset, the direction matters. If it is positive, λ0 picks up the slack. Else, we need to check if some linear combination of the other rows can compensate (by running the linear program solver), possibly resorting to non-zero Δω.
Let's look at an example, from this very same loop nest. Earlier, we compared the reduction load in the inner most loop with the store A[m,n]/=U[n,n]. This store happens after the reduction load. But what about the store A[m,n]=B[m,n]? This store happens before the reduction load, thanks to being split into an earlier loop in the original source.
This means, we're in almost the same situation as before, but the source and target roles are switched; for the first example:
⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100000000001000000001000−110−10000−1010−10000000010000000010000000−11−11000−100−101000−10−1010000−100010−1000000100−1000−101000000−1001⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ0λ]=⎣⎢⎢⎢⎢⎢⎡(ωt−ωs)00−ϕsϕt⎦⎥⎥⎥⎥⎥⎤
second is
⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡100000000001000000001000−110−10000−1010−1000000001000000001000000001−11000−100−101000−10−201000−1−100010−100−100010−1−1000−101001000−1011⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤[λ0λ]=⎣⎢⎢⎢⎢⎢⎡(ωt−ωs)00−ϕsϕt⎦⎥⎥⎥⎥⎥⎤
Because we do not have an offset in the first example, the solution is exactly as it was before: λ12=1, and all others, including λ0, are 0. However, because λs are non-negative, our previous solution cannot be adapted. Note that we have
ϕsϕt=[01]=[011]
So let's try running a linear program where we force Δω=0 to see if any feasible solution exists at all:
using JuMP, HiGHS
function checksat(c, C)
model = Model(HiGHS.Optimizer)
N = size(C,2)
@variable(model, x[1:N] >= 0)
@constraint(model, c .== C * x)
@objective(model, Min, sum(x))
optimize!(model)
return model
end
c = [0; 0; 0; 0; -1; 0; 1; 1];
C = [
1 0 0 -1 -1 0 0 0 -1 -1 -1 0 -1 0 1
0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 1 1 0 0 0 0
0 1 0 -1 0 0 0 0 0 0 0 1 0 -1 0
0 0 1 0 -1 0 0 0 0 0 0 0 1 0 -1
0 0 0 0 0 1 0 0 -1 0 0 -1 0 1 0
0 0 0 0 0 0 1 -1 0 -1 0 0 -1 0 1
0 0 0 0 0 0 0 1 0 0 -1 0 -1 0 1
];
checksat(c, C)
Uh oh! So what's the problem with simply moving ahead and accepting non-zero Δω? It has the effect of shifting operations across loop iterations. If we plow forward from here and satisfy all our constraints, we could end up with a solution such as:
function trisolve_ω1!(_A, _B, _U)
A = OffsetArray(_A, -1, -1)
B = OffsetArray(_B, -1, -1)
U = OffsetArray(_U, -1, -1)
M, N = size(A)
for n = 0:N-1
Unn = n == 0 ? zero(eltype(U)) : U[n-1, n-1]
for m = 0:M-1
Amn = A[m, n] = B[m, n]
n > 0 && (A[m, n-1] /= Unn)
for k = 0:n-1
Amn -= A[m, k] * U[k, n]
end
A[m, n] = Amn
end
end
let n = N
Unn = U[n-1, n-1]
for m = 0:M-1
A[m, n-1] /= Unn
end
end
end
instead of the much simpler
function trisolve_ω0!(_A, _B, _U)
A = OffsetArray(_A, -1, -1)
B = OffsetArray(_B, -1, -1)
U = OffsetArray(_U, -1, -1)
M, N = size(A)
for n = 0:N-1
Unn = U[n, n]
for m = 0:M-1
Amn = B[m, n]
for k = 0:n-1
Amn -= A[m, k] * U[k, n]
end
A[m, n] = Amn / Unn
end
end
end
which requires that all Δω=0.
So there is obviously some failure in our analysis or representation if two representations of the same program could produce such different outcomes. It'd be great to try and understand this at a deeper level at some point. I imagine there is some insight there that I'm missing on why this should be so, that may also suggest an ideal solution.
That is, why is something as simple as redefining kt∗=kt+1, a shift of part of the polyhedra, able to solve this problem? Looking at the numbers, I can see why, but that is purely mechanical, no geometric interpretation or insight.
using LinearAlgebra
W = Matrix{Int}(I, size(C,1), size(C,1));
W[1,end] = -1;
C2 = W*C;
checksat(c, C2)
That said, we know enough to come up with solutions. Let's first consider options.
One possibility is of course fixing it when laying out the operations. Because trisolve_ω0!
and trisolve_ω1!
are equivalent in order, it is of course possible to figure out how we can generate trisolve_ω0!
from trisolve_ω1!
's schedule. An approach may be to check when we have non-zero omegas. If we have ω=1, but no parents within the given loop iteration, we can move the operation to the end of the previous iteration. Similarly, ω=−1 without children within an iteration could be shifted to the start of the next iteration.
Another possibility is to try and change the problem we solve, such that the constant offset between indices is 0. There are a few different approaches we could imagine taking here, e.g. shifting the polyehdra, or transferring offsets into ωs within the simplices we build.
The first approach is reasonable, and can be done in conjunction with the second approach. However, I am taking the second approach first, because intuitively, we would like solving for Δω=0 in the LP step to also mean solving for having values associated with the same address loaded at the same time, potentially allowing us to reduce the number of load and store operations needed, or at least increase locality.
Exploring the second option in more depth, the question is, at what point do we apply the adjustment, and where? I think we should leave the array indices, loop nest objects, and even dependence polyhedra unchanged. Instead, we can update the constraint systems to reflect applying a shift.
This is an easily invertible linear transform that can be applied to both sides of the simplices when we use them to check the direction of each dependency (we glossed over this earlier in the blog post, but the basic approach: construct mirroring simplices assuming opposite directions, and then check which is violated [we currently don't support dependencies that switch direction, which is another TODO in the future; one can break the iteration space into regions corresponding to each direction]). We can solve this transformed problem (which can be fast - λ12=1, as discussed before), and then easily recover the solution in the original space
julia> checksat(c, C2)
julia> csol = Int.((W.//1)\c)
8-element Vector{Int64}:
1
0
0
0
-1
0
1
1
julia> checksat(csol, C)
which of course has Δω=0.
However, the important thing is that our dependence edges were built in the transformed problem, and it's relatively straightforward to apply these omegas as shifts in indices, without needing to do as much analysis as option 1 implemented as a pure post-processing step.
However, the offset to loop and ω shift transformation must be applied on a per-store basis, as stores must be assigned the same schedule as all loads feeding directly into them. Thus, transformations must be applied to operations in these sets. We must search for conflicts within these sets that would result in removing one offset simply adding it elsewhere. That is, for each store, we can check all edges connected to that store and its directly connected loads, and update these together as a set; we must ensure that this group update is a simplification. For example:
truncaxes(A, n) = firstindex(A,n):lastindex(A,n)-1
for j = truncaxes(A,2), i = truncaxes(A,1)
A[i+1,j+1] = A[i+1,j] + A[i,j+1]
end
The two loads have direct connections to the store, thus all three memory operations must be shifted as a group. We cannot shift i
or j
for the set of all three operations in a way that would actually remove the offset from the E matrix of the dependence polyhedra, thus we must keep these offsets.