Orthogonalize Indices

2022-03-14

The convolution operation $C = A * B$ is key to many signal processing applications. When used in a convolutional neural network, backpropagation also requires the calculation of the gradient of the convolution with respect to its arguments during the 'reverse pass'. Ignoring differentiation with respect to B, let's take conv! and convpullback! as representatives of the forward and reverse passes for convolution.

julia> using OffsetArrays

julia> function conv!(_C, _A, _B)
    A = OffsetArray(_A, OffsetArrays.Origin(0,0))
    B = OffsetArray(_B, OffsetArrays.Origin(0,0))
    C = OffsetArray(_C, OffsetArrays.Origin(0,0))
    I, J = size(B)
    M, N = size(C)
    for n = 0:N-1, m = 0:M-1, i = 0:I-1, j = 0:J-1
        C[m,n] += A[m + i, n + j] * B[i, j]
    end
    return C
end

julia> function convpullback!(_Ā, _B, _C̄)
    Ā = OffsetArray(_Ā, OffsetArrays.Origin(0,0))
    B = OffsetArray(_B, OffsetArrays.Origin(0,0))
    C̄ = OffsetArray(_C̄, OffsetArrays.Origin(0,0))
    I, J = size(B)
    M, N = size(C̄)
    for n = 0:N-1, m = 0:M-1, i = 0:I-1, j = 0:J-1
        Ā[m + i, n + j] += B[i, j] * C̄[m, n]
    end
    return Ā
end

While the forward pass is easy to optimize via register tiling, this is not the case for the pullback: the index into $\bar{A}$ is dependent on all four loop induction variables, making it impossible to hoist these loads and stores out of any loops. This forces us to re-load and re-store memory on every iteration, requiring several additional CPU instructions per multiplication.

I'll add a post detailing register tiling, but for now just note that aside from reducing the loads and stores to $\bar{A}$ by factors equal to the loops they're hoisted out of, tiling also results in several times fewer loads from B and from $\bar{C}$ , enabling code to start attaining significant fractions of peak flops. This reduced demand on memory bandwidth can deliver performance gains of about an order of magnitude in typical cases.

Now onto this post's core question: can we re-index the memory accesses so that $\bar{A}$ is dependent on only two loops? That is, we want to produce a new set of loops that looks more like

for w in W, x in X
    Āwx = Ā[w, x]
    for y in Y, z in Z
        Āwx += B[???] * C̄[???]
    end
    Ā[w, x] = Āwx
end

allowing a register-efficient tiled access pattern. If we can, what are the new ranges W, X, Y, and Z, and what are the new indices into B and $\bar{C}$ ? Can we develop a general algorithm?

Often, a human can look at a problem and reason their way to a solution without too much difficulty. However, our goal is to create a general algorithm to remove the need for expert human intervention. Ultimately, this should allow machine generation of optimal code even for challenging cases like reverse-mode automatic differentiation of expressions containing loops. Starting from a forward pass written with naive loops, it is an express goal that an AD tool like Enzyme + the new LoopVectorization will deliver performance which matches or beats the state of the art across a wide range of loops. That said, human reasoning is a useful starting point for building an intuition of the problem, which in turn can help point to general algorithms.

We want to set w = m + i and x = n + j, so W must iterate over the full range of values attained by m + i, i.e. w = 0:M+N-2, and x = 0:N+J-2. We might naively set the indices of B[i,j] to B[y,z]; what would this imply about the indices of $\bar{C}$ , and about the ranges Y and Z?

First, it's straightforward to find that we must have C̄[w-y, x-z], as m = w - i = w - y, and we can prove n similarly.

For the bounds on y, note that y = i, and 0 <= i <= I-1, thus 0 <= y <= I-1. However, we also have that 0 <= m <= M-1, therefore 0 <= w-y <= M-1, which means y <= w and y >= w - (M-1). Taking the intersection yields max(0, w - (M-1)) <= y <= min(I-1, w). We can apply the same argument for z to produce the bounds max(0, x - (N-1)) <= z <= min(J-1, x).

To treat this algorithmically, we represent the loop bounds via a system of inequalities:

[ 1  0  0  0    [ m         [ M-1
 -1  0  0  0      n    .<=     0
  0  1  0  0      i           N-1
  0 -1  0  0      j ]          0
  0  0  1  0                  I-1
  0  0 -1  0                   0
  0  0  0  1                  J-1
  0  0  0 -1 ]                 0 ]
# A * [m;n;i;j] <= b

and the indices with a matrix

[ 1 0 1 0   * [ m         [ m + i
  0 1 0 1       n    .==    n + j
  1 0 0 0       i           m
  0 1 0 0       j ]         n
  0 0 1 0                   i
  0 0 0 1 ]                 j     ]

# X * [m;n;i;j] == [index expressions...]

Letting the index matrix be X, we can then frame our objective as finding some invertible matrix K such that the first two rows of X*K are linearly independent single-element vectors, with 1 as the single element. We can permute the columns of K arbitrarily to simplify this to say X*K's first two rows should be hcat(I(2), 0). With this, we can insert $\textbf{I} = \textbf{KK}^{-1}$ to apply the transform. This also defines our new loop induction variables $\textbf{K}^{-1}\begin{bmatrix}m\\n\\i\\j\end{bmatrix}= \begin{bmatrix}w\\x\\y\\z\end{bmatrix}$ . Define $\textbf{w}$ as our new vector of induction variables, we can also express our new loop inequalities simply as $\textbf{AK}^{-1}\textbf{v} <= \textbf{b}$ .

(Note that $\textbf{K}^{-1}$ is a simple representation of a linear loop schedule – it is a linear function that gives the order in which iterations of the loops are evaluated. We will discuss affine schedules in much greater detail in a future post.)

The only remaining task is finding the matrix $K$ . As $K$ must be both an integer matrix and invertible, it is unimodular. As the first two rows of $\textbf{XK}$ correspond to the identity matrix, we know the first two rows of $X$ equal the first two rows of $\textbf{K}^{-1}$ .

From here, we realize that we can find such a matrix by modifying the algorithms typically used to bring matrices into reduced forms, such as reduced echelon form or the Hermite normal form. We can use column pivots and row additions (using integer multipliers), as these preserve the determinant and maintain classification as an integer matrix. We diagonalize one row at a time. When we encounter a row we cannot diagonalize, that means this index cannot produce a unimodular matrix in combination with the preceding indices, so we reject it and move on to the next index. In this way our matrix $\textbf{K}^{-1}$ can be a subset of up to size(X,2) rows of $\textbf{X}$ , not just the first two. The important characteristics here are that it allows us to choose which indices to prioritize – e.g. those that would enable register tiling – while still simplifying or maintaining simplicity in some other indices. If we run out of rows of $\textbf{X}$ , then the algorithm will still succeed, but then $\textbf{K}^{-1}$ will include rows not present in $\textbf{X}$ .

Inputting the loop bounds (as 0 <= i_0 <= M-1, 0 <= i_1 <= N-1, 0 <= i_2 <= O-1, and 0 <= i_3 <= P-1) and the indices, we get the following output post-transformation:

Loop 0 lower bounds:
i_0 >= 0
Loop 0 upper bounds:
i_0 <=  ( M + O - 2 )
Loop 1 lower bounds:
i_1 >= 0
Loop 1 upper bounds:
i_1 <=  ( N + P - 2 )
Loop 2 lower bounds:
i_2 >=  ( - M + 1 )  + i_0
i_2 >= 0
Loop 2 upper bounds:
i_2 <= i_0
i_2 <=  ( O - 1 )
Loop 3 lower bounds:
i_3 >=  ( - N + 1 )  + i_1
i_3 >= 0
Loop 3 upper bounds:
i_3 <= i_1
i_3 <=  ( P - 1 )

New ArrayReferences:
ArrayReference 0 (dim = 2):
{ Induction Variable: 0 }
 ( M + O - 1 )  * ({ Induction Variable: 1 })



ArrayReference 1 (dim = 2):
{ Induction Variable: 2 }
 ( O )  * ({ Induction Variable: 3 })



ArrayReference 2 (dim = 2):
{ Induction Variable: 0 } - { Induction Variable: 2 }
 ( M )  * ({ Induction Variable: 1 } - { Induction Variable: 3 })

This is the desired/expected outcome. We can now register-tile the loop.

Now let us move to a more artificial example, a (deliberately) badly-written matrix-multiply:

julia> using OffsetArrays

julia> function badmul!(C,A,B)
           M,N = size(C)
           K = size(B,1); fill!(C,0)
           for i in 0:M+N+K-3, l in max(0,i+1-N):min(M+K-2,i), j in max(0,l+1-K):min(M-1,l)
               C[j,i-l] += A[j,l-j]*B[l-j,i-l]
           end
           C
       end
badmul! (generic function with 1 method)

julia> M,K,N = 5,6,7
(5, 6, 7)

julia> A = OffsetArray(rand(M,K),-1,-1);

julia> B = OffsetArray(rand(K,N),-1,-1);

julia> C = OffsetArray(rand(M,N),-1,-1);

julia> badmul!(C,A,B)
5×7 OffsetArray(::Matrix{Float64}, 0:4, 0:6) with eltype Float64 with indices 0:4×0:6:
 1.9536   1.20026   2.20549  1.11528   1.77055  2.14641  1.75909
 1.54733  1.10697   1.6769   1.13867   1.06312  1.71708  1.65325
 1.76249  0.908232  1.63373  0.995246  1.19762  1.69865  1.71706
 1.46832  0.939424  1.78488  1.36437   1.09265  1.69105  1.35382
 1.27257  0.770909  1.14775  0.596495  0.92855  1.21593  1.26251

julia> parent(A)*parent(B)
5×7 Matrix{Float64}:
 1.9536   1.20026   2.20549  1.11528   1.77055  2.14641  1.75909
 1.54733  1.10697   1.6769   1.13867   1.06312  1.71708  1.65325
 1.76249  0.908232  1.63373  0.995246  1.19762  1.69865  1.71706
 1.46832  0.939424  1.78488  1.36437   1.09265  1.69105  1.35382
 1.27257  0.770909  1.14775  0.596495  0.92855  1.21593  1.26251

Entering the badmul! loopnest into the orthoganlizer produces:

Skewed loop nest:
Loop 0 lower bounds:
i_0 >= 0
Loop 0 upper bounds:
i_0 <=  ( M - 1 )
Loop 1 lower bounds:
i_1 >= 0
Loop 1 upper bounds:
i_1 <=  ( N - 1 )
Loop 2 lower bounds:
i_2 >= 0
Loop 2 upper bounds:
i_2 <=  ( O - 1 )

New ArrayReferences:
ArrayReference 0 (dim = 2):
{ Induction Variable: 0 }
 ( M )  * ({ Induction Variable: 1 })



ArrayReference 1 (dim = 2):
{ Induction Variable: 0 }
 ( O )  * ({ Induction Variable: 2 })



ArrayReference 2 (dim = 2):
{ Induction Variable: 2 }
 ( M )  * ({ Induction Variable: 1 })

We recover a perfectly-normal triple-loop matrix multiplication which can be straightforwardly optimized via register and cache tiling.

Once we get such a solution, we can test if it is satisfiable with respect to the dependency constraints. If so, we can add them as constraints before running the ILP solver to produce a schedule.