Optimizing Symmetric Quadratic Form

2022-12-19

This post is motivated by this discourse thread, providing the following code:

using LinearAlgebra, MKL, BenchmarkTools, LoopVectorization
product_1(x, M) = dot(x, M, x)

function product_2(x, M) 
    Mx = M * x
    return dot(x, Mx)
end
function product_3(x, M)
    n = length(x)
    result = zero(eltype(x))

    @inbounds for i = 1:n
        result += x[i]^2 * M[i, i]

        @simd for j = 1 : i - 1
            result += 2* x[i] * x[j] * M[j, i]
        end 
    end

    return result
end

The author expressed surprise that making matrix M symmetric wasn't always faster, and also that product_3, explicitly taking advantage of that symmetry, was slowest.

Setting a baseline:

BLAS.set_num_threads(1) # single threaded `product_2`
x = rand(200);
A = rand(length(x), 2+length(x)) |> y -> y*y';
B = Symmetric(A, :U);
@btime product_1($x, $A)
@btime product_2($x, $A)
@btime product_3($x, $A)
@btime product_1($x, $B)
@btime product_2($x, $B)
@btime product_3($x, $B)

product_1 and product_3 are implemented in "pure Julia", while product_2 calls out to the BLAS library for both the matrix-vector multiply and the vector-vector dot product. I get

julia> @btime product_1($x, $A)
  3.662 μs (0 allocations: 0 bytes)
508214.49429146934

julia> @btime product_2($x, $A)
  2.366 μs (1 allocation: 1.77 KiB)
508214.49429146945

julia> @btime product_3($x, $A)
  5.368 μs (0 allocations: 0 bytes)
508214.49429146847

julia> @btime product_1($x, $B)
  7.090 μs (0 allocations: 0 bytes)
508214.49429146847

julia> @btime product_2($x, $B)
  1.734 μs (1 allocation: 1.77 KiB)
508214.49429146945

julia> @btime product_3($x, $B)
  8.916 μs (0 allocations: 0 bytes)
508214.49429146847

julia> versioninfo()
Julia Version 1.10.0-DEV.175
Commit 0d469bd0f8* (2022-12-18 20:11 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 36 on 36 virtual cores

My CPU is cascadelake, an Intel Skylake-X clone feature AVX512 and 1MiB L2 cache/core. MKL is winning, especially for symmetric matrices, despite the fact we've limited it to only a single thread, and that it needs to allocate and write memory. using MKL is encouraged for best performance when possible, the default library OpenBLAS is likely to be slower.

Coming up second best will not do, especially when there's such an obvious deficit in product_2 – the need to allocate and write memory.

Asymptotically, the performance is limited by the rate at which we can loading memory from the matrix. Thus we should certainly see Symmetric winning, at least so long as we take advantage of the structure to avoid loading any elements more than once.

Lets take a quick look using LIKWID to try and get a better idea of the performance. For convenience, I define

using LIKWID, DataFrames, PrettyTables
foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = 
    foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
	
function filtered_print(metrics)
    subsetdict = Dict( 
	    "FLOPS_DP" => ["Runtime (RDTSC) [s]", "Runtime unhalted [s]", "CPI", "DP [MFLOP/s]", "AVX512 DP [MFLOP/s]", "Packed [MUOPS/s]", "Scalar [MUOPS/s]", "Vectorization ratio"],
	    "CYCLE_STALLS" => ["Total execution stalls", "Stalls caused by L1D misses [%]", "Stalls caused by L2 misses [%]", "Execution stall rate [%]", "Stalls caused by L1D misses rate [%]", "Stalls caused by L2 misses rate [%]", "Stalls caused by memory loads rate [%]"],
	    "CACHES" => ["L2 to L1 load bandwidth [MBytes/s]", "L2 to L1 load data volume [GBytes]", "L1 to L2 evict bandwidth [MBytes/s]", "L1 to L2 evict data volume [GBytes]", "L1 to/from L2 bandwidth [MBytes/s]", "L1 to/from L2 data volume [GBytes]"],
        "BRANCH" => ["Branch misprediction rate", "Instructions per branch"]
	)
	ms = String[];
	t1vs = Float64[];
	for (k,vs) = metrics
	    v = first(vs) # extract thread 1 only; FIXME in case of multithreading!
		submetrics = subsetdict[k]
		append!(ms, submetrics)
		for m = submetrics
		    push!(t1vs, v[m])
		end
	end
	pretty_table(DataFrame(Metric = ms, var"Thread 1" = t1vs))
end

foreachf is so that I can run code many times. Here, I will keep N = 1_000_000 across profiles to keep results consistent. filtered_print is to give a more concise output. Let's compare product_1 with product_3 using the dense matrix; I manually filtered the results:

julia> met, evt = @perfmon ("FLOPS_DP", "CYCLE_STALLS", "CACHES", "BRANCH") foreachf(product_1, 1_000_000, x, A);

julia filtered_print(met);
┌────────────────────────────────────────┬────────────┐
│                                 Metric │   Thread 1 │
│                                 String │    Float64 │
├────────────────────────────────────────┼────────────┤
│                    Runtime (RDTSC) [s] │    4.31409 │
│                   Runtime unhalted [s] │    4.09009 │
│                                    CPI │    0.50087 │
│                           DP [MFLOP/s] │    14603.8 │
│                    AVX512 DP [MFLOP/s] │    13769.3 │
│                       Packed [MUOPS/s] │     1760.9 │
│                       Scalar [MUOPS/s] │    754.986 │
│                    Vectorization ratio │    69.9913 │
│                 Total execution stalls │  2.76645e9 │
│        Stalls caused by L1D misses [%] │    83.0887 │
│         Stalls caused by L2 misses [%] │   0.625892 │
│               Execution stall rate [%] │    22.5896 │
│   Stalls caused by L1D misses rate [%] │    18.7694 │
│    Stalls caused by L2 misses rate [%] │   0.141387 │
│ Stalls caused by memory loads rate [%] │    27.0086 │
│     L2 to L1 load bandwidth [MBytes/s] │    64062.1 │
│     L2 to L1 load data volume [GBytes] │    275.821 │
│    L1 to L2 evict bandwidth [MBytes/s] │    77.0877 │
│    L1 to L2 evict data volume [GBytes] │   0.331903 │
│     L1 to/from L2 bandwidth [MBytes/s] │    64139.2 │
│     L1 to/from L2 data volume [GBytes] │    276.153 │
│              Branch misprediction rate │ 6.67933e-5 │
│                Instructions per branch │    7.20618 │
└────────────────────────────────────────┴────────────┘

julia> m, e = @perfmon ("FLOPS_DP", "CYCLE_STALLS", "CACHES", "BRANCH") foreachf(product_3, 1_000_000, x, A);

julia filtered_print(m);
┌────────────────────────────────────────┬────────────┐
│                                 Metric │   Thread 1 │
│                                 String │    Float64 │
├────────────────────────────────────────┼────────────┤
│                    Runtime (RDTSC) [s] │    5.50317 │
│                   Runtime unhalted [s] │    5.21599 │
│                                    CPI │   0.607209 │
│                           DP [MFLOP/s] │    8882.12 │
│                    AVX512 DP [MFLOP/s] │    7275.45 │
│                       Packed [MUOPS/s] │    935.599 │
│                       Scalar [MUOPS/s] │    1554.33 │
│                    Vectorization ratio │    37.5753 │
│                 Total execution stalls │  3.94969e9 │
│        Stalls caused by L1D misses [%] │    41.0844 │
│         Stalls caused by L2 misses [%] │   0.956054 │
│               Execution stall rate [%] │    25.1506 │
│   Stalls caused by L1D misses rate [%] │     10.333 │
│    Stalls caused by L2 misses rate [%] │   0.240453 │
│ Stalls caused by memory loads rate [%] │    26.4968 │
│     L2 to L1 load bandwidth [MBytes/s] │    30344.9 │
│     L2 to L1 load data volume [GBytes] │    167.261 │
│    L1 to L2 evict bandwidth [MBytes/s] │    146.758 │
│    L1 to L2 evict data volume [GBytes] │   0.808929 │
│     L1 to/from L2 bandwidth [MBytes/s] │    30491.7 │
│     L1 to/from L2 data volume [GBytes] │     168.07 │
│              Branch misprediction rate │ 0.00199066 │
│                Instructions per branch │    6.42476 │
└────────────────────────────────────────┴────────────┘

That that product_3 was executing over 50% more scalar MUOPS/s than packed. Each scalar MUOP doesn't contribute nearly as many FLOPS as packed one, so the total DP [MFLOP/s] still wasn't much higher than the AVX512-only value. product_1 did much better in this respect. It also had more instructions per branch. However, it did require much higher L2 to L1 load data, and thus also higher bandwidth, as we expected because it didn't take advantage of symmetry.

CPI is cycles per instruction, the higher the value, the more clock cycles were required per instruction. We see that the execution stall rate was higher for product_3.

For comparison, this is the symmetric product_2:

┌────────────────────────────────────────┬─────────────┐
│                                 Metric │    Thread 1 │
│                                 String │     Float64 │
├────────────────────────────────────────┼─────────────┤
│                    Runtime (RDTSC) [s] │     2.10946 │
│                   Runtime unhalted [s] │     1.95101 │
│                                    CPI │     0.41146 │
│                           DP [MFLOP/s] │     30664.5 │
│                    AVX512 DP [MFLOP/s] │     30663.7 │
│                       Packed [MUOPS/s] │     3833.36 │
│                       Scalar [MUOPS/s] │         0.0 │
│                    Vectorization ratio │       100.0 │
│                 Total execution stalls │   7.93344e8 │
│        Stalls caused by L1D misses [%] │     73.9433 │
│         Stalls caused by L2 misses [%] │     13.8556 │
│               Execution stall rate [%] │     13.5926 │
│   Stalls caused by L1D misses rate [%] │     10.0508 │
│    Stalls caused by L2 misses rate [%] │     1.88334 │
│ Stalls caused by memory loads rate [%] │     14.0704 │
│     L2 to L1 load bandwidth [MBytes/s] │     74542.3 │
│     L2 to L1 load data volume [GBytes] │     156.499 │
│    L1 to L2 evict bandwidth [MBytes/s] │     1823.74 │
│    L1 to L2 evict data volume [GBytes] │     3.82889 │
│     L1 to/from L2 bandwidth [MBytes/s] │     76366.0 │
│     L1 to/from L2 data volume [GBytes] │     160.328 │
│              Branch misprediction rate │ 0.000151162 │
│                Instructions per branch │     19.2526 │
└────────────────────────────────────────┴─────────────┘

Note the vectorization ratio of 100.0; only slightly more 512b instructions executed, and zero scalar. The CPI was also low. product_2 hit over 30 GFLOPS (30664.5 MFLOPS/s). We see a much better use of execution resources (i.e. vectorization), and far fewer branches. The execution stall rate and CPI were lower.

While LoopModels will be able to handle symmetry and arbitrary affine loop nests, LoopVectorization.jl is unfortunately limited to only rectangular loop nests, thus it cannot handle the symmetric case. I also plan on adding cache tiling to LoopModels, but this is also something LoopVectorization.jl does not support, so performance likely will not scale to larger sizes, nor do we consider things like alignment. But a 200x200 matrix fits neatly in cache, and has column sizes that are an integer multiple of 64 bytes, so none of this matters here. As a result, we had

function dotturbo(x, A)
    s = zero(promote_type(eltype(x), eltype(A)))
    @turbo for n ∈ axes(A, 2)
        t = zero(s)
        for m ∈ axes(A, 1)
            t += x[m] * A[m, n]
        end
        s += t * x[n]
    end
    s
end

Note that we manually implemented an optimization also used in LinearAlgebra.dot: we hoisted the multiplication by x[n] out of the loop. It's also my intention to automate this, but LoopVectorization.jl's transforms are fairly limited, and my performance development time has of course moved elsewhere.

I get

julia> @btime dotturbo($x, $A)
  1.875 μs (0 allocations: 0 bytes)
508214.49429146945

At least we're now winning the dense case. Thanks to the occasional GC activity from memory allocations, we're not too far behind the Symmetric case in average performance:

julia> @benchmark dotturbo($x, $A)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.869 μs …  6.652 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.887 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.892 μs ± 56.790 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄█▇▅▃
  ▂▂▃▆██████▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▂▂▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.87 μs        Histogram: frequency by time        2.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark product_2($x, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.733 μs … 107.171 μs  ┊ GC (min … max): 0.00% … 93.21%
 Time  (median):     1.768 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.843 μs ±   1.972 μs  ┊ GC (mean ± σ):  2.04% ±  1.88%

    ▁█▇▁
  ▁▃████▅▃▃▂▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.73 μs         Histogram: frequency by time        2.12 μs <

 Memory estimate: 1.77 KiB, allocs estimate: 1.

1.892 vs 1.843 microseconds.

The LIKWID results of dotturbo:

┌────────────────────────────────────────┬────────────┐
│                                 Metric │   Thread 1 │
│                                 String │    Float64 │
├────────────────────────────────────────┼────────────┤
│                    Runtime (RDTSC) [s] │    2.03067 │
│                   Runtime unhalted [s] │    1.92501 │
│                                    CPI │   0.620848 │
│                           DP [MFLOP/s] │    29055.0 │
│                    AVX512 DP [MFLOP/s] │    28545.1 │
│                       Packed [MUOPS/s] │    3769.91 │
│                       Scalar [MUOPS/s] │   0.844273 │
│                    Vectorization ratio │    99.9776 │
│                 Total execution stalls │  9.37384e8 │
│        Stalls caused by L1D misses [%] │    116.368 │
│         Stalls caused by L2 misses [%] │    1.49769 │
│               Execution stall rate [%] │    16.2874 │
│   Stalls caused by L1D misses rate [%] │    18.9533 │
│    Stalls caused by L2 misses rate [%] │   0.243935 │
│ Stalls caused by memory loads rate [%] │    19.4328 │
│     L2 to L1 load bandwidth [MBytes/s] │  1.37391e5 │
│     L2 to L1 load data volume [GBytes] │    277.845 │
│    L1 to L2 evict bandwidth [MBytes/s] │    406.024 │
│    L1 to L2 evict data volume [GBytes] │     0.8211 │
│     L1 to/from L2 bandwidth [MBytes/s] │  1.37797e5 │
│     L1 to/from L2 data volume [GBytes] │    278.666 │
│              Branch misprediction rate │ 9.64224e-5 │
│                Instructions per branch │    12.6511 │
└────────────────────────────────────────┴────────────┘

Timing reported here is again similar, but CPI is much higher; seems despite being dense we nonetheless needed fewer instructions, which isn't surprising as LoopVectorization.jl's objective function tries to minimize MUOP count. Being dense, the bandwidth needs were higher.

Attaining comparable performance here smells like blood in the water $-$ lets take advantage of symmetry to win this benchmark.

Lets start with product_3, and apply one transform at a time to discuss it's problems. First, we can hoist a lot of arithmetic.

function product_4(x, _M)
    n = length(x)
    result_diag = zero(eltype(x))
    result_supd = zero(eltype(x))
    M = _M isa Symmetric ? parent(_M) : _M
    Base.require_one_based_indexing(x)
    Base.require_one_based_indexing(M)
    @inbounds for i = 1:n
        result_diag += x[i]^2 * M[i, i]
        result_innr = zero(eltype(x))
        @simd for j = 1 : i - 1
            result_innr += x[j] * M[j, i]
        end
        result_supd += x[i]*result_innr
    end
    return 2result_supd + result_diag
end

yielding

julia> @btime product_4($x, $B)
  3.977 μs (0 allocations: 0 bytes)
508214.49429146934

Better, now we're more in line with product_1. If our math is slow, maybe we just need to ask the compiler to make it fast?

@fastmath function product_5(x, _M) # only change is `@fastmath`
    n = length(x)
    result_diag = zero(eltype(x))
    result_supd = zero(eltype(x))
    M = _M isa Symmetric ? parent(_M) : _M
    Base.require_one_based_indexing(x)
    Base.require_one_based_indexing(M)
    @inbounds for i = 1:n
        result_diag += x[i]^2 * M[i, i]
        result_innr = zero(eltype(x))
        @simd for j = 1 : i - 1
            result_innr += x[j] * M[j, i]
        end
        result_supd += x[i]*result_innr
    end
    return 2result_supd + result_diag
end

so we're now beating product_1 with dense arrays:

julia> @btime product_5($x, $B)
  3.503 μs (0 allocations: 0 bytes)
508214.49429146934

But we're still a far cry from @turbo. Unfortunately, because LoopVectorization.jl only supports rectangular loops, we're limited to applying it to the inner-most loop, hamstringing it:

@fastmath function product_6(x, _M) # only change is `@fastmath`
    n = length(x)
    n == 0 && return zero(eltype(x))
    M = _M isa Symmetric ? parent(_M) : _M
    Base.require_one_based_indexing(x)
    Base.require_one_based_indexing(M)
    result_diag = x[1]^2 * M[1, 1]
    result_supd = zero(eltype(x))
    # `@turbo` doesn't support length-0 iteration
    # so we need to make sure length(1:i-1) > 0
    @inbounds for i = 2:n 
        result_diag += x[i]^2 * M[i, i]
        result_innr = zero(eltype(x))
        @turbo for j = 1 : i - 1
            result_innr += x[j] * M[j, i]
        end
        result_supd += x[i]*result_innr
    end
    return 2result_supd + result_diag
end

yet

julia> @btime product_6($x, $B)
  1.795 μs (0 allocations: 0 bytes)
508214.4942914694

julia> @benchmark product_6($x, $B)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.663 μs …  5.431 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.680 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.717 μs ± 73.691 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▇█▅▁                               ▁
  ▃█████▅▅▃▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▂▁▂▂▂▁▁▁▁▂▂▂▄█▇▆▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.66 μs        Histogram: frequency by time        1.88 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

I'm not sure why the minimum time from @btime was so much higher than the mean time from @benchmark.

But we're now at least in clear contention with MKL's symv!+dot. So, if we could only apply @turbo to the inner loop, why did it help so much? LLVM is really bad at producing code for architectures with big vectors, and AVX512 in particular. On a Zen3 CPU with AVX2, I got

julia> @btime product_5($x, $B)
  2.537 μs (0 allocations: 0 bytes)
502848.6107780409

julia> @btime product_6($x, $B)
  2.063 μs (0 allocations: 0 bytes)
502848.61077804095

while on an Apple M1, I got

julia> @btime product_5($x, $B)
  3.604 μs (0 allocations: 0 bytes)
580770.6876433745

julia> @btime product_6($x, $B)
  3.521 μs (0 allocations: 0 bytes)
580770.6876433745

so the performance advantage decreases as the vector width decreases. This is mostly because LLVM doesn't handle remainders efficiently. The larger the vectors, the larger the remainders on average, the worse LLVM's performance.

So, what's left, what's missing? What would @turbo on an outerloop do differently? For one thing, reducing vectors to scalars is expensive, so @turbo will only do so once at the end, while our code is doing so on every iteration of the outer loop. How can we avoid doing so?

We can tell @turbo we want to accumulate into a SIMD vector instead of a scalar by defining our accumulator to be a VectorizationBase.Vec instead of a scalar. Then, we just have to accumulate the vector into a scalar ourselves before returning the value:

using VectorizationBase
@fastmath function product_7(x, _M) # only change is `@fastmath`
    n = length(x)
    T = eltype(x)
    n == 0 && return zero(T)
    M = _M isa Symmetric ? parent(_M) : _M
    Base.require_one_based_indexing(x)
    Base.require_one_based_indexing(M)
    result_diag = Vec(zero(T))
    @turbo for i = 1:n
        result_diag += x[i]^2 * M[i, i]
    end
    result_supd = Vec(zero(T))
    # `@turbo` doesn't support length-0 iteration
    # so we need to make sure length(1:i-1) > 0
    @inbounds for i = 2:n 
        result_innr = Vec(zero(T))
        @turbo for j = 1 : i - 1
            result_innr += x[j] * M[j, i]
        end
        result_supd += x[i]*result_innr
    end
    return VectorizationBase.vsum(2result_supd + result_diag)
end

Now I get:

julia> @btime product_7($x, $B)
  1.354 μs (0 allocations: 0 bytes)
508214.4942914695

Well ahead of any other implementation. The CPI of LIKWID results look quite good, especially CPI:

┌────────────────────────────────────────┬─────────────┐
│                                 Metric │    Thread 1 │
│                                 String │     Float64 │
├────────────────────────────────────────┼─────────────┤
│                    Runtime (RDTSC) [s] │     1.59119 │
│                   Runtime unhalted [s] │     1.50887 │
│                                    CPI │    0.321119 │
│                           DP [MFLOP/s] │     22122.9 │
│                    AVX512 DP [MFLOP/s] │     22121.3 │
│                       Packed [MUOPS/s] │      2765.7 │
│                       Scalar [MUOPS/s] │    0.538865 │
│                    Vectorization ratio │     99.9805 │
│                 Total execution stalls │   4.70885e8 │
│        Stalls caused by L1D misses [%] │     98.6231 │
│         Stalls caused by L2 misses [%] │     3.33533 │
│               Execution stall rate [%] │     10.3786 │
│   Stalls caused by L1D misses rate [%] │     10.2357 │
│    Stalls caused by L2 misses rate [%] │     0.34616 │
│ Stalls caused by memory loads rate [%] │     11.8953 │
│     L2 to L1 load bandwidth [MBytes/s] │     99520.2 │
│     L2 to L1 load data volume [GBytes] │     158.482 │
│    L1 to L2 evict bandwidth [MBytes/s] │     512.454 │
│    L1 to L2 evict data volume [GBytes] │    0.816064 │
│     L1 to/from L2 bandwidth [MBytes/s] │   1.00033e5 │
│     L1 to/from L2 data volume [GBytes] │     159.298 │
│              Branch misprediction rate │ 0.000174263 │
│                Instructions per branch │     9.08776 │
└────────────────────────────────────────┴─────────────┘

With a CPI of $<\frac{1}{3}$ , we averaged over 3 instructions per clock cycle! Our vectorization ratio was also excellent, of course.

Now that we have the fastest time, are we done?

No, as hinted above there's another important optimization @turbo applies to the dense case. @turbo cuts down on the x[j] reloads in the inner j loop by unrolling the outer i loop. That is, if we unroll M[j,i] by 8, we can reuse an x[j] load 8 times before discarding it and moving on to the next inner loop iteration. In this way, we need $\frac{1}{8}$ as many loads from $x$ in the inner loop.

function product_8(x, _M)
    N = length(x)
    T = eltype(x)
    N == 0 && return zero(T)
    M = _M isa Symmetric ? parent(_M) : _M
    Base.require_one_based_indexing(x)
    Base.require_one_based_indexing(M)
    # we unroll the outer loop by 8
    remainder = N & 7
    n = remainder
    if n != N # guard
        # 8 accumulators (for the outer loop)
        Base.Cartesian.@nexprs 4 i -> acc_i = Vec(zero(T))
        while true
            Base.Cartesian.@nexprs 8 i -> a_i = Vec(zero(T))
            # iterations i = n+1:n+8
            if n != 0
                # we'll loop normally up through n
                @turbo for j = 1:n
                    Base.Cartesian.@nexprs 8 i -> begin
                        a_i += x[j] * M[j,i+n]
                    end
                end
            end
            # 8x8 diagonal block
            @turbo for j = 1:8
                Base.Cartesian.@nexprs 8 i -> begin
                    mask_i = (j<i) + 0.5*(j==i)
                    a_i += x[j+n] * (M[j+n,i+n] * mask_i)
                end
            end
            Base.Cartesian.@nexprs 4 i -> acc_i += x[i+n]*a_i
            Base.Cartesian.@nexprs 4 i -> acc_i += x[4+i+n]*a_{4+i}
            n += 8
            n == N && break
        end # while true
        acc_1 += acc_3
        acc_2 += acc_4
        acc_1 += acc_2
        ret = 2*VectorizationBase.vsum(acc_1)
        remainder == 0 && return ret
    else
        ret = zero(T)
    end
    # we know remainder != 0, because N == 0 returned early
    r = Base.oneto(remainder)
    # product_5 seemed fast for small inputs
    return ret + @views product_5(x[r], M[r,r])
end

Now performance is starting to look pretty good:

julia> @btime product_8($x, $B)
  987.308 ns (0 allocations: 0 bytes)
508214.49429146945

We're now under 1 microsecond, leaving all the competition well behind, so I think I'll call it quits for today.

LIKWID results:

┌────────────────────────────────────────┬─────────────┐
│                                 Metric │    Thread 1 │
│                                 String │     Float64 │
├────────────────────────────────────────┼─────────────┤
│                    Runtime (RDTSC) [s] │     1.16956 │
│                   Runtime unhalted [s] │     1.10838 │
│                                    CPI │    0.437218 │
│                           DP [MFLOP/s] │     28353.2 │
│                    AVX512 DP [MFLOP/s] │     28350.3 │
│                       Packed [MUOPS/s] │     3544.51 │
│                       Scalar [MUOPS/s] │     1.46549 │
│                    Vectorization ratio │     99.9587 │
│                 Total execution stalls │    3.9451e8 │
│        Stalls caused by L1D misses [%] │     111.613 │
│         Stalls caused by L2 misses [%] │     6.21932 │
│               Execution stall rate [%] │     11.7638 │
│   Stalls caused by L1D misses rate [%] │     13.1299 │
│    Stalls caused by L2 misses rate [%] │    0.731625 │
│ Stalls caused by memory loads rate [%] │     13.9601 │
│     L2 to L1 load bandwidth [MBytes/s] │    125761.0 │
│     L2 to L1 load data volume [GBytes] │     147.678 │
│    L1 to L2 evict bandwidth [MBytes/s] │     633.472 │
│    L1 to L2 evict data volume [GBytes] │    0.743871 │
│     L1 to/from L2 bandwidth [MBytes/s] │   1.26395e5 │
│     L1 to/from L2 data volume [GBytes] │     148.422 │
│              Branch misprediction rate │ 0.000247131 │
│                Instructions per branch │     11.7534 │
└────────────────────────────────────────┴─────────────┘

While we lost our impressive CPI (although still over 2 IPC), this implementation required little more than half as many CPU instructions as any of the other implementations. Without unrolling, each inner loop iteration required a load from x, a load from A, and an fma instruction. On a CPU such as this one, capable of 2 aligned loads and 2 fma (fused multiply add) instructions per clock cycle, removing most of the x[j] reloads from the inner loop thus lifts a bottleneck from the inner most loop. Because that bottleneck existed in the form of a huge number of unnecessary instructions, we don't see removing these instructions show up in improved CPI, but simply in a far lower instruction count. The instruction counts per call were about $7605$ , $14097$ , and $14225$ for product_8, product_7, and product_2, respectively.

I used small arrays that fit into this CPU's large 1MiB L2 cache (which could fit a 360x360 matrix and vector), so that we weren't limited on memory bandwidth and could focus on interesting optimizations. As the arrays grow larger, it turns into a problem of streaming the matrix through memory, where there isn't really anything that can be done, aside from multithreading (to first allow splitting up the problem among many core's L2 caches, and then ultimately to take advantage of the higher memory bandwidth multiple cores have access to vs a single core).

Another important note is that this implementation optimistically assumes that stride(A,2)%8==0.

If we want to try and address alignment, we could consider taking product_7, and then initially pealing off enough initial iterations of the inner loop to align it. Not crossing 64 byte boundaries with loads from A will make them cheaper, but at the cost of reloading x more often. Given aligned loads and a vector width of 8, we could estimate the cost of this as roughly $\left(\frac{N}{8}\right)\left(\frac{N}{8}\right) + \frac{N(N+1)}{2\cdot 8} + N = \frac{N^2}{64} + \frac{(N(N+1))}{16} + N$ . The $\left(\frac{N}{8}\right)\left(\frac{N}{8}\right)$ reflects the $\frac{N}{8}$ outer loop iterations times the $\frac{N}{W}=\frac{N}{8}$ loads per inner loop iteration. The $\frac{N(N+1)}{2}$ are the number of elements we're loading from A, then the $\frac{1}{8}$ is again dividing by $W$ . The final $+ N$ are the loads x[i], insignificant thanks to the smaller exponent.

The base pointer of a large array, e.g. A, will normally be 64 byte aligned.

julia> Int(pointer(A))%64
0

Assuming this, the worst case scenario for alignment would be that isodd(stride(A,2)%8). Then, with AVX512, only 1 out of 8 columns will be aligned.

With that, the cost of loading from A becomes

\begin{aligned} \left(\frac{N(N+1)}{16}\right) \left(\frac{2\cdot 7}{8}\right) + \left(\frac{N(N+1)}{16}\right) \left(\frac{1}{8}\right) = \frac{15N(N+1)}{128} \end{aligned}

That is, loading from $A$ becomes $\frac{15}{8}$ times more expensive. We have an increase in cost of $\frac{7N(N+1)}{128}$ .

If we try to address this by pealing loop iterations at the start of each column to align the loads from $A$ , this causes one of two complications. Either

We can no longer unroll the outer loop, increasing the cost of reloading $x$ in the inner loop from $\frac{N^2}{64}$ to $\frac{N^2}{8}$ , an increase in cost of $\frac{16N^2}{128} - \frac{2N^2}{128} = \frac{14}{128}$ , meaning this is potentially a smaller increase in cost than what the unrolled product_8 suffers.
Still unroll, and use shuffle/permute instructions to try and line up the loads of x correctly with the loads from A that may now all be offset to align them. This approach would require generating different specializations for the different offset patterns, where you'd have fixed shuffle sequences in each one. As we're dominated by loads, the shuffles may be "free"/able to execute concurrently, making this likely the fastest, but most complicated, approach. This is especially the case for Zen4, which has fairly fast shuffles but slow loads, or for consumer AVX512 Intel CPUs, which can use port 5 for shuffle instructions but not for floating point arithmetic, again making shuffles more or less "free" to execute concurrently with loads and floating point.

P.S. A goal of LoopModels will be that @turbo will generate code as fast or faster than the above for code as simple as

@turbo function dotturbo(x, A)
    s = zero(Base.promote_eltype(x, A))
    @assert eachindex(x) == axes(A,1) == axes(A,2)
    for i = eachindex(x), j = eachindex(x)
        s += x[i] * A[i,j] * x[j]
    end
    s
end

when A isa Symmetric, and again of course doing the right thing when A isa Adjoint, or simple A isa Matrix. Our code should express our intent, simple and generic. The compiler should find out how to do the right thing for the types.