Short stride support
Branch short-spread-stride has a version of ElemSpread that uses a short stride equal to the size of a mask needed for one pack of elements. For example, stride of 64 for working with bitblocks of 8-bit elements on AVX-512.
The idea is to be able to make progress whenever enough data for a full mvmd_expand operation is available.
The popcount attribute does not work for this stride length, so I have implemented the rate on the source input stream using BoundedRate(0, 1) and setting processed item count explicitly. It would be good to update popcount rates to support this case.
The following two commands both produce correct results on ARM and AVX-512.
bin/nfd --ByteMerging ../QA/Normalization/NF-source --short-strides -UnalignedLoads > nfs.nfd2
bin/nfd --ByteMerging ../QA/Normalization/NF-source --short-strides >nfs.nfd3
However, these modes segfault sporadically with larger files. Are there limitations in the pipeline infrastructure that may cause an issue? One thing that I have observed is that, there are sometimes a large number of strides when the ProcessedItemCount on source does not advance, in the event that the mask stream has a large run of zeroes.