Why would you think in sets of ten, when there should actually just be one pattern in 15? Then it just becomes a challenge to arrange your code to work on these blocks of 15.
We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.
In my mind a SIMD version would work by predefining a string buffer 15 elements long with Fizz in positions 0,3,6,... and Buzz in positions 0,5,10. These are constants that don't ever change. Then vectorized code would only write to positions that change, the numbers: 1,2,4,7,8,11,13,14. Most of the time these positions would have fixed width too (lengths of large numbers don't change often) so you can use constant write addresses. So 8 SIMD threads could handle a block, and write everything blockwise.
I was thinking blocks of 10 because I can itoa(the number of tens) and then concat with the subset of 0, 1, 2, 3, 4, etc. that I care about. I guess with blocks of 15 you just need to do two itoas, and worry about two types of block vs 3.
We could probably code 16 versions of the block of 15 code that repeat and are nicely aligned for SIMD.