I'd expect this to behave quite differently to cuFFT: the transforms are likely to be small (either length 8 1D FFTs, or 8x8 2D FFTs) and thus synchronisation overhead is likely to dominate if one was to try to parallelise within a transform (other than via SIMD). However, this small size does mean that the transforms can be written out to have "perfect" data transfer and branching behaviour, so that they parallelise well at JPEG's natural parallelisation granularity (the 8x8 pixel blocks).