I would expect a developer to fully look into the bottlenecks in their applicati...

I would expect a developer to fully look into the bottlenecks in their application before applying something like this. For example, if they quite often have complex, multi-stage image processing pipelines, then offloading the entire pipeline to the GPU might result in quite significant speedup.

In addition (iirc) CPU-GPU busses have got quite a bit faster in the last 5 years. They're still a large bottleneck, yes, but for expensive, highly parallel computations on small pieces of data they don't completely dominate the computation cost.

EDIT:

I've also noticed that this framework uses OpenGL(ES) for its offloading. Given that, the computation could easily be offloaded to an embedded (i.e. non-discrete) GPU, eliminating the data movement cost.