With respect to the pretraining data, its true that we're probably SOL there in terms of verification. But for fine-tuning, they could still publish the dataset and see if others can reproduce their results as well as audit for contamination.
If we're comparing benchmark deltas between different fine-tuned variants that share the same base models, that seems like the bare minimum we should expect to come along with performance claims.
Even I am to grant that their subscription product has some secret sauce they want to keep close to the chest (ignoring for a moment their paid product is GPT-4 based), not doing the same for all the models they release to the open source community free of charge with a commercially-permissible license seems suspect.
I realize this sort of open source contribution is mostly for marketing purposes, but being critical of the performance claims I think is still valid nonetheless.
If we're comparing benchmark deltas between different fine-tuned variants that share the same base models, that seems like the bare minimum we should expect to come along with performance claims.