With respect to the pretraining data, its true that we're probably SOL there in ...

riku_iki · on Nov 1, 2023

I think both pretraining and finetuning datas are essential secret information for commercial models/services.

spmurrayzzz · on Nov 1, 2023

In the case of Phind though, they also publish their models on HF with similar bold performance claims without publishing the datasets: https://huggingface.co/Phind/Phind-CodeLlama-34B-v2

Even I am to grant that their subscription product has some secret sauce they want to keep close to the chest (ignoring for a moment their paid product is GPT-4 based), not doing the same for all the models they release to the open source community free of charge with a commercially-permissible license seems suspect.

I realize this sort of open source contribution is mostly for marketing purposes, but being critical of the performance claims I think is still valid nonetheless.