Based on your paper, it seems you didn't compare to other model based evaluation method such as Frechet Embedding Distance (FED). I would certainly like to see correlation study between each model evaluation method.
Thanks for your feedback, while we couldn't compare with all model-based methods out there, YiSi (August 2019) and BERTScore which was just presented at ICLR2020 this week (April 27th 2020) are very strong methods to compare against and reflect state-of-the-art. All comparisons are welcome!
What do you think about the train test discrepancy? ie. will practitioners have to fine-tune Nubia's models on their training dataset in order to evaluate on their test dataset?
- The general dataset used to train the language model before being fine-tuned to extract semantic similarity, logical entailment and grammaticality (ie Wikipedia)
- The dataset used to fine-tune the semantic similarity module and logical inference scorer
- The dataset used to predict human judgement
So far, the experiments have actually shown that without any finetuning, the NUBIA model trained to assess machine translations does better at agreeing with human judgement for image captions than the metrics specifically design to assess image captions.
For more advanced cases like, say, scoring medical reports where, for example, grammaticality doesn't matter as much, it may have to be fine-tuned. This is not unlike human training actually where experts are trained on "what to look for".
The nice thing with this modular architecture and the interpretable scores is that it can provide a lot of flexibility to study individual components and their emergent properties and make a judgement call on whether or not to fine tune.
The aggregators in Nubia are pretrained to correlate with human judgement, so it should only be used for inference, but the idea is that you can use it as a loss function to optimize translation/image captioning/summarization. It’s too big for that as is but thats what we’re working towards.
I think the question here is more along the lines of "If now, I have ,say, radiology reports, do I use Nubia out of the box or do I need to make it read radiology reports and have a sense of what high quality radiology reports look like before using it?"
Super cool! If all the components of this metric are models, thoughtful versioning and documenting it will be important. What are your thoughts on this ?
This is actually something we struggled with, and why we released everything this way. There is something to say about why BLEU and ROUGE have stood the test of time, they are extremely simple to use and static. Hopefully we can bridge this gap.