There is no such stable test, just like humans can memorize and create simple heuristics to pass any test without understanding so can an LLM. You have probably seen humans that has perfect grades but can't do much in practice, that is how these LLMs work.
The creators of the LLM just feeds it a bunch of edge questions, and whenever people invent new ones they just feed those as well, so proving it doesn't understand will always be a moving target just like making tests that tests peoples understanding is also a moving target since those people will just look at the old tests and practice those otherwise.