If I was doing a test like this I'd run the benchmark a large number of times (at least ten) in various (not necessarily strictly controlled) typical usage scenarios. Ideally there would be a variety of handsets of the same model too (assuming variance between those is small). Then I'd filter for outliers and take a mean. I like this kind of testing, because you account for confounds that may only appear in a single highly-controlled lab setup, and won't reflect the results that people are expected to see when they use their phone in a typical way.