Pokémon is increasingly used to evaluate modern large language models, but current practices lack standardization, and depend heavily on game-specific harness. The Pokémon Red involves three major tasks—navigation, combat control and training a competitive Pokémon team. We find they come with limitations: navigation tasks are too hard, combat control is too simple, and Pokémon team training is too expensive. We address these issues in Lmgame Bench, a new framework offering standardized evaluations and initial results across diverse games.