Hacker Newsnew | past | comments | ask | show | jobs | submit | zhisbug's commentslogin




Pokémon is increasingly used to evaluate modern large language models, but current practices lack standardization, and depend heavily on game-specific harness. The Pokémon Red involves three major tasks—navigation, combat control and training a competitive Pokémon team. We find they come with limitations: navigation tasks are too hard, combat control is too simple, and Pokémon team training is too expensive. We address these issues in Lmgame Bench, a new framework offering standardized evaluations and initial results across diverse games.


where other models tops out in a few moves



We find that spatial perception and spatial reasoning remain very difficult even for the strongest models like o3 or Claude 3.7




Sliding tile attention accelerates Hunyuan video generation by 3x with no quality drop and no need for training


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: