For every example-agent they gave, an ordinary 'dumb' (as in 'non-intelligent') service would've sufficed...
So, to give an example of what's worked really well for me: I'm working for an app hosting startup named Wasmer, and we host a decent of apps. Some of these are malicious. So to very effectively detect the malicious apps, we have an app-vetter agent named Herman. Herman reads the index page of every newly created app, alongside a printscreen of the index page, then he flags if he thinks that the app is malicious. Then some human (usually me) inspect the app and make the final decision of it should banned or not.
This allows us to scan a quite large amount of apps and filter the noisy non-malicious apps. Doing this with a 'dumb' service wouldn't really be feasible, and the context of an LLM fits perfectly where it gets both an image and the source code. An LLM is also quite 'omnipotent', in that it for example knows that DANA is a bank in Malaysia, something I personally have no idea about.
I think tedious and time consuming chores like this is a great way of using agents. Next in line for my experimentation is to utilize agents for 'fuzzy' integration testing, where the LLM simply has access to a browser + cli tools and UAT specifications, and may (in an isolated environment) do whatever it wants. Then it should report back any findings and improvements using an MCP integration towards our ticketing system. So to utilize the hallucinations to find issues.
I tried doing LLM-based tests, coined it "agentic-tests" and it worked quite well:
The idea was to use Stagehand [1] as the testing framework and then integrate this with Linear, which is our ticketing system. During a hackathon I whipped something together: first the 'agent' read a UAT from linear, then it pass this into a quite heavily prompted Stagehand. The prompts instructed Stagehand to run the UAT after its best ability and make very structured notes on each step on what failed. Once the Stagehand process was done, 'the agent' reported which steps succeeded and failed and into Linear.
Fundamentally the idea was sound, but there were some limitations in both the Linear SDK and in Stagehand. With some better tooling (or a novel system) and I predict this sort of agentic testing will work very well, especially for exploratory testing where the agent may be prompted to act like either a 90 year old grandma or 16 year old turbogamer. Privacy-safe usage heatmaps may also be generated automatically to test out the UX, as each run yielded slightly different approaches to achieve the UATs.
So, to give an example of what's worked really well for me: I'm working for an app hosting startup named Wasmer, and we host a decent of apps. Some of these are malicious. So to very effectively detect the malicious apps, we have an app-vetter agent named Herman. Herman reads the index page of every newly created app, alongside a printscreen of the index page, then he flags if he thinks that the app is malicious. Then some human (usually me) inspect the app and make the final decision of it should banned or not.
This allows us to scan a quite large amount of apps and filter the noisy non-malicious apps. Doing this with a 'dumb' service wouldn't really be feasible, and the context of an LLM fits perfectly where it gets both an image and the source code. An LLM is also quite 'omnipotent', in that it for example knows that DANA is a bank in Malaysia, something I personally have no idea about.
I think tedious and time consuming chores like this is a great way of using agents. Next in line for my experimentation is to utilize agents for 'fuzzy' integration testing, where the LLM simply has access to a browser + cli tools and UAT specifications, and may (in an isolated environment) do whatever it wants. Then it should report back any findings and improvements using an MCP integration towards our ticketing system. So to utilize the hallucinations to find issues.