Hermes timing out on long tool calls — anyone hit a clean retry pattern?
I'm seeing ~6% of long-running tool calls drop with a 504 around the 28s mark. Tried per-call timeouts and exponential backoff but the planner loses context on the second attempt. Curious if anyone wired retries at the agent level vs the tool level.
Specifically: when a tool call exceeds 25s I want the agent to *resume* its prior plan instead of restarting reasoning. The catch is that Hermes' replay context window already has the failed call in it, and on retry it tends to just call the same tool the same way and time out again.
What I'm trying now is a wrapper that strips the failed tool result from the context and re-injects a synthetic 'use a faster tool' nudge. Feels hacky. Open to better ideas.
try {
return await hermes.run(task, { timeout: 25_000 });
} catch (e) {
if (isTimeout(e)) return await hermes.resume(task.id);
throw e;
}