I'm seeing ~6% of long-running tool calls drop with a 504 around the 28s mark. Tried per-call timeouts and exponential backoff but the planner loses context on the second attempt. Curious if anyone wired retries at the agent level vs the tool level.
tstry {
return await hermes.run(task, { timeout: 25_000 });
} catch (e) {
if (isTimeout(e)) return await hermes.resume(task.id);
throw e;
}