We have tried restarting the services every night to start off with, the biggest issue is that the the only way to enabling triggering is a manual mouse click. we noticed there is a flag in the DB for the global triggering but the service have to be restarted to take effect of the new value set in the db so when the value is set to true, the execution service must restart to notice that the value has changed, but once restarted, triggering is already enabled but the worker agents might still take a couple of moments to reconnect to the server which is the main problem as the tasks will execute and fail due to agent not connected.
We were thinking of using the REST API to check how many workers are connected but the API shuts down when the Execution service restarts which wont allow us to use it.
I'm aware that the 3rd party tool might not replicate a real human mouse click 100% and not every occasion does the KeyNotFoundException happen, but only when it does happen do we want to know where the issue might be originating from.
Currently our SMC console sessions are set to 30 minutes, we do however have the logging output set, not sure if this causes memory or CPU spikes.
Our setup currently are as follows: 1 BPA Server with management and execution service, 3 worker Agents each different servers. we run between 600-700 tasks ranging from every few minutes to 1 every 3 month as schedules. Most of our triggers are schedule based.