This article applies to BoKS Manager 7.0. For information regarding BoKS Manager 6.7, please see Hotfix: Speedup of clntd send bridge (HFBM-0093).
A) Batched messages for clntd on the Server Agents are queuing up on the Master, apparently because the bridge cannot send them as fast as they are produced.
B) If a file with queued messages was corrupt, queue processing halted.
Resolution / Workaround
These issues are resolved in the hotfix HFBM-0097, available for download from the HelpSystems Community Portal.
The clntd send bridge for batched messages has an array where it stores active file descriptors to remote machines (connected or in the process of being connected). The size of this array has been increased from 40 and is configurable. This should greatly improve throughput in the case where many hosts are down. It can be a maximum of 1024, but as that is normally the limit for the number of file descriptors for a process, 994 is used as a default as the bridge needs some other open files. The value can be decreased by setting BOKS_INIT_NFD or BRIDGE_CLNTD_S_MAX_SOCKETS in the ENV file (BOKS_INIT_NFD affects all processes started by boks_init, so we recommend using BRIDGE_CLNTD_S_MAX_SOCKETS if you need to decrease the maximum number of network sockets that the bridge can use). If the calculated value is less than 40, 40 will be used.
The strategy for how often the clntd send bridge retries a connection to a host that it failed to communicate with has also been changed. The old strategy was to connect quite quickly, and then slowly back off until it tried once every 10 minutes. This was based on the assumption that hosts were normally up.
It is often the case that a host that cannot be contacted will be down for a long time. In that case bandwidth is wasted attempting to contact them (the connection attempt will time out after some 5-10 seconds. In the meantime it takes a slot in the above array, stopping messages to another host).
Now when a host is marked as down, the bridge will retry once in 5 minutes (in the hope that the machine was just restarted) and then once every 30 minutes. The latter value is also configurable using BRIDGE_CLNTD_S_MAX_CONNECT_RETRY_MINUTES in the ENV file. It can be set to between 5 - 1440 minutes. High values will mean it will take a long time for BoKS to discover that a host is back up again.
As the 'boksdiag fque -bridge' command does not show the actual processing done by the clntd send bridge, simple monitoring has been added to the bridge. It only monitors batched messages, not messages sent directly to a host by e.g. cadm, and the purpose is to see if the bridge is processing messages or not.
By default it ends up in $BOKS_var/monitoring/bridge_clntd_s.stat, and an entry looks like:
firstname.lastname@example.org Wed Dec 16 16:10:52 2015
Non-batch msgs: 198
Batch connect fail: 1100
Batch write ok: 1945
Batch write fail: 0
Batch read ok: 1942
Batch read fail: 3
The normal ENV variables can be used to redirect output or change monitoring interval.
During the time the bridge was processing a non-batch message, it did no processing of batch messages. This meant that a heavy use of non-batch messages, for example to fetch keystroke log files, would slow down batch processing. The code has been rewritten so the bridge will now be able to process batch messages at the same time as it is processing a non-batch message.
When the clntd send bridge encountered a queued batch message to a machine to which there was already an outstanding message, it stopped queue processing until the outstanding message was processed. This meant that if there were consecutive messages to the same machine, it would process these one-by-one and not until these were done did it process messages later in the queue. This impacted performance.This has been changed so in that case the new message is just added to an internal queue and the batch queue continues to be processed.
Batch messages are stored in files. If for some reason one of these files is corrupt, this would stop batch processing. This has been fixed so the offending file is copied for later analysis and processing continues with the next file. Warning messages are also written to boks_errlog:
WARNING: Got error
WARNING: Copying to bad file:
To implement this fixed functionality, apply the hotfix HFBM-0097, available for download from the FoxT customer services support site, to your Master / Failover Master.
The boks_bridge made many calls to a routine to check if a network connection was in progress for a host. This routine searched through a list which consumed a lot of CPU time. The logic is now changed to make fewer calls to this routine and the routine has been changed to use a hash to speed it up.
Still have questions? We can help. Submit a case to Technical Support.