This article applies to BoKS Manager 6.7.0 and 6.7.1. For information regarding BoKS Manager 7.0, please see Hotfix: Speedup of clntd send bridge (HFBM-0097).
Batched messages for clntd on the clients are queuing up on the Master, apparently because the bridge cannot send them as quickly as they are produced.
Resolution / Workaround
These issues are resolved in hotfix HFBM-0093, available for download from the HelpSystems Community Portal.
The clntd send bridge for batched messages has an array where it stores active file descriptors to remote machines (connected or in the process of being connected). The size of this array has been increased from 40 and is configurable. This should greatly improve throughput in the case where many hosts are down. The array can be a maximum of 1024, but as that is normally the limit for the number of file descriptors for a process, 994 is used as a default as the bridge needs some other open files. The value can be decreased by setting BOKS_INIT_NFD or BRIDGE_CLNTD_S_MAX_SOCKETS in the ENV file (BOKS_INIT_NFD affects all processes started by boks_init, so we recommend using BRIDGE_CLNTD_S_MAX_SOCKETS if you need to decrease the maximum number of network sockets that the bridge can use). If the calculated value is less than 40, 40 will be used.
The strategy for how often the clntd send bridge retries a connection to a host that it failed to talk to has also been changed. The old strategy was to connect quite quickly, and then slowly back off until it tried once every 10 minutes. This was based on the assumption that hosts were normally up.
It is often the case that a host that cannot be contacted will be down for a long time. In that case bandwidth is wasted attempting to contact them (the connection attempt will time out after some 5-10 seconds. In the meantime it takes a slot in the above array, stopping messages to another host).
Now when a host is marked as down, the bridge will retry once in 5 minutes (in the hope that the machine was just restarted) and then once every 30 minutes. The latter value is also configurable using BRIDGE_CLNTD_S_MAX_CONNECT_RETRY_MINUTES in the ENV file. It can be set to between 5 - 1440 minutes. High values will mean it will take a long time for BoKS to discover that a host is back up again.
As the 'boksdiag fque -bridge' command does not show the actual processing done by the clntd send bridge, simple monitoring has been added to the bridge. It only monitors batched messages, not messages sent directly to a host by e.g. cadm, and the purpose is to see whether or not the bridge is processing messages.
By default monitoring messages end up in $BOKS_var/monitoring/bridge_clntd_s, and an entry has the following format:
email@example.com Wed Dec 16 16:10:52 2015
Non-batch msgs: 198
Batch connect fail: 1100
Batch write ok: 1945
Batch write fail: 0
Batch read ok: 1942
Batch read fail: 3
The normal ENV variables can be used to redirect output or change monitoring interval.
During the time the bridge was processing a non-batch message, it did no processing of batch messages. This meant that a heavy use of non-batch messages, for example to fetch keystroke log files, would slow down batch processing. The code has been rewritten so the bridge will now be able to process batch messages at the same time as it is processing a non-batch message.
To implement this fixed functionality, apply the hotfix HFBM-0093, available for download from the FoxT customer services support site, to your Master / Failover Master.
Update for version 3 of HFBM-0093
Four problems have been found and corrected:
Update for version 4 of HFBM-0093
Three additional changes:
Update for version 5 of HFBM-0093
Version 5 includes a fix to an issue where the use of 'boksdiag fque -bridge -delete
It also includes a fix for another issue that in very special circumstances could cause the bridge to access memory that was free'd normally causing a core dump.
Update for version 6 of HFBM-0093
Update for version 7 of HFBM-0093
WARNING: Got error
WARNING: Copying to bad file:
Update for version 8 of HFBM-0093
Fixes for Master-to-Server Agent send functionality.
Update for version 9 of HFBM-0093
Version 9 includes a fix for an issue that was introduced in hotfix HFBM-0131-1 and also affected
this hotfix. When installed on a Replica, this could prevent some messages being sent to the Master
until BoKS was restarted.
Still have questions? We can help. Submit a case to Technical Support.