This article applies to BoKS Manager 6.7.0 and 6.7.1. For information regarding BoKS Manager 7.0, please see Hotfix: Speedup of clntd send bridge (HFBM-0097).

Description

Batched messages for clntd on the clients are queuing up on the Master, apparently because the bridge cannot send them as quickly as they are produced.


Resolution / Workaround

These issues are resolved in hotfix HFBM-0093, available for download from the HelpSystems Community Portal.

1)

The clntd send bridge for batched messages has an array where it stores active file descriptors to remote machines (connected or in the process of being connected). The size of this array has been increased from 40 and is configurable. This should greatly improve throughput in the case where many hosts are down. The array can be a maximum of 1024, but as that is normally the limit for the number of file descriptors for a process, 994 is used as a default as the bridge needs some other open files. The value can be decreased by setting BOKS_INIT_NFD or BRIDGE_CLNTD_S_MAX_SOCKETS in the ENV file (BOKS_INIT_NFD affects all processes started by boks_init, so we recommend using BRIDGE_CLNTD_S_MAX_SOCKETS if you need to decrease the maximum number of network sockets that the bridge can use). If the calculated value is less than 40, 40 will be used.

2)

The strategy for how often the clntd send bridge retries a connection to a host that it failed to talk to has also been changed. The old strategy was to connect quite quickly, and then slowly back off until it tried once every 10 minutes. This was based on the assumption that hosts were normally up.
It is often the case that a host that cannot be contacted will be down for a long time. In that case bandwidth is wasted attempting to contact them (the connection attempt will time out after some 5-10 seconds. In the meantime it takes a slot in the above array, stopping messages to another host).
Now when a host is marked as down, the bridge will retry once in 5 minutes (in the hope that the machine was just restarted) and then once every 30 minutes. The latter value is also configurable using BRIDGE_CLNTD_S_MAX_CONNECT_RETRY_MINUTES in the ENV file. It can be set to between 5 - 1440 minutes. High values will mean it will take a long time for BoKS to discover that a host is back up again.

3)

As the 'boksdiag fque -bridge' command does not show the actual processing done by the clntd send bridge, simple monitoring has been added to the bridge. It only monitors batched messages, not messages sent directly to a host by e.g. cadm, and the purpose is to see whether or not the bridge is processing messages.
By default monitoring messages end up in $BOKS_var/monitoring/bridge_clntd_s, and an entry has the following format:

bridge_clntd_s@somehost.com Wed Dec 16 16:10:52 2015
Non-batch msgs: 198
Batch connect fail: 1100
Batch write ok: 1945
Batch write fail: 0
Batch read ok: 1942
Batch read fail: 3
Idle: 55%

Where:

  • "Non-batch msgs" is the number of non-batch messages processed. All other numbers except "Idle" refer to batched messages.
  • "Batch connect fail" is connection attempts to hosts that are down or not reachable on the network.
  • "Batch write ok" and "Batch write fail" is the number of messages written ok or with an error.
  • "Batch read ok" means number of replies received.
  • "Batch read fail" is a timeout or premature connection close when reading a reply.
  • "Idle" is the percent of time spent waiting for new messages to send or replies to messages sent.

The normal ENV variables can be used to redirect output or change monitoring interval.

4)

During the time the bridge was processing a non-batch message, it did no processing of batch messages. This meant that a heavy use of non-batch messages, for example to fetch keystroke log files, would slow down batch processing. The code has been rewritten so the bridge will now be able to process batch messages at the same time as it is processing a non-batch message.

To implement this fixed functionality, apply the hotfix HFBM-0093, available for download from the FoxT customer services support site, to your Master / Failover Master.

Update for version 3 of HFBM-0093

Four problems have been found and corrected:

  1. An earlier unrelated fix to the low-level internal communication code did not work properly with the boks_bridge. In certain cases this made a receive bridge get stuck reading data from its server (e.g. boks_master). Meanwhile the remote bridge closed the connection, so there was a forked bridge left with a network socket stuck in CLOSE_WAIT. The low-level communication code has been fixed.
  2. There was an issue in the new code so non-batched connection attempts to a host that was unavailable hung until the OS timed out the connection. The OS timeout is much longer than the internal BoKS timeout. Attempts to send non-batched messages to other hosts during this time failed. The proper timeout is now applied.
  3. Another issue in the new code made non-batched messages being processed slower when there were only batched messages to hosts down being processed at the same time.
  4. The clntd send bridge (bridge_clntd_s) did not notice when the remote side closed connection until after some time (~two minutes). This made sockets end up in CLOSE_WAIT state during that time.

Update for version 4 of HFBM-0093

Three additional changes:

  1. Speedup: When the clntd send bridge encountered a queued batch message to a machine to which there was already an outstanding message, it stopped queue processing until the outstanding message was processed. This meant that if there were consecutive messages to the same machine, it would process these one by one and not until these were done did it process messages later in the queue. This impacted performance. This has been changed so in that case the new message is just added to an internal queue and the batch queue continues to be processed.
  2. Monitoring: The monitoring stat files for the bridges did not have the .stat extension. This meant they were not included in the boksinfo data. The files now have the .stat extension.
  3. Monitoring: An entry has been added to the monitoring file for the clntd send bridge with the number of batched messages queued since last report.

Update for version 5 of HFBM-0093

Version 5 includes a fix to an issue where the use of 'boksdiag fque -bridge -delete ' in certain cases could disrupt the internal bridge queue of hosts with queued messages so if messages were queued to hosts not yet in the queue they were not processed by the bridge until it was restarted.

It also includes a fix for another issue that in very special circumstances could cause the bridge to access memory that was free'd normally causing a core dump.

Update for version 6 of HFBM-0093

  1. If the boks_bridge failed to create a new queue file, it would lose messages. This has been fixed so if it fails to create a new file it will simply continue adding messages to the last queue file.
  2. The boks_bridge queue files were compacted too often. As this is an expensive operation (potentially copying lots of data), it impacted the performance of the boks_bridge. Now compaction will be done at most every 10 minutes.

Update for version 7 of HFBM-0093

  1. In very special cases it was possible for the bridge to get stuck in an infinite loop so it would not process any new messages. This has been fixed.
  2. Batch messages are stored in files. If for some reason one of these files is corrupt, this would stop batch processing. This has been fixed so the offending file is copied for later analysis and processing continues with the next file. The following warning messages are also written to boks_errlog:


WARNING: Got error () when reading queue file


and


WARNING: Copying to bad file: /BAD.

Update for version 8 of HFBM-0093

Fixes for Master-to-Server Agent send functionality.

  1. Bugfix: With the default value for BRIDGE_CLNTD_S_MAX_SOCKETS, it was possible for the bridge to use more than 1024 filedescriptors, which would cause select() to give an error. Now the bridge reserves 200 filedescriptors for internal use to avoid this issue.
  2. Speedup: The bridge made many calls to a routine to check if a network connection was in progress for a host. This routine searched through a list which consumed a large amount of CPU time. The logic is now changed to make fewer calls to this routine and the routine has been changed to use a hash to speed it up.

Update for version 9 of HFBM-0093

Version 9 includes a fix for an issue that was introduced in hotfix HFBM-0131-1 and also affected
this hotfix. When installed on a Replica, this could prevent some messages being sent to the Master
until BoKS was restarted.


Still have questions? We can help. Submit a case to Technical Support.

Last Modified On: May 25, 2018