This article applies to BoKS Manager 7.0. For information regarding BoKS Manager 6.7, please see Hotfix: Speedup of clntd send bridge (HFBM-0093).


Description

A) Batched messages for clntd on the Server Agents are queuing up on the Master, apparently because the bridge cannot send them as fast as they are produced.


B) If a file with queued messages was corrupt, queue processing halted.


Resolution / Workaround

These issues are resolved in the hotfix HFBM-0097, available for download from the HelpSystems Community Portal.

Version 1

A:

1)

The clntd send bridge for batched messages has an array where it stores active file descriptors to remote machines (connected or in the process of being connected). The size of this array has been increased from 40 and is configurable. This should greatly improve throughput in the case where many hosts are down. It can be a maximum of 1024, but as that is normally the limit for the number of file descriptors for a process, 994 is used as a default as the bridge needs some other open files. The value can be decreased by setting BOKS_INIT_NFD or BRIDGE_CLNTD_S_MAX_SOCKETS in the ENV file (BOKS_INIT_NFD affects all processes started by boks_init, so we recommend using BRIDGE_CLNTD_S_MAX_SOCKETS if you need to decrease the maximum number of network sockets that the bridge can use). If the calculated value is less than 40, 40 will be used.

2)

The strategy for how often the clntd send bridge retries a connection to a host that it failed to communicate with has also been changed. The old strategy was to connect quite quickly, and then slowly back off until it tried once every 10 minutes. This was based on the assumption that hosts were normally up.
It is often the case that a host that cannot be contacted will be down for a long time. In that case bandwidth is wasted attempting to contact them (the connection attempt will time out after some 5-10 seconds. In the meantime it takes a slot in the above array, stopping messages to another host).
Now when a host is marked as down, the bridge will retry once in 5 minutes (in the hope that the machine was just restarted) and then once every 30 minutes. The latter value is also configurable using BRIDGE_CLNTD_S_MAX_CONNECT_RETRY_MINUTES in the ENV file. It can be set to between 5 - 1440 minutes. High values will mean it will take a long time for BoKS to discover that a host is back up again.

3)

As the 'boksdiag fque -bridge' command does not show the actual processing done by the clntd send bridge, simple monitoring has been added to the bridge. It only monitors batched messages, not messages sent directly to a host by e.g. cadm, and the purpose is to see if the bridge is processing messages or not.
By default it ends up in $BOKS_var/monitoring/bridge_clntd_s.stat, and an entry looks like:

bridge_clntd_s@somehost.com Wed Dec 16 16:10:52 2015
Non-batch msgs: 198
Batch connect fail: 1100
Batch write ok: 1945
Batch write fail: 0
Batch read ok: 1942
Batch read fail: 3
Idle: 55%

Where:

  • "Non-batch msgs" is the number of non-batch messages processed. All other numbers except "Idle" refer to batched messages.
  • "Batch connect fail" is connection attempts to hosts that are down or not reachable on the network.
  • "Batch write ok" and "Batch write fail" is the number of messages written ok or with an error.
  • "Batch read ok" means number of replies received.
  • "Batch read fail" is a timeout or premature connection close when reading a reply.
  • "Idle" is the percent of time spent waiting for new messages to send or replies to messages sent.

The normal ENV variables can be used to redirect output or change monitoring interval.

4)

During the time the bridge was processing a non-batch message, it did no processing of batch messages. This meant that a heavy use of non-batch messages, for example to fetch keystroke log files, would slow down batch processing. The code has been rewritten so the bridge will now be able to process batch messages at the same time as it is processing a non-batch message.

5)

When the clntd send bridge encountered a queued batch message to a machine to which there was already an outstanding message, it stopped queue processing until the outstanding message was processed. This meant that if there were consecutive messages to the same machine, it would process these one-by-one and not until these were done did it process messages later in the queue. This impacted performance.This has been changed so in that case the new message is just added to an internal queue and the batch queue continues to be processed.

B:

Batch messages are stored in files. If for some reason one of these files is corrupt, this would stop batch processing. This has been fixed so the offending file is copied for later analysis and processing continues with the next file. Warning messages are also written to boks_errlog:

WARNING: Got error () when reading queue file

and

WARNING: Copying to bad file: /BAD.

To implement this fixed functionality, apply the hotfix HFBM-0097, available for download from the FoxT customer services support site, to your Master / Failover Master.

Version 2

Speedup:

The boks_bridge made many calls to a routine to check if a network connection was in progress for a host. This routine searched through a list which consumed a lot of CPU time. The logic is now changed to make fewer calls to this routine and the routine has been changed to use a hash to speed it up.


Bug fixes:

  • The use of 'boksdiag fque -bridge -delete ' in certain cases could disrupt the internal bridge queue of hosts with queued messages so if messages were queued to hosts not yet in the queue they were not processed by the bridge until it was restarted. This issue is fixed.
  • In very special circumstances, the bridge could access memory that was freed normally causing a core dump. This issue is fixed.
  • If the boks_bridge failed to create a new queue file, it would lose messages. This has been fixed so if it fails to create a new file it will simply continue adding messages to the last queue file.
  • The boks_bridge queue files were compacted too often. As this is an expensive operation (potentially copying lots of data), it impacted the performance of the boks_bridge. Now compaction will be done at most once every 10 minutes.
  • In very special cases it was possible for the bridge to get stuck in an infinite loop so it would not process any new messages. This has been fixed.
  • With default value for BRIDGE_CLNTD_S_MAX_SOCKETS, it was possible for the bridge to use more than 1024 file descriptors, which would cause select() to give an error. Now the bridge reserves 200 file descriptors for internal use to avoid this problem.


Still have questions? We can help. Submit a case to Technical Support.

Last Modified On: May 25, 2018