So investigating some issues for the inks company today, I come across a problem that we’ve seen before with Junipers WX appliances..
A site has problems transferring data across the network between two Junipers, since we’ve been through the mill already for this application, there is a special exclusion application definition for this. So we try the usual trick, which is disable the NSC (disk-based) compression and TCP acceleration by sticking the failing server in the exclusion application. Re-run the transfer, and the problem still exists..
Since we’ve done this, we’ve added several other Juniper sites to the network, along with several hundred users. This means that the pair of servers involved are facing some increased loads from the other functions carried by the server. (The application comes in three parts, one that gathers the data, one that processes it, and the third that supplies that information out to the rest of the organisation. The issues that we have seen so far have all been related to the gathering of the information, since that is the process that generates and logs errors, and is also noticed in missing information if it fails).
The issue is not in the problem of the gathering of the data, but in the probable undersizing of the system(s) that runs the entire application. The problem appears that the system has a finite amount of network buffer space to send or receive the data via the network. If the acceleration for the entire system (both the gathering and publishing sides) is enabled, then we have an issue in which the transfers hit several points of TCP zero window on the gathering side (which was being measured today). This means that the server is running out of space in the system (it normally signifies a bottleneck within the system that prevents the received data being processed effectively). This is not surprising, since the process contacts remote databases, and drags information from them into a consolidated central store (data warehouse). The zero window can also mean that the system at the operating system level is unable to accept more data since the buffer space is exhausted.
By adding the gathering connections into an application definition that precludes the TCP application function, the volume of data being received is throttled (since the TCP window now comes into play for the particular server we’ve got an issue with, since it is South Africa, and the data warehouse is in Arizona, USA). This has in the past restored function for the gathering function, but in the intervening time we’ve added several more users that are using Junipers to connect to the publishing side of the application, and this has increased the load on the total system buffers, since more are now occupied with data leaving the system on it’s way towards it’s subscribers. In the end, disabling TCP acceleration for both sides (gathering and publishing) of the application reduced the loading on the server, and enabled the system to complete both tasks without errors.
I’m fairly convinced that an analysis of this server will show that it is highly overloaded, and that a good dose of memory and some careful operating system tuning of the network stack will enable the TCP acceleration to be enabled again. (Part of this comes from a comment that the processing of the data can take “quite a long time, nearly an hour a site”, when it’s only 10Mb of data that are transferred across the network itself.)
So if you’ve got issues with TCP acceleration, and it affects a single server, then it might be an idea to ensure that it is operating correctly. Check the application loading levels, the level of swap file in use, and the amount of memory that is being reserved for network buffers. It’s not true that you can simply enable TCP acceleration without an impact in the network. You are automatically making each connection work harder, and this will take more resources in the system to support, especially if the system load remains constant.
So there is impact on servers under TCP acceleration, and it ain’t necessarily all good. Make sure your system can cope with the extra network buffer utilisation needed.