Thoughts: Application Performance II

I thought I would just tell a few quick stories about my encounters with application efficiency and how the network was blamed. In the spirit of humility, I'll offer where I feel I did poorly and where I did well, and what lessons I learned.

City Police Database
The biggest, most obvious example of this was when I was doing some consulting work for a city government, which shall of course remain nameless. They had gotten a new application (and when I say new, I mean it--it had not been seriously proven in any other installations). I was asked to make the application high-availability. The application was not ready to be installed yet, but I was to do the up-front work. The parameters were thus:

We don't want to buy any new network equipment
The redundancy must go across multiple sites, which are connected by DS3 links several hops away.

So I devised a method I thought was pretty good to fit these parameters. It used pre-existing Cisco 3845 routers: one at the "main site" and one at the "backup site." The 3 servers that made up the program (two application servers and one database server) went on a new VLAN that was routed by the 3845. The VLAN at the main site had one particular IP subnet and the VLAN at the backup site had another subnet. But each server had a loopback interface, and those interfaces were the ones that DNS pointed to. The 3845 at the main site used Cisco's IP SLA with tracking, tracking the application's TCP port on it's physical address. If the application went down, it stopped advertising the loopback /32 via EIGRP. The backup site would then start advertising, because it had a route to that /32 address but with an administrative distance of 200, which is lower than EIGRP's external administrative distance of 170.

All that is background, which is actually not related to my point. All of this redundancy stuff should probably not have been put in, at least without a careful up-front study of the application itself. That's another lesson: do your homework first, and save a lot of headache. If you have built a Cadillac simply to transport monkeys, you've probably wasted your time and efforts (and the customer's money). My personality type is ISTP, which if you know psychology, tells you why I went ahead and did all this.

It turns out that the application was horribly written in many ways--it didn't run as a service, it ran as a user-level program, so we had to set the servers to auto-login, etc. But the biggest problem was, and the point of this section, the performance. It had never been load-tested, nor had it been run over anything less than a gigabit network. So when it was slow, it was of course the fault of the network, and more specifically, the network engineer who obviously didn't know what he was doing.

Here's what we determined after I did a careful analysis of the application (without seeing any code or finding this out from the programmers who didn't speak English very well). The client component of the application, which sat on the user's workstation, was doing SQL queries in the background. Instead of doing a query like:

select COMPONENT from TABLE where SOMETHING > 100;

it was doing something like:

select * from TABLE;

then once it downloaded the huge amount of data, it would "post-filter" that. Yes, definitely a network problem when every query resulted in the downloading of over 100 megs of data, which would then be weeded out by the client software just to show 10 records.

It was really difficult maintaining humility in this, and I must confess that though I did maintain humility while investigating, I blew a gasket once I discovered the problem.

Pharmaceutical Database
There was a particular pharmaceutical company I consulted for who wanted a fast link between their main site and another site where many users were as well as off-site backups sat. They already had a DS3 between the sites, but they wanted to put in an additional 100 Mbps connection, which they did. I then configured all the routers, etc, to route things primarily down the 100 Mbps connection and secondarily down the DS3.

Everything was fine and we verified the routing was going the correct way, etc. But then the database backups started going much slower than they had before. After investigating, I discovered that the DS3 ran more-or-less directly from their office in West Bend, WI, to the office in Waukesha, WI. However, the 100 Mbps connection ran from West Bend, WI, up to the CO in St Paul, MN, then down to Waukesha, WI.

Obviously, the latency over the 100 Mbps connection was significantly more (25-30 milliseconds) than the latency over the DS3 since the distance was so much further. This can be understood here. So this was a network problem, if you will. However, again, the way the database was synchronizing, it was doing thousands of little queries, rather than a bulk transfer.

Here were some possible solutions, in no particular order.

Modify all the computers to use a higher TCP window size using the Window scale option. Without window scaling, you have to consider latency when determining the maximum speed of a TCP session. It goes like this:

Window Size (bytes)
------------------- * 8 bits/byte = bps
Latency (sec)

So 65535 / .025 * 8 = ~21 Mbps. So we could not get more than 21 Mbps from this connection.
Make the syncing go over the DS3.
Ask the programmers re-write the way the database synchronizes.
Make the carrier reroute the 100 Mbps connection.

First we enabled TCP window scaling option for the two servers that needed to synchronize across this connection. That didn't seem to help. Then we dug into just how the application was working and discovered the little queries that made the syncing work, which would be unaffected by window sizing. So we tried option 3, because it was what we determined to be the best. The programmers laughed and said it was a networking problem and it was our fault because it used to work better. We probably should have pushed them harder and gotten management involved more, but we didn't. We tried option 4 but that was impossible because the carrier was based in MN and would have charged roughly a bazillion dollars to run fiber directly between the two sites, and this carrier was originally chosen precisely because they did a good price/sales job on the management.

Ultimately we made the syncing go over the DS3 using offset lists in EIGRP for /32 routes for those two servers. That also meant that there would be asynchronous routing when anything communicated across the WAN with one of those two servers (besides the server-to-server communication of course). No one cared, as long as it "worked better" and redundancy was there.

Two lessons from these experiences:

Do your homework! Don't rush ahead without asking a lot of good questions. Some things may seem obvious based on your past experience, but don't assume.
If you determine something to be the best option and it gets laughed off by the people you're suggesting it to, go to management ready to explain all the ins-and-outs of the problem as you understand it.
Whatever your personality type, attempt to show restraint in running ahead with solutions. Brain-storm, but remember that brainstorming is not meant to be implemented on a whim. Sleep on it and run it by colleagues and other smart people.

Remember, work as a team to get it resolved. If you blame someone else, try to put yourself in their shoes and always speak with humility and remind people you are on the same team--you just want to find the best solution. This kind of thinking is often contagious, and usually management will appreciate it. Of course, this isn't always true, and you can't control other people, but if you keep your cool, things will go better, at least for you.

Thoughts

Application Performance II

No comments:

Post a Comment

Previous working directory in Windows Command Prompt

Search This Blog