Thoughts: March 2011

I thought I would just tell a few quick stories about my encounters with application efficiency and how the network was blamed. In the spirit of humility, I'll offer where I feel I did poorly and where I did well, and what lessons I learned.

City Police Database
The biggest, most obvious example of this was when I was doing some consulting work for a city government, which shall of course remain nameless. They had gotten a new application (and when I say new, I mean it--it had not been seriously proven in any other installations). I was asked to make the application high-availability. The application was not ready to be installed yet, but I was to do the up-front work. The parameters were thus:

We don't want to buy any new network equipment
The redundancy must go across multiple sites, which are connected by DS3 links several hops away.

So I devised a method I thought was pretty good to fit these parameters. It used pre-existing Cisco 3845 routers: one at the "main site" and one at the "backup site." The 3 servers that made up the program (two application servers and one database server) went on a new VLAN that was routed by the 3845. The VLAN at the main site had one particular IP subnet and the VLAN at the backup site had another subnet. But each server had a loopback interface, and those interfaces were the ones that DNS pointed to. The 3845 at the main site used Cisco's IP SLA with tracking, tracking the application's TCP port on it's physical address. If the application went down, it stopped advertising the loopback /32 via EIGRP. The backup site would then start advertising, because it had a route to that /32 address but with an administrative distance of 200, which is lower than EIGRP's external administrative distance of 170.

All that is background, which is actually not related to my point. All of this redundancy stuff should probably not have been put in, at least without a careful up-front study of the application itself. That's another lesson: do your homework first, and save a lot of headache. If you have built a Cadillac simply to transport monkeys, you've probably wasted your time and efforts (and the customer's money). My personality type is ISTP, which if you know psychology, tells you why I went ahead and did all this.

It turns out that the application was horribly written in many ways--it didn't run as a service, it ran as a user-level program, so we had to set the servers to auto-login, etc. But the biggest problem was, and the point of this section, the performance. It had never been load-tested, nor had it been run over anything less than a gigabit network. So when it was slow, it was of course the fault of the network, and more specifically, the network engineer who obviously didn't know what he was doing.

Here's what we determined after I did a careful analysis of the application (without seeing any code or finding this out from the programmers who didn't speak English very well). The client component of the application, which sat on the user's workstation, was doing SQL queries in the background. Instead of doing a query like:

select COMPONENT from TABLE where SOMETHING > 100;

it was doing something like:

select * from TABLE;

then once it downloaded the huge amount of data, it would "post-filter" that. Yes, definitely a network problem when every query resulted in the downloading of over 100 megs of data, which would then be weeded out by the client software just to show 10 records.

It was really difficult maintaining humility in this, and I must confess that though I did maintain humility while investigating, I blew a gasket once I discovered the problem.

Pharmaceutical Database
There was a particular pharmaceutical company I consulted for who wanted a fast link between their main site and another site where many users were as well as off-site backups sat. They already had a DS3 between the sites, but they wanted to put in an additional 100 Mbps connection, which they did. I then configured all the routers, etc, to route things primarily down the 100 Mbps connection and secondarily down the DS3.

Everything was fine and we verified the routing was going the correct way, etc. But then the database backups started going much slower than they had before. After investigating, I discovered that the DS3 ran more-or-less directly from their office in West Bend, WI, to the office in Waukesha, WI. However, the 100 Mbps connection ran from West Bend, WI, up to the CO in St Paul, MN, then down to Waukesha, WI.

Obviously, the latency over the 100 Mbps connection was significantly more (25-30 milliseconds) than the latency over the DS3 since the distance was so much further. This can be understood here. So this was a network problem, if you will. However, again, the way the database was synchronizing, it was doing thousands of little queries, rather than a bulk transfer.

Here were some possible solutions, in no particular order.

Modify all the computers to use a higher TCP window size using the Window scale option. Without window scaling, you have to consider latency when determining the maximum speed of a TCP session. It goes like this:

Window Size (bytes)
------------------- * 8 bits/byte = bps
Latency (sec)

So 65535 / .025 * 8 = ~21 Mbps. So we could not get more than 21 Mbps from this connection.
Make the syncing go over the DS3.
Ask the programmers re-write the way the database synchronizes.
Make the carrier reroute the 100 Mbps connection.

First we enabled TCP window scaling option for the two servers that needed to synchronize across this connection. That didn't seem to help. Then we dug into just how the application was working and discovered the little queries that made the syncing work, which would be unaffected by window sizing. So we tried option 3, because it was what we determined to be the best. The programmers laughed and said it was a networking problem and it was our fault because it used to work better. We probably should have pushed them harder and gotten management involved more, but we didn't. We tried option 4 but that was impossible because the carrier was based in MN and would have charged roughly a bazillion dollars to run fiber directly between the two sites, and this carrier was originally chosen precisely because they did a good price/sales job on the management.

Ultimately we made the syncing go over the DS3 using offset lists in EIGRP for /32 routes for those two servers. That also meant that there would be asynchronous routing when anything communicated across the WAN with one of those two servers (besides the server-to-server communication of course). No one cared, as long as it "worked better" and redundancy was there.

Two lessons from these experiences:

Do your homework! Don't rush ahead without asking a lot of good questions. Some things may seem obvious based on your past experience, but don't assume.
If you determine something to be the best option and it gets laughed off by the people you're suggesting it to, go to management ready to explain all the ins-and-outs of the problem as you understand it.
Whatever your personality type, attempt to show restraint in running ahead with solutions. Brain-storm, but remember that brainstorming is not meant to be implemented on a whim. Sleep on it and run it by colleagues and other smart people.

Remember, work as a team to get it resolved. If you blame someone else, try to put yourself in their shoes and always speak with humility and remind people you are on the same team--you just want to find the best solution. This kind of thinking is often contagious, and usually management will appreciate it. Of course, this isn't always true, and you can't control other people, but if you keep your cool, things will go better, at least for you.

Application Efficiency, Part I
The website Ethereal Mind, which I often read, had a recent post in response to another post by Matthew Norwood entitled Programming Bad Performance. It touched on something every network engineer feels when application performance is slow somewhere on the network. It's a topic I have dealt with a lot in my professional career so I uncharacteristically weighed-in with a comment, which I present below in a modified form...

It seems that the onus, or burden of proof, is often on the network engineer to figure out just what the problem is, since the network is this mysterious entity that most people don’t know much about. I have found two types of server admins in my experience:

The ones that assume it’s their problem and never think about the network, but then it proves to be a network problem (often when attempting to establish communication between a service network and an internal network).
The ones that assume (along with all the users, usually) that it is a network problem and very quickly start pointing fingers.

I have found that over the years I have learned a lot more about efficient database programming, analyzing server cpu, memory, and disk utilization, etc, than I ever wanted to, as a means to simply find what the real problem is. Greg at Etherealmind points out that proving it’s not the network, or at least casting plausible reason to believe it’s not, will get you home on time. I agree with that, but in a world of people saying, “It’s not my problem” and washing their hands of it, I want to put forth a little more effort to get to the bottom of things.

Something else that I have learned is that humility is really important. If you start acting cocky and the problem turns out to be yours, you really look like, and are, a jerk. Taking more of a “Let’s figure this out together” attitude is much more likely to lead to success and team unity in the end. The real art is being able to foster and maintain that attitude when other people seem to be out for blood.

The Jesus Connection
I'll make this my first post that crosses my two categories of blogging. Wanting to live life in a way that Jesus would do it, the previous paragraph helps see one or two ways of doing that. The bible often speaks of humility... three places that jump out to me:

Do nothing out of selfish ambition or vain conceit. Rather, in humility value others above yourselves, not looking to your own interests but each of you to the interests of the others (Philippians 2:3-4).
For those who exalt themselves will be humbled, and those who humble themselves will be exalted (Matthew 23:12).
When pride comes, then comes disgrace, but with humility comes wisdom (Proverbs 11:2)

Unity is also key to solving problems. It's amazing how humility and unity go together as well (I think that's pretty obvious). If everyone is cocky, it does not help team unity--it just creates divides. Be completely humble and gentle; be patient, bearing with one another in love. Make every effort to keep the unity of the Spirit through the bond of peace. There is one body and one Spirit, just as you were called to one hope (Ephesians 4:2-4).

Three of my favorite topics: humility, unity, and hope!

Thoughts

Application Performance II

Application Performance and the Jesus Way

Previous working directory in Windows Command Prompt

Search This Blog