(This article is adapted from a very interesting post written by Sunil Uttamchandani, Co-founder and Director of Services at Mithi Software, a Pune-based Software Products company specializing in software for email, collaboration and other enterprise products. The article first appeared on the Mithi Blog and is adapted & reproduced here for the benefit of PuneTech readers with permission.)
Most of the education of a Software Developer is centered around programming, and keeping their code clean and maintainable and debuggable, and well-tested and ensuring that customers don’t run into bugs, and if they do, the bugs are easy to find. However, in real life, one of the most difficult category of bugs to find is the “invisible” bug. The first thing you notice about such a bug is that a customer complains about a bug, but you are unable to reproduce it in your environment. Now, if there is one thing you cannot convince a customer about, is that the bug is caused due to some misconfiguration of the software infrastructure in the customer environment. All bugs are bugs in your product, irrespective of what actually caused the bug.
In the Blog of Mithi Software, Sunil Uttamchandani talks about how their products (which deal with email servers and other enterprise collaboration software) often have to deal with “Intangible/Invisible Network Obstacles” when dealing with customer bugs.
Here he describes a recent experience.
A Ghost In the Network
Recently during a POS (proof on site) exercise with a prospective customer, we had to perform a test in which an email client would send mail to a large number of recipients from our cloud email setup and capture performance test results. As a regular practice, we setup the SMTP controls on our server to allow this test, did a test from our environment and then asked the client to repeat the same test in their environment.
The test failed in the client environment.
We enabled the SMTP scanning engines for their source IP to capture detailed information (which would slow down the mail flow naturally), and we found that the client could deliver a few mail, but would give up after a little while. It would simply show the progress bar, but would not move ahead. The logs on our server showed that there was no more connections coming from that client. As a first point of troubleshooting we eliminated the scanning controls and simplified the SMTP rules in our product to speed up things by making no checks for their source IP address. We did another round of testing, but we had similar results. Just a few more mails went through and the process hung again. During this phase, we couldn’t successfully send mail to all their recipients at all. After a few mails, the system would simply do nothing and client would eventually time out.
On the face of it, all looked well in the client’s environment, since the other users/programs in the client’s environment were going about their business with no issues.
Without assuming anything, we performed the test from our office to eliminate any issues on the server side. Once we did this successfully, we re-did the test from our environment, with the client’s data and that too went through successfully. All pointers were now to the client’s environment!
There obviously was some firewall policy, some proxy, or some other transparent firewall in the network which was disabling the test through the given Internet link. On our request, when the firewall policies were bypassed for connections to our servers, the test went through successfully.
This shows two things. Network administrators, and firewalls often interfere with the web connections in complicated, and difficult to debug ways. And, the job of determining the root cause of the problem always falls upon the product vendor.
More Examples of Real Life Network Problems
If you think this is an isolated problem, think again. Sunil goes on to point out a bunch of other cases where similar ghost bugs bothered them:
Several times, our help desk receives tickets for such “intangible” problems in the network which are difficult to troubleshoot since there is some element in the network, which is interfering in the normal flow. Clients find it difficult to accept these kind of issues since on the face of it all seems to be well. Some real life examples of such issues we face:
- At one of our customer sites, address book on the clients’ machines suddenly stopped working. Clients connect to the Address book over the LDAP port 389. We found that while a telnet to the LDAP port was working fine from a random set of clients, still the address book was not able to access the server over port 389. It turned out to be a transparent firewall which had a rate control.
- Several of our customers complain of duplicate mail. This typically happens when MS Outlook as a client sends a mail, but retains the mail in the Outbox when it doesn’t receive a proper acknowledgement from the server. It then resends the mail and may do so repeatedly until its transaction completes successfully. On the face of it, it appears to be a server issue, while actually its a network quality issue. Difficult to prove. I’ve personally spent hours on the phone trying to convince customers to clean up their networks. One of our customers, after a lot of convincing, did some hygiene work on their network and the problem “magically” vanished.
- One of our customers complained that their remote outgoing mail queue was rising rapidly. We found that the capacity of Internet link’s (provided by the ISP) to relay mail had suddenly dropped. So mails were going, but very slowly, and hence the queues were rising. Apparently there had been no change in the network which could explain this. After some analysis, We were quite convinced that the ISP had probably an introduced an SMTP proxy in the network, which had some rate control or tar pit policies. The ISP refused to acknowledge this. To prove our hypothesis, we routed the mail from our hosted servers over a different port (not port 25 – which is default for SMTP). As soon as we did this, the mail flow became normal, even though we were sending through the same Internet link. As of the time of this writing, the ISP is still to acknowledge that there is an impediment in the path for port 25.
These and several more incidents show that problems in the network environment are challenging to troubleshoot and accept.
So What Next?
In other words, to be able to keep customers happy, software developers need to have a very good and detailed understanding of the various IT infrastructure environments in which their product is likely to be deployed, and be able to come up with inventive strategies by which to isolate which part of the infrastructure is actually causing the problem.
I’ve no expertise in network security not how SMTP server work, but this aren’t bugs as far as my development experience is concerned, as they are not *programming errors*. I’ll call them infrastructure/network issues/problems.
Secondly, they might required some network health probing system in order to access whether given client infrastructure has all the prerequisite that required by their email server (Correct me if Mithi has different stuff).
Sagar, the point that this article is trying to make is that no matter where the error lies, if the customer is not getting what he wants, then it is a bug in your product, and he will expect you to fix it. In fact, the main point of the article is that developers should be aware of the environment and misconfigurations and not just programming errors.
As for your second point – everyone has network health probing, but if you read the scenarios in detail, you’ll realize that the obstacles were much more subtle and only got triggered under very specific conditions – and the combinations are infinite. It would be difficult, almost impossible, to build a health probing system that could catch all those errors.
This is really interesting post. The IT infrastructure thought will definitely be in my bag of tools when next time I deal with such a bug.
Thanx for sharing 🙂