Tag Archives: IT

Invisible Bugs or Why Every Developer Must Understand Details of IT Infrastructure

(This article is adapted from a very interesting post written by Sunil Uttamchandani, Co-founder and Director of Services at Mithi Software, a Pune-based Software Products company specializing in software for email, collaboration and other enterprise products. The article first appeared on the Mithi Blog and is adapted & reproduced here for the benefit of PuneTech readers with permission.)

Most of the education of a Software Developer is centered around programming, and keeping their code clean and maintainable and debuggable, and well-tested and ensuring that customers don’t run into bugs, and if they do, the bugs are easy to find. However, in real life, one of the most difficult category of bugs to find is the “invisible” bug. The first thing you notice about such a bug is that a customer complains about a bug, but you are unable to reproduce it in your environment. Now, if there is one thing you cannot convince a customer about, is that the bug is caused due to some misconfiguration of the software infrastructure in the customer environment. All bugs are bugs in your product, irrespective of what actually caused the bug.

In the Blog of Mithi Software, Sunil Uttamchandani talks about how their products (which deal with email servers and other enterprise collaboration software) often have to deal with “Intangible/Invisible Network Obstacles” when dealing with customer bugs.

Here he describes a recent experience.

A Ghost In the Network

Recently during a POS (proof on site) exercise with a prospective customer, we had to perform a test in which an email client would send mail to a large number of recipients from our cloud email setup and capture performance test results. As a regular practice, we setup the SMTP controls on our server to allow this test, did a test from our environment and then asked the client to repeat the same test in their environment.

The test failed in the client environment.

We enabled the SMTP scanning engines for their source IP to capture detailed information (which would slow down the mail flow naturally), and we found that the client could deliver a few mail, but would give up after a little while. It would simply show the progress bar, but would not move ahead. The logs on our server showed that there was no more connections coming from that client. As a first point of troubleshooting we eliminated the scanning controls and simplified the SMTP rules in our product to speed up things by making no checks for their source IP address. We did another round of testing, but we had similar results. Just a few more mails went through and the process hung again. During this phase, we couldn’t successfully send mail to all their recipients at all. After a few mails, the system would simply do nothing and client would eventually time out.

On the face of it, all looked well in the client’s environment, since the other users/programs in the client’s environment were going about their business with no issues.

Without assuming anything, we performed the test from our office to eliminate any issues on the server side. Once we did this successfully, we re-did the test from our environment, with the client’s data and that too went through successfully. All pointers were now to the client’s environment!

There obviously was some firewall policy, some proxy, or some other transparent firewall in the network which was disabling the test through the given Internet link. On our request, when the firewall policies were bypassed for connections to our servers, the test went through successfully.

This shows two things. Network administrators, and firewalls often interfere with the web connections in complicated, and difficult to debug ways. And, the job of determining the root cause of the problem always falls upon the product vendor.

More Examples of Real Life Network Problems

If you think this is an isolated problem, think again. Sunil goes on to point out a bunch of other cases where similar ghost bugs bothered them:

Several times, our help desk receives tickets for such “intangible” problems in the network which are difficult to troubleshoot since there is some element in the network, which is interfering in the normal flow. Clients find it difficult to accept these kind of issues since on the face of it all seems to be well. Some real life examples of such issues we face:

  • At one of our customer sites, address book on the clients’ machines suddenly stopped working. Clients connect to the Address book over the LDAP port 389. We found that while a telnet to the LDAP port was working fine from a random set of clients, still the address book was not able to access the server over port 389. It turned out to be a transparent firewall which had a rate control.
  • Several of our customers complain of duplicate mail. This typically happens when MS Outlook as a client sends a mail, but retains the mail in the Outbox when it doesn’t receive a proper acknowledgement from the server. It then resends the mail and may do so repeatedly until its transaction completes successfully. On the face of it, it appears to be a server issue, while actually its a network quality issue. Difficult to prove. I’ve personally spent hours on the phone trying to convince customers to clean up their networks. One of our customers, after a lot of convincing, did some hygiene work on their network and the problem “magically” vanished.
  • One of our customers complained that their remote outgoing mail queue was rising rapidly. We found that the capacity of Internet link’s (provided by the ISP) to relay mail had suddenly dropped. So mails were going, but very slowly, and hence the queues were rising. Apparently there had been no change in the network which could explain this. After some analysis, We were quite convinced that the ISP had probably an introduced an SMTP proxy in the network, which had some rate control or tar pit policies. The ISP refused to acknowledge this. To prove our hypothesis, we routed the mail from our hosted servers over a different port (not port 25 – which is default for SMTP). As soon as we did this, the mail flow became normal, even though we were sending through the same Internet link. As of the time of this writing, the ISP is still to acknowledge that there is an impediment in the path for port 25.

These and several more incidents show that problems in the network environment are challenging to troubleshoot and accept.

So What Next?

In other words, to be able to keep customers happy, software developers need to have a very good and detailed understanding of the various IT infrastructure environments in which their product is likely to be deployed, and be able to come up with inventive strategies by which to isolate which part of the infrastructure is actually causing the problem.

Business Continuity Management Lifecycle and Key Contractual Requirements

(This overview of Business Continuity Management is a guest post by Dipali Inamdar, Head of IT Security in Geometric)

In emergency situations like pandemic outbreaks, power failures, riots, strikes, infrastructure issues, it is important that your business does not stop functioning. A plan to ensure this is called a Business Continuity Plan (BCP), and it is of prime importance to your business to ensure minimum disruption and smooth functioning of your operations. Earlier most companies would document business continuity plans only if their clients asked for it and would focus mainly on IT recovery. But scenarios have changed now. Corporations of all sized have now realized the importance of keeping their business functioning at all time and hence they are working towards a well defined business continuity management framework. Business continuity (BC) is often understood as a process to handle events that could disrupt business. However, BC is more than just recovery. The plan should also ensure proper business resumption after recovering from the disruption.

Business continuity management is a continuous life cycle as follows:

Business Continuity Planning Lifecycle
Click on the image to see it in full size

How does one start with BCM?

Business Impact Analysis (understanding the organization)

The first step is to conduct a Business Impact analysis. This would help you to identity critical business systems and processes and how their outage (downtime) could affect your business. You cannot have plan in place for all the processes without considering financial investments needed to have those in place. CEO’s inputs and client BC requirements also serve as input for impact analysis.

Defining the plan (Determining BCM strategy)

The next step is to identify the situations that could lead to disruption of the identified critical processes.

The situations could be categorized as:

  • Natural and environmental: – Earthquakes, floods, hurricanes etc
  • Human related: – Strikes, terrorist attacks, pandemic situation, thefts etc
  • IT related: – critical systems failure, virus attacks etc
  • Others: – Business Competition, power failure, Client BC contractual requirements

It might not be feasible to have plans for each and every situation, as implementing the defined plans needs to be practically possible. After the situations have been identified one needs to identify different threats, threat severity (how serious will be the impact on business if threat materializes) and their probability of occurrence (likelihood of threat materialization). Based on threat severity and occurrence levels critical risks are identified.

Implementing the plan (Developing and implementing BCP response)

The identified risks and additional client specific BCP requirements serve as inputs to the creation of BCPs. BCPs should focus on mitigation plan for the identified risks. BCP should be comprehensive, detailing roles and responsibilities of all the response teams. Proper budget needs to be allocated. Once the plan is documented the plan should be implemented.

The different implementation as per BCP could include having redundant infrastructure, signing up Service Level Agreements (SLAs) with service providers, having backup power supply, sending backup tapes to offshore sites, and training people in cross skills, having proper medicines or masks for addressing pandemic situations.

BCP should also have proper plans in place to resume business as usual. Business resumption is a critical and very important aspect of business continuity framework.

Testing and improving plan (Exercising, maintaining and reviewing)

Once the plans are documented and implemented the plans should be regularly tested. The tests could be scheduled or as and when the need arises. One can simulate different tests like moving people to other locations, having primary infrastructure down, testing UPS and diesel generator capacity, calling tree tests, evacuation drills, having senior management backups to take decisions, transport arrangements etc.

The tests will help you identify areas which need improvement in the BCP. The gaps between the expected and actual results need to be compared. The test results needs to be published to senior management. The plan needs to be reviewed regularly to update latest threats and have mitigations for the critical ones which will result in continuous lifecycle. One can schedule internal audits or apply for BS25999 certification to ensure proper compliance to BCP requirements.

Pune faces threats of irregular power supply, pandemic out break etc which could lead to business disruptions. One needs to have detailed plans for critical threats to ensure continuity of critical operations. The plans should also have detailed procedure to ensure proper business resumption. Plans may be documented but actual action during emergency situations is very important.

Important note: Contractual requirements

When signing off specific contractual requirements with clients, certain precautions must be taken as follows:

  • Before signing stringent SLAs it should be checked that there is a provision for exclusions or relaxations during disaster situations as you will not be able to achieve SLAs during disaster scenarios
  • When BCP requirements are defined in client contracts the responsibilities or expectations from the clients should also be clearly documented and agreed to ensure effective execution of the BCP
  • BCP requirements can only be effectively implemented when proper budget allocations are planned. So for specific BCP requirements cost negotiations with the client are important. Usually this is ignored, so it is important that the sales team should be appraised before agreeing on BCP requirements with the client.
  • Do not sign-off on vague BCP requirements. They should be clear, specific and practically achievable
  • Before signing off any contract which has a penalty clause, it should be reviewed thoroughly to ensure that compliance to those clauses is practically possible

About the author: Dipali Inamdar

Dipali Inamdar, Head – IT security in Geometric Ltd, has more than 11 years of experience in Information Technology and Information Security domain. She is a certified CISA, ISO27001 Lead Auditor, BS25999 Lead Auditor and ISO2000 Internal auditor. She has worked in sectors spanning BPO, IT and ITES companies, Finance sector for Information Security and Business Continuity Management. She is currently operating out of Pune and is very passionate about her field. See her linked-in profile for more details.

Reblog this post [with Zemanta]