Registration and Fees: The event is free for all. Register here.
Details: PLUG meeting
The PLUG meeting is open to all, there are no charges or pre-requisites to attend the meeting. If you are intrested in FOSS (Free/Open Source Software) you are welcome to the meeting. If you want to give a talk or a demo, you are welcome.
Details: Groovy – Grails Discussion
The Groovy language and the Grails framework have slowly but surely grown in prominence. Grails uses the best ideas from the Ruby on Rails world while still continuing to leverage the tried, tested and trusted Java platform as well as established frameworks like Spring and Hibernate. However there has been hardly any community behind Groovy and Grails in India. The Java meet for this month is an attempt to facilitate discussion amongst Groovy and Grails enthusiasts in Pune & India.
The Groovy – Grails meet will commence with a Grails introduction and demo by Harshad Oak. This will be followed by a discussion about the Groovy & Grails, it’s current state in India and its future prospects.
About the Speaker – Harshad Oak
Harshad is the founder of Rightrix Solutions and editor IndicThreads.com. He is the author of 3 books on Java technology and several articles. For his contributions to technology and the developer community, he has been recognized as an Oracle ACE Director and Sun Java Champion.
To initiate discussion prior to the meet or continue it after the meet, join these groups.
Where: India International Multiversity, Sakal Nagar, Baner Road,
Registration and Fees: This event is free for everyone. There is no need to register.
Details:
This is a lecture series organized by Dr. Jha in “Knowledge Representation” using the Nyaya and Navya Nyaya techniques. Navya-Ny?ya developed a sophisticated language and conceptual scheme that allowed it to raise, analyse, and solve problems in logic and epistemology. The lecture series will be an introductory course which will cover the basics and also look at the design principles of Sanskrit, inference schemas used in Nyaya etc.
The idea here is to get a fresh perspective on Knowledge Representation and looking at how these techniques could be used in today’s IT problems ranging from better modeling in databases to better common sense representation systems.
About the Speaker – Prof. V.N. Jha
Prof. V. N. Jha is a specialist of various branches of Sanskrit learning and Navya Nyaya. All along he has been trying to promote Sanskrit studies through multi-disciplinary approaches in order to make such studies relevant to contemporary knowledge domains. He has visited several countries as visiting Professor and has delivered lectures. He has contributed over40 books and over 100 articles. Over 25 students received PhD degree under his supervision.
Anand Lunia, Executive Director & CFO, Seedfund is visiting Pune this week, and is interested in meeting Pune entrepreneurs in one-on-one meetings. If you fit that description, you can send him an email at anand AT NOSPAM seedfund dot in.
Seedfund believes in investing small amounts (1 crore to 5 crore) in early stage startups and they are mainly interested in the following sectors: internet or media or mobile or telecom or retail or consumer-facing plays, and also for things that aren’t yet “sectors”.
This message brought to you by the Pune OpenCoffee Club, which you should join if you are a Pune entrepreneur, or you are interested in the Pune startup ecosystem.
Rohit Ghatol of the Pune GTUG (Google Technologies User Group) recently conducted a well received tutorial on Google Web Toolkit. Based on user interest, he is trying to gauge whether there is enough interest to conduct another tutorial, this time on Google Gadget and Google OpenSocial development platforms. If you are interested, please let him know by filling out this survey.
Virtualization is fast emerging as a game-changing technology in the enterprise computing space. What was once viewed as a technology useful for testing and development is going mainstream and is affecting the entire data-center ecosystem. This article on the important use-cases of server virtualization by Milind Borate, is the second in PuneTech’s series of articles on virtualization. The first article gave an overview of server virtualization. Future articles will deal with the issue of management of virtual machines, and other types of virtualization.
Is server virtualization a new concept? It isn’t, because traditional operating systems do just that. An operating system provides a virtual view of a machine to the processes running on it. Resources are virtualized.
Each process gets a virtual address space.
A process’ access privileges control what files it can access. That is storage virtualization.
The scheduler virtualizes the CPU so that multiple processes can run without conflicting with each other.
Network is virtualized by providing multiple streams (for example, TCP) over the same physical link.
Storage and network are weakly virtualized in traditional operating systems because some global state is shared by all processes. For example, the same IP address is shared by all processes. In case of storage, the same namespace is used by all processes. Over time, some files/directories become de-facto standards. For example, all process look at the same /etc/passwd file.
Today, the term “server virtualization” means running multiple OSs on one physical machine. Isn’t that just adding one more level of virtualization? An additional level generally means added costs, lower performance, higher maintenance. Why then is everybody so excited about it? What is it that server virtualization provides in addition to traditional OS offerings? An oversimplified version of the question is: If I can run two processes on one OS, why should I run two OSs with one process each? This document enumerates the drivers for running multiple operating systems on one physical machine, presenting a use case, evaluating the virtualization based solution, suggesting alternates where appropriate and discussing future trends.
Use case: I have two different applications. One runs on Windows and the other runs on Linux. The applications are not resource intensive and a dedicated server is under-utilized.
Analysis: This is a weak argument in an enterprise environment because enterprises want to standardize on one OS and one OS version. Even if you find Windows and Linux machines in the same department, the administrators are two different people. I wonder if they would be willing to share a machine. On the other hand, you might find applications that require conflicting versions of, say, some library, especially on Linux.
Alternative solution: Wine allows you to run Windows applications on Linux. Cygwin allows you to run Linux applications on Windows. Unfortunately, it’s not the same as running the application directly on the required OS. I won’t bet that a third party application would run out of the box under these virtual environments.
Future: Some day, developers will get fed up of writing applications for a particular OS and then port them to others. JAVA provides us with an host/OS independent virtual environment. JAVA wants programmers to write code that is not targetted for a particular OS. It succeeded in some areas. But, still there is a lot of software written for a particular OS. Why did everybody not move to JAVA? I guess, because JAVA does not let me do everything that I can do using OS APIs. In a way, that’s JAVA’s failure in providing a generic virtual environment. In future, we will see more and more software developed over OS independent APIs. Databases would be the next target for establishing generic APIs.
Use case: I have two different applications. If I install both on the same machine, both fail to work. In fact, they might actually work but it’s not a supported by my vendor.
Analysis: In the current state of affairs, an OS is not just hardware virtualization. The gamut of libraries, configuration files, daemons is all tied up with an OS. Even though an application does not depend on the exact kernel version, it very much depends on the library versions. It’s also possible that the applications make conflicting changes to some configuration file.
Alternative solution: OpenVZ modifies Linux to provide multiple “containers” inside the same OS. The machine runs a single kernel but provides multiple isolated environments. Each isolated environment can run an application that would be oblivious to the other containers.
Future: I think, operating systems need to support containers by default. The process level isolation provided at memory and CPU level needs to be extended storage and network also. On the other hand, I also hope that application writers desist from depending on shared configuration and shared libraries pay some attention to backward compatiblity.
Use case: In case an application or the operating system running the application faults, I want my other applications to run unaffected.
Analysis: A faulty application can bring down entire server especially if the application runs in a priviledged mode and if it could be attacked over a network. A kernel driver bug or operating system bug also brings down a server. Operating systems are getting more stable and servers going down due to operating system bug is rare now a days.
Alternative solution: Containers can help here too. Containers provide better isolation amongst applications running on the same OS. But, bugs in kernel mode components cannot be addressed by containers. Future: In near future, we are likely see micro-kernel like architectures around virtual machines monitors. Light weight operating systems could be developed to work only with virtual machine monitors. Such a solution will provide fault isolation without incurring the overheads of a full opearting system running inside a virtual machine.
Use case: I want to build a datacenter with utility/on-demand/SLA-based computing in mind. To achieve that, I want to be able to live-migrate an application to a different machine. I can run the application in a virtual machine and live-migrate the virtual machine.
Analysis: The requirement is to migrate an application. But, migrating a process is not supported by existing operating systems. Also, the application might do some global configuration changes that need to be available on the migration target.
Alternative solution: OpenVZ modifies Linux to provide multiple “containers” inside the same OS. OpenVZ also supports live migration of a container.
Future: As discussed earlier, operating systems need to support containers by default.
Use case: My operating system does not support the cutting edge hardware I bought today.
Analysis: Here again, I’m not bothered about the operating system. But, my applications run only on this OS. Also, enterprises like to use the same OS version throughout the organization. If an enterprise sticks to an old OS version, it does not work with new hardware to be bought. If an enterprise is willing to move to the newer OS, it does not work with the existing old hardware.
But, the real issue here is the lack of standardization across hardware or driver development models. I fail to understand why every wireless LAN card needs a different driver. Can all hardware vendors not standardize the IO ports and commands so that one generic driver works for all cards? On the other hand, every OS and even OS version has a different drivers development model. That means every piece of hardware requires a different driver for each OS version. Alternative solution: I cannot think of a good alternative solution. One specific issue, unavailability of wireless LAN card drivers for Linux, is addressed by NdisWrapper. NdisWrapper allows us to access a wirelss card on Linux by loading a Windows driver.
Future: We either need hardware level standardization or the ability to run the same driver on all verions on all operating systems. It would be good to have wrappers, like NdisWrapper, for all types of drivers and all operating systems. A hardware driver should write to a generic API provided by the wrapper framework. The generic API should be implemented by the operating system vendors.
Use case: I want to manage hardware infrastructure for software development. Every developer and QA engineer needs dedicated machines. I can quickly provision a virtual machine when the need arises.
Analysis: Under development software fails more often than a released product. Software developers and QA engineers want an isolated environment for the tests to correctly attribute bugs to the right application. Also, software development envinronments require frequent reprovisioning as the product under development needs to be tested under different operating systems.
Alternative solution: Containers would work for most software development. I think, the exception is kernel level development.
Future: Virtual machines found an instant market in software QA labs. Virtual machines will continue to flourish in this market.
Use case: I want to ship some data to another machine. Instead of setting up identical application enviroment on the other machine to access the data, I want to ship the entire machine image itself. Shipping physical machine image does not work because of hardware level differences. Hence, I want to ship virtual machine image.
Analysis: This is another hard reality of life. Data formats are not compatible across multiple versions of a software product. Portable data formats are used by human readable documents. File-system data formats are also stable to a large extent and you can mount a FAT file-system or ISO 8529 file-system on virtually any version of any operating system. The same level of compatiblity is not established for other structured data. I don’t see that happening in near future. Even if this hurdle is crossed, you need to bother about correctly shipping all the application configuration, which itself could be different for the same software running on different OSs.
Alternative solution: OpenVZ container could be a light-weight alternative to a complete virtual machine.
Future: The future seems inclined towards “computing in a cloud”. The network bandwidth is increasing and so is the trend towards is outsourced hosting. Mail and web services are outsourced since a long time. Oracle-on-demand allows us to outsource database hosting. Google (writely) wants us to outsource document hosting. Amazon allows us to outsource storage and compuation both. In future, we will be completely oblivious to the location of our data and applications. The only process running on your laptop would be an improved a web browser. In that world, only system software engineers, who build these datacenters, would be worried about hardware and operating system compatibilities. But, they also will not be overly bothered because the data-center consolidations will reduce the diversity in hardware and OS.
Use case: I want to replace desktop PCs with thin clients. A central server will run a VM for each thin client. The thin client will act as a display server.
Analysis: Thin clients could bring down the maintenance costs substantially. Thin client hardware is more resilient than a desktop PC. Also, it’s easier to maintain the software installed on a central server than managing several PCs. But, it’s not required to run a full virtual machine for each thin client. It’s sufficient to allow users to run the required applications from a central server and make the central storage available.
Alternative solution: Unix operating systems are designed to be server operating systems. Thin X terminals are still prevalent in Unix desktop market. Microsoft Windows, the most prevalent desktop OS, is designed as a desktop OS. But, Microsft also has added substantial support for server based computing. Microsft’s terminal services allows multiple users to connect to a Windows server and launch applications from a thin client. Several commercial thin clients can work with Microsoft terminal services or similar services provided by other vendors.
Future: Before the world moves to computing in a global cloud, an intermediate step would be enterprise-wide desktop application servers. Thin-clients would become prevalent due to reduced maintenance costs. I hope to see Microsoft come up with better licensing for server based computing. On Unix, floating application licenses is the norm. With a floating application licence, a server (or a cluster of servers) can run only fixed application instances as per the license. It does not matter which user or thin client launches the application. Such a floating licensing from Microsoft will help.
Server virtualization is a “heavy” solution for the problems it addresses today.These problems could be adddressed by operating systems in a more efficient manner with following modifications:
Support for containers.
Support for live migration of containers.
Decoupling of hardware virtualization and other OS functionalities.
If existing operating systems muster enough courage to deliver these modifications, server virtualization will have tough time. It’s unrealistic to expect complete overhauls of existing operating systems. It’s possible to implement containers as a part of OS but decoupling hardware virtualizatoin from OS is a hard job. Instead, we are likely to see new light weight operating systems designed to run only in server virtualization environment. The light weight operating system will have following characteristics:
It will do away with functionality already implemented in virtual machine monitor.
It will not worry about hardware virtualization.
It might be a single user operating system.
It might expect all co-operative processes.
It will have a minimal kernel mode component. It will be mostly composed of user mode libraries providing OS APIs.
Existing virtual machine monitors would also take up more responsiblity in order to support light weight operating systems:
Hardware support: The hardware supported by a VMM will be of primary importance. The OS only needs to support the virtual hardware made visible by VMM.
Complex resource allocation and tracking: I should get a finer control over resources allocated to virtual machines and be able to track resource usage. This involves CPU, memory, storage and network.
I hope to see a light weight OS implementation targetted at server virtualization in near future. It would a good step towards modularizing the operating systems.
Thanks to Dr. Basant Rajan and V. Ganesh for their valuable comments.
About the Author – Milind Borate
Milind Borate is the CTO and VP of Engineering at Pune-based continuous data protecting startup Druvaa. He has over 13 years experience in enterprise product development and delivery. He worked at Veritas Software as Technical Director for SAN-FS and served on board of Veritas patent filter committee. Milind has filed over 15 patent applications (4 alloted) and co-authored “Undocumented Windows NT” in 1998. He holds a BE (CS) degree from University of Pune and MTech (CS) degree from IIT, Bombay.
This article was written when Milind was at Coriolis, a startup he co-founded before Druvaa.
Have you ever wondered how much planning and co-ordination it takes to roll out Indicas smoothly off the Tata Motors assembly line? Consider this – A typical automobile consists of thousands of parts sourced from hundreds of suppliers, and a manufacturing and assembly process that consists of dozens of steps. All these different pieces need to tie-in in an extremely well synchronized manner to realize the end product.
How is this achieved? Well, like most complex business challenges, this too is addressed by a combination of efficient business processes and Information Technology. The specific discipline of software that addresses these types of problems is known as “Supply Chain Management” (SCM).
Pune has a strong manufacturing base and leads the nation in automotive and industrial sectors. Companies such as Tata Motors, Bajaj Auto, Kirloskar Oil Engines, Cummins, and Bharat Forge are headquartered in Pune. The manufacturing industry has complex production and materials management processes. This has resulted in a need for effective systems to help in decision making in these domains. The discipline that addresses these decision making processes is referred to as ‘Advanced Planning & Scheduling’ (acronym: APS). APS is an important part of SCM. This article briefly discusses some of the basic concepts of SCM/APS, their high-level technology requirements, and mentions some Pune based companies active in this area. Note, given Pune’s manufacturing background, it is no accident that it is also a leader in SCM related software development activities in India.
Introduction to SCM
Supply chain management (SCM) is the process of planning, implementing and controlling the operations of the supply chain as efficiently as possible. Supply Chain Management spans all movement and storage of raw materials, work-in-process inventory, and finished goods from point-of-origin to point-of-consumption. SCM Software focuses on supporting the above decision making business processes that cover demand management, distribution, logistics, manufacturing and procurement. APS specifically deals with the manufacturing processes. Note, SCM needs to be distinguished from ‘ERP’ that deals with automating business process workflows and transactions across the entire enterprise.
‘Decision Making’ is vital in SCM and leads to a core set of requirements for SCM software. Various decision making and optimization strategies are widely used. These include Linear Programming, Non-Linear Programming, Heuristics, Genetic Algorithms, Simulated Annealing, etc. These decision making algorithms are often implemented in C or C++. (In some cases, FORTRAN still continues to be leveraged for specific mathematical programming scenarios.) Some solutions use standard off-the-shelf optimization packages/solvers such as ILOG Linear Programming Solver as a component of the overall solution.
Consider a typical process/paint manufacturer such as Asian Paints. They make thousands of different end products that are supplied to hardware stores from hundreds of depots and warehouses, to meet the end consumer demand. The products are manufactured in various plants and then shipped to the warehouses in trucks and rail-cars. Each plant has various manufacturing constraints such as 1) a given batch mixer can only make certain types of paints, 2) to reduce mixer cleaning requirements, different color paints need to be produced in the order of lighter to darker shades. Now, to make it more interesting, there are many raw material constraints! Certain raw materials can only be procured with a long lead time. An alternative raw material might be available earlier, but it is very expensive! How do we decide? How many decisions are we talking about? And remember, these decisions have to be synchronized, since optimizing any one particular area in isolation can lead to extremely bad results for the other, and an overall sub-optimal solution. In optimization language, you can literally end up dealing with millions of variables in solving such a problem.
SCM software also has a fairly specific set of GUI requirements. A typical factory planner will deal with thousands of customer orders, machines, raw material parts and processing routings. Analyzing and acting on this information is often challenging. A rich role based user workflow for the planner is a critical. GUIs are usually browser-based with custom applets wherever functionality richness is needed. In very specific cases, ‘thick’ desk-top based clients (typically developed in Java) are also required for some complex workflows. Alerts and problem based navigation are commonly used to present large amounts of information in a prioritized, actionable format. Rich analytical OLAP type capabilities are also required in many cases.
Integration is an important part of SCM software architecture. SCM software typically interacts with various Enterprise IT systems such as ERP, CRM, Data-Warehouses, and other legacy systems. Many inter-enterprise collaboration workflows also require secure integration with customer/partner IT systems via the internet. Both batch and real-time integration workflows are required. Real-time integration can be synchronous or asynchronous. Batch data can sometimes (e.g. in Retail SCM) run into terabytes and lead to batch uploads of millions of lines. Loading performance and error checking becomes very important.
Consider a computer manufacturer such as Dell. They are renowned for pioneering the rapid turnaround configure-to-order business. Dell assembly plants sources material from different suppliers. In order to get maximum supply chain efficiencies, they actively manage raw material inventory levels. Any excess inventory results in locked-in capital and a reduction in Return on Investment (ROI). In order to achieve effective raw material inventory management, Dell needs to share its production and material requirements data with its suppliers so that they can supply parts at the right time. To achieve this, there needs to be a seamless real-time collaboration between the Dell procurement planner and the suppliers. Data is shared in a secured fashion via the internet and rapid decisions such as changes to quantity, selecting alternate parts, selecting alternate suppliers are made in real-time.
SCM in Pune
Most of the large manufacturing companies in Pune leverage some kind of SCM software solutions. These are typically sourced from SCM software industry leaders such SAP, i2 and Oracle. In some cases, home grown solutions are also seen.
Many small and med-sized software product companies in Pune are focused on the SCM domain. Some offer comprehensive end-to-end solutions, while others focus on specific industry niche areas. Note that by its very nature, SCM processes are fairly complex and specifically tailored to individual companies. As a result many SCM products are highly customizable and require varying degrees of onsite development. This leads to software services as an integral part of most of these SCM product companies.
Pune based FDS Infotech has been developing SCM and ERP software suite for over a decade. They have a wide customer base in India. A representative example of their solution can be seen at Bharat Forge. Here their SCM/APS solution is being used to maximize the efficiency of the die shop. This is achieved through better schedule generation that considers all the requisite manpower, machine and raw-material constraints.
Entercoms, also based in Pune, is primarily focused on the Service Parts Management problem in SCM. Their customers include Forbes Marshall and Alfa-Laval.
SAS, a global leader in business intelligence and data-analytics software also develops various SCM solutions, with specific focus on the Retail segment. Their Retail solution focuses on a wide variety of problems such as deciding the right merchandizing strategies, planning the right assortments for the stores, forecasting the correct demand, etc. They boast a wide global customer base. Their Pune R&D center is involved in multiple products, including their Retail solution.
In addition to these three, many other small SCM software companies in Pune work on specific industry niches.
About the Author
Amit Paranjape is one of the driving forces behind PuneTech. He has been in the supply chain management area for over 12 years, most of it with i2 in Dallas, USA. He has extensive leadership experience across Product Management/Marketing, Strategy, Business Development, Solutions Development, Consulting and Outsourcing. He now lives in Pune and is an independent consultant providing consulting and advisory services for early stage software ventures. Amit’s interest in other fields is varied and vast, including General Knowledge Trivia, Medical Sciences, History & Geo-Politics, Economics & Financial Markets, Cricket.
About the author: Manas is interested in a variety of things like psychology, philosophy, sociology, photography, movie making etc. But since there are only 24 hours in a day and most of it goes in sleeping and earning a living, he amuses himself by writing software, reading a bit and sharing his thoughts.
Ultimately, we build all the systems so that they are used at one point of time, and the system that we build will be used only if they are usable. If a system is not usable, how will someone use it?
It is very common to come across doors that are supposed to be pushed to open (and there is a sticker close to the handle bar which says PUSH) but most of the people will pull it. Also, it is very common to see doors that are supposed to be pulled but people end up pushing them. It’s not really a problem with the people. It’s a problem with the design of the door. It’s a usability issue. There are doors that are pushed when they are expected to be pushed. It’s all in the design, isn’t it?
When Gmail was launched, it was an instant hit. Even though it didn’t do as many things as other email services did at that point of time (like support for all browsers, drafts etc) but it was still a revolution by itself. And the reason was simple – usability. What was the need of Gnome project? Wasn’t it the usability of GNU/Linux for non-programmers?
A system is not an island. It is connected with various other entities by various means. And it shares one relationship or the other with those entities. A conceptual model effectively captures these various entities and what kind of relationship our system has with these external entities.
The model of Britanica Encyclopedia is that a group of experts together create an encyclopedia over important topics which can then be read by people. The model of Wikipedia is that people (and that includes everyone in the world) can help in writing the encyclopedia that everyone can read. The model of Google Knol is that experts can write articles on specific subjects which everyone can read. The readers can suggest improvements that the original authors can incorporate.
The conceptual model must be made usable. The entire system will eventually be built on top of the conceptual model.
2. Interface usability
So, we have settled that a system maintains some relationships with some external entities. This relationship is exposed through interfaces and we need to think through the usability of those interfaces.
Gmail is an excellent example of interface usability. As I mentioned earlier, there were several email services when Gmail was introduced but its interface usability was far superior. This is an example of how the same conceptual model can be presented to the user with completely different interfaces.
The conceptual usability is a must to help the user understand the system. Wikipedia is an encyclopedia which can be read and modified by anyone. And interface usability is a must to help the user *do* something with the system. On each page of Wikipedia, everywhere your eye goes, you’ll find a link to edit the page or a section thereof. It almost invites you to modify the page.
Methodology and Evaluation
There are ways/methodologies to design usable systems. The usable systems are still created by people whose thought process naturally evaluates the usability at every step. However, these methods can make the system designer better grounded in the real world and make him/her more efficient too.
Much like a system cannot be declared as functional till it is tested for the same, usability of the system has to be tested with equal vigor. After all, if the system is not usable, who is going to use it (even if it is functional)? And there is only one way to test the usability of the system – by letting target users use it and that too without providing any guidance. Closely observing the users can be an eye opener many times.
And the system designer/developer cannot verify the system usability. In the course of developing, the developer has learnt too much about the system and knows exactly how it works. However, the user would not have so much of knowledge about the system and may not attempt to do things in the same ways. If you don’t believe me, go and re-read the doors example.
And what else?
Ultimately, it is possible to educate people how the system works. But the willingness of people to be educated will depend on why they need to use the system and how often. So, keep it as the last resort.
And first thing is still the last thing. We need to create usable systems because nobody uses unusable systems. And very few systems are usable by accident. Most of the usable systems are developed with usability as a focus.
I am liveblogging the CSI Pune lecture on Business Intelligence and Data Warehousing. These are quick-n-dirty notes, so please forgive the uneven flow and typos. This page is being updated every few minutes.
There’s a large turnout – over 100 people here.
Business Intelligence is an area that covers a number of different technologies for gathering, storing, analyzing and providing access to data that will help an large company make better business decisions. Includes decision support systems (i.e. databases that run complex queries (as opposed to databases that run simple transactions)), online analytical processing (OLAP), statistical analysis, forecasting and data mining. This is a huge market, with major players like Microsoft, Cognos, IBM, SAS, Business Objects, SPSS in the fray.
What kind of decisions does this help you with? How to cut costs. Better understanding of customers (which ones are credit worthy? which one are at most risk of switching to a competitor’s product?) Better planning of flow of goods or information in the enterprise.
This is not easy because amount of data is exploding. There’s too much data. Humans can’t make sense of all of them.
To manage this kind of information you need a big storage platform and a systematic way of storing all the information and being able to analyze the data (with the aforementioned complex queries). Collect together data from different sources in the enterprise. Pull from various production servers and stick it into an offline, big, fat database. This is called a data warehouse.
The data needs to be cleaned up quite a lot before it is usable. Inconsistencies between data from different data sources. Duplicates. Mis-matches. If you are combining all the data into one big database, it needs to be consistent and without duplicates. Then you start analyzing the data. Either with a human doing various reports and queries (OLAP), or the computer automatically finding interesting patterns (data mining).
Business Intelligence is an application that sits on top of the Data Warehouse.
Lots of difficult problems to be solved.
Many different data sources: flat files, CSVs, legacy systems, transactional databases. Need to pick updates from all these sources on a regular basis. How to do this incrementally and efficiently? How often – daily, weekly, monthly? Parallelized loading for speed. How to do this without slowing down the production system. Might have to do this during a small window at night. So now you have to ensure that the loading finishes in the given time window.
This is the first lecture of a 6-lecture series. Next lectures will be Business Applications of BI. This will give an idea of which industries benefit from BI – specific examples: e.g. banking for assessing credit risk, fraud, etc. Then Data Management for BI. Various issues in handling large volumes of data; data quality, transformation and loading. These are huge issues, and need to be handled very carefully, to ensure that the performance remains acceptable in spite of the huge volumes. Next lecture is technology trends in BI. Where is this technology going in the future. Then one lecture on role of statistical techniques in BI. You’ll need a bit of a statistical background to appreciate this lecture. And final session on careers in BI. For detailed schedule and other info of this series, see the Pune Tech events calendar, which is the most comprehensive source of tech events info for Pune.
SAS R&D India works on Business Applications of BI (5 specific verticals like banking), on Data management, on some of the solutions. A little of the analytics – forecasting. Not working on core analytics – that is only at HQ.
We are trying to get the slides used in this talk from the speaker. Hopefully in a few days. Please check back by Monday.
And this is not confined to national boundaries. It is one of only two (as far as I know) Pune-based companies to be featured in TechCrunch (actually TechCrunchIT), one of the most influential tech blogs in the world (the other Pune company featured in TechCrunch is Pubmatic).
Why all this attention for Druvaa? Other than the fact that it has a very strong team that is executing quite well, I think two things stand out:
It is one of the few Indian product startups that are targeting the enterprise market. This is a very difficult market to break into, both, because of the risk averse nature of the customers, and the very long sales cycles.
Unlike many other startups (especially consumer oriented web-2.0 startups), Druvaa’s products require some seriously difficult technology.
The rest of this article talks about their technology.
Druvaa has two main products. Druvaa inSync allows enterprise desktop and laptop PCs to be backed up to a central server with over 90% savings in bandwidth and disk storage utilization. Druvaa Replicator allows replication of data from a production server to a secondary server near-synchronously and non-disruptively.
We now dig deeper into each of these products to give you a feel for the complex technology that goes into them. If you are not really interested in the technology, skip to the end of the article and come back tomorrow when we’ll be back to talking about google keyword searches and web-2.0 and other such things.
Druvaa Replicator
Overall schematic set-up for Druvaa Replicator
This is Druvaa’s first product, and is a good example of how something that seems simple to you and me can become insanely complicated when the customer is an enterprise. The problem seems rather simple: imagine an enterprise server that needs to be on, serving customer requests, all the time. If this server crashes for some reason, there needs to be a standby server that can immediately take over. This is the easy part. The problem is that the standby server needs to have a copy of the all the latest data, so that no data is lost (or at least very little data is lost). To do this, the replication software continuously copies all the latest updates of the data from the disks on the primary server side to the disks on the standby server side.
This is much harder than it seems. A simple implementation would simply ensure that every write of data that is done on the primary is also done on the standby storage at the same time (synchronously). This is unacceptable because each write would take unacceptably long and this would slow down the primary server too much.
If you are not doing synchronous updates, you need to start worrying about write order fidelity.
Write-order fidelity and file-system consistency
If a database writes a number of pages to the disk on your primary server, and if you have software that is replicating all these writes to a disk on a stand-by server, it is very important that the writes should be done on the stand-by in the same order in which they were done at the primary servers. This section explains why this is important, and also why doing this is difficult. If you know about this stuff already (database and file-system guys) or if you just don’t care about the technical details, skip to the next section.
Imagine a bank database. Account balances are stored as records in the database, which are ultimately stored on the disk. Imagine that I transfer Rs. 50,000 from Basant’s account to Navin’s account. Suppose Basant’s account had Rs. 3,00,000 before the transaction and Navin’s account had Rs. 1,00,000. So, during this transaction, the database software will end up doing two different writes to the disk:
write #1: Update Basant’s bank balance to 2,50,000
write #2: Update Navin’s bank balance to 1,50,000
Let us assume that Basant and Navin’s bank balances are stored on different locations on the disk (i.e. on different pages). This means that the above will be two different writes. If there is a power failure, after write #1, but before write #2, then the bank will have reduced Basant’s balance without increasing Navin’s balance. This is unacceptable. When the database server restarts when power is restored, it will have lost Rs. 50,000.
After write #1, the database (and the file-system) is said to be in an inconsistent state. After write #2, consistency is restored.
It is always possible that at the time of a power failure, a database might be inconsistent. This cannot be prevented, but it can be cured. For this, databases typically do something called write-ahead-logging. In this, the database first writes a “log entry” indicating what updates it is going to do as part of the current transaction. And only after the log entry is written does it do the actual updates. Now the sequence of updates is this:
write #0: Write this log entry “Update Basant’s balance to Rs. 2,50,000; update Navin’s balance to Rs. 1,50,000” to the logging section of the disk
write #1: Update Basant’s bank balance to 2,50,000
write #2: Update Navin’s bank balance to 1,50,000
Now if the power failure occurs between writes #0 and #1 or between #1 and #2, then the database has enough information to fix things later. When it restarts, before the database becomes active, it first reads the logging section of the disk and goes and checks whether all the updates that where claimed in the logs have actually happened. In this case, after reading the log entry, it needs to check whether Basant’s balance is actually 2,50,000 and Navin’s balance is actually 1,50,000. If they are not, the database is inconsisstent, but it has enough information to restore consistency. The recovery procedure consists of simply going ahead and making those updates. After these updates, the database can continue with regular operations.
(Note: This is a huge simplification of what really happens, and has some inaccuracies – the intention here is to give you a feel for what is going on, not a course lecture on database theory. Database people, please don’t write to me about the errors in the above – I already know; I have a Ph.D. in this area.)
Note that in the above scheme the order in which writes happen is very important. Specifically, write #0 must happen before #1 and #2. If for some reason write #1 happens before write #0 we can lose money again. Just imagine a power failure after write #1 but before write #0. On the other hand, it doesn’t really matter whether write #1 happens before write #2 or the other way around. The mathematically inclined will notice that this is a partial order.
Now if there is replication software that is replicating all the writes from the primary to the secondary, it needs to ensure that the writes happen in the same order. Otherwise the database on the stand-by server will be inconsistent, and can result in problems if suddenly the stand-by needs to take over as the main database. (Strictly speaking, we just need to ensure that the partial order is respected. So we can do the writes in this order: #0, #2, #1 and things will be fine. But #2, #0, #1 could lead to an inconsistent database.)
Replication software that ensures this is said to maintain write order fidelity. A large enterprise that runs mission critical databases (and other similar software) will not accept any replication software that does not maintain write order fidelity.
Why is write-order fidelity difficult?
I can here you muttering, “Ok, fine! Do the writes in the same order. Got it. What’s the big deal?” Turns out that maintaining write-order fidelity is easier said than done. Imagine the your database server has multiple CPUs. The different writes are being done by different CPUs. And the different CPUs have different clocks, so that the timestamps used by them are not nessarily in sync. Multiple CPUs is now the default in server class machines. Further imagine that the “logging section” of the database is actually stored on a different disk. For reasons beyond the scope of this article, this is the recommended practice. So, the situation is that different CPUs are writing to different disks, and the poor replication software has to figure out what order this was done in. It gets even worse when you realize that the disks are not simple disks, but complex disk arrays that have a whole lot of intelligence of their own (and hence might not write in the order you specified), and that there is a volume manager layer on the disk (which can be doing striping and RAID and other fancy tricks) and a file-system layer on top of the volume manager layer that is doing buffering of the writes, and you begin to get an idea of why this is not easy.
Naive solutions to this problem, like using locks to serialize the writes, result in unacceptable degradation of performance.
Druvaa Replicator has patent-pending technology in this area, where they are able to automatically figure out the partial order of the writes made at the primary, without significantly increasing the overheads. In this article, I’ve just focused on one aspect of Druvaa Replicator, just to give an idea of why this is so difficult to build. To get a more complete picture of the technology in it, see this white paper.
Druvaa inSync
Druvaa inSync is a solution that allows desktops/laptops in an enterprise to be backed up to a central server. (The central server is also in the enterprise; imagine the central server being in the head office, and the desktops/laptops spread out over a number of satellite offices across the country.) The key features of inSync are:
The amount of data being sent from the laptop to the backup server is greatly reduced (often by over 90%) compared to standard backup solutions. This results in much faster backups and lower consumption of expensive WAN bandwidth.
It stores all copies of the data, and hence allows timeline based recovery. You can recover any version of any document as it existed at any point of time in the past. Imagine you plugged in your friend’s USB drive at 2:30pm, and that resulted in a virus that totally screwed up your system. Simply uses inSync to restore your system to the state that existed at 2:29pm and you are done. This is possible because Druvaa backs up your data continuously and automatically. This is far better than having to restore from last night’s backup and losing all data from this morning.
It intelligently senses the kind of network connection that exists between the laptop and the backup server, and will correspondingly throttle its own usage of the network (possibly based on customer policies) to ensure that it does not interfere with the customer’s YouTube video browsing habits.
Data de-duplication
Overview of Druvaa inSync. 1. Fingerprints computed on laptop sent to backup server. 2. Backup server responds with information about which parts are non-duplicate. 3. Non-duplicate parts compressed, encrypted and sent.
Let’s dig a little deeper into the claim of 90% reduction of data transfer. The basic technology behind this is called data de-duplication. Imagine an enterprise with 10 employees. All their laptops have been backed up to a single central server. At this point, data de-duplication software can realize that there is a lot of data that has been duplicated across the different backups. i.e. the 10 different backups of contain a lot of files that are common. Most of the files in the C:\WINDOWS directory. All those large powerpoint documents that got mail-forwarded around the office. In such cases, the de-duplication software can save diskspace by keeping just one copy of the file and deleting all the other copies. In place of the deleted copies, it can store a shortcut indicating that if this user tries to restore this file, it should be fetched from the other backup and then restored.
Data de-duplication doesn’t have to be at the level of whole files. Imagine a long and complex document you created and sent to your boss. Your boss simply changed the first three lines and saved it into a document with a different name. These files have different names, and different contents, but most of the data (other than the first few lines) is the same. De-duplication software can detect such copies of the data too, and are smart enough to store only one copy of this document in the first backup, and just the differences in the second backup.
The way to detect duplicates is through a mechanism called document fingerprinting. Each document is broken up into smaller chunks. (How do determine what constitutes one chunk is an advanced topic beyond the scope of this article.) Now, a short “fingerprint” is created for each chunk. A fingerprint is a short string (e.g. 16 bytes) that is uniquely determined by the contents of the entire chunk. The computation of a fingerprint is done in such a way that if even a single byte of the chunk is changed, the fingerprint changes. (It’s something like a checksum, but a little more complicated to ensure that two different chunks cannot accidently have the same checksum.)
All the fingerprints of all the chunks are then stored in a database. Now everytime a new document is encountered, it is broken up into chunks, fingerprints computed and these fingerprints are looked up in the database of fingerprints. If a fingerprint is found in the database, then we know that this particular chunk already exists somewhere in one of the backups, and the database will tell us the location of the chunk. Now this chunk in the new file can be replaced by a shortcut to the old chunk. Rinse. Repeat. And we get 90% savings of disk space. The interested reader is encouraged to google Rabin fingerprinting, shingling, Rsync for hours of fascinating algorithms in this area. Before you know it, you’ll be trying to figure out how to use these techniques to find who is plagiarising your blog content on the internet.
Back to Druvaa inSync. inSync does fingerprinting at the laptop itself, before the data is sent to the central server. So, it is able to detect duplicate content before it gets sent over the slow and expensive net connection and consumes time and bandwidth. This is in contrast to most other systems that do de-duplication as a post-processing step at the server. At a Fortune 500 customer site, inSync was able reduce the backup time from 30 minutes to 4 minutes, and the disk space required on the server went down from 7TB to 680GB. (source.)
Again, this was just one example used to give an idea of the complexities involved in building inSync. For more information on other distinguishinging features, check out the inSync product overview page.
Have questions about the technology, or about Druvaa in general? Ask them in the comments section below (or email me). I’m sure Milind/Jaspreet will be happy to answer them.
Also, this long, tech-heavy article was an experiment. Did you like it? Was it too long? Too technical? Do you want more articles like this, or less? Please let me know.
It’s the middle of the night, and your prepaid phone runs out of credits, and you need to make a call urgently. Don’t you wish that you could re-charge your prepaid mobile over the internet? Pune-based startup ApnaBill allows you to do just that. Fire up a browser, select your operator (they have partnerships with all major service providers), pay from your bank account or by credit card, and receive an SMS/e-mail with the recharge PIN. Done. They have extended this model to satellite TV (TataSky, Dish), with more such coming out of the pipeline.
PuneTech interviewed co-founder and lead developer Mayank Jain where he talks about various things, from technical challenges (does your hosting provider have an upper limit on number of emails you can send out per day?), to unexpected problems that will slow down your startup (PAN card!), and advice for other budding entrepreneurs (start the paperwork for registration/bank accounts as soon as possible).
On to the interview.
Overview of ApnaBill:
Simply put, ApnaBill.com is a online service for facilitating Prepaid and Postpaid Utility Bill payments.
Available now, are Prepaid utility bill payments like prepaid mobile recharge and prepaid vouchers for Tata Sky, World Phone, Dish TV etc.
Organizationally, ApnaBill.com is an offshoot of Four Fractions. It aims at being the single point of contact between service providers and customers, thereby minimizing transactional costs. The benefit of this is directly passed onto our customers as we do NOT charge any transaction costs from our customers. Its an ApnaBill.com policy and would be applicable to all of our product line.
Apart from regular Utility Bill Payments, we are also exploring some seemingly blue ocean verticals which have not been targeted by the online bill payment sector – yet.
Monetization strategy:
We have managed to make our business model such that despite absorbing the transactional cost, we’ll be able to make profits. They would definitely be low but the sheer amount of transactions (which we would attract because of no-transaction-charge policy) would put our figures in positive direction.
Moreover, profit generated from transactions is just one revenue source. Once we have a good traction, our advertisement revenue sources would also become viable.
We are definitely looking at a long term brand building.
Technical Challenges – Overview
Contrary to popular belief, technology is generally the simplest ingredient in a startup – specially because the startup can generally excercise full control over how it is used and deployed. And with increasingly cheaper computing resources, this space is becoming even more smoother.
However, following problems were a real challenges which we faced and solved.
Being a web 2.0 startup, we faced some major cross browser issues.
Minimizing client side internet connectivity and page display speeds
Database versioning.
Thankfully, ApnaBill.com is running Ruby on Rails under the hood – and all the solutions we designed, just got fit into the right grooves.
Technical Challenges – Details
Ruby on Rails a one of the best framework a web developer can ask for. All the solutions to the above problems just come bundled with it.
Prototype javascript library solves a lot of common cross browser issues. To completely eradicate them, an additional PNG hack from Pluit Solutions and IE7.js which lets IE6 browser render PNG images which have transparency. Once you have sanity in terms of cross browser issues, you can actually start focussing on feature development.
To overcome mail capping limits for shared hosts, we devised our own modules which would schedule mails if they were crossing the mail caps. However, we later discovered that there’s a great Ruby gem – ar_mailer to do just that. We are planning to make the shift.
Minimizing client side page load speeds was an interesting problem. We used Yahoo’s YSlow to detect where we lagged interms of page load speeds, introduced the necessary changes like moving JS to bottom of pages, CSS to the top, etc. which helped us alot in reducing the load time. Yahoo also has a JS minifier – YUI Compressor – which works great in reducing javascript files to upto 15%. We also deployed a dumb page-name based JS deployment scheme which simply blocks any javascript to load up on some particular pages (for example the homepage). This helps us in ultra fast page loads.
If you see our homepage, no JS loads up when the page is loading up. However, once the page is loaded, we initiate a delayed JS load which renders our news feed in the end.
Database versioning is an inbuilt feature in Rails. We can effectively revert back to any version of ApnaBill.com (in terms of functionality) with standard Rails framework procedures.
Non-technical challenges:
Integrating various vendors and services was visibly the biggest challenge we overcame during the (almost) 9 months development cycle of ApnaBill.com.
Getting the organization up and running was another big challenge. The paperwork takes a lot of valuable time – which if visioned properly, can be minimized to a manageable amount.
Payment Gateways are a big mess for startups. They are costly, demand huge chunks of money for security deposits and have very high transaction costs. Those who are cheap – lack even the basic courtesy and quality of service. Sooner or later, the backbone of your business becomes the single most painful factor in your business process – specially when you have no control over its functioning.
Thankfully, there are a few payment gateways which are above all of this. We hope to make an announcement soon.
The founders of ApnaBill - from left, Mayank, Sameer and Sandeep.
The process of founding ApnaBill:
When and how did you get the idea of founding ApnaBill? How long before you finally decided to take the plunge and start in earnest? What is your team like now?
In June 2007, one of the founding members of Four Fractions saw a friend of his, cribbing about how he cannot recharge his prepaid mobile phone from the comforts of his home. He had to walk about 1 km to reach the nearest local shop to get his phone connection recharged.
This idea caught the founder’s attention and he, along-with others formed Four Fractions on 20th December ’07 to launch ApnaBill.com as one of their flagship products.
ApnaBill.com was opened for public transactions on 15th June 08. The release was a birthday present to ApnaBill.com’s co-founder’s mom.
Our team is now 5 people strong, spread across New Delhi and Pune. As of now, we are self funded and are actively looking for seed funding.
What takes most of the time:
As I mentioned earlier, getting various services integrated took most of the time. If we had to just push out our own product (minus all collaborations), it would have taken us less than 3 months.
There was this funny thing that set us back by almost 1 month…
We applied for a PAN card for Four Fractions. First, our application somehow got lost in the process. Then someone in the government department managed to put down our address as 108 when it was supposed to be 10 B (8 and B are very similar looking).
None of us ever envisioned this – but it happened. We lost a precious month sorthig this issue out. And since all activities were dependent on official papers, other things like bank accounts, payment gateway intgrations etc also got pushed back. But I am glad, we sorted this out in the end. Our families supported us through this all the way.
Every process like creating Bank accounts, getting PAN cards etc are still very slow and manual in nature. If we can somehow improve on them, the ecosystem can prove very helpful for budding startups.
About the co-founders:
There are 3 CoFounders for ApnaBill.com
Sameer Jain: Sameer is the brain behind our revenue generation streams and marketing policies. He is a Post Grad from Delhi University in International Marketing.
Sandeep Kumar: Sandeep comes from billing (technical) background. With him, he has brought vast knowledge about billing processes and solid database knowhow.
Myself (Mayank Jain): I come from desktop application development background. I switched to Ruby on Rails almost 18 months ago – and since then, I am a devoted Ruby evangelist and Rails developer.
Luckily, we have a team which is just right. We have two polarizing ends – Sandeep and Sameer. One of them is constantly driving organization to minimizing costs while the other is driven towards maximizing revenue from all possible sources. I act as a glue between both of them. Together, we are constantly driving the organization forward.
About selection for proto.in:
Proto.in was the platform for which we were preparing for from almost 2 months. We had decided our launch dates in such a way that we would launch and be LIVE just in time for Proto.in.
Being recognized for your efforts is a big satisfaction.
Proto.in was also a huge learning experience. Interacting directly with our potential users gave us an insight on how they percieve ApnaBill.com and what they want out of it. We also came across some interesting revenue generation ideas when interacting with the startup veterans at Proto.
There are a lot of people who are currently doing a job somewhere, but who harbor a desire to start something on their own. Since you have already gone that route, what suggestions would you have for them?
Some tips I would like to share with my peer budding entrepreneurs…
Focus, focus and focus!
If you are an internet startup, book your domain before anything and get the right hosting partner.
Start the paperwork for firm/bank accounts registration as soon as possible.
Write down your financial/investment plan on paper before you start. Some plan is way better than a no plan!
Adopt proper development process for the tech team. With a process in place, development activities can be tracked rationally.
Get someone to manage your finances – outsourcing is a very attractive option.
The most important factor for a startup besides anything else – is to keep fighting during the adverse scenarios. Almost everything would spring into your face as a problem. But a team which can work together to find a solution for it – makes it to the end.
Just remember, more than the destination, it is the journey that would count.