Tag Archives: Technology

optimization: a technical overview

(This is the fourth in the PuneTech series of articles on optimization by Dr. Narayan Venkatasubramanyan, an Optimization Guru and one of the original pioneers in applying Optimization to Supply Chain Management. The first one was an ‘overview’ case study of optimization. The second was architecture of a decision support system. The third was optimization and organizational readiness for change.

For Dr. Narayan Venkatasubramanyan’s detailed bio, please click here. For the full series of articles, click here.)

this is a follow-up to optimization: a case study. frequent references in this article to details in that article would make this one difficult to read for someone who hasn’t at least skimmed through that.

the problem of choice

the wikipedia article on optimization provides a great overview of the field. it does a thorough job by providing a brief history of the field of mathematical optimization, breaking down the field into its various sub-fields, and even making a passing reference to commercially available packages that help in the rapid development of optimization-based solutions. the rich set of links in this page lead to detailed discussions of each of the topics touched on in the overview.

i’m tempted to stop here and say that my job is done but there is one slight problem: there is a complete absence of any reference to helicopter scheduling in an offshore oil-field. not a trace!

this brings me to the biggest problem facing a young practitioner in the field: what to do when faced with a practical problem?

of course, the first instinct is to run with the technique one is most familiar with. being among the few in our mba program that had chosen the elective titled “selected topics in operations research” (a title that i’m now convinced was designed to bore and/or scare off prospective students who weren’t self-selected card-carrying nerds), we came to the problem of helicopter scheduling armed with a wealth of text-book knowledge.

an overview of linear programming

A series of linear constraints on two variable...
the lines represent the constraints. the blue region is the set of all “permissible values”. the objective function is used to choose one (“the most optimal”) out of the blue points. image via wikipedia

having recently studied linear and integer programming, we first tried to write down a mathematical formulation of the problem. we knew we could describe each sortie in terms of variables (known as decision variables). we then had to write down constraints that ensured the following:

  • any set of values of those decision variables that satisfied all the constrains would correspond to a sortie
  • any sortie could be described by a set of permissible set of values of those decision variables

this approach is one of the cornerstones of mathematical programming: given a practical situation to optimize, first write down a set of equations whose solutions have a one-to-one correspondence to the set of possible decisions. typically, these equations have many solutions.

click here for an animated presentation that shows how the solutions to a system of inequalities can be viewed graphically.

the other cornerstone is what is called an objective function, i.e., a mathematical function in those same variables that were used to describe the set of all feasible solutions. the solver is directed to pick the “best” solution, i.e., one that maximizes (or minimizes) the objective function.

the set of constraints and the objective function together constitute a mathematical programming problem. the solution that maximizes (or minimizes) the objective function is called an optimal solution.

linear programming – an example

googling for “linear programming examples” leads to millions of hits, so let me borrow an example at random from here: “A farmer has 10 acres to plant in wheat and rye. He has to plant at least 7 acres. However, he has only $1200 to spend and each acre of wheat costs $200 to plant and each acre of rye costs $100 to plant. Moreover, the farmer has to get the planting done in 12 hours and it takes an hour to plant an acre of wheat and 2 hours to plant an acre of rye. If the profit is $500 per acre of wheat and $300 per acre of rye how many acres of each should be planted to maximize profits?”

the decisions the farmer needs to make are: how many acres of wheat to plant? how many acres of rye to plant? let us call these x and y respectively.

so what values can x and y take?

  • since we know that he has only 10 acres, it is clear that x+y must be less than 10.
  • the problem says that he has to plant at least 7 acres. we have two choices: we can be good students and write down the constraint “x+y >= 7” or we can be good practitioners and demand to know more about the origins of this constraint (i’m sure every OR professional of long standing has scars to show from the times when they failed to ask that question.)
  • the budget constraint implies that 200x + 100y <= 1200. again, should we not be asking why this farmer cannot borrow money if doing so will increase his returns?
  • finally, the time constraint translates into x + 2y <= 12. can he not employ farm-hands to increase his options?
  • the non-negativity constraints (x, y >= 0) are often forgotten. in the absence of these constraints, the farmer could plant a negative amount of rye because doing so would seem to get him more land, more money, and more time. clearly, this is practically impossible.

as you will see if you were to scroll down that page, these inequalities define a triangular region in the x,y plane. all points on that triangle and its interior represents feasible solutions: i.e., if you were to pick a point, say (5,2), it means that the the farmer plants 5 acres each of wheat and 2 acres of rye. it is easy to confirm that this represents no more than 10 acres, no less than 7 acres, no more than $1200 and no more than 12 hours. but is this the best solution? or is there another point within that triangle?

this is where the objective function helps. the objective is to maximize the profit earner, i.e., maximize 500x + 300y. from among all the points (x,y) in that triangle, which one has the highest value for 500x + 300y?

this is the essence of linear programming. LPs are a subset of problems that are called mathematical programs.

real life isn’t always lp

in practice, not all mathematical programs are equally hard. as we saw above, if all the constraints and the objective function are linear in the decision variables and if the decision variables can take on any real value, we have a linear program. this is the easiest class of mathematical programs. linear programming models can be used to describe, sometimes approximately,a large number of commercially interesting problems like supply chain planning. commercial packages like OPL, GAMS, AMPL, etc can be used to model such problems without having to know much programming. packages like CPLEX can solve problems with millions of decision variables and constraints and produce an optimal solution in reasonable time. lately, there have been many open source solvers (e.g., GLPK) that have been growing in their capability and competing with commercial packages.

Illustrates a cutting plane algorithm to solve...
integer programming problems constrain the solution to specific discrete values. while the blue lines represent the “feasible region”, the solution is only allowed to take on values represented by the red dots. this makes the problem significantly more difficult. image via wikipedia

in many interesting commercial problems, the decision variables is required to take on discrete values. for example, a sortie that carries 1/3 of a passenger from point a to point b and transports the other 2/3 on a second flight from point a to point b would not work in practice. a helicopter that lands 0.3 in point c and 0.7 in point d is equally impractical. these variables have to be restricted to integer values. such problems are called integer programming problems. (there is a special class of problems in which the decision variables are required to be 0 or 1; such problems are called 0-1 programming problems.) integer programming problems are surprisingly hard to solve. such problems occur routinely in scheduling problems as well as in any problem that involves discrete decisions. commercial packages like CPLEX include a variety of sophisticated techniques to find good (although not always optimal) solutions to such problems. what makes these problems hard is the reality that the solution time for such problems grows exponentially with the growth in the size of the problem.

another class of interesting commercial problems involves non-linear constraints and/or objective functions. such problems occur routinely in situations such refinery planning where the dynamics of the process cannot be described (even approximately) with linear functions. some non-linear problems are relatively easy because they are guaranteed to have unique minima (or maxima). such well-behaved problems are easy to solve because one can always move along an improving path and find the optimal solution. when the functions involved are non-convex, you could have local minima (or maxima) that are worse than the global minima (or maxima). such problems are relatively hard because short-sighted algorithms could find a local minimum and get stuck in it.

fortunately for us, the helicopter scheduling problem had no non-linear effects (at least none that we accounted for in our model). unfortunately for us, the discrete constraints were themselves extremely hard to deal with. as we wrote down the formulation on paper, it became quickly apparent that the sheer size and complexity of the problem was beyond the capabilities of the IBM PC-XT that we had at our disposal. after kicking this idea around for a bit, we abandoned this approach.

resorting to heuristics

we decided to resort to a heuristic approach, i.e., an approach that used a set of rules to find good solutions to the problem. the approach we took involved the enumeration of all possible paths on a search tree and then an evaluation of those paths to find the most efficient one. for example, if the sortie was required to start at point A and drop off m1 men at point B and m2 men at point C, the helicopter could

  • leave point A with the m1 men and proceed to point B, or
  • leave point A with the m2 men and proceed to point C, or
  • leave point A with the m1 men and some of the m2 men and proceed to point B, or
  • leave point A with the m1 men and some of the m2 men and proceed to point C, or
  • . . .

if we were to select the first possibility, it would drop off the m1 men and then consider all the options available to it (return to A for the m2 men? fly to point D to refuel?)

we would then traverse this tree enumerating all the paths and evaluating them for their total cost. finally, we would pick the “best” path and publish it to the radio operator.

at first, this may seem ridiculous. the explosion of possibilities meant that this tree was daunting.

there were several ways around this problem. firstly, we never really explicitly enumerated all possible paths. we built out the possibilities as we went, keeping the best solution until we found one that was better. although the number of possible paths that a helicopter could fly in the course of a sortie was huge, there were simple rules that directed the search in promising directions so that the algorithm could quickly find a “good” sortie. once a complete sortie had been found, the algorithm could then use it to prune searches down branches that seemed to hold no promise for a better solution. the trick was to tune the search direction and prune the tree without eliminating any feasible possibilities. of course, aggressive pruning would speed up the search but could end up eliminating good solutions. similarly, good rules to direct the search could help find good solutions quickly but could defer searches in non-obvious directions. since we were limited in time, so the search tree was never completely searched, so if the rules were poor, good solutions could be pushed out so late in the search that they were never found, at least not in time to be implemented.

one of the nice benefits of this approach was that it allowed the radio operator to lock down the first few steps in the sortie and leave the computer to continue to search for a good solution for the remainder of the sortie. this allowed the optimizer to continue to run even after the sortie had begun. this bought the algorithm precious time. allowing the radio operator the ability to override also had the added benefit of putting the user in control in case what the system recommended was infeasible or undesirable.

notice that this approach is quite far from mathematical programming. there is no guarantee of an optimal solution (unless one can guarantee that pruning was never too aggressive and that we exhaustively searched the tree, neither of which could be guaranteed in practical cases). nevertheless, this turned out to be quite an effective strategy because it found a good solution quickly and then tried to improve on the solution within the time it was allowed.

traditional operations research vs. artificial intelligence

this may be a good juncture for an aside: the field of optimization has traditionally been the domain of operations researchers (i.e., applied mathematicians and industrial engineers). even though the field of artificial intelligence in computer science has been the source of many techniques that effectively solve many of the same problems as operations research techniques do, OR-traditionalists have always tended to look askance at their lowly competitors due to the perceived lack of rigour in the AI techniques. this attitude is apparent in the wikipedia article too: after listing all the approaches that are born from mathematical optimization, it introduces “non-traditional” methods with a somewhat off-handed “Here are a few other popular methods:” i find this both amusing and a little disappointing. there have been a few honest attempts at bringing these two fields together but a lot more can be done (i believe). it would be interesting to see how someone steeped in the AI tradition would have approached this problem. perhaps many of the techniques for directing the search and pruning the tree are specific instances of general approaches studied in that discipline.

if there is a moral to this angle of our off-shore adventures, it is this: when approaching an optimization problem, it is tempting to shoot for the stars by going down a rigorous path. often, reality intrudes. even when making technical choices, we need to account for the context in which the software will be used, how much time there is to solve the problem, what are the computing resources available, and how it will fit into the normal routine of work.

other articles in this series

this article is the fourth in the series of short explorations related to the application of optimization. i’d like to share what i’ve learned over a career spent largely in the business of applying optimization to real-world problems. interestingly, there is a lot more to practical optimization than models and algorithms. each of the the links leads to a piece that dwells on one particular aspect.

optimization: a case study
architecture of a decision-support system
optimization and organizational readiness for change
optimization: a technical overview (this article)

About the author – Dr. Narayan Venkatasubramanyan

Dr. Narayan Venkatasubramanyan has spent over two decades applying a rare combination of quantitative skills, business knowledge, and the ability to think from first principles to real world business problems. He currently consults in several areas including supply chain and health care management. As a Fellow at i2 Technologies, he tackled supply chains problems in areas as diverse as computer assembly, semiconductor manufacturer, consumer goods, steel, and automotive. Prior to that, he worked with several airlines on their aircraft and crew scheduling problems. He topped off his days at IIT-Bombay and IIM-Ahmedabad with a Ph.D. in Operations Research from the University of Wisconsin-Madison.

He is presently based in Dallas, USA and travels extensively all over the world during the course of his consulting assignments. You can also find Narayan on Linkedin at: http://www.linkedin.com/in/narayan3rdeye

Reblog this post [with Zemanta]

Why Analytics Matter in Business Intelligence – CSI Pune Lecture – 6th March

Computer Society of India – Pune Chapter presents the 5th lecture in a series on Data warehousing. The first lecture gave an overview of BI and DW. The second lecture was about how these techniques are used by businesses. The third was about data management for business intelligence. The fourth lecture talked about technology trends in BI. This is the fifth in the series:

What: Why Analytics Matter in Business Intelligence by Ajit Ghanekar of SAS R&D India.

When: Friday March 6th, 2008, 6:30pm to 8:30pm

Where: Dewang Mehta Auditorium, Persistent Systems,402, Senapati Bapat Road, Pune
Entry: Free for CSI Members & Students, Rs. 100 for others. Rs. 50 for Persistent employees.  Register here.

Details – Technology trends in Business Intelligence

One of the areas which adds significant value to business is application of analytics to solving complex problems. These can be in the areas of scoring, risk management, fraud detection, forecasting and so on. The focus of this session will be to give an introduction to the role of statistical techniques in BI applications.

It is not necessary to have attended the previous lectures.

For more information about other tech events in Pune, see the PuneTech events calendar.

About the speaker – Ajit Ghanekar

Ajit is a Senior Software Specialist – Analytics at SAS Research & Development, India, and has 10 years of experience in developing various Analytical solutions in the areas like Statistical Inference, Modeling, Time Series in Banking and Pharma domains. Currently, he is engaged in SAS Credit Risk Management Solution.

Ajit has a Masters in Statistics from Pune University & PG Diploma in Banking and Finance from SIBM

Reblog this post [with Zemanta]

Should you use a file-system or a database

Whether to use a file-system or a database to store the data of your application has been a contentious issue since the 80s. It was something we worried about even when I was doing my Ph.D. in Databases in the 90s. Now Jaspreet Singh, of Pune-based startup Druvaa has weighed in on this issue on Druvaa’s blog. His post is republished here with permission.

This topic has been on my plate for some time now. It’s interesting to see how databases have come a long way and have clearly out-shadowed file-systems for storing structured or unstructured information.

Technically, both of them support the basic features necessary for data access. For example both of them ensure  –

  • Data is managed to ensure its integrity and quality
  • Allow shared access by a community of users
  • Use of well defined schema for data-access
  • Support a query language

But, file-systems seriously lack some of the critical features necessary for managing data. Lets take a look at some of these feature.

Transaction support
Atomic transactions guarantee complete failure or success of an operation. This is especially needed when there is concurrent access to same data-set. This is one of the basic features provided by all databases.

But, most file-systems don’t have this features. Only the lesser known file-systems – Transactional NTFS(TxF), Sun ZFS, Veritas VxFS support this feature. Most of the popular opensource file-systems (including ext3, xfs, reiserfs) are not even POSIX compliant.

Fast Indexing
Databases allow indexing based on any attribute or data-property (i.e. SQL columns). This helps fast retrieval of data, based on the indexed attribute. This functionality is not offered by most file-systems i.e. you can’t quickly access “all files created after 2PM today”.

The desktop search tools like Google desktop or MAC spotlight offer this functionality. But for this, they have to scan and index the complete file-system and store the information in a internal relational-database.

Snapshots
Snapshot is a point-in-time copy/view of the data. Snapshots are needed for backup applications, which need consistent point-in-time copies of data.

The transactional and journaling capabilities enable most of the databases to offer snapshots without shopping access to the data. Most file-systems however, don’t provide this feature (ZFS and VxFS being only exceptions). The backup softwares have to either depend on running application or underlying storage for snapshots.

Clustering
Advanced databases like Oracle (and now MySQL) also offer clustering capabilities. The “g” in “Oracle 11g” actually stands for “grid” or clustering capability. MySQL offers shared-nothing clusters using synchronous replication. This helps the databases scale up and support larger & more-fault tolerant production environments.

File systems still don’t support this option 🙁  The only exceptions are Veritas CFS and GFS (Open Source).

Replication
Replication is commodity with databases and form the basis for disaster-recovery plans. File-systems still have to evolve to handle it.

Relational View of Data
File systems store files and other objects only as a stream of bytes, and have little or no information about the data stored in the files. Such file systems also provide only a single way of organizing the files, namely via directories and file names. The associated attributes are also limited in number e.g. – type, size, author, creation time etc. This does not help in managing related data, as disparate items do not have any relationships defined.

Databases on the other hand offer easy means to relate stored data. It also offers a flexible query language (SQL) to retrieve the data. For example, it is possible to query a database for “contacts of all persons who live in Acapulco and sent emails yesterday”, but impossible in case of a file system.

File-systems need to evolve and provide capabilities to relate different data-sets. This will help the application writers to make use of native file-system capabilities to relate data. A good effort in this direction was Microsoft WinFS.

Conclusion

The only disadvantage with using the databases as primary storage option, seems to be the additional cost associated. But, I see no reason why file-systems in future will borrow features from databases.

Disclosure

Druvaa inSync uses a proprietary file-system to store and index the backed up data. The meta-data for the file-system is stored in an embedded PostgreSQL database. The database driven model was chosen to store additional identifiers withe each block – size, hash and time. This helps the filesystem to –

  1. Divide files into variable sized blocks
  2. Data deduplication – Store single copy of duplicate blocks
  3. Temporal File-system – Store time information with each block. This enables faster time-based restores.
Reblog this post [with Zemanta]

Business Intelligence Technology Trends: CSI Pune Lecture – 30 Jan

Computer Society of India – Pune Chapter presents the 4th lecture in a series on Data warehousing. The first lecture gave an overview of BI and DW. The second lecture was about how these techniques are used by businesses. The third was about data management for business intelligence. This is the fourth in the series:

What: Technology trends in Business Intelligence by Prasad Kulkarni of SAS R&D India.
When: Friday January 30th, 2008, 6:30pm to 8:30pm
Where: Damle Hall, Damle Path, Behind Indsearch, Off Law College Road
Registration and Fees: Free for CSI Members & Students, Rs. 100 for others. Register here.

Details – Technology trends in Business Intelligence

This lecture will cover technological advances in BI domain. It will start with a discussion on general trends in BI and will relate them to technology. Primary focus is on different technologies used currently, their necessity and type of problem they are solving in the business intelligence domain. It will discuss areas like SOA (Service oriented architecture), SaaS (Software as a service), MDM (Master data management), Real time warehousing, Click stream data warehouses, Federated/integrated search, Web 2.0, Data visualization and so on. The participant will know how such technologies are solving problems specific to BI domain.

It is not necessary to have attended the previous lecture.

For more information about other lectures in this series, and in general other tech events in Pune, see the PuneTech events calendar.

About the speaker – Prasad Kulkarni

Prasad Kulkarni is working with SAS Research And Development India Pvt. Ltd for past 8 years as Associate Director – Platform Research and Development. He leads the core technology group at SAS R&D Pune. Prasad holds post graduation degree in computer management from University Of Pune and has 12 years of experience in the field of information technology. He has worked with product development setups in India. With SAS his focus areas are Metadata Management, Data Warehousing, Data visualization and Data access.

Related articles by Zemanta

Reblog this post [with Zemanta]

The Great Debate: PostgreSQL vs. MySQL – with Jim Mlodgenski, 23 Jan

PostgreSQL
Image via Wikipedia

This information sent in by @nikkhils of EnterpriseDB. Thanks!

What: “The Great Debate: PostgreSQL vs. MySQL” with Jim Mlodgenski, Senior Database Architect, EnterpriseDB
When: Friday, 23 Jan, 6pm
Where: Dewang Mehta Auditorium, Persistent, S.B. Road
Registration and Fees: This event is free for all to attend, thanks to Persistent Systems

Details:
For years, the common industry perception has been that MySQL is faster and easier to use than PostgreSQL. PostgreSQL is perceived as more powerful, more focused on data integrity, and stricter at complying with SQL specifications, but correspondingly slower and more complicated to use.

Like many perceptions formed in the past, these things aren’t as true with the current generation of releases as they used to be. DBAs, developers, and IT managers and decision-makers will benefit from this hour-long presentation about the pros and cons of using PostgreSQL or MySQL, which will include a discussion about the ongoing trend towards using open source in the enterprise.

About the Speaker – Jim Mlodgenski

Jim is one of EnterpriseDB’s first employees and joined the company in May, 2005. As Senior Database Architect he has been responsible for EnterpriseDB’s technical pre-sales, professional services, providing customized solutions and training.

Prior to joining EnterpriseDB, Jim was a partner and architect at Fusion Technologies, a technology services company founded by EnterpriseDB’s chief architect, Denis Lussier. For nearly a decade, Jim developed early designs and concepts for Fusion’s consulting projects and specialized in Oracle application development, Web development, and open source.

Jim received a BS degree in Physics from Rensselaer Polytechnic Institute.

Jim has spoken at many international open-source conferences and is the author of many white papers on RDBMS

Reblog this post [with Zemanta]

Understanding Data De-duplication

Druvaa is a Pune-based startup that sells fast, efficient, and cheap backup (Update: see the comments section for Druvaa’s comments on my use of the word “cheap” here – apparently they sell even in cases where their product is priced above the competing offerings) software for enterprises and SMEs. It makes heavy use of data de-duplication technology to deliver on the promise of speed and low-bandwidth consumption. In this article, reproduced with permission from their blog, they explain what exactly data de-duplication is and how it works.

Definition of Data De-duplication

Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required.

Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB.

Technological Classification

The practical benefits of this technology depend upon various factors like –

  1. Point of Application – Source Vs Target
  2. Time of Application – Inline vs Post-Process
  3. Granularity – File vs Sub-File level
  4. Algorithm – Fixed size blocks Vs Variable length data segments

A simple relation between these factors can be explained using the diagram below –

Deduplication Technological Classification

Target Vs Source based Deduplication

Target based deduplication acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.

Target Vs Source Deduplication

On the contrary Source based deduplication acts on the data at the source before it’s moved. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.

Inline Vs Post-process Deduplication

In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage.

The former is called inline deduplication. The obvious advantages are –

  1. Increase in overall efficiency as data is only passed and processed once
  2. The processed data is instantaneously available for post storage processes like recovery and replication reducing the RPO and RTO window.

the disadvantages are –

  1. Decrease in write throughput
  2. Extent of deduplication is less – Only fixed-length block deduplication approach can be use

The inline deduplication only processed incoming raw blocks and does not have any knowledge of the files or file-structure. This forces it to use the fixed-length block approach (discussed in details later).

Inline Vs Post Process Deduplication

The post-process deduplication asynchronously acts on the stored data. And has an exact opposite effect on advantages and disadvantages of the inline deduplication listed above.

File vs Sub-file Level Deduplication

The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates can be easily eliminated by calculating single checksum of the complete file data and comparing it against existing checksums of already backed up files. It’s simple and fast, but the extent of deduplication is very less, as it does not address the problem of duplicate content found inside different files or data-sets (e.g. emails).

The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks, and then uses standard hash based algorithm to find similar blocks.

Fixed-Length Blocks v/s Variable-Length Data Segments

Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it’s possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly – but not completely – of the same data segments.

Data Sets and Block Allignment

For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.

Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.

Variable-Length Data Segment technology divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to “float” within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.

ROI Benefits

Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –

  1. No. of applications or end users generating data
  2. Total data
  3. Daily change in data
  4. Type of data (emails/ documents/ media etc.)
  5. Backup policy (weekly-full – daily-incremental or daily-full)
  6. Retention period (90 days, 1 year etc.)
  7. Deduplication technology in place

The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for variable length data segment technology which has a much better capability for dealing with arbitrary byte insertions.

Numbers
While some vendors claim 1:300 ratios of bandwidth/storage saving. Our customer statistics show that, the results are between 1:4 to 1:50 for source based deduplication.

Reblog this post [with Zemanta]

OpenSocial Developers Conference in Pune – 20th Dec

What: A conference for all OpenSocial Developers
When: 20th December 9:30am to 6:30pm
Where: Tower C, Panchshil Tech Park, Yerwada Pune – 411006      View Map
Registration and Fees: This is a free conference, but attendance is by invitation only. If you register now, you might still get an invitation.

Details:

A group of OpenSocial enthusiasts from Pune have come together to create this conference. The event is aimed to unite all the OpenSocial Application Developers from all over the country and just share/code/have fun and maybe inspire others to take up OpenSocial Development.

This event will also help towards awareness of OpenSocial and building a strong OpenSocial developers community.

Who should attend?

Anyone who developed any application based on OpenSocial platform or anyone who want to learn how to create OpenSocial Applications.

What’s the menu?

There are two tracks of speakers going on at two different halls. See the detailed schedule for more information. Another section is dedicated to codelab. A few developers will develop an application for the “India I Care” NGO. If you want to participate, just the organizers know on the Developer Garage Mailing list. At the end there will be an “Application show case” where OpenSocial application owners can demo their application.

Blogging and Twittering

Follow @devgarage on twitter for official Developer Garage updates. In general, people blogging or tweeting about this event are expected to use the odgpune tag, which means that searching for this term will give you everything you wanted to know about this event. (And please use that term in your own blogs or tweets.)

Reblog this post [with Zemanta]

Techstart.in: Nurturing the passion of our engineering students

This week on PuneTech, we are going to feature a bunch of initiatives started by people who are passionate about helping students in our engineering colleges (actually anybody interested in technology, student or not, engineering or not) to be more, achieve more, learn more, all with the help of mentors from industry who would like to see all these talented students reach their true potential. Watch this space over the next few days for more such initiatives or better yet, subscribe for updates via RSS, Email, or twitter. Today, we are featuring, Techstart.in. (Update: the next post in this series is about KQInfoTech’s industry-supported “free” PG Diploma in Systems Programming.)

Techstart.in is a group that aims to create special interest technology clubs of students, with each club mentored by one or more people from from industry who have experience in that area, and are willing to spend time with the students to guide them. The club will have loosely structured activites, projects to complete, possibly presentations and discussions, all planned and guided by the mentors.

The club was started by Freeman Murray and has since been joined by a number of mentors – but there is no such thing as too many mentors. So, you should seriously consider signing up. The only qualifications you need are that you should be passionate about this, and you should have a little industry experience.

Freeman explains techstart.in thus:

The basic idea is to find people with practical industry experience willing to spend some time each month creating or identifying useful exercises people interested in their field could do to develop their skills and posting them on a blog or mailing list. Additionally they spend time each week facilitating a discussion of the participants on a mailing list.

The intention is not to compete with existing online resources for technical training and support, but to provide some more human support and mentorship for people on the path. Mentors can and should encourage participants to engage in the existing online communities surrounding their technologies. Their guidance as to what communities to engage, and how to engage could still be invaluable.

In this way, over time people can develop significant skills in fields where they don’t have formal training while they continue their studies or working full time.

We all crib about the quality of technical education, but with the Internet we have the opportunity to do something about it. We can help eager young and the motivated who want to get into high-tech but are over whelmed at the amount of information available on the internet, or get blocked because of elementary problems.

It shouldn’t take much time, for mentors just a couple hours a month to research the monthly activities and post links to learning resources participants should look into, and then a couple hours each week responding to questions and facilitating discussion on the mailing list. For participants, activities should take 5 – 10 hours of effort each month, plus some additional time sharing with the community thru the blog and the mailing lists.

If there’s a field you are passionate about and feel more people should get into, please think about setting up a small club for it on the techstart wiki. If you see a club where people are exploring a technology you’ve been curious, by all means join the community.

The initial clubs we have are in blogging, advanced java and open source technology. Amit is also mentoring a group to write some automatic deployment scripts in php.

Find out more on the wiki – http://techstart.in

How to participte – Students

To participate in one of the techstart clubs simply visit the clubs website, or join the their mailing list. Make sure to introduce yourself to the community when you join, and read over any introductory material the mentor put up on the website or in the group.

How to participate – Mentors

Start your own club or make yourself available for mentoring people on a project – simply create a mailing list for it on Google Groups or any other public mailing list site, and add a description of it and yourself to this wiki. At least every month post exercises to the list which participants can do to strengthen their skills, and spend some time every week monitoring the list, encouraging discussion, and helping people with problems. That’s it !

Contact [mailto://freemanATpobox.com|Freeman] to get the wiki key. If you’d like to join the discussion about how to make TechStart better please join the TechStart Google Group

Clubs already formed

Title: Bloggers Club

Mentors: Freeman Murray Tarun Chandel

Description:

This track is for people interested in writing on the internet. All participants will set up and customize their blog initially, and then every two weeks participants are encouraged to share their next post with the group. The group will give feedback on the writers style, grammer and ideas. Members are additionally encouraged to comment on eachother’s blogs and do cross linking. Occasionally exersizes relating to google analytics and SEO will be given to the club members. Twitter, Videoblogging , Google Analytics, SEO, RSS and feedreaders will also be discussed in time.

Mailing List: groups.google.com/group/bloggers-club

Title: Java Insights 101

Mentor: Parag Shah

Description:

This learning track is for developers who have completed at least one course in Core Java (or are familiar with basic principles of Java, like syntax, compiling, and running Java programs) and would like to improve their understanding of the Java language and ecosystem.

URL: adaptivelearningsolutions.blogspot.com/2008/11/javainsights-101.html

Mailing List: groups.google.com/group/adaptive-learning

Title: Open Source Technology

Mentor: Tarun Dua

Description:

This is a track for technologists who want to build upon their understanding of the free and open ecosystem being provided by the Open Source and relatively open and portable datasets. Do you dig a well everytime you want to drink water, then why do you insist on hacking a new solution when another more efficient solution already exists as Open Source. Leverage what already exists in the ecosystem instead of re-inventing the wheel.

URL: linux-delhi.com

Mailing List:groups.google.com/group/linux-delhi-techstart

Title: Auomatic Deplyoment Script

Mentor: Amit Singh

Description:

Ami of this project is to automate the process of deploying websites written in PHP. A very basic script exists at my blog, we will be enhancing it by putting continuous integration, database migration etc.

URL:http://sourceforge.net/projects/adscript

Mailing List: http://groups.google.co.in/group/adscript

Mentors already signed up

Name Affiliation Skills & Interests
Freeman Murray upStart Software development training, startup culture, internet video, internet advocacy
Subhransu Behera EnTrip Ruby on Rails, Web Application Development, Linux System Programming, Fedora Packaging
Parag Shah Adaptive Software Solutions Software development, Software development training, New media technologies
Tarun Dua E2ENetworks Efficient technology operations is the key to effective delivery of technology where it matters most.
Amit Singh Pune It Labs Pvt. Ltd. Web Application Development
Reblog this post [with Zemanta]

PUG Community Day: Sharepoint 3.0; Windows Mobile. 13 Dec

What: Pune (Microsoft Technologies) User Group Community Day featuring presentations on SharePoint Services 3.0 and Windows Mobile Line of Business Solution Accelerator
When: Saturday 13 December, 4pm onwards
Where: SEED Infotech Ltd., Nalanda, S No – 39, Hissa No – 2/2, CTS 943, Opp Gandhi Lawns, Erandwana, Pune 411 004,
Registration and Fees: This event is free for all. No registration required.

(For a full list of tech activities in Pune this weekend, see yesterday’s post on PuneTech)

Session 1 – Windows SharePoint Services 3.0

This session is First introductory session for Windows SharePoint Services 3.0. This session is a must for the Developers/Users who are new to this technology and want to start off with Windows SharePoint Development.

Built on Microsoft Windows Server 2003, Windows SharePoint Services also provides a foundation platform for building Web-based business applications that can flex and scale easily to meet the changing and growing needs of your business. Robust administrative controls for managing storage and Web infrastructure give IT departments a cost-effective way to implement and manage a high-performance collaboration environment. With a familiar, Web-based interface and close integration with everyday tools including the Microsoft Office system, Windows SharePoint Services is easy to use and can be deployed rapidly.
WSS 3.0 enables the user to create web site with following capabilities

  • Collaborate
  • Manage Document and Integrity of Content
  • Content Management Capabilities
  • Integration with Office Application

Speakers:

Sudhir Kesharwani is from Accenture and having over 6 yrs of experience in Microsoft Technology, 16 Months experience in SharePoint Application Development. He holds many certifications like MCTS – Microsoft Office SharePoint Server 2007, MCTS – Windows SharePoint Services 3.0, MCPD – EA & MCSD

and

Akhilesh Nirapure is from Accenture and having over 2.5 yrs of experience in Microsoft Technology, 16 months experience in SharePoint Application Development. He holds many certifications like – MCTS in Microsoft Office SharePoint Server 2007 (MCTS – MOSS) and MCPD – EA

Session 2 – Windows Mobile – Line of Business Solution Accelerator 2008

The Windows Mobile Line of Business Solution Accelerator is a sample line of business application that showcases the latest design principles and technologies in the mobile space.

Delivering new innovations and development best practices to the Windows Mobile platform with Visual Studio 2008, the .NET Compact Framework 3.5, SQL Server Compact 3.5, a working Supply Chain application, over 5,000 lines of commented code.

In this presentation, we’ll see how we can use provided sample code in our applications and how we can build Line of Business apps more rapidly.

Speakers:

Mayur Tendulkar is a Microsoft Windows Mobile professional who loves talking about mobile and embedded technologies. He has been a Microsoft Student Partner Lead since 2006 and representing Student and Professional Community. He has delivered sessions in Microsoft Virtual TechDays and PUG DevCon 2008. Apart from that, he conducts regular community activities in Pune. You can read his blog at http://blog.mayurtendulkar.com OR reach him at mayur.tendulkar{at}gmail.com

Supply Chain Management in Consumer Goods – An In-Depth Look

Amit Paranjape, a regular contributor and primary adviser to PuneTech, had earlier written an article giving an overview of Supply Chain Management, and companies in Pune that develop software products in this area. This article, the next in the series, goes into details of the problems that SCM software products need to tackle in a consumer goods supply chain. This is a longer-than-usual article, hence posted on a Friday so you can read it over the weekend (assuming you are not attending one of the various tech activities happening in Pune this weekend.)

Here is a story about a packet of ‘Star Glucose Biscuits’ in ‘SuperMart’ on FC Road in Pune, told from the point of view of Supply Chain Management. Buckle up your seat belts because this story has tension, drama, emotion, and suspense (will the biscuits reach the shops in time for the T20 World Cup Promotion?)

Overview

The story begins at the Star Biscuits Factory in Bangalore where flour, sugar and other raw material are converted to the finished cases of biscuits. From Bangalore, the biscuits are shipped to a regional Distribution Center on the outskirts of Pune. This center then ships the biscuits to the local depots in different parts of cities such as Mumbai, Pune and from there they ultimately end up at the neighboring retail store, such as SuperMart on FC Road, Pune. In this seemingly simple journey are hidden a host of difficult business decisions and issues that arise on a daily basis. And to complicate matters further, we will throw in a few ‘interesting’ challenges as well! Throughout this story, we will take a deeper look at how the various business processes, and software programs associated with planning this entire supply chain network work in concert to bring you the extra energy and extra confidence of Star Glucose Biscuits.

This chain, from the raw materials all the way to the finished product sitting on the retail shelves is called the supply chain, and managing it efficiently is called supply chain management. Supply chain management is one of the most important aspects of running a manufacturing business, and doing it well has been the key to the phenomenal success of such giants as Walmart and Dell. The basic conflict that SCM is trying to tackle is this: you must have the right quantity of goods at the right place at the right time. Too few biscuits in the store on Sunday, and you lose money because have to turn customers away. Too many biscuits in the store and you have excess inventory. This is bad in a number of ways: 1. It eats up shelf space in the store, or storage space in your warehouse. Both of these cost money. 2. Your money, your working capital is tied up in excess inventory which is sitting uselessly in the warehouse. 3. If the biscuits remain unsold, you lose a lot of money. The same trade-off is repeated with intermediate goods at each step of the supply chain.

The Supply Chain in detail

Schematic of a supply chain. From bottom to top: multiple suppliers supply raw materials to multiple factories. Finished goods are then sent to regional distribution centers. From there it goes to smaller regional depots, and finally to individual stores.
Schematic of a supply chain. From bottom to top: multiple suppliers supply raw materials to multiple factories. Finished goods are then sent to regional distribution centers. From there it goes to smaller regional depots, and finally to individual stores.

At the Star Biscuit factory in Bangalore, they are gearing up to meet forecasted production requirements that were recently communicated by the Star Biscuits headquarters (HQ) in Mumbai. This is the ‘demand’ placed on this factory. These production requirements consist of weekly quantities spread over next 12 weeks. The factory planning manager now has to plan his factory to meet these requirements on time.

Let us see what all he needs to take into account. First, he needs to figure out the raw material requirements – wheat flour, oil, sugar, flavors, etc. as well as packaging material. Each of them has different procurement lead-times and alternative suppliers. He needs to pick the time to place orders with the right suppliers so that the material is available on time for the manufacturing process.

The manufacturing process itself consists of two primary steps – making the biscuits from the flour, and packaging the biscuits into individual boxes and cases. Typically, multiple parallel making and packing lines work together to achieve the desired output. The packing process scheduling is often complicated further by a series of different sizes and packaging configurations.

Ensuring that the right amount of material is available at the right time is called Material Requirements Planning (MRP), and in the old days, that was good enough. However, this can no longer be done purely in isolation. Even if the amounts of the different raw materials required are predicted precisely, it can be problematic if the various making and packing machines do not have the capacity to handle the load of processing the raw materials. Hence, another activity, called capacity planning, needs to be undertaken, and the capacity plan needs to be synchronized with the materials requirement plan, otherwise excess raw material inventory will result, due to sub-optimal loading of the machines and excessive early procurement of raw material. Excess inventory translates to excess working capital cash requirement; which in the current hyper competitive world is not good! Luckily, today there are sophisticated APS (Advanced Planning & Scheduling) software tools that are far superior to traditional MRP systems that enable him to simultaneously do material and capacity planning.

Happy with the production plan for the next 12 weeks, the factory planner then makes sure that individual making and packing lines have their detailed production schedules for the next two weeks. The finished cases leave his factory on truck loads. But where do they go from here? The journey to SuperMart where our customer wants to purchase the final product is still far from over!

The next stop is a big distribution center (DC) for the western region that sits on the outskirts of Pune. This distribution center is housed in a large warehouse with multi-level stacking pallets (each pallet contains multiple cases) of multiple different products from the manufacturer. A set of conveyors and fork-lifts enable material to flow smoothly from inbound truck docks to stocking area, and from the stocking area, to the outbound truck docks. These products come not only from our Star Biscuits factory in Bangalore, but from various other Star Biscuits factories located all over India. In fact, some of these products could also be directly imported from the parent company of Star Biscuits in the UK (the ones with the bitter, dark chocolate!). This Pune distribution center stocks and stores this material in proximity to the western region – with specific emphasis on the large Greater Mumbai and Pune markets. How are all these warehousing related activities smoothly managed? The DC manager takes full advantage of a Warehouse Management System Software (WMS). The truck loading and load containerization is managed by a Transportation Management Module.

From here outbound shipments are sent to smaller regional depots that are located in the cities, nearer to the stores. From these depots, the biscuits are finally shipped to the stores to meet the end customer demand. How is this demand calculated? Clearly, it is impossible to predict the demands coming from individual customers at the store, few weeks in advance! Hence it is necessary to ‘forecast’ the demand.

Forecasting demand and determining stock levels

Who decides how much material to stock? And how is it calculated? Clearly, as we briefly indicated earlier, keeping too much product is costly and keeping too little results in stockouts (material unavailability on store shelf) at the stores, thereby resulting in unhappy customers and lost sales. Too much product equals excess working capital (similar to the excess raw material problem) and is not good for the company’s financial performance. Too little, and we will run out if there are any major demand swings (commonly referred to as ‘demand variability’). To achieve the optimum level of the desired stock on hand to buffer against demand variability a ‘safety stock quantity’ is maintained in the warehouse. This quantity is computed by the central Supply Chain Management (SCM) team at HQ.

The actual computation of the safety stock for different products in the distribution center is calculated using a statistical computation (In some cases, a manual override is also done over the computed value). The most common technique consists of using Poisson distribution, demand & supply variability historical data, demand & supply lead time data, and desired customer service levels. Customer service levels are assigned based on an “ABC” classification of the products. ‘A’ category items are the fast movers and have a high revenue share and are typically assigned a 99% customer service level. Roughly speaking, a ‘99%’ customer service level implies that the safety stock quantity is adequate to guard against demand variability signals 99 times out of 100. Proactive planning on a daily basis that involves daily monitoring of stocks of all products is done, based on actual outbound shipments, can many times help in even reacting to that ‘1 in 100’ cases with rapid corrective measures.

The forecasting process for all the products is done at Star Biscuits Head Quarters. Let us continue with our example of ‘Star Glucose Biscuits’. The modern forecasting process is more accurately referred to as a ‘Demand Planning’ process. Statistical forecast is one input to the overall process. Statistical forecasts are derived from shipment history data and other input measures such as seasonality, competitor data, macro-economic data, etc. Various statistical algorithms are used to come up with a technique that reduces the forecasting error. Forecasting error is typically measured in ‘MAPE’ (Mean Absolute Percentage Error) or ‘MAD’ (Mean Absolute Deviation).

The statistical forecast is then compared with the sales forecast and the manufacturing forecast in a consensus planning process. This is often done as part of a wider ‘Sales & Operations Planning Process’ in many companies. Often times, a ‘top-down’ and ‘bottom-up’ forecasting technique is used. Here, individual forecasts at the product level are aggregated up the product hierarchy into product group forecasts. Similarly, aggregate product group level forecasts are disaggregated down the same hierarchy to the individual product level. These are then compared and contrasted and the expert demand planner then takes the final decision. Aggregated forecasting is important, since often times this reduces the forecast error. In case of Star Glucose Biscuits, the aggregated product hierarchy would first combine all sizes (e.g. 100 gm, 200 gm), then aggregate along sugar based biscuits type, and then into ‘All biscuits’ groups.

The end result of the demand planning process is the final consensus forecast that is calculated in weekly time intervals (commonly referred to in planning terminology as ‘buckets’) for a time horizon of 8-12 months. The demand forecasts drive the entire Star Biscuits Supply Chain. To simplify the overall Demand Planning process, modern DP software tools provide great help in statistical forecasting, OLAP based aggregation/disaggregation, and in facilitating interactive collaborative web-based workflows across different sets of users.

Managing the Supply Chain

It was easy if all we had was one DC in Pune and one factory in Bangalore, supplying to one store. But Star Biscuits has a much more extensive network! They have multiple factories throughout India and the same biscuits can be produced by each factory. How much of the demand to allocate to which factory? This problem is addressed by the Supply Chain Management (SCM) team that works in close concert with the Demand Planning team. The allocation is made based on various criteria such as shipment times, capacities, costs, etc. In coming up with the sourcing, transportation and procurement decisions – minimizing costs and maximizing customer service are amongst the top business objectives.

Now, getting back to Star Biscuits, they have over 300 products across 5 factories and 4 DCs. The DCs in turn, receive demand from nearly 100 depots that are supplying to thousands of stores all over India. Determining the optimal allocation of demands to factories, safety stocks to DCs and all the transportation requirements is beyond the abilities of humans. Such a complicated scale and decision problem needs computer help! Luckily, advanced SCM software tools can help the SCM team make these decisions fairly efficiently. Good SCM tools allow user interactivity, optimization, and support for business specific rules & heuristics. The SCM process thus determines the 12 week demand in weekly buckets for our factory in Bangalore, where we started.

To summarize, the overall supply chain – we saw how the product demand gets forecasted at HQ by the demand planning group. The SCM group then decides on how to allocate and source this demand across the different factories. They also decide on the ideal safety stock levels at the DCs. The WMS group ensures the efficient management of the distribution center activities. The factory planner team decides on the most efficient way to produce the biscuits demand allocated to their plant. The transportation management team is assigned the task of shipping material across this network in the best possible way to reduce cost and cut down on delivery times.

Dealing with drastic changes

And all of this is just to do “normal” business in “normal” times.
All the processes described earlier are great if the business works at a stable, reasonably predictable pace. Our safety stock policies guard against the day to day variability. But what about drastic changes? Unfortunately in the current environment, the only thing that is constant is ‘change’.

Here is what happened at Star Biscuits. One day, out of the blue, the entire planning team was thrown into a mad scramble by a new request from the marketing department. In order to react to a marketing campaign launched by one of their top competitors, the marketing department had a launched a new cricket promotion of its own for the next month.

Promotions are extremely important in the consumer goods industry. They entail targeted customer incentives, advertising spending and custom packaging – all in a synchronized fashion. The success of promotions often times make or break the annual year performance of a Consumer Goods Company. Promotions driven sales often times contribute large double digit percentage of total sales of consumer goods companies.

This particular cricket promotion involved a special packing requirement with the star logos on the packet. The target customer demand was not only upped by 50%, the offer also had a ‘Buy 1 Get 1 Free’ incentive. As a result, the total demand was going up by nearly 300%.

The SVP in charge of Supply Chain was trying his best to get a handle of the problem. He was getting irritated by the constant pressure he was under from the SVP Marketing, and the CEO.

The demand planning team had to quickly alter its demand numbers to meet the new targets. The real trouble spot was brewing at the SCM team. The team had to rapidly make decisions on where to source this sudden demand spike. While cost optimization is important, meeting customer demand at ‘all costs’ is the key. The Bangalore factory was already running at 90% capacity and was in no position to produce much more. Luckily for the SCM team, their SCM tool quickly ran a series of scenarios and presented possible alternatives. These scenarios looked at various alternatives such as contract packing, new factories, expedited raw material shipments, direct shipments from the factories to the stores, etc. One of the resulting scenarios seemed to fit the bill. It was decided that bulk of the extra demand be routed to the alternative factory in Faridabad which had some spare capacity. From here, the product was going to be shipped directly (where feasible) to the Mumbai and Pune depots, where a large chunk of the promotion driven demand was expected. The rest of the country’s demand was going to be met by the conventional approach, from the Bangalore factory. The new package also resulted in demand for new packaging material with the cricket logos. New scenarios were generated that source this material from packaging material suppliers from the middle-east. (Interesting to note, that in some time crunched promotions, packaging material often times ends up being the bottle neck!)

Satisfied with this approach, the SVP Supply Chain ordered his team to come up with process improvements to prevent such scrambles in future. Luckily there was an easy solution. The Demand Planning software tool had a nice capability to support an integrated promotions planning & demand planning workflow. Such workflows look at various promotions related data, such as timing, costs, volume, competitor strategies and efficiently plan future promotions – instead of reacting to them at the last minute. In turn, such effective promotion planning can not only drive revenues, but also further improve supply chain efficiencies.

The SVP is happy, but what happened to our end customer on FC Road Pune? Well, she walked away happy with her promotion pack of Star Glucose biscuits, completely oblivious of what had happened behind the scenes!

About the Author – Amit Paranjape

Amit Paranjape is one of the driving forces behind PuneTech. He has been in the supply chain management area for over 12 years, most of it with i2 in Dallas, USA. He has extensive leadership experience across Product Management/Marketing, Strategy, Business Development, Solutions Development, Consulting and Outsourcing. He now lives in Pune and is an independent consultant providing consulting and advisory services for early stage software ventures. Amit’s interest in other fields is varied and vast, including General Knowledge Trivia, Medical Sciences, History & Geo-Politics, Economics & Financial Markets, Cricket.