Building EKA – The world’s fastest privately funded supercomputer

Eka, built by CRL, Pune is the world’s 4th fastest supercomputer, and the fastest one that didn’t use government funding. This is the same supercomputer referenced in Yahoo!’s recent announcement about cloud computing research at the Hadoop Summit. This article describes some of the technical details of Eka’s design and implementation. It is based on a presentation by the Eka architects conducted by CSI Pune and MCCIA Pune.

Interconnect architecture

The most important decision in building a massively parallel supercomputer is the design of how the different nodes (i.e. processors) of the system are connected together. If all nodes are connected to each other, parallel applications scale really well (linear speedup), because communication between nodes is direct and has no bottlenecks. But unfortunately, building larger and larger such systems (i.e. ones with more and more nodes) becomes increasingly difficult and expensive because the complexity of the interconnect increases as n2. To avoid this, supercomputers have typically used sparse interconnect topologies like Star, Ring, Torus (e.g. IBM’s Blue Gene/L), or hypercube (Cray). These are more scalable as far as building the interconnect for really large numbers of nodes is concerned. However, the downside is that nodes are not directly connected to each other and messages have to go through multiple hops before reaching the destination. Here, unless the applications are designed very carefully to reduce message exchanges between different nodes (especially those that are not directly connected to each other), the interconnect becomes a bottleneck for application scaling.

In contrast to those systems, Eka uses an interconnect designed using concepts from projective geometry. The details of the interconnect are beyond the scope of this article. (Translation: I did not understand the really complex mathematics that goes on in those papers. Suffice it to say that before they are done, fairly obscure branches of mathematics get involved. However, one of these days, I am hoping to write a fun little article on how a cute little mathematical concept called Perfect Difference Sets (first described in 1938) plays an important role in designing supercomputer interconnects over 50 years later. Motivated readers are encouraged to try and see the connection.)

To simplify – Eka uses an interconnect based on Projective Geometry concepts. This interconnect gives linear speedup for applications but the complexity of building the interconnect increases only near-linearly.

The upshot of this is that to achieve a given application speed (i.e. number of Teraflops), Eka ends up using fewer nodes than its compatriots. This means it that it costs less and uses less power, both of which are major problems that need to be tackled in designing a supercomputer.

Handling Failures

A computer that includes 1000s of processors, 1000s of disks, and 1000s of network elements soon finds itself on the wrong side of the law of probability as far as failures are concerned. If one component of a system has a MTBF (mean time between failures) of 10000 hours, and the system has 3000 components, then you can start expecting things to fail once every 10 hours. (I know that the math in that sentence is probably not accurate, but ease of understanding trumps accuracy in most cases.)

If an application is running on 500 nodes, and has been running for the last 20 hours, and one of the nodes fails, the entire application has to be restarted from scratch. And this happens often, especially before an important deadline.

A simple solution is to save the state of the entire application every 15 minutes. This is called checkpointing. When there is a failure, the system is restarted from the last checkpoint and hence ends up losing only 15 minutes of work. While this works well enough, it can get prohibitively expensive. If you spend 5 minutes out of every 15 minutes in checkpointing your application, then you’ve effectively reduced the capacity of your supercomputer by 33%. (Another way of saying the same thing is that you’ve increased your budget by 50%.)

The projective geometry architecture also allows for a way to partition the compute nodes in such a way that checkpointing and status saving can be done only for a subset of the nodes involved. The whole system need not be reset in case of a failure – only the related subset. In fact, with the projective geometry architecture, this can be done in a provably optimal manner. This results in improved efficiency. Checkpoints are much cheaper/faster, and hence can be taken more frequently. This means that the system can handle failures much better.

Again, I don’t understand the details of how projective geometry helps in this – if someone can explain that easily in a paragraph or two, please drop me a note.

The infrastructure

The actual supercomputer was built in just 6 weeks. However, other aspects took much longer. It took an year of convincing to get the project funded. And another year to build the physical building and the rest of the infrastructure. Eka uses

  • 2.5MW of electricity
  • 400ton cooling capacity
  • 10km of electrical cabling
  • 10km of ethernet cabling
  • 15km of infiniband cabling

The computing infrastructure itself consists of:

  • 1800 blades, 4 cores each. 3Ghz for each core.
  • HP SFS clusters
  • 28TB memory
  • 80TB storage. Simple SATA disks. 5.2Gbps throughput.
  • Lustre distributed file-system
  • 20Gbps infiniband DDR. Eka was on the cutting edge of Infiniband technology. They sourced their infiniband hardware from an Israeli company and were amongst the first users of their releases – including beta, and even alpha quality stuff.
  • Multiple Gigabit ethernets
  • Linux is the underlying OS. Any Linux will work – RedHat, SuSe, your favorite distribution.

Its the software, stupid!

One of the principles of the Eka project is to be the one-stop shop for tackling problems that require huge amounts of computational powers. Their tagline for this project has been: from atoms to applications. They want to ensure that the project takes care of everything for their target users, from the hardware all the way up to the application. This meant that they had to work on:

  • High speed low latency interconnect research
  • System architecture research
  • System software research – compilers etc.
  • Mathematical library development
  • Large scientific problem solving.
  • Application porting, optimization and development.

Each of the bullet items above is a non-trivial bit of work. Take for example “Mathematical library development.” Since they came up with a novel architecture for the interconnect for Eka, all parallel algorithms that run on Eka also have to be adapted to work well with the architecture. To get the maximum performance out of your supercomputer, you have to rewrite all your algorithms to take advantages of the strengths of your interconnect design while avoiding the weaknesses. Requiring users to understand and code for such things has always been the bane of supercomputing research. Instead, the Eka team has gone about providing mathematical libraries of the important functions that are needed by applications specifically tailored to the Eka architecture. This means that people who have existing applications can run them on Eka without major modifications.

Applications

Of the top 10 supercomputers in the world, Eka is the only system that was fully privately funded. All other systems used government money, so all of them are for captive use. This means that Eka is the only system in the top 10 that is available for commercial use without strings attached.

There are various traditional applications of HPC (high-performance computing) (which is what Eka is mainly targeted towards):

  • Aerodynamics (Aircraft design). Crash testing (Automobile design)
  • Biology – drug design, genomics
  • Environment – global climate, ground water
  • Applied physics – radiation transport, supernovae, simulate exploding galaxies.
  • Lasers and Energy – combustion, ICF
  • Neurobiology – simulating the brain

But as businesses go global and start dealing with huge quantities of data, it is believed that Eka-like capabilities will soon be needed to tackle these business needs:

  • Integrated worldwide supply chain management
  • Large scale data mining – business intelligence
  • Various recognition problems – speech recognition, machine vision
  • Video surveillance, e-mail scanning
  • Digital media creation – rendering; cartoons, animation

But that is not the only output the Tatas expect from their investment (of $30 million). They are also hoping to tap the expertise gained during this process for consulting and services:

  • Consultancy: Need & Gap Analysis and Proposal Creation
  • Technology: Architecture & Design & Planning of high performance systems
  • Execution: Implement, Test and Commissioning of high performance system
  • Post sales: HPC managed services, Operations Optimization, Migration Services
  • Storage: Large scale data management (including archives, backups and tapes), Security and Business Continuity
  • Visualization: Scalable visualization of large amounts of data

and more…

This article is based on a presentation given by Dr. Sunil Sherlekar, Dr. Rajendra Lagu, and N. Seetha Rama Krishna, of CRL, Pune, who built Eka. For details of their background, see here. However, note that I’ve filled in gaps in my notes with my own conjectures, so errors in the article, if any, should be attributed to me.

36 thoughts on “Building EKA – The world’s fastest privately funded supercomputer

  1. @potax, In general, the article does not ascribe any of the ideas to the corresponding inventors (and neither did the actual presentation). At the end of the article I’ve just mentioned the names of the people who did the actual presentation.

    If we start naming people whose ideas are involved, I would guess that it would be quite a long list – and in any case, I wouldn’t be the right person to be making that list.

    See related article

  2. @Tej, it is true that the Tatas have been quite visionary. In general, they have a track record of doing that. They have been pioneers in a lot of the major industries of India: Steel, airlines, automotive, software.
    They have also pioneered higher education: IISc, TIFR, TISS. And, they are also pioneering corporate social responsibility (CSR): 67% of the Tata group of companies is owned by social trusts. Which means that 67% of their profits go for social causes without any extra effort. (All this info is from the same presentation.)

  3. @Orochi, yeah your questions would probably trigger it to take over their nuclear arsenal and turn your town into glass.

  4. Fresh air. That’s what I feel while reading your post. India Mathematicians are known to be among the top-nnns. I consider EKA as finally “just” an application of geometry understanding , I mean it involves much more skills than the word “just” might understates. Next, from Europe and specially from France we do know that the next “indian threat” to come is its serious ability to take aircraft R&D over. Hence, once again, is it a little bit “perverse” to recall them that emerging powers will get the right tools to achieve that on their own.
    From my standpoint as a Madagascan this news means soon our universities will have those tools too for less than half the price of what you call top-ten fastest computers.

  5. Mr Orchid,
    Your tech call are not at all “TECH”, because you guys ask customer care the question like this:

    1) How do I print my voicemail !!

    2) I don’t need any of that SQL stuff — I just want a database!

    3) Hi, I’m supposed to pack [zip] my database and send it to you. What should I pack it in

    4) Customer: What is Microsoft Word?
    Tech Support: A program that lets you type up documents.
    Customer: Hey! Don’t give me any of you computer jargon crap, ok? I’m not a computer programmer!

    5) I want a system that I can afford, but not one that will go obsolete in six or seven years

    6) Wait, that password looks really gray. I’m going to type it in again.

    7) Where is the lower case?

  6. @chinaman

    Nod of course we all must wait for China! Hail China! The Center of the Planet! The most powerful nation!

    NOW FREE TIBET FIRST MF’ers

  7. My guess about the “Perfect Difference Set”(PDS henceforth) is that it could be used to reduce the number of inter-connections between nodes. Reading the article on PDS on Mathworld didnt strike much of a chord, but looking at the “Perfect Ruler” article did 🙂 . So the set {1,2, 5, 7} can be used to “get” to any number between 1 to 6.(5-1=4, 7-1 = 6, 5-2=3 etc..) So we can get to 6 numbers(or points) from 4 points. Mapping the same to a node – network, rather than connecting each node to the other, we can arrange the nodes using a PDS representation and inter-connect one node with only a few nodes. The rest can be reached via minimal hops. For example, if the Node “5” wants to reach Node “3”, it could do so via Node “2” which is connected to Node “3”. Am looking forward to read your article which will comprehensively throw light on this.

    I had a query related to Eka and the announcement that it would be running Hadoop. My understanding of Hadoop, having worked on it for over 1.5 years is as follows:
    “Hadoop allows parallelization of jobs as defined by the programmer. The jobs get distributed to multiple nodes where they are executed, the parallelization here is hence “virtual”. The processors of the system are not physically connected to each other “.
    How does Eka change or affect Hadoop ? Hadoop is doing things at the software layer which Eka achieves at the system level. Am i right about this ? I stay in Pune, but did not attend the seminar (shame on me :-(). Could you throw some light on this or point me to someone who can answer these queries ?

    Regards,
    Abhay
    42.
    P.S: Hoping to catch you at the Pune OpenCoffee Meet on 4th April.

  8. I’ve not read yet the details of the network topology, but nothing sounds astonishingly new to me for all the other keypoints. Using MPI and BLAS has solved portability and performance issues for years now, when migrating to a new architecture, while the commodity part used are pretty similar to a Dell or HP cluster (you can find many of them in top500). Blade + Infiniband is a pretty usual setup.

    Oh and don’t think you’ll get one of the top 10 computer for cheap. If the computing power goes cheaper, DOE and DOD will just by larger and get a level of performance you’ll never be able to matche anyway :] Don’t forget that the laptop I’m writting this message on would appear #500 in the top500 of 2000, and is, in that aspect, supercomputing made cheap.

  9. The interconnect paper doesn’t involve any projective geometry that I can see, only graph theory. Am I missing something?

  10. @Mukul, Thanks. I am hoping to do a similar write up about pubmatic one of these days.

    @Abhay, I think you got it mostly right with the perfect ruler stuff. Only difference is that a perfect ruler is constrained to deal with positive numbers and hence its efficiency is halved. Doing nodes and hops is more akin to the perfect differences. Which means that with {1,2,4,7} you can actually reach 13 nodes (not 6). Think of nodes are arranged in a circle, so doing 2-4 will give you -2 which is 13-2 = 11. I found this patent which appears to be relevant, but has too much legal mumbo-jumbo to be easily readable. One of these days, I want to understand this stuff and write a more useful article.

    And, yes, see you at OpenCoffee Club.

    @Aurelian, I apologize for not being able to bring out the details of the topology better – it was not covered in the talk. I’ve asked Dr. Sherlekar to take a look at this page and answer some of the comments. Lets hope he does that. (But the weekend has started here, so I don’t know when (if?) he’ll get a chance to do that.)

    @TFox, If you look at the wikipedia articles on finite geometry and projective planes you’ll notice that it all starts looking very similar to hypergraph theory that is involved in the bus interconnect papers.

  11. Great job. Both on the article and on the computer.

    To understand checkpointing:
    Imagine changing tires on a car. or:

    1. Load problem into computer.
    2. Check the tires.
    3. While running, use the same load structure, to ‘checkpoint’ the tires every mile. ( i.e. a full check point every 4 miles ).
    4. if a tire fails, reload problem, from last complete checkpoint.

    IBM’s probject stretch would checkpoint every 15 mins, and would log the results of every instruction. It was possible to step right to the single instruction of failure.

  12. This is inspiring! I wish i could be the first man in malaysia that build supercomputer inside southeast asia region. Pray me, all!

    Maybe in the next 10 years, ill fulfill my dream. God, gimme strenght!

  13. Navin,

    Kudos for PuneTech!

    I enjoyed reading this post, well written.

    Got to know about EKA last year just before the SC07 Reno conference when it was announced. Its a laudable achievement. Tata’s are visionary no doubt. What I am intrigued about is how this will get monetized from a business perspective? Does anyone remember Param? It was red taped probably and EKA is not under goverenment control so that is certainly different.

    Tata is on a roll – right from EKA to Nano to Jaguar. I am waiting for some big announcements in the computing and software space. At this moment there are hardly any ‘innovative’ software products which are ‘created and delivered’ right out of India – there are so many ‘me-toos’ . There once was tally which captured Indian market – I am yet to learn about other such softwares for India and for the world. Other than those there is the huge service industry with China at their toes.

    On a lighter note, after reading about the electricity consumed by EKA, and about the impending power cuts in Pune starting this Saturday, i wish we had more people using Green IT datacentre optimization products such as the ones created by Evergrid 🙂

  14. @Shaloo, thanks.

    I agree that the monetization story of Eka will be very interesting to watch. I’m guessing that they are making some money off Yahoo! in the cloud computing research initiative. Also, there are some hints to the directions that this could go in in the Applications section. However, I think the bottomline really is that even they don’t really know how this will pan out and whether they will recover their money. Which is why I think that the investment is visionary.

    While it is true that there aren’t lots of products out of India that have captured the imagination of the world, I think there are lots that are just bubbling under the popular consciousness and just need some more visibility. To take an example off the top of my head – the Zoho suite continues to garner accolades. In addition, the startup culture is really beginning to take off in India, and I think it will result in products sometime soon.

    Finally, the point about electricity was brought up during after the presentation by one of the audience members, and pretty much Dr. Lagu waffled on the answer. (“We might build a new facility near a dam.”) And Eka is pretty low on the Green500 list of supercomputers.

    Evergrid’s work in this area is certainly very interesting and timely. I loved Srinidhi’s talk on this topic in Pune last year. It would be great to have an article about Evergrid on Punetech (either the green stuff, or the load balancing work that you are doing in Pune). That is somewhere on my to-do list, but would get done much faster if you volunteer to write it 🙂

  15. @CHINAMAN I AM ALSO WAITING FOR A CHINESE CHEAP SUPERCOMPUTER PROBABLY MADE OF CHEAP PLASTICS AND CRAP CONTAINING LEAD AND ARSENIC AND TONS OF POLLUTING GASES IN IT LOL 🙂

Leave a Reply to Kulin Cancel reply

Your email address will not be published. Required fields are marked *