Eka, built by CRL, Pune is the world’s 4th fastest supercomputer, and the fastest one that didn’t use government funding. This is the same supercomputer referenced in Yahoo!’s recent announcement about cloud computing research at the Hadoop Summit. This article describes some of the technical details of Eka’s design and implementation. It is based on a presentation by the Eka architects conducted by CSI Pune and MCCIA Pune.
The most important decision in building a massively parallel supercomputer is the design of how the different nodes (i.e. processors) of the system are connected together. If all nodes are connected to each other, parallel applications scale really well (linear speedup), because communication between nodes is direct and has no bottlenecks. But unfortunately, building larger and larger such systems (i.e. ones with more and more nodes) becomes increasingly difficult and expensive because the complexity of the interconnect increases as n2. To avoid this, supercomputers have typically used sparse interconnect topologies like Star, Ring, Torus (e.g. IBM’s Blue Gene/L), or hypercube (Cray). These are more scalable as far as building the interconnect for really large numbers of nodes is concerned. However, the downside is that nodes are not directly connected to each other and messages have to go through multiple hops before reaching the destination. Here, unless the applications are designed very carefully to reduce message exchanges between different nodes (especially those that are not directly connected to each other), the interconnect becomes a bottleneck for application scaling.
In contrast to those systems, Eka uses an interconnect designed using concepts from projective geometry. The details of the interconnect are beyond the scope of this article. (Translation: I did not understand the really complex mathematics that goes on in those papers. Suffice it to say that before they are done, fairly obscure branches of mathematics get involved. However, one of these days, I am hoping to write a fun little article on how a cute little mathematical concept called Perfect Difference Sets (first described in 1938) plays an important role in designing supercomputer interconnects over 50 years later. Motivated readers are encouraged to try and see the connection.)
To simplify – Eka uses an interconnect based on Projective Geometry concepts. This interconnect gives linear speedup for applications but the complexity of building the interconnect increases only near-linearly.
The upshot of this is that to achieve a given application speed (i.e. number of Teraflops), Eka ends up using fewer nodes than its compatriots. This means it that it costs less and uses less power, both of which are major problems that need to be tackled in designing a supercomputer.
A computer that includes 1000s of processors, 1000s of disks, and 1000s of network elements soon finds itself on the wrong side of the law of probability as far as failures are concerned. If one component of a system has a MTBF (mean time between failures) of 10000 hours, and the system has 3000 components, then you can start expecting things to fail once every 10 hours. (I know that the math in that sentence is probably not accurate, but ease of understanding trumps accuracy in most cases.)
If an application is running on 500 nodes, and has been running for the last 20 hours, and one of the nodes fails, the entire application has to be restarted from scratch. And this happens often, especially before an important deadline.
A simple solution is to save the state of the entire application every 15 minutes. This is called checkpointing. When there is a failure, the system is restarted from the last checkpoint and hence ends up losing only 15 minutes of work. While this works well enough, it can get prohibitively expensive. If you spend 5 minutes out of every 15 minutes in checkpointing your application, then you’ve effectively reduced the capacity of your supercomputer by 33%. (Another way of saying the same thing is that you’ve increased your budget by 50%.)
The projective geometry architecture also allows for a way to partition the compute nodes in such a way that checkpointing and status saving can be done only for a subset of the nodes involved. The whole system need not be reset in case of a failure – only the related subset. In fact, with the projective geometry architecture, this can be done in a provably optimal manner. This results in improved efficiency. Checkpoints are much cheaper/faster, and hence can be taken more frequently. This means that the system can handle failures much better.
Again, I don’t understand the details of how projective geometry helps in this – if someone can explain that easily in a paragraph or two, please drop me a note.
The actual supercomputer was built in just 6 weeks. However, other aspects took much longer. It took an year of convincing to get the project funded. And another year to build the physical building and the rest of the infrastructure. Eka uses
- 2.5MW of electricity
- 400ton cooling capacity
- 10km of electrical cabling
- 10km of ethernet cabling
- 15km of infiniband cabling
The computing infrastructure itself consists of:
- 1800 blades, 4 cores each. 3Ghz for each core.
- HP SFS clusters
- 28TB memory
- 80TB storage. Simple SATA disks. 5.2Gbps throughput.
- Lustre distributed file-system
- 20Gbps infiniband DDR. Eka was on the cutting edge of Infiniband technology. They sourced their infiniband hardware from an Israeli company and were amongst the first users of their releases – including beta, and even alpha quality stuff.
- Multiple Gigabit ethernets
- Linux is the underlying OS. Any Linux will work – RedHat, SuSe, your favorite distribution.
Its the software, stupid!
One of the principles of the Eka project is to be the one-stop shop for tackling problems that require huge amounts of computational powers. Their tagline for this project has been: from atoms to applications. They want to ensure that the project takes care of everything for their target users, from the hardware all the way up to the application. This meant that they had to work on:
- High speed low latency interconnect research
- System architecture research
- System software research – compilers etc.
- Mathematical library development
- Large scientific problem solving.
- Application porting, optimization and development.
Each of the bullet items above is a non-trivial bit of work. Take for example “Mathematical library development.” Since they came up with a novel architecture for the interconnect for Eka, all parallel algorithms that run on Eka also have to be adapted to work well with the architecture. To get the maximum performance out of your supercomputer, you have to rewrite all your algorithms to take advantages of the strengths of your interconnect design while avoiding the weaknesses. Requiring users to understand and code for such things has always been the bane of supercomputing research. Instead, the Eka team has gone about providing mathematical libraries of the important functions that are needed by applications specifically tailored to the Eka architecture. This means that people who have existing applications can run them on Eka without major modifications.
Of the top 10 supercomputers in the world, Eka is the only system that was fully privately funded. All other systems used government money, so all of them are for captive use. This means that Eka is the only system in the top 10 that is available for commercial use without strings attached.
There are various traditional applications of HPC (high-performance computing) (which is what Eka is mainly targeted towards):
- Aerodynamics (Aircraft design). Crash testing (Automobile design)
- Biology – drug design, genomics
- Environment – global climate, ground water
- Applied physics – radiation transport, supernovae, simulate exploding galaxies.
- Lasers and Energy – combustion, ICF
- Neurobiology – simulating the brain
But as businesses go global and start dealing with huge quantities of data, it is believed that Eka-like capabilities will soon be needed to tackle these business needs:
- Integrated worldwide supply chain management
- Large scale data mining – business intelligence
- Various recognition problems – speech recognition, machine vision
- Video surveillance, e-mail scanning
- Digital media creation – rendering; cartoons, animation
But that is not the only output the Tatas expect from their investment (of $30 million). They are also hoping to tap the expertise gained during this process for consulting and services:
- Consultancy: Need & Gap Analysis and Proposal Creation
- Technology: Architecture & Design & Planning of high performance systems
- Execution: Implement, Test and Commissioning of high performance system
- Post sales: HPC managed services, Operations Optimization, Migration Services
- Storage: Large scale data management (including archives, backups and tapes), Security and Business Continuity
- Visualization: Scalable visualization of large amounts of data
This article is based on a presentation given by Dr. Sunil Sherlekar, Dr. Rajendra Lagu, and N. Seetha Rama Krishna, of CRL, Pune, who built Eka. For details of their background, see here. However, note that I’ve filled in gaps in my notes with my own conjectures, so errors in the article, if any, should be attributed to me.