TechWeekend LiveBlog: NoSQL + Database in the Cloud #tw6
This is a quick-and-dirty live-blog of TechWeekend 6 on NoSQL and Databases in the Cloud.
First, I (Navin Kabra) gave an overview of NoSQL systems. Since I was talking, I wasn’t able to live-blog it.
When not to use NoSQL
Next, Dhananjay Nene talked about when to not use NoSQL. Main points:
- People know SQL. They can leverage it much faster, than if they were to use one of these non-standard interfaces of one of these new-fangled systems.
- When reporting is very important, having SQL is much better. Reporting systems support SQL. Re-doing that with NoSQL will be more difficult.
- Consistency, and Transactions are often important. Going to NoSQL usually involves giving them up. And unless you are really, really sure you don’t need them, this issue might come and bite you.
- If you’re considering using NoSQL, you better know what the CAP theorem is; you better really understand what C, A, and P in that mean; don’t even consider NoSQL until you’re very well versed with these concepts
- RDBMS can really scale quite a lot – especially if you optimize them well. So 90% of the time, it is very likely that the RDBMS is good enough for your situation and you don’t need NoSQL. So don’t go for NoSQL unless you are really sure that your RDBMS wont scale.
MongoDB the Infinitely Scalable
Next up is BG, talking about MongoDB, the Infinitely Scalable. They are using MongoDB in production for http://paisa.com (Infinitely Beta). The main points he made:
- Based on the idea that JSON is a well understood format for data, and it is possible to build a database based on JSON as the primary data structuring format.
- The data is stored on disk using BSON, a binary format for storing JSON
- MongoDB it does not really allow joins; but with proper structuring of your data, you will not need joins
- You can do very rich querying, deeply nested, in MongoDB
- MongoDB has native support for ‘sharding’ (i.e. breaking up your data into chunks to be spread across multiple servers). This is really difficult to do.
- MongoDB is screaming fast.
- It is free and open source, but it is also backed by a commercial company, so you can get paid support if you want. There are hosting solutions (including free plans) where you can host your MongoDB instances (e.g. http://mongohq.com)
- You store “documents” in MongoDB. Since you can’t really do joins, the solution is to de-normalize your data. Everything you need should be in the one document, so you don’t need joins to fetch related data. e.g. if you were storing a blog post in MongoDB, you’ll store the post, all its meta-data, and all the comments in a single document.
MongoDB Use Cases:
- Great for “web stuff”
- High Speed Logging (because MongoDB has extremely fast writes)
- Data Warehousing – great because of the schema flexibility
- Storing binary files via GridFS – which are queryable!
MongoDB is used in production by these popular services:
- Used by http://bit.ly url shortening service
FourSquare recently had a major unplanned downtime – because they did not understand how to really MongoDB. That underscores the importance of understanding the guarantees given by your NoSQL system – otherwise you could run into major problems including downtime, or even data loss. See this blog post for more on the FourSquare outage
Some stats about use of MongoDB at paisa.com. 54 million documents. 80GB of data. 6GB of indexes. All of this on 2 nodes (master-slave setup).
Gautam Rege now talking about his experiences with Redis. Main points made:
- Redis is a key-value database with an attitude. Nothing more.
- Important feature: in (key, value), the value can be a list, hash, set.
- 1 million key lookups in 40ms. Because it keeps data in memory.
- Persistence is lazy – save to disk every x seconds. So you can lose data in case of a crash. So you need to be sure that your app can handle this.
- Redis is a “main memory database” (which can handle virtual memory – so your database does not really have to fit in memory)
- All get and set operations on Redis are atomic. A lot of concurrency problems and race conditions disapper because of atomicity.
- Sets in redis allow union, intersection, difference. Accessed like a hash.
- Sorted sets combine hashes and arrays. Can lookup by key, but can also scan sequentially.
- Redis allows real-time publish-subscribe.
- Redis is simple. Redis is for specific small applications. Not intended for being the general purpose database for your app. Use where it makes sense. For example:
- Lots of small updates
- Vote up, vote down
- Tagging. Implementing a tagging solution is a pain – becomes easy with Redis
- Cross-referencing small data
- Don’t use Redis for ORM (object-relational mapping)
- Don’t use Redis if memory is limited
- Sites like digg use Redis for tagging
Saranya Sriram talking about SQL Azure and data in the cloud. SQL Azure is pretty much SQL Server in the cloud, retrofitted for for the cloud:
- Exposes a RESTful interface
- Has language bindings for python, rails, java, etc.
- Gives full SQL / Relational database in the cloud
- The standard tools used to access SQLServer locally can also be used to access SQL Azure from the cloud
- For Azure you get a cloud simulation on your local machine to develop and test your application. For SQL Azure, you simply test with your local SQL Server edition. If you don’t have a SQL Server license, you can download SQL Server Express, which is free.
- You can develop applications in Microsoft Visual Studio. You can incorporate PHP also in this.
- You can also use Eclipse for developing applications.
- SQL Azure has a maximum size limit of 50GB. (Started with 1 GB last year)
- There is no free plan for Azure. You have to play. “Enthusiasts” can use it free for 180 days. If you sign up for the Bizspark program (for small startups, for the first 3 years) it is free. Similarly students can use it for free by signing up for the DreamSpark program. (Actually, the Bizspark and DreamSpark programs give you free access to lots of Microsoft software.)