MongoDB vs CouchDB

For background: I am working in a project that involves Epigenetic data. Ours focus is to store data from different projects with our own generated data. Storing could be easy, just store the files in a directory. We want more than just “saving the files”: we want store and retrieve this data, and for retrieving I am not saying just copy the file. It is necessary to search the data by its metadata, properties, and genomic regions.

For this project I choice MongoDB. Even MongoDB is not perfect, it was the best option for this project. One of members came with the question: “Why MongoDB and not CouchDB” ?

So, I got some main “features” from both databases and I did a comparison between them.

The features are Accessing the data, API, Queries and Indexes, Map Reduce, and Sharding.
Accessing the data is important when we have to see what is happening in the database or when we have some problem in the system and we have to look deeper in the data. API is how our application will communicate with the database. Queries and Index is how we will retrieve the data from the database. MapReduce is very important because the size of our data and how we can process this data. Sharding is important in two points: querying parallelization and how to handle the data grown.

Accessing the data:
MongoDB has a nice command line application (mongo) where it is possible to insert, query, and performs operations on the data using Java Script. MongoDB also has web interface to check the database status, and some GUIs to access the database, but mainly all operations are made using the “mongo” tool.
CouchDB does not have a shell. The commands could send using the unix tool “curl” or the tool Futon. But it is not possible to make more complex queries or check the data status in an easy way. For that, is it necessary to define the query previously and them to execute it.

API:
– MongoDB has official APIs for Python, C, C++, Java, Java Script and others languages. All these APIs are used to transform the data and queries to to a Binary JSON document, called BSON. This BSON document is send to the server, and the answer is another BSON, that the API transform to the actual language Objects/Structures.

– CouchDB’s main API is built over HTTP protocol, where all commands are HTTP requests. CouchDB provides some abstraction layers for Java, C (not C++), and Java. The big problem is the overhead generated by this approach. It is one of the main reason of why it was not choice as the project database.
Explaining the overhead: MongoDB clients (in this case, our developed software) has connections to the server keep in a connection pool, where the clients send and receive the data (compressed JSON documents). CouchDB does not use permanent connections with the client, it is necessary to create a connection for each request, and the data sent is a not compressed HTTP request.Putting it in numbers: in my tests, I was able to have approximately 15k regions (chr, start, end) per second using MongoDB and with Couch it was less than 5k insertions.

Queries and Indexes:
– MongoDB supports dynamic queries: db.collection.find({query…}, {fields}). Indexes can be created to speed up the queries.

– CouchDB has commands to define “views” that are the couchdb’s queries. The views should be defined before being executed. The indexes are created per view, it means, that if we have different vies for the same data, the index can be duplicated. It is a really important point in our project, where we will perform different kind of queries on the regions collections and multiple indexes will waste RAM memory.

MapReduce:
– MongoDB has a build-in MapReduce and also the aggregation framework, where operations like the SQL aggregation commands (count, sum, group by) can be executed. The Queries, MapReduce, and Aggregation commands should be defined and executed separately

– CouchDB user MapReduce directly in all queries, possibly making them to execute faster, but it is possible to execute the MapReduce using MongoDB as a separated command. In fact, I do not agree with the idea of using MapReduce as querying background. MapReduce design was made for off line and batch processing, not for real time processing. MapReduce concepts are not optimized for fast queries, but for parallel batch processing. For the queries that we are going to perform in our project, like retrieving all regions that belongs to some experiment, a simpler solution index+multiple shards could be faster.

Sharding:
– MongoDB has built in sharding. It means that it is possible to create and use a cluster of mongodb instances with just a few commands, do not needing any especial softwares and configuration. When we include a new mongodb instance/shard inside the cluster, mongodb acomodate the data into this shard quietly, supporting the data grown in a easier way. We  did some tests with mongodb shards and the results were good, which almost linear gaim in the queries.

– CouchDB also supports sharding, but it needs special and not so easy configuration. It is necessary to configure a lot of files and if we think in a long time project, where in the future others people will have to maintain it, these configuration are not interesting.

Conclusion:
CouchDB is more suitable for more stable data, where all documents for a given collection have the same fields and the queries will not change so much. In this project, one of the main focus is the data and query flexibility, so, depending on previously defined queries can bind the development and user iteration. Another important point is how our data size will grow: MongoDB is designed to provide a robust solution to store the data and to allow its grow over the time, and for it, what we will have to do, is to insert new shards into the clusters, without having to change the project code.

Nevertheless to remind, the users will never touch the database. The database is a tool where we will store and retrieve the data, not the main focus of the project.

4 thoughts on “MongoDB vs CouchDB

  1. MongoDB is great! All these features (ReplicaSet, Sharding, MapReduce, etc) combined with good APIs and tooling, makes Mongo a very good choice for you. We use Mongo massively here at Eventials to save statistics and run M/R jobs, for now we are very happy with it. Good comparison post, thanks!

  2. Thank you very much for at-least making me decide.. where should i go no.. I was too much confused.. where to go.. first it was quite harder to finalize.. NODE.JS or GOLANG… after GOOOOOOOOOOOOGLING GOOOOOGLING GOOOOGLING.. I could figure out.. NODE is better… and then mongoDB is good for so..
    now i’ll start goooogling.. where to start now from?… with node.js and mongoDB..

    Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s