More comments on Bioinformatics Software Development

I just found these comments from this Nature Editorial.

Their secret sauce appears to boil down to five ingredients: developers must possess sufficient proximity to and understanding of the research problem at hand; timing of the software release should correspond with the emergence of the problem in the research community that it addresses; software should have extensibility and interoperability; the algorithm implemented by the software should ideally be novel and indicative of profound insight; and, finally, a broad range of users should be able to run and operate the program.

Completely agree.

 The underappreciation of computational science is manifest in several ways. First, developing a mathematical algorithm to answer a research question is seen as more intellectually valuable than developing a software implementation for a broad community of users (in fact, both sets of skills are needed, but rarely found in the same person).

Usually I hear the expression “code monkey” for trainees that are working on only in the implementation. But it is hard for whom is outside, to understand how hard is to build a good bioinformatics system.

Comments on “The anatomy of successful computational biology software”

Hi, just a few comments from this note http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html,

Firstly, I think that for a tool to have a big success, it should:

* Initial user base

* Vacuum in terms of tools, it is hard to compete with already existing tool.

* Mathematica/Statistical interpretation

* Generic, all the mentioned tools are very generic or they are in the starting point of any analysis pipeline

* Simplicity, simple interface, simple commands

* Multiplatform

But I strongly disagree in some points:

Gentleman: I have found that real hardcore software engineers tend to worry about problems that are just not existent in our space. They keep wanting to write clean, shiny software, when you know that the software that you’re using today is not the software you’re going to be using this time next year. At Genentech (S. San Francisco, California), we develop testing and deployment paradigms that are on somewhat shorter cycles.

For me, he is talking about prototypes and not real softwares. He is worried about small prototypes softwares or even scripts.  Even Li says it:

Li: People not doing the computational work tend to think that you can write a program very fast. That, I think, is frankly not true. It takes a lot of time to implement a prototype. Then it actually takes a lot of time to really make it better.

There is also another problem:

Taylor: I don’t think there are good incentives for contributing to and improving existing software instead of inventing something new. The latter is more likely to be publishable.

Some args that Software developing is not science. You have to prove that you are doing science there.

A very important points are:

Trapnell: [..]  The computational folks need to learn more about statistics. The biology folks need to understand basic computation in order to even be able to communicate with the biostatistics crowd.

and

Krzywinski: In terms of data visualization, the idea that we can show all the data that we are collecting is long gone. We now need to look at the differences in the data sets, and help the user focus on the things that are important.

(Like EpiExplorer does)

By the way, did you realize that there is not epigenetic software in the list? (Only a critique on the tools for finding peeks)

Enlarge your data now

I was helping a MongoDB user with sharding one time. His chunks weren’t splitting and I was trying to diagnose the issue. His shard key looked reasonable, he didn’t have any errors in his log, and manually splitting the chunks worked. Finally, I looked at how much data he was storing: only a few MB per chunk. “Oh, I see the problem,” I told him. “It looks like your chunks are too small to split, you just need more data.”

“No, my data is huge, enormous.” he said.

“Um, okay. If you keep inserting data, it should split.”

“This is a bug. My data is big.”

We argued back and forth a bit, but I managed to back off from having called his data small and convince him it wasn’t a bug. That day I learned that people take their data size very personally.

MongoDB vs CouchDB

For background: I am working in a project that involves Epigenetic data. Ours focus is to store data from different projects with our own generated data. Storing could be easy, just store the files in a directory. We want more than just “saving the files”: we want store and retrieve this data, and for retrieving I am not saying just copy the file. It is necessary to search the data by its metadata, properties, and genomic regions.

For this project I choice MongoDB. Even MongoDB is not perfect, it was the best option for this project. One of members came with the question: “Why MongoDB and not CouchDB” ?

So, I got some main “features” from both databases and I did a comparison between them.

The features are Accessing the data, API, Queries and Indexes, Map Reduce, and Sharding.
Accessing the data is important when we have to see what is happening in the database or when we have some problem in the system and we have to look deeper in the data. API is how our application will communicate with the database. Queries and Index is how we will retrieve the data from the database. MapReduce is very important because the size of our data and how we can process this data. Sharding is important in two points: querying parallelization and how to handle the data grown.

Accessing the data:
MongoDB has a nice command line application (mongo) where it is possible to insert, query, and performs operations on the data using Java Script. MongoDB also has web interface to check the database status, and some GUIs to access the database, but mainly all operations are made using the “mongo” tool.
CouchDB does not have a shell. The commands could send using the unix tool “curl” or the tool Futon. But it is not possible to make more complex queries or check the data status in an easy way. For that, is it necessary to define the query previously and them to execute it.

API:
– MongoDB has official APIs for Python, C, C++, Java, Java Script and others languages. All these APIs are used to transform the data and queries to to a Binary JSON document, called BSON. This BSON document is send to the server, and the answer is another BSON, that the API transform to the actual language Objects/Structures.

– CouchDB’s main API is built over HTTP protocol, where all commands are HTTP requests. CouchDB provides some abstraction layers for Java, C (not C++), and Java. The big problem is the overhead generated by this approach. It is one of the main reason of why it was not choice as the project database.
Explaining the overhead: MongoDB clients (in this case, our developed software) has connections to the server keep in a connection pool, where the clients send and receive the data (compressed JSON documents). CouchDB does not use permanent connections with the client, it is necessary to create a connection for each request, and the data sent is a not compressed HTTP request.Putting it in numbers: in my tests, I was able to have approximately 15k regions (chr, start, end) per second using MongoDB and with Couch it was less than 5k insertions.

Queries and Indexes:
– MongoDB supports dynamic queries: db.collection.find({query…}, {fields}). Indexes can be created to speed up the queries.

– CouchDB has commands to define “views” that are the couchdb’s queries. The views should be defined before being executed. The indexes are created per view, it means, that if we have different vies for the same data, the index can be duplicated. It is a really important point in our project, where we will perform different kind of queries on the regions collections and multiple indexes will waste RAM memory.

MapReduce:
– MongoDB has a build-in MapReduce and also the aggregation framework, where operations like the SQL aggregation commands (count, sum, group by) can be executed. The Queries, MapReduce, and Aggregation commands should be defined and executed separately

– CouchDB user MapReduce directly in all queries, possibly making them to execute faster, but it is possible to execute the MapReduce using MongoDB as a separated command. In fact, I do not agree with the idea of using MapReduce as querying background. MapReduce design was made for off line and batch processing, not for real time processing. MapReduce concepts are not optimized for fast queries, but for parallel batch processing. For the queries that we are going to perform in our project, like retrieving all regions that belongs to some experiment, a simpler solution index+multiple shards could be faster.

Sharding:
– MongoDB has built in sharding. It means that it is possible to create and use a cluster of mongodb instances with just a few commands, do not needing any especial softwares and configuration. When we include a new mongodb instance/shard inside the cluster, mongodb acomodate the data into this shard quietly, supporting the data grown in a easier way. We  did some tests with mongodb shards and the results were good, which almost linear gaim in the queries.

– CouchDB also supports sharding, but it needs special and not so easy configuration. It is necessary to configure a lot of files and if we think in a long time project, where in the future others people will have to maintain it, these configuration are not interesting.

Conclusion:
CouchDB is more suitable for more stable data, where all documents for a given collection have the same fields and the queries will not change so much. In this project, one of the main focus is the data and query flexibility, so, depending on previously defined queries can bind the development and user iteration. Another important point is how our data size will grow: MongoDB is designed to provide a robust solution to store the data and to allow its grow over the time, and for it, what we will have to do, is to insert new shards into the clusters, without having to change the project code.

Nevertheless to remind, the users will never touch the database. The database is a tool where we will store and retrieve the data, not the main focus of the project.

Snakes

from http://www.theverge.com/2013/4/10/4208308/how-to-complete-snake-and-accept-the-emptiness-of-life :

It takes 13 minutes and seven seconds to complete Snake, the decades-old game that enjoyed a renascence through Nokia’s early mobile phones. 13 minutes, seven seconds, one hundred pellets. But what is this endless pursuit of pellets for? What reward lies at the end of this snake’s insatiable desire for food? Nothing. Victory in life only results in death. Immortalized in a two-minute GIF, this foreboding tale of how reptilian consumerism breeds nihilism is a mesmerizing journey of birth, life, and death.

Image