So you want to be a computational biologist?

Nice review that I found in Nature: http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html

A very important point:

You’re a scientist, not a programmer

The perfect is the enemy of the good. Remember you are a scientist and the quality of your research is what is important, not how pretty your source code looks. Perfectly written, extensively documented, elegant code that gets the answer wrong is not as useful as a basic script that gets it right. Having said that, once you’re sure your core algorithm works, spend time making it elegant and documenting how to use it. Use your biological knowledge as much as possible—that’s what makes you a computational biologist.

and:

Be suspicious and trust nobody

The following experiment is often performed during statistics training. First, a large matrix of random numbers is created and each column is designated as ‘case’ or ‘control’. A statistical test is then applied to each row to test for significant differences between the case data and the control data. You should not be surprised to learn that hundreds of rows come back with P values indicating statistical significance. Biological datasets, such as those generated by genomics experiments are just like this, large and full of noise. Your data analysis will produce both false positives and false negatives; and there may be systematic bias in the data, introduced either in the experiment or during the analysis.

“Knowledge of biology is vital in the interpretation of computational results.”

There is a temptation, even among biologists trained in statistical techniques, to throw caution to the wind when particular software or pipelines produce an interesting result. Instead, treat results with great suspicion, and carry out further tests to determine whether the results can be explained by experimental error or bias. If multiple approaches agree, then your confidence in those answers increases. But for many findings, validation and further work in the laboratory may be necessary. Knowledge of biology is vital in the interpretation of computational results. Setting traps, or tests, as mentioned above, is only part of this. Those tests are meant to ensure that your software or pipeline is working as you expect it to work; it doesn’t necessarily mean that the answers produced are correct.

Have fun ;)

More comments on Bioinformatics Software Development

I just found these comments from this Nature Editorial.

Their secret sauce appears to boil down to five ingredients: developers must possess sufficient proximity to and understanding of the research problem at hand; timing of the software release should correspond with the emergence of the problem in the research community that it addresses; software should have extensibility and interoperability; the algorithm implemented by the software should ideally be novel and indicative of profound insight; and, finally, a broad range of users should be able to run and operate the program.

Completely agree.

 The underappreciation of computational science is manifest in several ways. First, developing a mathematical algorithm to answer a research question is seen as more intellectually valuable than developing a software implementation for a broad community of users (in fact, both sets of skills are needed, but rarely found in the same person).

Usually I hear the expression “code monkey” for trainees that are working on only in the implementation. But it is hard for whom is outside, to understand how hard is to build a good bioinformatics system.

Comments on “The anatomy of successful computational biology software”

Hi, just a few comments from this note http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html,

Firstly, I think that for a tool to have a big success, it should:

* Initial user base

* Vacuum in terms of tools, it is hard to compete with already existing tool.

* Mathematica/Statistical interpretation

* Generic, all the mentioned tools are very generic or they are in the starting point of any analysis pipeline

* Simplicity, simple interface, simple commands

* Multiplatform

But I strongly disagree in some points:

Gentleman: I have found that real hardcore software engineers tend to worry about problems that are just not existent in our space. They keep wanting to write clean, shiny software, when you know that the software that you’re using today is not the software you’re going to be using this time next year. At Genentech (S. San Francisco, California), we develop testing and deployment paradigms that are on somewhat shorter cycles.

For me, he is talking about prototypes and not real softwares. He is worried about small prototypes softwares or even scripts.  Even Li says it:

Li: People not doing the computational work tend to think that you can write a program very fast. That, I think, is frankly not true. It takes a lot of time to implement a prototype. Then it actually takes a lot of time to really make it better.

There is also another problem:

Taylor: I don’t think there are good incentives for contributing to and improving existing software instead of inventing something new. The latter is more likely to be publishable.

Some args that Software developing is not science. You have to prove that you are doing science there.

A very important points are:

Trapnell: [..]  The computational folks need to learn more about statistics. The biology folks need to understand basic computation in order to even be able to communicate with the biostatistics crowd.

and

Krzywinski: In terms of data visualization, the idea that we can show all the data that we are collecting is long gone. We now need to look at the differences in the data sets, and help the user focus on the things that are important.

(Like EpiExplorer does)

By the way, did you realize that there is not epigenetic software in the list? (Only a critique on the tools for finding peeks)

Trying to come back

So, I keep receiving emails about comments on the post “Aprenda a programar em dez anos”. I really appreciate it. Even if it is an old text, it is still very actual and still makes sense now days.

I am looking for a place to write about my projects, what I am doing, why I was away for so long time… I tough about a new blog, but, why a new one if I have this old and (may be) cool place? There is still people reading this blog?

So, a little bit of what I am doing:
I am living in Germany since 2010, where since 2011 ends, I am doing my PhD at Max Planck Institute for Informatics at Bioinformatics group. My area is computational epigenetics, where I am working with a lot of data, very probably you will see a lot of texts about big data and these actual hypes here related with (epi-)genetics.

I will see something cool, probably MongoDB, and I will write about it soon.

Suporte para Sequências de RNA no Genoogle.

Enquanto muitos estão na praia, bebendo, discutindo sobre o hexa campeonato do Flamengo, estou implementando o suporte a sequências de RNA no Genoogle.

A princípio parecia simples,definir um novo alfabeto, tirando o T do DNA e adicionando o U. Porém estou fazendo um trabalho bem mais amplo. Agora que o Genoogle está começando a ficar maduro, estou organizando melhor a estrutura de classes e os Junits. Como exemplo da mudança, o “diff” está em mais de 3.700 linhas de código. Acredito que esteja valendo a pena. Ano que vem, no início quero fazer um anúncio mais formal do Genoogle nas principais listas de discussão de bioinformática brasileiras e estrangeiras.

Fora o trabalho na estrutura interna do Genoogle, eu gostaria de trabalhar na interface Web dele. Estou meio sem tempo, porém se tu tiveres vontade de trabalhar com GWT (http://code.google.com/webtoolkit/), um pouco, bem pouco de SQL e muito de Java, contate me para conversamos!

Genoogle wants YOU!

Então, acabei de liberar uma versão nova do Genoogle.
Esta versão junto com a anterior é um conjunto de melhorias e correções que eu queria ter feito, como utilizar as variáveis de ambiente para algumas configurações e melhorias na exibição dos resultados. Percebo que uma das partes mais chatas é a liberação de uma nova versão. Mesmo com um checklist, é chatinho escrever changelist, pegar versão, gerar binarios, copiar para outro repositorio, verificar se tudo deu certo e anunciar. A parte de programar, desenvolver, é relativamente fácil, o difícil é liberar e fazer com que as pessoas usem.

Então, por isto novamente estou fazendo a propaganda aqui.
Se você gosta de genética, programação, desenvolvimento, algoritmos, procura um projeto interessante para se envolver, visite a página do Genoogle, baixe os binários, os fontes, inscreva-se no grupo, e participe!

O Genoogle é um concorrente do NCBI-BLAST, um dos softwares mais utilizados e citados na área de bioinformática, e tenho certeza que o Genoogleé capaz de abocanhar uma fatia destes usuários!

Eu ainda não fiz um anuncio em grandes listas de discussão, pois ainda quero mais usuários, encontrar e corrigir mais bugs e ter uma versão mais estável. Então, o Genoogle precisa de VOCÊ!

Genoogle!

Genoogle é o software que eu desenvolvi durante meu mestrado. Ao contrário da grande maioria das dissertações e teses acadêmicas, decidi pesquisar (lado científico) uma técnica e fazer (lado engenharia) um software funcional que implementasse a técnica pesquisada e desenvolvida.

O resultado é o Genoogle, que é um “Similar DNA Sequences Searching Engine and Tools”, ou melhor, um motor de busca de sequências genéticas similares e ferramentas. Através de uma sequência de entrada, ele pesquisa as sequências similares num conjunto de sequências. Esta busca é feita utilizando-se técnicas de indexação de dados, para otimizar o algoritmo de busca, e paralelização, para utilizar a capacidade computacionais dos computadores multicore. O Genoogle possui como interfaces o modo texto, página web e web services.

O Genoogle esta licenciado sob a GPL 3. Ou seja, código fonte aberto para modificar, estudar, mexer e principalmente: melhora-lo! Alias, há diversos assuntos a serem melhorados no Genoogle, desde a construção das páginas de consulta com JSP, otimização do código, criação de testes, e implementação e testes de novos algoritmos.
Então, se você é da área de bioinformática, ou de programação, ou gosta de desafios: visite genoogle.pih.bio.br para conhecer o projeto (e participar!), inscreva-se na lista de discussão em http://groups.google.com/group/genoogle e baixe o software e os fontes em http://svn.pih.bio.br/genoogle-packages/current/.

Todos os comentários são bem vindos!