So you want to be a computational biologist?

Nice review that I found in Nature: http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html

A very important point:

You’re a scientist, not a programmer

The perfect is the enemy of the good. Remember you are a scientist and the quality of your research is what is important, not how pretty your source code looks. Perfectly written, extensively documented, elegant code that gets the answer wrong is not as useful as a basic script that gets it right. Having said that, once you’re sure your core algorithm works, spend time making it elegant and documenting how to use it. Use your biological knowledge as much as possible—that’s what makes you a computational biologist.

and:

Be suspicious and trust nobody

The following experiment is often performed during statistics training. First, a large matrix of random numbers is created and each column is designated as ‘case’ or ‘control’. A statistical test is then applied to each row to test for significant differences between the case data and the control data. You should not be surprised to learn that hundreds of rows come back with P values indicating statistical significance. Biological datasets, such as those generated by genomics experiments are just like this, large and full of noise. Your data analysis will produce both false positives and false negatives; and there may be systematic bias in the data, introduced either in the experiment or during the analysis.

“Knowledge of biology is vital in the interpretation of computational results.”

There is a temptation, even among biologists trained in statistical techniques, to throw caution to the wind when particular software or pipelines produce an interesting result. Instead, treat results with great suspicion, and carry out further tests to determine whether the results can be explained by experimental error or bias. If multiple approaches agree, then your confidence in those answers increases. But for many findings, validation and further work in the laboratory may be necessary. Knowledge of biology is vital in the interpretation of computational results. Setting traps, or tests, as mentioned above, is only part of this. Those tests are meant to ensure that your software or pipeline is working as you expect it to work; it doesn’t necessarily mean that the answers produced are correct.

Have fun ;)

Building mongodb with custom boost library using linux. Client C++ driver too &:)

In fact it is much more a memory for myself than some tutorial. But as I have not found any (actual) reference of it.
By the way, I tested during 2.5.x development cicle. Probably it can change. (And will. (As it did.))

Just the commands.

Download boost, unzip it, go to the folder:

./bootstrap.sh
./b2 variant=debug link=static  threading=multi  --with-system --with-thread --with-date_time --with-regex --with-serialization   --with-system --with-program_options --with-filesystem stage
mkdir /opt/boost
./b2 install --prefix=/opt/boost

For mongo:

cd ~
mkdir mongo
cd mongo
git clone https://github.com/mongodb/mongo.git
mkdir /opt/mongo
scons --full -j 64 --prefix=/opt/mongo/ --cpppath=/opt/boost --libpath=/epidb/opt/boost/

It takes some time because it builds *everything*.

Done!

For testing:

cd ~
mkdir db
/opt/mongo/mongod --dbpath db

Building MongoDB with clang and libc++

I am working in a project that I use MongoDB as data storage system. I really like to “trunk” version, directly from git [https://github.com/mongodb/mongo]. Even using a Mac, I used to compile MongoDB with gcc. Because MongoDB had a problem with the clang suite. Today I realized that it was fixed: https://jira.mongodb.org/browse/SERVER-8467.

In the same way, since I updated to Maverick, I had small problems builind my project. Mainly problem in the linking phase, where for some reason it was linking the binaries with libc++ where it was expected to link it with libc++ (because MongoDB and Boost were built using libc++).

So, after some time, I compiled the BOOST libraries, MongoDB, and my project using  clang/clang++ and libc++.

I will explain here the steps, if you want to know why I used some parameter or other, please ask at the comments.

For it, I did:

mkdir /opt/boost-clang-libc++/ 
mkdir /opt/mongo-clang-libc++/ 

Downloaded Boost libraries from http://sourceforge.net/projects/boost/files/boost/1.54.0/
U
ncompressed,  and inside the directory I did:

./bootstrap.sh  --prefix=/opt/boost-clang-libc++/
./b2 variant=debug link=static threading=multi toolset=clang cxxflags="-stdlib=libc++ -arch x86_64" linkflags="-stdlib=libc++” --with-system --with-thread --with-date_time --with-regex --with-serialization   --with-system --with-program_options --with-filesystem stage 
./b2 install --prefix=/opt/boost-clang-libc++/

Paciente… Paciente… built!

for MongoDB:

git clone https://github.com/mongodb/mongo.git
# Remember that it is a VERY unstable version! You can clone some Tag or download the source code of some mongodb stable version
scons --full install  --64=FORCE64  -j 16 --prefix /opt/mongo-clang-libc++/  --use-system-boost --extrapath=/opt/boost-clang-libc++/ --osx-version-min=10.8 --libc++  # osx-version-min=10.8 is for maverick, you can use 10.7 as well.

Wait, wait… built!

For running mongodb you have to setup where Boost’s Dynamic libraries are:

export DYLD_LIBRARY_PATH=/opt/boost-clang-libc++/lib:$DYLD_LIBRARY_PATH

I suggest you to put it into you .bash_profile

So…
To compile your program that uses, for example, boost_filesystem and the mongodb c++ Library you have to:

clang++ yourprogram.cpp -o program -L/opt/mongo-clang-libc++/lib -lmongoclient /opt/boost-clang-libc++/lib/libboost_filesystem.a

I have one problem…

– I have one problem.
– Let’s use XML and Perl
– Now I have three problems!

Let’s use Java instead!
Now you have a ProblemFactory problemFactory = new ProblemFactory(ProblemFactory.PROBLEM_FACTORY);
Let’s use C instead!
Now I am rewriting the string library! Now I am rewriting malloc()!
Let’s use Erlang.
Now I have no idea what I am doing.
Let’s use LISP.
Now I feel superior, but it takes 200 times to run.
Let’s use FORTRAN!
And now every computer scientist hates me
Let’s use Python.
Now I just “import work” and go home.
“Knock, knock.”
“Who’s there?”

very long pause….

“Java.”
If you put a million monkeys at a million keyboards, one of them will eventually write a Java program.
The rest of them will write Perl programs.
A Cobol programmer made so much money doing Y2K remediation that he was able to have himself cryogenically frozen when he died. One day in the future, he was unexpectedly resurrected.
When he asked why he was unfrozen, he was told:
“It’s the year 9999 – and you know Cobol”

Q: How many prolog programmers does it take to change a lightbulb?
A: Yes.

More comments on Bioinformatics Software Development

I just found these comments from this Nature Editorial.

Their secret sauce appears to boil down to five ingredients: developers must possess sufficient proximity to and understanding of the research problem at hand; timing of the software release should correspond with the emergence of the problem in the research community that it addresses; software should have extensibility and interoperability; the algorithm implemented by the software should ideally be novel and indicative of profound insight; and, finally, a broad range of users should be able to run and operate the program.

Completely agree.

 The underappreciation of computational science is manifest in several ways. First, developing a mathematical algorithm to answer a research question is seen as more intellectually valuable than developing a software implementation for a broad community of users (in fact, both sets of skills are needed, but rarely found in the same person).

Usually I hear the expression “code monkey” for trainees that are working on only in the implementation. But it is hard for whom is outside, to understand how hard is to build a good bioinformatics system.

Comments on “The anatomy of successful computational biology software”

Hi, just a few comments from this note http://www.nature.com/nbt/journal/v31/n10/full/nbt.2721.html,

Firstly, I think that for a tool to have a big success, it should:

* Initial user base

* Vacuum in terms of tools, it is hard to compete with already existing tool.

* Mathematica/Statistical interpretation

* Generic, all the mentioned tools are very generic or they are in the starting point of any analysis pipeline

* Simplicity, simple interface, simple commands

* Multiplatform

But I strongly disagree in some points:

Gentleman: I have found that real hardcore software engineers tend to worry about problems that are just not existent in our space. They keep wanting to write clean, shiny software, when you know that the software that you’re using today is not the software you’re going to be using this time next year. At Genentech (S. San Francisco, California), we develop testing and deployment paradigms that are on somewhat shorter cycles.

For me, he is talking about prototypes and not real softwares. He is worried about small prototypes softwares or even scripts.  Even Li says it:

Li: People not doing the computational work tend to think that you can write a program very fast. That, I think, is frankly not true. It takes a lot of time to implement a prototype. Then it actually takes a lot of time to really make it better.

There is also another problem:

Taylor: I don’t think there are good incentives for contributing to and improving existing software instead of inventing something new. The latter is more likely to be publishable.

Some args that Software developing is not science. You have to prove that you are doing science there.

A very important points are:

Trapnell: [..]  The computational folks need to learn more about statistics. The biology folks need to understand basic computation in order to even be able to communicate with the biostatistics crowd.

and

Krzywinski: In terms of data visualization, the idea that we can show all the data that we are collecting is long gone. We now need to look at the differences in the data sets, and help the user focus on the things that are important.

(Like EpiExplorer does)

By the way, did you realize that there is not epigenetic software in the list? (Only a critique on the tools for finding peeks)

Enlarge your data now

I was helping a MongoDB user with sharding one time. His chunks weren’t splitting and I was trying to diagnose the issue. His shard key looked reasonable, he didn’t have any errors in his log, and manually splitting the chunks worked. Finally, I looked at how much data he was storing: only a few MB per chunk. “Oh, I see the problem,” I told him. “It looks like your chunks are too small to split, you just need more data.”

“No, my data is huge, enormous.” he said.

“Um, okay. If you keep inserting data, it should split.”

“This is a bug. My data is big.”

We argued back and forth a bit, but I managed to back off from having called his data small and convince him it wasn’t a bug. That day I learned that people take their data size very personally.

MongoDB vs CouchDB

For background: I am working in a project that involves Epigenetic data. Ours focus is to store data from different projects with our own generated data. Storing could be easy, just store the files in a directory. We want more than just “saving the files”: we want store and retrieve this data, and for retrieving I am not saying just copy the file. It is necessary to search the data by its metadata, properties, and genomic regions.

For this project I choice MongoDB. Even MongoDB is not perfect, it was the best option for this project. One of members came with the question: “Why MongoDB and not CouchDB” ?

So, I got some main “features” from both databases and I did a comparison between them.

The features are Accessing the data, API, Queries and Indexes, Map Reduce, and Sharding.
Accessing the data is important when we have to see what is happening in the database or when we have some problem in the system and we have to look deeper in the data. API is how our application will communicate with the database. Queries and Index is how we will retrieve the data from the database. MapReduce is very important because the size of our data and how we can process this data. Sharding is important in two points: querying parallelization and how to handle the data grown.

Accessing the data:
MongoDB has a nice command line application (mongo) where it is possible to insert, query, and performs operations on the data using Java Script. MongoDB also has web interface to check the database status, and some GUIs to access the database, but mainly all operations are made using the “mongo” tool.
CouchDB does not have a shell. The commands could send using the unix tool “curl” or the tool Futon. But it is not possible to make more complex queries or check the data status in an easy way. For that, is it necessary to define the query previously and them to execute it.

API:
– MongoDB has official APIs for Python, C, C++, Java, Java Script and others languages. All these APIs are used to transform the data and queries to to a Binary JSON document, called BSON. This BSON document is send to the server, and the answer is another BSON, that the API transform to the actual language Objects/Structures.

– CouchDB’s main API is built over HTTP protocol, where all commands are HTTP requests. CouchDB provides some abstraction layers for Java, C (not C++), and Java. The big problem is the overhead generated by this approach. It is one of the main reason of why it was not choice as the project database.
Explaining the overhead: MongoDB clients (in this case, our developed software) has connections to the server keep in a connection pool, where the clients send and receive the data (compressed JSON documents). CouchDB does not use permanent connections with the client, it is necessary to create a connection for each request, and the data sent is a not compressed HTTP request.Putting it in numbers: in my tests, I was able to have approximately 15k regions (chr, start, end) per second using MongoDB and with Couch it was less than 5k insertions.

Queries and Indexes:
– MongoDB supports dynamic queries: db.collection.find({query…}, {fields}). Indexes can be created to speed up the queries.

– CouchDB has commands to define “views” that are the couchdb’s queries. The views should be defined before being executed. The indexes are created per view, it means, that if we have different vies for the same data, the index can be duplicated. It is a really important point in our project, where we will perform different kind of queries on the regions collections and multiple indexes will waste RAM memory.

MapReduce:
– MongoDB has a build-in MapReduce and also the aggregation framework, where operations like the SQL aggregation commands (count, sum, group by) can be executed. The Queries, MapReduce, and Aggregation commands should be defined and executed separately

– CouchDB user MapReduce directly in all queries, possibly making them to execute faster, but it is possible to execute the MapReduce using MongoDB as a separated command. In fact, I do not agree with the idea of using MapReduce as querying background. MapReduce design was made for off line and batch processing, not for real time processing. MapReduce concepts are not optimized for fast queries, but for parallel batch processing. For the queries that we are going to perform in our project, like retrieving all regions that belongs to some experiment, a simpler solution index+multiple shards could be faster.

Sharding:
– MongoDB has built in sharding. It means that it is possible to create and use a cluster of mongodb instances with just a few commands, do not needing any especial softwares and configuration. When we include a new mongodb instance/shard inside the cluster, mongodb acomodate the data into this shard quietly, supporting the data grown in a easier way. We  did some tests with mongodb shards and the results were good, which almost linear gaim in the queries.

– CouchDB also supports sharding, but it needs special and not so easy configuration. It is necessary to configure a lot of files and if we think in a long time project, where in the future others people will have to maintain it, these configuration are not interesting.

Conclusion:
CouchDB is more suitable for more stable data, where all documents for a given collection have the same fields and the queries will not change so much. In this project, one of the main focus is the data and query flexibility, so, depending on previously defined queries can bind the development and user iteration. Another important point is how our data size will grow: MongoDB is designed to provide a robust solution to store the data and to allow its grow over the time, and for it, what we will have to do, is to insert new shards into the clusters, without having to change the project code.

Nevertheless to remind, the users will never touch the database. The database is a tool where we will store and retrieve the data, not the main focus of the project.

Snakes

from http://www.theverge.com/2013/4/10/4208308/how-to-complete-snake-and-accept-the-emptiness-of-life :

It takes 13 minutes and seven seconds to complete Snake, the decades-old game that enjoyed a renascence through Nokia’s early mobile phones. 13 minutes, seven seconds, one hundred pellets. But what is this endless pursuit of pellets for? What reward lies at the end of this snake’s insatiable desire for food? Nothing. Victory in life only results in death. Immortalized in a two-minute GIF, this foreboding tale of how reptilian consumerism breeds nihilism is a mesmerizing journey of birth, life, and death.

Image

Reading man pages with Preview

Reading the man pages in the terminal are really boring, specially using up, down, space, “/” keys.

Using Mac Os X is possible to read the man pages in Preview.

Just put this code into yours .bash_profile / .profile file and be happy :)

pman() { man -t “$@” | open -f -a Preview; }

after restarting the bash, just use the pman command rather the usual man.

Just explaining:

the -t option gets the content of the man page and puts it in the stdout, where will be read by the open and opened in the Preview.

ISS Transit Over The Moon

I’d like to write about programming.
In fact I am trying to solve a problem with Oracle XML, in a project that I only know superficially. For some unknown reason, when the oracle version changed, some old commands are not working anymore, being necessary use getClobVal() to get the value of the XML node. Please, don’t ask me which version changed, I only want to fix it.

Now… while I am waiting to deploy the new war file, I was looking my feeds and found this amazing picture:

 ISS Transit Over The Moon

ISS Transit Over The Moon

It is only the ISS transiting in front of the Moon.
Amazing.

source: ISS Transit Over The Moon

Trying to come back

So, I keep receiving emails about comments on the post “Aprenda a programar em dez anos”. I really appreciate it. Even if it is an old text, it is still very actual and still makes sense now days.

I am looking for a place to write about my projects, what I am doing, why I was away for so long time… I tough about a new blog, but, why a new one if I have this old and (may be) cool place? There is still people reading this blog?

So, a little bit of what I am doing:
I am living in Germany since 2010, where since 2011 ends, I am doing my PhD at Max Planck Institute for Informatics at Bioinformatics group. My area is computational epigenetics, where I am working with a lot of data, very probably you will see a lot of texts about big data and these actual hypes here related with (epi-)genetics.

I will see something cool, probably MongoDB, and I will write about it soon.

Building MySql 5.5.14 at Mac Os X 10.7 Lion with Xcode 4

Building MySql 5.5.14 at Mac Os X 10.7 Lion with Xcode 4

When I updated my Mac Os X to the 10.7 version, I have two very dispointed surprises:
First, the “old” Xcode 4 was not running, it is necessary to update to a newer version, a very nice 3 Gb download.
Second, you open the Xcode, choice to build/install and when you run the mysqld, seg fault!

You spend 1 hours trying to figure out what happend, checking configuration, debugging, and you find and weird error when using pthread.

What is happened:
Xcode 4.1 pthread library was a bug (or feature) that the pthread_init function is not exported or exists, I do not know very well.

How to fix:
Open the configure.cmake file from the MySql source, go to the line 394, that should be:
CHECK_FUNCTION_EXISTS (pthread_init HAVE_PTHREAD_INIT)

change it to:

IF (NOT CMAKE_OSX_SYSROOT)
CHECK_FUNCTION_EXISTS (pthread_init HAVE_PTHREAD_INIT)
ENDIF (NOT CMAKE_OSX_SYSROOT)

Generate again the configurations for Xcode:
cmake . -G “Xcode”

Open the project at the Xcode, build install or run the mysqld and be happy :-)

Ps: I tested with MySql 5.5.14 and Xcode 4.1

source: https://github.com/mxcl/homebrew/issues/6277
https://github.com/CharlieRoot/homebrew/blob/44db2406d3d71cb5882627581a7dee1c4acbe0a6/Library/Formula/mysql.rb

Nenhuma novidade.

Este blog, que nos tempos áureos já obteve a marca de mais de cinco mil visitas por mes, agora luta amargamente para obter mil e poucas. Tudo bem, ele está abandonado, não estou dando muita atenção e ele, e sinceramente nem estou com muita vontade de escrever. Neste começo de ano, que estou esperando o meu visto para ir trabalhar na Alemanha, estou dedicando meu tempo a fazer três coisas importantíssimas: não matar, não morrer e não engravidar. Fora isto estou aproveitando, principalmente fora do mundo nerd.

No mundo nerd, dedico meu tempo principalmente ao Genoogle e sua interface web, ao mundo do Twitter, (@felipealbrecht) e a jogar o beta do Starcraft 2. Nada de mais, nada de muito interessante, o mais legal que fiz foi fazer uma palestra no InfoGlobo sobre o Google e métodos de desenvolvimento.

No mais, é isto, minha mudança para a Alemanha que deveria ter ocorrido dia 10, foi transferida, a princípio, para o dia 22 de março. Lá irei trabalha na Karlsruhe Tecnlogy Consulting primeiramente com o desenvolvimento de software para a conservação de energia.

Aliás, continuo procurando almas nerd-caridosas que estejam com vontade de desenvolver o Genoogle, tanto o seu núcleo quanto a intereface web.

Meu autor favorito.

Sério, eu poderia dizer que meu autor favorito é o Tolkien, o Carl Sagan, o cara que escreveu o Pequeno Príncipe, mas o meu autor favorito é o Fábio Hernandes.

O Fábio Hernandes escrevia na VIP a diversos anos atrás. Lembro-me com meus 16 a 18 anos, talvez mais ou menor, lendo os textos dele na última página da revista. Lembro-me do seu tio, um homem sábio do interior, lembro-me, mas esqueço-me de diversas lições entre as suas linhas escritas. Então, hoje sem motivo, procurei ele no google e descobri o seu blog e seu twitter: @ohomemsincero.

Se… se eu não tivesse deixado de lê-lo nos últimos anos, ou até mesmo, se eu o tivesse lido alguns de seus textos no último mês, eu não teria cometido erros tão infantis.

Apenas para complementar, um pequeno texto dele:

“No mundo perfeito, os casos de amor terminariam na hora certa. No último beijo que funcionou. Na última vez em que o amor e a generosidade triunfaram sobre o ódio e a mesquinharia. Mas isso não acontece. A gente sempre ultrapassa o ponto ideal no término dos relacionamentos. É a maldição dos homens e mulheres apaixonados.”

Minhas leis.

A algum tempo, a mais de 6 anos atrás, eu estava muito ligado ao Budismo, e como em costume de nossas vidas, estava preso entre diversas dúvidas: estava no início da faculdade, diversos amigos meus haviam temporariamente saído de minha vida, encontrava-me perdido no que fazer da vida e até mesmo decepcionado.
Para tentar esclarecer a minha vida, comecei a escrever o que era importante para mim, como enxergava o mundo, meus valores e meus objetivos. Por fim, compilei tudo numa listagem de 12 itens.

Talvez não seja a melhor escrita, talvez tenham lugares comuns e coisas óbvias, ou até contraditórias, mas não me importo, isto não é uma obra literária, é o meu caminho. Admito que continuo falhando diversas vezes, mas sempre com o desejo de um dia conseguir realiza-las.

A seguir, um copy and paste do meu arquivo “leis.txt”:


--
1 - A dor traz a forca.
2 - A vida nao é cor de rosa.
3 - A ignorancia traz a felicidade.
4 - Supere-se a cada instante.
5 - Faca o que deve ser feito. Faca! Nao espere e nao tema o preco.
6 - Sempre estejas pronto para morrer, mas nunca disposto.
7 - Enxergue alem do que a visao mostra.
8 - Saiba o que voce quer.
9 - O tempo passa, o correto não é permanente
10 - Lute até que sua forca acabe. Quando acabar, renove-a.
11 - Sonhos existem para se tornarem realidade.
12 - Seja honesto com todos, principalmente consigo.

--

Sem duvida não é o caminho e as leis mais faceis a serem
seguidas, mas sao as que eu concordo e sempre busco acreditar.
São apenas 12 linhas, pouco texto, mas para mim dentro de cada
uma há um grande conteúdo e significado.

Poderia escrever minha vida inteira sobre o significado de cada
linha, mas meu objetivo não é este. Quero que cada pessoa que ler,
pense sobre e veja a resposta alem do que esta escrito.
Muitos nao entenderao, outros pouco, mas se uma pessoa entende-las
e melhora-las acho que elas farao o sentido de sua existencia.

Neste pequeno trecho há muito mais dito do que pode ser expresso
em letras. A palavra é algo existente, a ideia não. Nao quero passar
a palavra, quero passar a ideia.

Sexto dia de Julho de 2003 - 21:45

Felipe Fernandes Albrecht

--

Retrospectivas 2009

No ano de 2009:

  • Enterrei meu pai, senti profundas saudades, mas levo uma grande lição: Carro não é brinquedo.
  • Vi uma das maiores demonstrações de amor, de minha mãe pelo meu pai. Esta foi outra grande lição: Fique perto de quem você ama, especialmente nos momentos difíceis.
  • Terminei um relacionamento longo. Outra lição: Fique perto de quem acrescenta algo.
  • Voltei a trabalhar num lugar que já tinha trabalhado, não gostado e trabalhei por menos de 45 dias. Saiba onde você quer trabalhar e onde quer chegar.
  • Praticamente aprendi uma nova profissão “extra”, onde basicamente tirei todo o meu sustento no ano: Aprenda coisas novas, vire-se com elas e veja que nas dificuldades, oportunidades. Lugar comum? Pode ser, mas é difícil na prática.
  • Entre dificuldades, tanto pessoais como burocráticas, terminei meu mestrado: Tenha foco e complete seus objetivos.
  • Comecei um relacionamento, porém, durou poucos meses: Conheça as pessoas, respeite-as, mas saiba o tipo de gente que você quer ao seu lado.
  • Comecei uma prática nova, Yoga: Não fique preso a preconceitos, lugares comuns e faça algo completamente diferente do que és acostumado a fazer.
  • Fui à Alemanha e a Holanda: Arrisque!
  • Graças ao Yoga, conheci pessoas maravilhosas. E graças a estas pessoas, estou indo trabalhar na Alemanha: Conheça pessoas, e saiba quem você quer do seu lado.
  • Lancei meu primeiro projeto grande: O Genoogle (http://genoogle.pih.bio.br) não sei se será um sucesso, quantas pessoas irão usa-lo, mas fiz algo. Fiz!
  • Tive novas amizades e fortaleci as antigas.
  • Tive paixões, algumas não correspondidas, outras que eu não quis corresponder, mas me apaixonei e gostei de todas as paixões e só guardo boas recordações delas.
  • Emagreci mais de 12 quilos: Cuide da sua saúde. Não emagreça pela aparência, mas pela sua saúde. Okay, as gatinhas vem de brinde ;-)
  • Dei aula e palestras em faculdades. Foi bem legal e acho que eu seria bom professor.
  • Assumi minha postura de Java Boy, mas sei que em 2010 o C# me espera. Não quero ficar preso a nenhuma tecnologia.
  • Comecei a usar o twitter. Depois de ter falado mal, comecei a usar e gostei, mas deixo minha following list bem enxuta. Quem quiser me seguir no twitter: http://twitter.com/felipealbrecht.
  • Li livros, alguns nerds, muitos tecnicos, várias histórias, poucos romances, mas todos me ensinaram algo.
  • Conversei, levei muito esporro, dei alguns, mas nenhuma mágoa ficou.
  • Arrisquei, perdi algumas vezes, mas aprendi muito neste ano que passou. E já estou aprendendo muito neste 2010.

Aqui é um blog nerd, não pessoal, mas sinceramente, eu gosto de errar. Nesta última semana cometi alguns erros extremamente infantis, mas, sinceramente, foram bons. Aprendi mais sobre a personalidade das pessoas (algo difícil para um nerd) e o aprendizado está enraizado.

Então é isto, bom 2010 a todos! E prometo mais textos nerds neste ano!

Suporte para Sequências de RNA no Genoogle.

Enquanto muitos estão na praia, bebendo, discutindo sobre o hexa campeonato do Flamengo, estou implementando o suporte a sequências de RNA no Genoogle.

A princípio parecia simples,definir um novo alfabeto, tirando o T do DNA e adicionando o U. Porém estou fazendo um trabalho bem mais amplo. Agora que o Genoogle está começando a ficar maduro, estou organizando melhor a estrutura de classes e os Junits. Como exemplo da mudança, o “diff” está em mais de 3.700 linhas de código. Acredito que esteja valendo a pena. Ano que vem, no início quero fazer um anúncio mais formal do Genoogle nas principais listas de discussão de bioinformática brasileiras e estrangeiras.

Fora o trabalho na estrutura interna do Genoogle, eu gostaria de trabalhar na interface Web dele. Estou meio sem tempo, porém se tu tiveres vontade de trabalhar com GWT (http://code.google.com/webtoolkit/), um pouco, bem pouco de SQL e muito de Java, contate me para conversamos!

Genoogle wants YOU!

Então, acabei de liberar uma versão nova do Genoogle.
Esta versão junto com a anterior é um conjunto de melhorias e correções que eu queria ter feito, como utilizar as variáveis de ambiente para algumas configurações e melhorias na exibição dos resultados. Percebo que uma das partes mais chatas é a liberação de uma nova versão. Mesmo com um checklist, é chatinho escrever changelist, pegar versão, gerar binarios, copiar para outro repositorio, verificar se tudo deu certo e anunciar. A parte de programar, desenvolver, é relativamente fácil, o difícil é liberar e fazer com que as pessoas usem.

Então, por isto novamente estou fazendo a propaganda aqui.
Se você gosta de genética, programação, desenvolvimento, algoritmos, procura um projeto interessante para se envolver, visite a página do Genoogle, baixe os binários, os fontes, inscreva-se no grupo, e participe!

O Genoogle é um concorrente do NCBI-BLAST, um dos softwares mais utilizados e citados na área de bioinformática, e tenho certeza que o Genoogleé capaz de abocanhar uma fatia destes usuários!

Eu ainda não fiz um anuncio em grandes listas de discussão, pois ainda quero mais usuários, encontrar e corrigir mais bugs e ter uma versão mais estável. Então, o Genoogle precisa de VOCÊ!

Genoogle!

Genoogle é o software que eu desenvolvi durante meu mestrado. Ao contrário da grande maioria das dissertações e teses acadêmicas, decidi pesquisar (lado científico) uma técnica e fazer (lado engenharia) um software funcional que implementasse a técnica pesquisada e desenvolvida.

O resultado é o Genoogle, que é um “Similar DNA Sequences Searching Engine and Tools”, ou melhor, um motor de busca de sequências genéticas similares e ferramentas. Através de uma sequência de entrada, ele pesquisa as sequências similares num conjunto de sequências. Esta busca é feita utilizando-se técnicas de indexação de dados, para otimizar o algoritmo de busca, e paralelização, para utilizar a capacidade computacionais dos computadores multicore. O Genoogle possui como interfaces o modo texto, página web e web services.

O Genoogle esta licenciado sob a GPL 3. Ou seja, código fonte aberto para modificar, estudar, mexer e principalmente: melhora-lo! Alias, há diversos assuntos a serem melhorados no Genoogle, desde a construção das páginas de consulta com JSP, otimização do código, criação de testes, e implementação e testes de novos algoritmos.
Então, se você é da área de bioinformática, ou de programação, ou gosta de desafios: visite genoogle.pih.bio.br para conhecer o projeto (e participar!), inscreva-se na lista de discussão em http://groups.google.com/group/genoogle e baixe o software e os fontes em http://svn.pih.bio.br/genoogle-packages/current/.

Todos os comentários são bem vindos!