If publishing in BMC Bioinformatics is that simple

Last week I read an article by Fourment and Gillings, A comparison of common programming languages used in bioinformatics [pubmed][doi] in BMC Bioinformatics. It basically is about a comparison of programming languages often used in bioinformatics. They compare Perl, Python, C, C++, C#(.NET) and java. The authors stress that each particular language has advantages for use in different bioinformatic applications. Fine, I can agree with that, but…

The more I think about the article, the more I am vexed by it. Besides kicking down obvious open doors, the results and methods leave many things to be desired. There are no error bars or standard deviations in any of the figures which would have been stupidly easy to do and necessary. All programs were written by the same person with varying experience in Perl, C++ and Java, other languages where learned while writing the programs. I think this is a recipe for disaster. Every language has its peculiarities which can be avoided or used to the fullest only when one has some decent experience with the language. A colleague of mine (who is the resident python expert) classified the blast parser as ‘rather messy’ after one short glance.

I simply can not get my head around the fact that Python parses a 9.6 gig blast output file in 38 minutes while Perl does the same thing in a little more than 7… 38 minutes! 9.6 gig blast output file! I have tried these scripts myself on some blast output (not nearly as large) I had lying around and found huge differences in processing time using the same script and blast file. Also… 9.6 gig! They mention the sequence used to search, but not the database they searched in… How do you end up with a relevant blast search output so large?

I think this article still needs a lot of work to convince me of the numbers they report. I am willing to agree that Python is better than Perl in some things and vice versa, but I have strong opinion with how this study was performed. Although it is nice that someone presents actual numbers and figures about how different languages perform, I do not think it is good enough to be published in BMC Bioinformatics.

E-values!

This actually works!

#!/usr/bin/perl
use warnings;
use strict;
print "1e-1" * "3", "\n";

This certainly does make my life easier with respect to HMMER and Blast E-values!

BBC 2007: Day 2

Second day at the BBC in Leuven. Quite some interesting stories that day! One talk was about using structure information to analyze how proteins bind to protein domains which are common in signal transduction. The second speaker actually told us something similar to what I have been doing for my Masters. Though not the same, he did use some of the ideas we also had for comparing protein interaction networks. He gave a link to the pre-published text and I am printing it at this exact same moment!

The keynote speaker M. Madan Babu had a brilliant presentation about the structure, evolution and dynamics of transcription regulation networks. As the first speaker after the break, his microphone stopped working. When that was fixed, the beamer broke down. He could still laugh about it though. When he could finally continue he told us about regulation motives in yeast. When analyzing the network they found a limited set of motives which were predominant in the network. These motives were analyzed in an evolutionary context by looking at duplications of the transcription factors and their target genes. Only in rare instances were these motives explained by duplications, which was counter intuitive. Also an a priori assumption in which transcriptional “hubs” should control relatively more duplicated genes was found not to be the case. They did find enrichment for some types of motives in specific processes such as DNA replication and sporulation. Feed forward loops for example are enriched in slow processes.

When looking at chromosomal localization of target genes and transcription factors they found a clear preference for target genes to be concentrated in one or at most two chromosomes. Even within the chromosomes target genes display regional preferences or avoidance. This mapping of preferences could help for optimizing expression of exogenous genes regulated by endogenous transcription factors.

The following talks included evolution of chromalveolates, which was very interesting, as well as a talk about MANTiS, which is an orthology database which is supposed to go on line in December. Instead of Inparanoid or Bi-directional best hits they use phylogenetic trees, which of course is much better. Instead of general orthology it can infer orthologs vs. paralogs and in-paralogs vs. out-paralogs. This depends on the quality of the trees used, and how the gene families have been determined, but it is good to do this so others can use it.

Wrap up: After a bad start with the poster session on Monday, the BBC took off with some very interesting talks. I especially liked the keynote talk by Madan Babu. I’ve noticed that a lot of research presented in the talks and especially on the posters, involved making bioinformatic tools for biologists who will not use them. I am a bit pessimistic in this I know, but as a molecular biologist myself I can only wonder.

BBC 2007: Day 1

First day at the BBC in Leuven. My low expectations about the organization were confirmed, but the talks were good and that´s what counts. First was one of the keynote speakers: Charles Lawrence. He works on RNA secondary structure prediction. He warned us about maximizing likelyhoods, free energies etc. because they might not represent the actual population of RNA structures in a sample. Eventhough a MFE is the most optimal structure one must not forget entropy. This was also true for sequence alignments so it is definately something for me to think about. He recommended an article to read: Miyazawa et al. Prot. Eng. 1994.

The next couple of talks were about Micro Array data and (transcription) networks, of which my knowledge is limited. The bottom line of most of these talks were “This is how it is normally done, but our algorithm is better“. This is good I suppose, but the last session of the day was most interesting though.

The last session had many interesting talks, but the third one stood out: Victor Guryev of the Hubrecht lab at Utrecht showed us high CNV´s within lab rat strains, thereby showing high variation within species. He could identify CNV´s by finding regions were the amount of coverage by WGS mapping was two fold or more higher than average. The regions of these CNV´s seem to be conserved in human. Great stuff!!! I´ll keep my eyes open for the publication.

Benelux Bioinformatics Conference 2007

Today is the first day of the Benelux Bioinformatics Conference at Leuven. I have been here since saturday so I`m sort of getting used to the French keyboard layout :) . Leuven is a beautifull little city with lots of old buildings and good food, so I can recommend visiting. Anyway, I have to be off to the conference. I`ll try to make a post tonight about the first day.

What I do: building gene trees

Since October last year I’ve been working as a PhD student at the Theoretical Biology group of the Faculty of Science at Utrecht University. I actually work for the Physiological Chemistry group of Prof. dr. Bos at the Academic Medical Centre, but that’s another story… Below I will explain a bit about what I am doing with my current project. I will try to keep it as uncomplicated as possible.

My project involves studying the evolution of signaling pathways in Eukaryotes and trying to understand specifically the emergence of new signaling pathways. Signaling pathways are a chain of events in the cell, carried out by proteins, which have evolved to ‘let the cell know’ what happens outside of the cell so it can react accordingly.

The observation on which my project is based is the fact that the complex eukaryotes (like us) tend to have had many duplications of key proteins which have gained their own function and regulate different processes. My job, in short, is to find out approximately when, why and how this happened for some specific protein families.

Read more »