The primary quest of man (barring survival and self-preservation) has been to understand his place in the universe. What distinguishes planet Earth from the rest of the known universe is the presence of life. Arguably, the fundamental problems faced by humans are life science problems.
Computational sciences and the life sciences were earlier treated as totally different and contrasting fields. The emergence of bioinformatics shattered that mindset. However, with the emergence of newer challenges, interdisciplinary approaches to studying the life sciences are becoming more popular. Relatively new computational approaches to the life sciences are now showing a lot of promise, and opening up exciting new frontiers in systems and computational biology. Computational science is being used not just to analyse data but to model and simulate living organisms.
Open source software and technology are heavily exploited in these efforts, and it would not be a stretch to say that this keeps the research open and freely accessible.
The data explosion
Today, data produced by experiments in life sciences is increasing at a rate that is twice that of the storage capacity of computer hardware. The rate of the availability of new data has already outpaced our ability to make sense of it. Indeed, life sciences data today is accessed as much by non-life sciences professionals as by those in the life sciences field.
The genomes are one of the main causes of the data explosion. A genome is the encoding of an organisms DNA, so to speak, and is a reflection of the complexity of life. The code size of the genome (measured in number of base pairs) ranges from 1.8Kbp (Kilo base pairs) for a virus to about 3.2Gbp for humans. While the human genome is less than a gigabyte, it is customary to store 100 times as much data per individual to eliminate errors. With the cost of DNA sequencing falling by orders of magnitude over the past few years, the storage requirements for the genomes of a sizeable population will certainly strain the available storage capacity on the planet. Clearly, newer approaches to minimising storage requirements are needed and are being actively explored.
The Ensembl (ensembl.org) genome browser-cum-database uses open source technology (MySQL, Perl, Apache, Git, etc) not just to make the genomic and other data interactively accessible, but also to enable researchers to annotate and update genomic data. It is a massive collaborative effort. In true open source spirit, it has a developer site and allows you to install and host your own Ensembl site.
A genome database is quite unlike other databases, for instance, those that have details about cell phone users or registered voters. This is because genomic data is continuously changing. The changes could be because the data becomes more species-specific and more complex. As more research is conducted using the model organisms whose genome databases are available, more details get added.
While existing database technology is already in use in most cases, the need for newer paradigms is being felt, especially to help explore the complex interrelationships of the data and to search for complex sequences. Technologists are on the lookout for an approach to integrate various types of biological data, which would make it easier to draw more useful conclusions. Bio4j (bio4j.com) is a graph database, which is also a bioinformatics platform integrating data from various protein and gene ontology databases.
The complexity of life processes
Living organisms are examples of complex systems, functioning by means of complex pathways which in turn may also interact with each other. In other words, living organisms are not merely a sum of their genes; the interaction between genes gives rise to many emergent properties, which cannot be studied only by summing up the functions of individual genes.
Wiki Pathways (wikipathways.org) is a community resource for curating cell pathways. It is an open resource for users and contributors alike.
Cytoscape (www.cytoscape.org) is an open source software platform, specifically designed to visualise complex biological networks. It also offers Web based interactive views of public bioscience databases. Indeed, the best of network and exploratory data analysis tools need to be exploited to analyse the complexity of cell processes.
Data is exploding at a rate far greater than that at which humans can assimilate or make sense of it. Data mining and machine learning techniques help researchers with decision making, classification and clustering of biological data, and also to understand complex relationships and find patterns in biological data.
Orange (orange.biolab.si) is an open source tool available for data mining through visual programming or Python scripting. It also has extra features for text mining, bioinformatics and data analytics. Other machine learning software tools such as scikit-learn can be used equally well to mine biological data.
Biological text mining works on literature produced in the areas of medical and molecular biology. The technique helps in the extraction of information for biological processes and diseases. For example, information mined from biology literature has helped to determine the connection between magnesium and migraine headaches. Interestingly, this fact was discovered by Don R. Swanson, an American information scientist and was clinically validated only later.
Epidemiological models to study the spread of diseases
Mathematical models are especially useful to study and manage infectious diseases. Such models help to predict the degree of vulnerability of different areas, the rate of spread of the disease and also the measures that need to be taken to control the spread of the disease. Mathematical models and software help authorities calculate the effectiveness of public medical interventions like mass vaccination programs.
The R statistical computing environment has a set of Outbreak Tools under its R-epi project. These tools help make sense of the diverse and complex outbreak data by statistically analysing and visualising it. SAGES (Suite for Automated Global Electronic bioSurveillance) is a collection of open source software tools that can be used to closely examine the spread of any disease.
Nobody expected that in vivo and in vitro would be followed by in silico. In silico is another way of saying that the organ or organism is living inside a computer or is being simulated by it. Some specialised companies market virtual organs or simulated disease models.
In silico biosimulation of diseases is helping scientists to model and understand them like never before. These models not only help simulate the disease but also help create virtual populations which are susceptible to the disease. The simulation of complex biological networks and pathways helps in the processes of drug discovery and testing. It enables researchers to take informed decisions, thus optimising clinical trials to check the effectiveness of new drugs.
PySB is a Python framework for modelling, simulating and visualising biochemical processes. BioPython is a set of Python tools for computations related to biology. Finally (for this article, at least), BioLinux is a Linux distribution for biologists and life science researchers. Based on Ubuntu 14.04LTS, it offers a large collection of software packages for the life sciences. The open source model is also encouraging new forms of distributed research and crowd sourcing, one of them being the Open Source Drug Discovery project (osdd.net).
The use of computer science to deal with challenges in the life sciences is a huge leap for research in this area as it has opened up many new opportunities. It has also introduced new techniques that make it easier to understand the complexity of living systems. Moreover, it also helps us to do away with many tedious procedures that need to be performed repeatedly in a wet-lab. Undoubtedly, the new computational approaches used are transforming the life sciences for the better and additionally, making research more open, collaborative and accessible.