Numbers, data, lists: the end of Biology or the dawn of a new era?

These are notes were the reference from a lecture given at the Danby Society of Downing College on 14 February 2018 under the title ‘Any questions? The challenges of Science in a data driven world’

The era of the History of Science ushered by the publication of Newton’s Principia Mathematica is coming to an end; we need to take stock of this and consider the possibilities that a new, data centered and driven reality offers to us. The inflexion point that Newton represents does not mean that there was no Science before then; there was a lot that we would recognize as Science; it just had a different fabric and was called Natural Philosophy, but Newton’s Magnum Opus brought together observations, theory, deduction, generality, prediction- the elements of modern Science- in a manner that had not been done before, and showed the way to the future. That future is starting to feel like the past now, being substituted by a different way of practicing Natural Philosophy. We can now either surrender to the trend that data rules or try to marry the tradition that has brought us so much intellectual and practical progress with the beast that ‘big data’ represents. We have a choice.

One of the achievements of Newton was to build on the work of his predecesors but particularly of Johannes Kepler, an astrologer with a deep interest in deciphering the laws of planetary motion. He was a copernican i.e. believed that the planets orbited around the sun and not in the prevalent view of the time that they, including the sun, revolved around the earth.  He also had a gift for geometry and calculus which he used in his day job: horoscope making. Kepler’s legacy to Newton was the derivation of empirical laws for the movements of the planets that proved not only right but a match to the fire Newton was to ignite.  But the reason for Kepler’s achievement was not only his insight but the data he had access to, the data he sought. At the time the scientific community –in so far as there was ‘Science”- was small and Kepler knew where to find data. He knew about Copernicus ideas and that at the time this was more of a view than a fact anchored in observation- in fact people still fitted the data to the geocentric theory with dire consequences of difficult to comprehend models. Thinking about the possibility that the sun was at the center of the then known Universe, he set out to think about it, with data. And the best data, he knew, belonged to one Tycho Brahe, a colourful danish character living in exile in Prague who had made the most accurate observations of planetary motion, in particular of Mars, at the time. It was when he put together these data with his intuition and geometrical talent that he produced the three celebrated laws that helped Newton make his case.

Many years later, in 1859 Darwin published ‘On the Origin of species’ which represents a cornerstone of the biological sciences and many believe, with reason, the beginning of modern Biology. Behind that famous book there are over twenty years of work, patiently gathering data, corresponding with breeders and naturalists, pouring over collections and observations of plants and animals, threading all together around an idea, with its seed laid on board of the Beagle, that there is a unity to the natural world that spans millions of years, that what we see today is the result of descent by modification: Evolution by Natural Selection. The statement ‘I think’ in the 1837 sketch of the first phylogenetic tree is a witness to the fact that there was an idea at the inception of the theory. And so, like Kepler’s laws, what Darwin delivers twenty years later uses data to shape an idea and not to beget the idea. The data is sought, screened, used selectively, forged to shape the idea and, of course, the idea is in turn shaped by the data. Mendeleev and the periodic table, Watson and Crick and the structure of DNA, Schrodinger and Heisenberg and quantum mechanics, all examples of the same. This has been the way of Science for the last 400 years. But this is changing and is changing fast.

Advances in computer science and social developments have ushered in an unprecedented ability to gather and store data from any source in enormous quantities. It has also unleashed a deeply hidden human hunger for data. Google is but one reference for what this is but there are many others. When you go to your supermarket, through your loyalty card you are being tracked, analyzed and this is why in the back of your receipt you will find that the discounts you are offered match your pattern of consumption; this is why, if you bother in looking at your junk email box, you will find advertisements subtly tailored to your internet search patterns. And it is not just the supermarkets and the companies that want to know about you, also the governments and social media giants, screening emails and internet searches; sometimes with reason, sometime without. You’ld be naïve thinking that you are reading this and that nobody (well not a human but a machine) is watching. And we are curious too; remember Google processes 40,000 searches per second. Google is becoming our surrogate memory and where the skill used to be to remember, today is to ‘know how to search’. We also take advantage of this ‘data wind’ to create and to expose ourselves: there are 300 hours of video uploaded to YouTube every minute and your phone is full of pictures that you will not have time to look at in the next year. We live in the era of data and with this comes the need to analyse it, to use it, perhaps to understand it i.e, to turn it into information. But with so much we need special mechanisms that trawl through it and produce patterns.

Today we hear about Machine Learning (ML), Deep Learning (DL), Artificial Intelligence (AI) and much more in the way that we use computers without understanding how they work. These terms and what they represent are permeating our lives. I don’t know much about it -though people in my lab do and try to teach me- but what I know tells me that these are powerful methods to analyse data and to find patterns in a volume of data that a human being could not go through in a life time. And thus it is this: the combination of our ability to collect large amounts of data and the methods developed to tread through it, that is changing the fabric of Science.

Often I find myself wondering whether if one could feed all the data available to Kepler (and I mean ALL, as he had to discern which data he used and which he did not, this is why he sought out Brahe) to some ML algorithms, whether it would come up with Kepler’s laws, particularly the third one, which is the result of a careful selection of data and the answer to a simple but precise question. We could ask the same question of Darwin. If we expected a positive outcome, in both cases we would have to consider that data can abolish the influence of the history of intellectual thought and personal intuition, which was so important in both cases and also others. Kepler built on Brahe but also on Copernicus and Galileo, while Darwin did so on his grandfather and also on Lamarck, Malthus and Chambers. A different and equally useful question is to ask whether, giving all the information we can gather today about celestial mechanics or species around the world to Kepler and Darwin, whether they would come up with their views. My suspicion about the second is no, because they might become overwhelmed. A key element of their epiphanies was the slow cooking of the ideas built piecemeal from small and incremental pieces of selective data, turning data into information, information into principles, principles into laws.

Why am I talking to you about this? After all, you know in one way or another most of it. I guess I am doing two things. On the one hand, I am reflecting on the obvious: that we are moving into a data driven society in which what we do and how we go about it is determined, shaped and fostered by data, big data. On the other hand, as a scientist I feel –and I am not the only one- that this is changing the essence of Science and the method that has been prevalent since the XVII century. Sometimes it is possible to use the data creatively for a good purpose as the work of the late Hans Rosling shows (see his Gapminder website Notwithstanding this example and the more sinister commercial uses of data and statistics, the question for those of us who do Science is: what about it? And nowhere is the significance of data more obvious than in the Biomedical Sciences for it is here that because of the nature of biological systems, numbers and data bloom in unsuspected manners.

Data? Numbers? Don’t look at the sky, look at the earth!. There might well be 10 trillion stars in the Universe but, from the moment of inception a human being grows about 37 trillion cells i.e. with a population of about 7 billion, there are 10^21 human cells kicking around on the plane. Ironic that there is a project “the human cell atlas’ which aims to create a directory of this…… to the human panoply you may want to add other animals In addition, there might well be 200 different cell types but, for example, the reason why you are here and able to listen to me is because you have one hundred billion neurons each of which making, on average, 7000 synapses!. There is an important difference between stars and cells that bears on these numbers: pretty much those 10 trillion stars are the same whereas, as we are starting to see, cells are different from each other and will make more that will be different. A three year old child has 1 quadrillion synapses and they are all doing different things!. These numbers are mind-boggling but you should not be surprised: that is what makes any animal what it is. More numbers, as it turns out that our trillions of cells share their existence with one trillion bacteria and don’t forget that your body right now, as you listen to me, is making 200 million new cells every minute, of which 30 million are red blood cells. What a machine!  We can go on: there are about 8 million species in the earth (give or take one million) and, at an average gene content of 40,000 genes per species (not including or getting philosophical about how much is coded in a gene) this gives again a trillion, maybe a quadrillion, figure for the number of total genes in the biosphere. A recent fad comes from the technical development that enables us to look at gene expression at the level of individual cells and what we find is that cells are very different from each other in terms of the genes that they express, that the trillions of cells that we have, use combinations of the 70,000 odd genes that we have to create their own identities, that during evolution living systems explore the combinatorial that results from having 70,000 elements at work, a canvas of 70,000 colours to create a living system.  As it happens, most likely than not the 200 million cells that you are making every minute will have with different patterns of gene expression from each other and from those that will come out in the next hour. Exhausting, but also intriguing. Does all matter?

There may we 10^80 atoms in the Universe but the dynamic nature of living systems has easily surpassed this number in terms of the numbers of genes scrutinized by Natural Selection in the course of Life History up to now. As Stephen J. Gould puts it:

For sheer excitement, Evolution as an empirical reality, beats any myth of human origins by light years. A genealogical nexus stretching back nearly a billion years and now ranging from bacteria to the highest Redwood tree, to human footprints in the moon. Can any tale of Zeus or Wotan top this? When truth value and visceral thrill thus combine then, indeed, as Darwin stated in closing his great book, there is grandeur in this view of life!

But the global view that Gould invites us to admire might is being lost in the fog of its details, the numbers and structure of genes, of cells, of organisms. We are losing sight of the picture and the questions it raises.

Are there no more questions? Is our obsession with data a reflection that we know all we need to know? Perhaps, surely, there are realities lurking behind these numbers that we cannot fathom them or we are to lazy to figure them out. Perhaps, as Umberto Ecco says it is that ‘When we cannnot provide a definition by essence for something and so, to be able to talk about it, to make comprehensible or in some way perceivable, we list its properties’. We have been doing this for centuries but now this is done in a manner that escapes our comprehension; this, to me, is the way that projects like the “human cell Atlas’ feels; a catalogue not a map. Projects like this, to which we can add the ‘many genomes’,  ‘all species genomes” or  the ‘prehistoric genomes’ are expensive, bean counting exercises; no doubt with some value but expensive counting, listing, classifying without a reason; exercises that use ML and DL to try to grasp a meaning. Something might and I don’t doubt will come out of it. Somebody, somewhere, sometime will have a question. In the meantime we should not forget how we have got here, that we have laid down a basis to understand this complex web of cells and genes without knowing its details, the microscopic details that we are so keen on at the moment. In turn, lets face it, these details have, yet, to give is something more than lists. They will but only if we do not forget that there are questions that require an answer.

So, how do those of us who have been brought up in the classical tradition deal with this? What is the future of biomedical sciences? Importantly, is there room for the ‘old way”? Or is it that we do not need any more Feynmans or Keplers or Darwins, that there are no more important questions. What follows is a personal view, which may well be wrong but that tries to rationalize what is happening before it devours some of us.

Biomedical sciences is, slowly, becoming the science of big data and one can envision two streams emerging. One is the gathering, classification and, where possible, analysis of lists. The 1000 genome project, the cancer genome project, the species genome project and the human cell atlas are examples of such endeavours. They are more technical than intellectual and in the long run will deliver useful information, though it is likely that most of it, particularly in the transcriptomics, will have to be redone in the future when technology and analysis settle down. At the moment we feel we do not have the means to cope and analyse this information but, the main reason for this is that we have not thought of the questions that this can answer. There might already be enough data to answer some questions but we need to think of questions. Gene ensembles will be linked to diseases and this information will be used to create cures and medical treatments. We shall slowly define what is an individual and in the already evident variation we shall discover much. These projects, basically the development of utilitarian approaches, are becoming the realm of Biology because its nature lends itself to this. In some ways this reflects the state of a collective that has run out of ideas or that not having any, wants to make a project out of lists and lists are many to do in Biology. Today much funding goes into these projects. Accepting that this is the way it is or, at least the substrate for the future, I can see three activities in the future biomedical sciences:

  1. Large Institutes and institutions that pool resources and provide exceptional working environments for young(ish) scientists where a redefined biosciences will develop. These institutions will accumulate funding –at the expense of research in Universities and small research groups- and will enable some people to pursue questions of merit but for the most part what they will be doing is producing what we now call ‘papers/publications’, occasionally adding elements of value to the data mountain. For the most part they will add quality data. Hype will be an important element of their production and too big too fail will be their motto. Occasionally there might be something significant but it will be difficult –as it is now- to distinguish the signal from the noise.
  2. A different strand will have a very applied slant to it and will thrive on data, some of the substrate for this work will come from the large institutions (1) but, principally it will rely on a new very applied Biology, where a third generation biotech –now emerging- and the introduction of engineering approaches will lead to very exciting and novel approaches and results to biological problems. Society will profit a lot from this and I suspect that this will be where the most interesting developments will happen. The interesting element of this strand will be that the value of the research will be determined, principally, by its practical value and not by its publication impact.
  3. Finally, there will also be room for a more classical approach, but with a twist. There will be room for people interested in questions –some of them will be able to work in the institutions summarized in 1 as long as they don’t fall into the many traps that such places contain- and they will have a chance to work on those questions much as mathematicians do today with problems like Fermat’s theorem and Riemann’s hypothesis. Somebody will define such questions, perhaps as David Hilbert did in mathematics, and this people will work in isolation or small groups to answer them. The ability to understand and access data will be essential here. On occasion, as it happens in Physics, these questions might lead to consortia to test hypothesis or predictions.

In the long term and with developing limitations in funding, it is very likely that 1 will be pushed into various forms of 2 or the mega-projects of 3.A devaluation of the current notion of publication (which is undergoing a transformation) will help this shift as this is an important component of 1. In the long term there is a chance that the classical scientific method will be restored in the biomedical sciences, though it will be in a new form.

As Bob Dylan said

Come gather ’round people
Wherever you roam
And admit that the waters
Around you have grown
And accept it that soon
You’ll be drenched to the bone.
If your time to you
Is worth savin’
Then you better start swimmin’
Or you’ll sink like a stone
For the times they are a-changin’.

Not only there is nothing wrong with these changes but rather they bring up challenges and opportunities. If we are honest, people of my generation should not be pining (as some do) for a foregone time -much of the problem of todays’s biomedical sciences is that it has not adapted to the times, that the system of decision making, attribution and research orientation is the one that operated in the 80s and 90s –, a system largely operated by the people that were successful at a very different time with  a very different system who do not cater for a constituency that has grown up with different needs and aspiration and, importantly, in a different world. Those of us from that era should be jealous of the possibilities that have opened up for people with an interest in Science and the background to tackle it, rather than nostalgic for a period and ways, which have already made their contribution. What is important, though, is no to forget that data helps answer questions and make progress, that we should not forget that the power of what we can do today is not in the making of lists, but in answering Questions which even if we can’t formulate today, we should strive to find. Let’s not be complacent in the midst of the embarrassment of data we live in. And let us not forget that these data revolution has not yet delivered anything that matches  the discoveries that people made with less data: the laws of genetics, the structure and function and DNA, the fundamentals of genetic circuits……It is disingenuous to think, as some people do these days, that one can reach fundamental principles just by looking at data or by throwing data into ML or DL algorithms.

The important bit in computational biology is ‘Biology’ and the need to know what to compute. Questions are missing in todays changing times and it is important that we start making the list that matters most, the list of important questions in Biology. Let’s do this before we forget that we can do it.