The Launch of the Human Cell Atlas Project: A Catalogue of Trillions of Cells Without a Question?

A catalogue is not a map and if you want to transform a catalogue into a map, you probably need to understand what the map is for.

The launch of the Human Cell Atlas Project (https://www.broadinstitute.org/research-highlights-human-cell-atlas) has caught the eyes and the ears of newspapers, the mind of some scientists and, obviously, the imagination of the leaders of the project. If you see the headlines, it might tempt you. After all, the prospect and challenge of learning the secrets of 35 trillion cells (3.5 x 10^12) is an impressive undertaking. The number reminds us of Avogadro’s number and, of course, there is the challenge and the excitement for the technologists and the computer scientists associated with such large numbers. There is a lot of money and human power at stake and, in so far as I can tell, very little forethought. Yes, there are the challenges of the logistics behind such projects but, one can think of many projects that aim to collect and organize data from biological systems but do not make it to the headlines. This one, I suspect, builds on an ongoing trend for genomic data gathering, organization and analysis inspired by our technological developments and also on the fact that, when it comes to data, Biology rules because biological data are about the generation, control, selection and maintenance of variability at different scales from molecular to organismal. This does not mean that all in it is meaningful and this, I think, is the main oversight of a project which has more cloth than substance. One has often the feeling that biologists are searching for their own Manhattan project. First it was the human genome (then any genome and many genomes), the connectome (or connectomes), then in a more quiet manner there is the human proteome (it is more complicated) and when we are getting tired of –omes, we begin to talk about Atlases.

Don’t get me wrong, there is something useful here but, are there not better investments of talent and money in the same or related fields at this point in time? I have followed the single cell field from its emergence and have recently commented on its current status (http://amapress.gen.cam.ac.uk/?p=1765) so will not repeat myself since much of what I said there applies to this project. However, let me point out the surprising naïvité behind this project and the lack of understanding of what maps, Atlases if you want to give them a name, mean in Biology. The idea of this project is to obtain information about the transcriptome of individual cells and use this as a reference base for……what? What is the map for? We are told that “Without maps of different cell types, where they are located in the body, and the genes they express, we cannot describe all cellular activities and understand the biological networks that direct them” and, with the usual references to health and medicine, that “A cell atlas has the potential to transform our approach to biomedicine. It would help identify markers and signatures for different diseases, uncover new targets for therapeutic intervention, and provide a direct view of human biology in vivo, removing the distorting aspects of cell culture”. At first sight nothing wrong with this, and a lot of promise, except that we already have experience of similar projects at smaller scale and, when you look at them, these projects pose many questions which we have not answered yet. And many of those questions have a common denominator: what kind of questions can one ask to such data beyond cataloguing, organizing? Forget the hype, what is this for? What is this about? If there are no clear answers to these questions, why going bigger blindly?

A significant problem with this idea of cataloguing (a catalogue is not a map) trillions of cells into classes based on their identities lies in what we already have learnt from similar smaller scale experiments: populations of cells in a similar state exhibit large heterogeneities in gene expression which bedevil their identity (however we define it), that will vary between individuals and certainly with age. We use these heterogeneities to classify but we don’t understand their meaning. So, which age shall we look at? Which individuals? DNA is constant, RNA is not. Furthermore, at a technical level we are still not sure how deep we have to go into the sequence analysis of the cells to get meaningful information, we still do not know how to read this information. Most of the projects already out there seem to be more about sequencing more cells, faster, deeper, about new algorithms to organize the data, about data visualization than about Biology. The Biology tends to be left out, as if the structure of the data will yield the Biology for free. Sure there are biologists in most projects. In several years of maps at a smaller scale than what is proposed here, I have not seen much of ‘sense’ in this field and the questions that it poses are left unanswered or, worst, unuttered. What is more worrying, in a field like the hematopoietic system, which appears to be a test bed for this field, there are two or three papers a month with little cross referencing and few general messages beyond slight reshuffles of the maps that years of classical Biology had produced and, occasionally, a few new markers for cell types. And yet, I am sure that there is much that can be learnt from a proper look at the data which, in turn, would determine how we gather the data. There is something to ponder here, in particular on question of what do we want these catalogues for?

By the mid 1800s physicists knew that the macroscopic variables of a physical system depended on the atoms and molecules that made it up. So, for example, the Temperature and the Pressure of a gas were/are a consequence of the velocities and relative positions of the constituent molecules. The question was how to relate the two. If they had had much of today’s technology, they (some) might have opted for measuring the momentum and position of every molecule and through lengthy calculations work out the answer. There are about 10 trillion molecules in a nMole of a gas so, you can see the headline: Physicists engage into a Molecular Atlas that will transform physics and engineering: Trillions of molecules to be mapped in terms of their velocities and relative positions. Furthermore, we would be told that this will be done for a couple of the fundamental molecules in the Universe thus ushering in a new era of technology. This, of course, does not make much sense and it was the prescience of JC Maxwell and, in particular, L Boltzmann that showed the way to deal with the problem. In this manner they created Statistical Physics which in addition to producing very exciting Science did transform Physics and Engineering. Now, Biology of course is not Physics, and the identity of a cell is not the state of a molecule but there are analogies that can be used in the analysis of transcriptomes and maybe we should engage deeper with those analogies rather than going for the trivial. It is possible that what we want to know is not the ‘identity’ of every cell (though I would argue that our view of this is, still, primitive, superficial and inaccurate) but what those individual identities average to, what is it that is being read at the higher level of organization by the cells. After all, experiments (experiments, not cataloguing exercises) tell us that the heterogeneities that are observed in the analysis of single cells in populations are dynamic, though we do know or understand what the meaning of those heterogeneities are. This observation, alone, suggests that if you want to transform a catalogue into a map, you probably need to understand the meaning of these dynamic heterogeneities. Moreover, for all we know (or ignore) we still do not understand what the macro-variables that cells measure are, what is being represented in those heterogeneities. Doing a map of the trillions of cells of an organism (this being human cells only to reinforce an anthropocentric view which while justifiable is narrow minded) without knowing what these are related to is not that useful or clever. And if you want to compare this to the Human Genome Project, just remember one thing, the genome is the same in every cell and, for the most part in every species, it is this conservation that make SNPs and polymorphisms useful. However, with transcriptomes there will be surprises and we need to think what we want to measure before we go too far.

James Briscoe has written about the semi-comic statement of Microsoft to ‘solve’ cancer within ten years and pointed out how biological systems have this habit of ‘fighting against anything we try to do” (https://briscoelab.org/2016/10/02/the-three-billion-dollar-question/). Nowhere is this best exemplified than in the single cell transcriptomics field, where the intrinsic tendency to generate heterogeneities, can fox the best experimental designs. We still have much to learn.

One suspects that, if and when this project gets under way it will become one of those ‘too big too fail’ exercises that are becoming frequent in Biology. Projects which take funding away from hypothesis-driven or hypothesis-seeking projects which might provide a good focus for these accounting exercises. I would argue that, so far, the field of single cell analysis of transcriptomes has yielded some information but little insight, it has revealed the presence of heterogeneities in expression which are dynamic and pose some questions which are not often discussed but which need to be addressed before going too far, perhaps, in the wrong direction. We may take a page from the history of Physics and think that, perhaps, we should understand Pressure, Volume, Temperature and their relationships, before we attempt to deduce them from the molecules which, probably, underlie them. Because, as I have said before, a catalogue is not a map and their purposes are very different: a map helps you navigate, a catalogue puts order into a collection of items but does not in principle, have a defined purpose.

In the end the project will generate data and the brief of the Broad Institute (https://www.broadinstitute.org/news/international-human-cell-atlas-initiative-gets-underway) contains some interesting statements: “By making the Atlas freely available to scientists all over the world, scientists hope to transform research into our understanding of human development and the progression of diseases such as asthma, Alzheimer’s disease and cancer. In the future, the reference map could also point the way to new diagnostic tools and treatments”. We have heard that before, haven’t we?

I still miss the Biology in the project. If you want a big project, might it not be better to think and apply the current technology to a system or a problem which can teach us how to approach, one day, a human cell Atlas?