How the Coming Data Deluge Will Reshape Neuroscience

And connectomics said, “Apres moi, le deluge.”

Neuroscience is on the brink of a revolution. A brain, fully reconstructed on the cellular scale, is the end goal of a newborn field called connectomics and although we are a long way off, it’s looking like it might actually be possible. We now are at the point where precise and large scale reproduction of brain wiring is actually happening. If you look back through my posts you will see a number of projects that gathered enormous data sets to this end. For example, an open access database called FlyCircuit is a supercomputer-based model of the fruit fly brain  based on pictures of over 16000 individual neurons and counting. Once the FlyCircuit group generated all this data, they had to come up with special software to analyze how those neurons are positioned and connected inside the brain. This enormous amount of data is freely available online for people to use in their own analysis. In another paper, Davi Bock and his colleagues took 3.2 million extremely high-mag images of a 0.008 cubic millimetre chunk of mouse brain that took up 36 terabytes of disk space (data also freely available). The challenge then was to analyze it, and while they arguably didn’t get very much info from their analysis, mining their data in different ways could likely produce a lot more info from this one tiny bit of brain. So in neuroscience, we are now getting to the point where data is in major surplus.

In fact this is the trend throughout the scientific world; genomics, climate science and astronomy are just a few of the disciplines that are being hit with the data deluge (cheesy, alliterative buzz name coined elsewwhere). The case has always been exactly the opposite, though. Until recently, the bottleneck of research was at the data acquisition end. In a recent special issue of Science on the data deluge, Richard Baraniuk explains that we have just moved out of an age where data is hard to acquire and into one where the challenge is what to do with our wealth of it. An indicator of this occurred in 2007, when for the first time the world produced more data than it could store. And now, on a yearly basis, we produce double the amount of data we can store. If you’ve read or heard tech-optimist Ray Kurzweil, you will know his Law of Accelerating Returns catch-stat that just about everything to do with data doubles every two years. As it turns out this is more or less the case for the world’s data storage capacity and data acquisition too, except that, as Baraniuk informs us, the world’s data storage capacity is growing slower (20% slower per year) than the amount of data produced, meaning that if the trend continues, our world will get ever more data rich, but only to have to throw away a good portion of that wealth.

Mining Neuroscience Data

In neuroscience, we aren’t at that point yet, and are probably a long way off, but I would argue that the field of connectomics in particular is going in the same direction.

In fact, in the special issue of Science that I mentioned before, Huda Akil and friends wrote a perspective piece on “Challenges and Opportunities in Mining Neuroscience Data.” They start by coining the phrase “neural choreography,” which they use to refer to the complex dance of neural communication and connectivity that makes up our brains and minds. In their article they stress that the brain can’t be understood from a purely reductionist approach and so we need to move from the very basics of neurobiology to the more complex. Akil and friends put forth that as we push the study neural choreography forward, new layers of function will emerge, and thus the overarching goal in the field is to go from studying neuronal genes and proteins, to neurons, to neural circuits and finally to thought and behavior. This isn’t really news; I would wager that most neuroscientists, regardless of how zeroed in they are on their favorite gene or cell-type, have at least entertained the idea that they are joining in a process that will hopefully end in a full understanding of the brain and mind. However, Akil and friends do give some insightful details on some of the efforts that are starting to look at neural circuitry on a huge scale, and the resulting amounts of data.

They start with the details of a new initiative; the Human Connectome Project (HCP). Obviously echoing the Human Genome Project, the HCP aims to comprehensively map the major connections between brain areas, as well as compile information about each of the regions. The HCP is projected to comprise a petabyte of data, or 1000 terabytes and all this data will be made available online for easy browsing and analysis. The HCP is currently using two main neuro-imaging methods to map out brain connections. Both of these methods use variations on magnetic resonance imaging (MRI) to scan brains of living human beings. The first method, called diffusion mapping, exploits the preferential movement of water molecules along neural tracts (ie nerves or axon tracts) to determine the orientations of neural fibres that connect brain areas to one another. The second technique is called Resting State Functional MRI. Functional MRI (fMRI) is essentially just video MRI that assesses which brain areas are more active than others during the scan. To do this, experimenters and physicians use a specialized MRI setup that tracks the levels of oxygen being used by each brain area. Generally, fMRI is performed on people doing some kind of mental task – thus determining which brain areas are involved in that task. However, as the name suggests, Resting State fMRI is based on resting fluctuations in the amount of oxygen used by brain areas – the subjects are not performing any mental tasks. During rest though, the brain is still active – thoughts flit through your head, muscles twitch, etc. So by watching for brain activity during rest, and keeping track of when certain brain areas become active, you can infer that, since area B became active right after area A did, perhaps they are connected.

It is important to note that these techniques do not get down to the cellular level – the resolution of fMRI is something on the order of millimetres cubed, a volume that can house hundreds of thousands of cells. So the HCP does not purport to generate a replica of the brain on the cellular level, but regardless, the data produced will be valuable in making predictions as to information flow in the human brain, and will also hopefully give us a look at how brain wiring differs between individuals.

While Akil and friends are clearly optimisitc about the study of the human connectome, they offer much less information on what they call “microconnectomes,” connectomes on the cellular level. In particular they cite the Brainbow mouse, which is a strain of mice genetically engineered to express fluorescent proteins in their neurons, resulting in a brain that has neurons that fluoresce in many different colors. The idea behind these mice is that the many different-colored neurons should make it easy to tell neurons apart and thus simplify studying the circuitry. However, despite producing some beautiful and award winning images, the brainbow mouse has yet to really deliver on the connectomics scene. So while Akil and friend mention that yes, the effort to map circuitry on the cellular level is underway, they don’t give much insight. Personally I think that Akil and friends have their priorities a bit backward, but I’m an unabashed cellular junkie so don’t trust me. However, I will offer that the basis of neural computation is the passing around of tiny electical impulses between neurons that are arranged into extremely complex circuits. And so, to truly understand the brain, we need to understand it on a cellular level. Despite this oversight by the authors though, it has been a good year for connectomics, and if you would like to hear about some of the advances you should have a look at some of my older posts that I mentioned at the beginning of this piece.

Although their article might leave you with a skewed impression that the HCP is the be all and end all in the field, Akil and friends do a good job of hitting home the scale of these projects and the need for compiling and sharing all this data to optimize its analysis. In particular they focus on the Neuroscience Information Framework, an online database of databases that provides open access to all types of neuroscience data. Akil and friends note that although this is a formidable organization, it has its drawbacks. For instance, the most notable roadblock to accessing the NIF’s wealth of data is the project-to-project and database-to-database disparity in terminology. To deal with this the NIF  created Neurolex, a standardized naming system that allows the NIF’s users to effectively search its holdings. The challenge now is getting people to use this system when submitting data. Akil and friends argue that neuroinformatics approaches such as this allow for navigation and integration of multiple tiers of huge amounts of data, which will in turn facilitate the untangling of neuroscientific questions that are becoming more and more complex. They leave off with 8 recommendations which essentially boil down to:
1.) Share your data,
2.) But make sure it is standardized.
3.) And get used to the fact that machines and software that most of us don’t understand are here to stay and are the way of the future.

One major issue on the road to this type of collaborative neuroinformatics approach is the cultural shift that the scientific community will need to undergo; scientists aren’t necessarily all that willing to share their data fully and irrevocably – what if someone else makes the discovery from their data before they do? The point, however, is not to open up your data on its own, the point is to amass enough data to be able to synthesize something new. Along that road there are certainly going to be hardships – we likely will need “data stewards” to manage the databases, we’ll need new software, we’ll need storage space, and new ways of understanding huge amounts of data  – but by creating more data and amassing it we gain a new level of depth and descriptive ability that will undoubtedly bare fruit.

Where Does All This Data and the Coming Flood Leave Us?

In a 2008 article, Chris Anderson, editor of Wired magazine, put forth the idea that the scientific method as we know it is becoming obsolete. The work flow of “hypothesize – model – test” is soon to be antiquated given the massive amount of data now available to us. He takes up the example set by Google and its ability to track activity on the internet. Google has access to information on how people behave – specifically, which websites we go to and all the info pertaining to our cyber travels. Anderson asks, who cares why people do what they do on the internet? “The point is that they do it, and we can track and measure it with unprecedented fidelity.” His point is that we can make as many hypotheses about people’s internet behavior as we want, but it doesn’t matter – the data is there just waiting to be mined – “With enough data, the numbers speak for themselves.”

At one point I think he hits the central point without really realizing it, though. He criticizes the fact that scientists are taught to be wary of correlation, learning early on that correlation is not causation. He follows that with:

“There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

What he misses though, is that when we get to this level we are no longer dealing with correlation – we are dealing with big enough numbers that the data is fully descriptive, truly representative, exhaustive. By collecting all data from the internet, you are tracking the activity of every person using it, and so you have no need for models because you have the real thing recorded. The tale that Google is telling with the internet is one of description, with no need for hypotheses or trust in correlation. They have moved past the point of needing to figure out how to collect the data. In their field, they have the data – all of it. The hard part now is figuring out ways to interpret it.

What does this say about neuroscience though? Not much. But we can look forward to the days when our acquisition techniques produce fully descriptive data – when we can connecto-type individual brains and enter a new (and probably very scary) age of brain science. At that point we will likely have to do the same as Google and take a back seat to connectome-trawling algorithms that will describe the brain replicas that we feed into the supercomputers needed to stored them. This amounts to true naive observation, description without preconceived notions (if, of course, we can design the algorithms to be objective). Naive observation, followed by informed induction of generalizations from the data is, in my opinion, probably the best way to conduct science. If you can pick through the data without any bias and pull out the rules, laws, facts as you come across them, there is no need for making assumptions and hypotheses going into the experiment. However, we aren’t there yet, so in the meantime, we’ll have to continue wading through our ankle deep data. Welcome the flood when it truly hits though, because it will transform the way we understand the brain and ourselves.

This entry was posted in Home, News and Opinion, Uncategorized. Bookmark the permalink.

One Response to How the Coming Data Deluge Will Reshape Neuroscience

  1. Denise says:

    I guess it’s a battle between a reductionist approach to study a single protein or single cell at a time and build upwards from there and this more holistic, top-down approach which involves generating more data than can possibly be analyzed right now, only to have to figure out later how to garner any relevant or important information from it. Perhaps if we figure out how to combine the two approaches, observation of descriptive data with solid hypotheses about how the brain functions (along with better ways to reduce bias), we might be able to get our answer.

    The solution to the data deluge and the overflow of PhDs that can’t get jobs? The PhD Student Tenure Track Position (aka eternal PhD student). We’ll create an army of highly qualified, overtrained and underpaid PhD students to sift through the data for their infinitely long PhD theses.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s