Hi, I'm Kyle Burris. I'm a first-year statistics PhD student here at Duke. >> Hi, I'm Callie Mao, and I'm an undergraduate statistics student at Duke. This is Dr. David Dunson, Professor of Statistical Science. >> Hi I'm David. I work a lot in Bayesian statistical methods and so I've been doing that for about 20 or 25 years now and a lot of it is motivated by high dimensional interesting biomedical applications. >> Could you tell us about some of the application in Bayesian methods in the health data that you're working with? >> Yeah, we have a whole enormous variety of high dimensional health data that we're motivated by. I think that there's been this enormous profusion of high dimensional really complicated data in a variety of fields, and so I just give one example that we're really excited about lately which is neurosciences. And so there's a lot of new imaging technology that allows you to actually measure the connection structure in an individual's brain and so you guys can go in and do a brain scan called a diffuse intenser imaging scan. And also get structural imaging, and based on that imaging scan, you can actually recreate the fibers connecting different regions of your brain. And so you guys might be wired slightly differently, and that might be related to your personality traits, tendency to have depression, creativity, intelligence in different domains. And so we'd like to kind of figure out those relationships and so we've been working on that a lot. Lately using Bayesian methods and so the data are really intriguing. They're kind of curves linking up different regions of the brain. They have a network type structure and we're trying to build models for discovering low dimensional structure in the brain related to phenotypes like intelligence, creative reasoning, etc. >> And so how can Bayesian statistics in particular tell us more about the brain? >> It's interesting that most of the literature recently has been dominated by optimization methods, a really simple methods that take the data and look at tiny pieces of it separately. And so imagine that we have a big network in your brain, okay? And so everyone in the study has a slightly different network, and we'd like to know how are the features of the network related to traits of that individual. And so can we do that using non Bayesian methods? Well, we could take little pieces of the network and just do a test separately for the relationship between, say, IQ and this link in the network, get a p-value and do that for every possible link in the network. And then maybe do some sort of adjustment to avoid having too many false discoveries. So that turns out to do extremely badly, if you do tests. The other thing people do in kind of modern statistics is to do an optimization type of approach. And so you try to take that brain network and then maybe do some sort of singular value decomposition or matrix factorization to learn some low-dimensional structure. Then you can do that really fast, but the problem with that type of approach, which has really dominated a lot of literature is that it just gives you a point estimate, okay? And so, let's say I want to study the relationship between some mental health disorder or aging and brain structure, okay? Well, if I just get a point estimate of that, that's just one guess or one, maybe, best guess of what's going on. It doesn't tell me how uncertain I am in that and so I can't really publish an article on that or feel like I am confident about that result. It might be that there's 10,000 other different relationships that are equally consistent with the data, and I've just estimated one of them. And so, the really distinct characteristic of Bayesian methods is their ability to characterize uncertainty. Uncertainty in scientific inferences, in this case. And so I can say, what's the post to your probability that there's any relationship between IQ and brain structure, say. I can also say, well, what's the post to your probability in particular region that there's a relationship between IQ and brain structure. The other thing that people do is they will just take the data in the brain network and they'll extract certain features. They're called topological features of the network. And they might take three, or four, or five of these different features and then just do a statistical analysis based on that. But that also fixates to some particular features, and if you collect too many you'll end up having false positives. In a Bayesian approach you can holistically model the entire brain structure flexibly while allowing uncertainty. So I think it's been quite exciting. We've already found very intriguing relationships between brain structure and Alzheimer's disease. And also big differences between individuals having low creative reasoning scores and high creative reasoning scores in terms of connections in the frontal lobe across hemispheres and so it's been a lot of fun. >> You just talked a little bit about the advances that Bayesian statistics is making in a variety of applications such as neuroscience and brain imaging. But what advances are being made in the field of Bayesian statistics in general? >> Yeah, it's been really exciting, I think. In general, Bayesian statistics if you kind of back up in time, there's been these different kind of revolutions essentially. And so prior to the onset of Markov Chain Monte Carlo methods in the late 80s, early 90s, it was more of a small philosophical field where people would work on toy problems. And then once we had more computational tools, it sort of exploded into being a very applied driven, applied motivated field that could have a substantial impact on lot of application areas. But that's kind of changing a little recently, as with the data becomes so big, that these traditional algorithms that people were using in the 90s, early 2000s, no longer usable. And so one of the really exciting things in Bayesian statistics is trying to design completely new algorithms, new ways to do inferences, new types of models for really hugely high dimensional complicated data while allowing uncertainty. Can we talk to people in computer science and machine learning, and figure out ways to kind of scale up algorithms using big computing systems, distributed computing. And that's been really interesting and intriguing, while also maintaining theoretical guarantees that these methods do well. So that's been a big part of our work and there's been an explosion in this area in general, I think. >> Can you tell us a bit more about the application of Bayesian methods to big data? >> Yeah, so it's interesting because what people mean by big data exactly. And so, a lot of the people in the literature are working on Bayesian methods have been focused on big data meaning really large sample sizes. And so that we might have a 100 million subjects or something. But, if you don't have a really large sample size, but the number of variables you're collecting isn't that big, and you're doing something like say, a logistic regression, which is a toy example people almost always use in motivating these types of methods, then actually, there's often not that much motivation for doing a Bayesian approach relative to some fast optimization approach. Because, I have 50 parameters and I have a 100 million sample size, well, my posterior distribution might be essentially, really super concentrated right around something that we would get by just doing say maximum likely estimation. And so the really interesting problems in big data are when you actually have really large numbers of variables or you're trying to fit a really flexible model like a non-parametrical model or you have something like rare events. So, for example, in computational advertising, we might be looking at people going from a large number of websites to a small set of websites, client websites. And so there might be hundreds of thousands websites and then a hundred client websites, and those transitions can be really rare. And so, even though you have millions of people, you have rare events. And so, then we found that the uncertainty in those transitions is super important. And so, if you just do an optimization approach, you might be quite misled. And so this problem of kind of scaling up to characterize the complexity of big data while allowing uncertainty, is really important in most of these settings where you have really high dimensional data, but also the sample size is enormous, I think. >> All right, well thank you so much for your time. >> Thank you, it's been fun. >> Pleasure speaking with you.