Century of Data - Abstracts
'Big Data' Asymptotics via Approximate Message Passing
David Donoho, Stanford University
We will discuss asymptotics for fitting of linear models in the large p, large n setting where there are about as many variables as observations.
We will review how approximate message passing algorithms make explicit the basic emergent phenomenon in these cases, which is the existence of 'extra Gaussian noise' not caused by the noise in the observations, but instead caused by large n/large p emergent phenomena.
Our approach is heuristically very clear in showing the origin of this extra noise, quantifying its appreciable size, and offers many new insights, for example about phase transitions in algorithmic behavior.
This talk covers joint work with Andrea Montanari, and Iain Johnstone, and draws on several papers, including some involving Montanari's coauthors Mohsen Bayati, Arian Maleki, and Adel Javanmard.
Robust Sparse Quadratic Discrimination
Jianqing Fan, Princeton University
We propose a novel Rayleigh quotient based sparse quadratic dimension reduction method -- named QUADRO -- for analyzing high dimensional data. Unlike in the linear setting where Rayleigh quotient optimization coincides with classification, these two problems are very different under nonlinear settings. One major challenge of Rayleigh quotient optimization is that the variance of quadratic statistics involves all fourth cross-moments of predictors, which are infeasible to compute for high-dimensional applications and may accumulate too many stochastic errors. This issue is resolved by considering a family of elliptical models. Moreover, for heavy-tail distributions, robust estimates of mean vectors and covariance matrices are employed to guarantee uniform convergence in estimating nonpolynomially many parameters, even though the fourth moments are assumed. Computationally, we propose an efficient linearized augmented Lagrangian method to solve the constrained optimization problem. Theoretically, we provide explicit rates of convergence in terms of Rayleigh quotient under both Gaussian and general elliptical models.
Thorough numerical results on both synthetic and real datasets are also provided to back up our theoretical results.
This is joint work with Tracy Ke, Han Liu, and Lucy Xia.
Are Observational Studies Any Good?
David Madigan, Columbia University
Observational healthcare data, such as administrative claims and electronic health records, play an increasingly prominent role in healthcare. Pharmacoepidemiologic studies in particular routinely estimate temporal associations between medical product exposure and subsequent health outcomes of interest and such studies influence prescribing patterns and healthcare policy more generally. Some authors have questioned the reliability and accuracy of such studies, but few previous efforts have attempted to measure their performance. The Observational Medical Outcomes Partnership (OMOP, http://omop.org) has conducted a series of experiments to empirically measure the performance of various observational study designs with regard to predictive accuracy for discriminating between true drug effects and negative controls. In this talk, I describe the past work of the Partnership, explore opportunities to expand the use of observational data to further our understanding of medical products, and highlight areas for future research and development.
Survival Models and Health Sequences
Peter McCullagh, University of Chicago
Medical investigations focusing on patient survival often generate not only a failure time for each patient but also a sequence of measurements on patient health at annual or semi-annual check-ups while the patient remains alive. Such a sequence of random length accompanied by a survival time is called a survival process. Ordinarily robust health is associated with longer survival, so the two parts of a survival process cannot be assumed independent. This talk is concerned with a general technique--temporal realignment---for constructing statistical models for survival processes. A revival model is a regression model in the sense that it incorporates covariate and treatment effects into both the distribution of survival times and the joint distribution of health outcomes. It also allows the sequence of health outcomes to be used clinically for predicting the subsequent trajectory, including the residual survival time.
Reproducibility and Cross-Study Replicability of Prognostic Signatures from High Throughput Genomic Data
Giovanni Parmigiani, Harvard University
Numerous gene signatures of patient prognosis for late-stage, high-grade ovarian cancer have been published, but diverse data and methods have made these difficult to compare objectively. However, the corresponding large volume of publicly available expression data creates an opportunity to validate previous findings and to develop more robust signatures. We thus built a database of uniformly processed and curated public ovarian cancer microarray data and clinical annotations, and re-implemented and validated 14 prognostic signatures published between 2007 and 2012. In this lecture I will describe the methodology and tools we developed for evaluating published signatures in this context. I will also use this application as the springboard for a more general discussion on how to evaluate statistical learning methods based on a collection of related studies.
Modeling Visual Cortex V4 in Naturalistic Conditions with Invariant and Sparse Image Representations
Bin Yu, University of California, Berkeley
The functional organization of cortex area V4 in the mammalian ventral visual pathway is far from being well understood. V4 plays an important role in the recognition of shapes and objects and in visual attention, but its complexity makes it hard to analyze. In particular, no current model of V4 has shown good predictions for neuronal responses to natural images and there is no consensus on its primary role.
In this talk, we present analysis of electrophysiological data on V4 neuron responses to natural images. We propose a new computational model that achieves comparable prediction for V4 as for V1 neurons. Our model does not rely on any pre-defined image features but only on invariance and sparse coding principles. We interpret our model using sparse principal component analysis and discover two groups of neurons: those selective to texture versus those selective to contours. This supports the thesis that one primary role of V4 is to extract objects from background in the visual field. Moreover, our study also confirms the diversity of V4 neurons. Among those selective to contours, some of them are selective to orientation, others to acute curvature features.
This is joint work with J. Mairal, Y. Benjamini, M. Oliver, B. Willmore, and J. Gallant.