Statistical Learning and Computation for BigData Analysis
Fall 2014
Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway
Sponsored by Yahoo Labs
The Bigdata seminar meets weekly or biweekly for presentations by invited researchers emphasizing either the theory or practice of statistical learning with BigData. Lunch will be served starting at 11:40am, with the talks running between 12pm and 1pm.
September 25, 2014 |
Speaker: Kilian Weinberger, Associate Professor in Computer Science, Washington University Title: Learning with Marginalized Corruption Abstract:If infinite amounts of labeled data are provided, many machine-learning algorithms become perfect. With finite amounts of data, regularization or priors have to be used to introduce bias into a classifier. We propose a third option: learning with marginalized corrupted features. We (implicitly) corrupt existing data as a means to generate additional, infinitely many, training samples from a slightly different data distribution — this is computationally tractable, because the corruption can be marginalized out in closed form. Our framework leads to machine learning algorithms that are fast, generalize well and naturally scale to very large data sets. We showcase this technology in the context of risk minimization with linear classifiers and deep learning for domain adaptation. We further show that our framework is not limited to features: marginalized corrupted labels and graph edges have promising applications in tag prediction of natural images and label propagation within protein-protein interaction networks. Bio: Kilian Q. Weinberger is an Associate Professor in the Department of Computer Science & Engineering at Washington University in St. Louis. He received his Ph.D. from the University of Pennsylvania in Machine Learning under the supervision of Lawrence Saul and his undergraduate degree in Mathematics and Computer Science from the University of Oxford. During his career he has won several best paper awards at ICML, CVPR, AISTATS and KDD (runner-up award). In 2011 he was awarded the Outstanding AAAI senior program chair award and in 2012 he received an NSF CAREER award. Kilian Weinberger’s research is in Machine Learning and its applications. In particular, he focuses on high dimensional data analysis, resource efficient learning, metric learning, machine learned web-search ranking, transfer- and multi-task learning as well as biomedical applications. Before joining Washington University in St. Louis, Kilian worked as a research scientist at Yahoo! Research in Santa Clara. |
Spring 2014
Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway
The Bigdata seminar meets weekly or biweekly for presentations by invited researchers emphasizing either the theory or practice of statistical learning with BigData. Lunch will be served starting at 11:40am, with the talks running between 12pm and 1pm.
January 30th, 2014 |
Speaker: Stephen Burley, Distinguished Professor in Department of Chemistry and Chemical Biology, Rutgers University Title:Integrative Structural Biology and the Big Data Revolution: Role of the Protein Data Bank Abstract:The mission and activities of the Protein Data Bank (www.pdb.org) will be described in some detail, with particular emphasis on the challenges and opportunities presented by development of integrative structural biology. Bio: |
Feburary 13, 2014 |
Speaker: Visa Koivunen, Academy Professor, Aalto University, Finland Visiting professor, Princeton University (sabbatical). Title:Tensor models and techniques for analyzing high-dimensional data Abstract:Analyzing high-dimensional and high-volume datasets is of interest in many fields of engineering and science. For example, modeling of multidimensional MIMO channels, analyzing sensor data collected by mobile terminals, analyzing fMRI and surveillance data are among emerging application areas. Tensors, being multi-way arrays in a simple definition, accommodate high-dimensional data sets naturally. Various tensor decompositions based on multilinear models are powerful tools to explore and reveal important information in high-dimensional data sets. In this presentation we will give a tutorial overview of tensor representation and methods for data analysis. In particular, tensor factorization methods and low-rank modeling techniques are considered. In addition, recent advances in sparse methods and regularization for reducing dimensionality, simplifying visualizations and variable selection when employing tensor models are introduced. Furthermore, statistically robust procedures for analyzing tensor data are proposed and their performance is studied in the face of outliers. Bio: Visa Koivunen (IEEE Fellow) received his D.Sc. (EE) degree with honors from the University of Oulu, Finland. He received the primus doctor (best graduate) award among the doctoral graduates in years 1989-1994. He is a member of Eta Kappa Nu. From 1992 to 1995 he was a visiting researcher at the University of Pennsylvania, Philadelphia, USA. Years 1997 -1999 he was faculty at Tampere UT. Since 1999 he has been a full Professor of Signal Processing at Helsinki Univ of Technology , Finland that is currently known as Aalto University. He received the Academy professor (distinguished professor nominated by the Academy of Finland) position from the Academy of Finland for years 2010-2014. He is one of the Principal Investigators in SMARAD Center of Excellence in Research nominated by the Academy of Finland. Years 2003-2006 he has been also adjunct full professor at the University of Pennsylvania, Philadelphia, USA. During his sabbatical term year 2007 he was a Visiting Fellow at Princeton University, NJ, USA. He was a part-time Visiting Fellow at Nokia Research Center (2006-2012). He has been visiting fellow Princeton University multiple times. He is currently on sabbatical at Princeton University for the full academic year 2013-2014. Dr. Koivunen's research interest include statistical, communications and sensor array signal processing. He has published about 350 papers in international scientific conferences and journals. He co-authored the papers receiving the best paper award in IEEE PIMRC 2005, EUSIPCO 2006, EUCAP 2006 and COCORA 2012. He has been awarded the IEEE Signal Processing Society best paper award for the year 2007 (with J. Eriksson). He served as an associate editor for IEEE Signal Processing Letters and IEEE Transactions on Signal Processing. He is co-editor for IEEE JSTSP special issue on Smart Grids. He is a member of editorial board for IEEE Signal Processing Magazine. He has been a member of the IEEE Signal Processing Society technical committees SPCOM-TC and SAMTC. He was the general chair of the IEEE SPAWC conference 2007 conference in Helsinki, Finland June 2007. |
Feburary 28, 2014 (Note the special date) |
Speaker: Vincent Poor, Professor in EE at Princeton . (Note the talk is on Friday not Thursday) Title:Privacy in the Smart Grid: An Information Theoretic Framework Abstract: The proliferation of electronic data generated in smart grid and other applications has made potential leakage of private information through such data an important issue. This talk will first describe a fundamental information theoretic framework for examining, in a general setting, the tradeoff between the privacy of data and its measurable benefits. This framework will then be used to investigate two problems arising in smart grid. The first of these is smart-meter privacy, in which the tradeoff between the privacy of information that can be inferred from meter data and the usefulness of that data is examined. The second is competitive privacy, which models situations in which multiple parties (e.g., power companies) need to exchange information to collaborate on tasks (e.g., management of a shared grid) without revealing company-sensitive data. Bio: H. Vincent Poor is the Michael Henry Strater University Professor of Electrical Engineering at Princeton, where is also the dean of the School or Engineering and Applied Science. His research interests are in the areas of information theory, statistical signal processing and stochastic analysis, and their applications in smart grid, wireless networks and related fields. His publications in these areas include the recent book Mechanisms and Games for Dynamic Spectrum Allocation, published by Cambridge University Press in 2014. Dr. Poor is a Fellow of the IEEE and a member of the National Academy of Engineering and the National Academy of Sciences. |
March 13, 2014 |
Speaker: Kenneth W. Church, IBM Research Title:Big Data Goes Mobile Abstract:What is "big"? Time & Space? Expense? Pounds? Power? Size of machine? Size of market? We will discuss many of these dimensions, but focus on throughput and latency (mobility of data). If our clouds can't import and export data at scale, they may turn into roach motels where data can check in; but it can't check out. DataScope is designed to make it easy to import and export 100s of TBs of disks. Amdahl's Laws have stood up remarkably well to the test of time. These laws explain how to balance memory, cycles and IO. There is an opportunity to extend these laws to balance for mobility. Bio: Ken is currently at IBM working on Siri-like applications of speech on phones. Before that, he was the Chief Scientist of the HLTCOE at JHU. He has worked at Microsoft and AT&T, as well. Education: MIT (undergrad and graduate). He enjoys working with large datasets. Back in the 1980s, we thought that Associated Press newswire (1million words per week) was big, but he has since had the opportunity to work with much larger datasets such as AT&T's billing records and Bing's web logs. He has worked on many topics in computational linguistics including: web search, language modeling, text analysis, spelling correction, word-sense disambiguation, terminology, translation, lexicography, compression, speech (recognition and synthesis), OCR, as well as applications that go well beyond computational linguistics such as revenue assurance and virtual integration (using screen scraping and web crawling to integrate systems that traditionally don't talk together as well as they could such as billing and customer care). Service: past president of ACL and former president of SIGDAT (the organization that organizes EMNLP). Honors: AT&T Fellow. |
March 27, 2014 |
Speaker: Howard Karloff, Yahoo! Labs @ NYC Title: Maximum Entropy Summary Trees Abstract: Given a very large, node-weighted, rooted tree on, say, n nodes, if one has only enough space to display a k-node summary of the tree, what is the most informative way to draw the tree? We define a type of weighted tree that we call a "summary tree" of the original tree, that results from aggregating nodes of the original tree subject to certain constraints. We suggest that the best choice of which summary tree to use (among those with a fixed number of nodes) is the one that maximizes the information-theoretic entropy of a natural probability distribution associated with the summary tree, and we provide a (pseudopolynomial-time) dynamic-programming algorithm to compute this maximum entropy summary tree, when the weights are integral. The result is an automated way to summarize large trees and retain as much information about them as possible, while using (and displaying) only a fraction of the original node set. We also provide an additive approximation algorithm and a greedy heuristic that are faster than the optimal algorithm, and generalize to trees with real-valued weights. This is joint work with Ken Shirley of ATT Labs and Richard Cole of NYU. Bio: After receiving his PhD from Berkeley, Howard Karloff taught at the University of Chicago and Georgia Tech before leaving Georgia Tech as a full professor to join AT&T Labs--Research in 1999. He left ATT Labs in 2013 to join Yahoo Labs in New York. An editor of ACM's Transactions on Algorithms and an ACM Fellow, he has served on the program committees of numerous conferences, chaired the 1998 Symposium of Discrete Algorithms (SODA) program committee, was general chair of the 2012 Symposium on the Theory of Computing (2012) and will be general chair of STOC 2014. He is the author of numerous journal and conference articles and the Birkhauser book "Linear Programming." His research interests span algorithms and optimization and extend to more applied areas of computer science such as databases, networking, and machine learning. |
April 3, 2014 |
Speaker: Wei Liu, IBM Research Title: Handling Big Data: A Machine Learning Perspective Abstract: With the rapid development of the Internet, nowadays tremendous amounts of data including images and videos, up to millions or billions, can be collected for training machine learning models. Inspired by this trend, my current work is dedicated to developing large-scale machine learning techniques for the purpose of making classification and nearest neighbor search practical on big data. My first approach is to explore data graphs to aid classification and nearest neighbor search. A graph offers an attractive way of representing data and discovering the essential information such as the neighborhood structure. However, both of the graph construction process and graph-based learning techniques become computationally prohibitive at a large scale. To this end, I propose an efficient large graph construction approach and subsequently apply it to develop scalable semi-supervised learning and unsupervised hashing algorithms. To address other practical application scenarios, I further develop advanced hashing techniques that incorporate supervised information or leverage unique formulations to cope with new forms of queries such as hyperplanes. All of the machine learning techniques I have proposed emphasize and pursue excellent performance in both speed and accuracy. The addressed problems, classification and nearest neighbor search, are fundamental for many practical problems across various disciplines. Therefore, I expect that the proposed solutions based on graphs and hashing will have a tremendous impact on a great number of realistic large-scale applications. Bio: Wei Liu received the M.Phil. and Ph.D. degrees in electrical engineering from Columbia University, New York, NY, USA in 2012. Currently, he is a research staff member of IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA. He has been the Josef Raviv Memorial Postdoctoral Fellow at IBM Thomas J. Watson Research Center for one year since 2012. His research interests include machine learning, data mining, computer vision, and information retrieval. Dr. Liu is the recipient of the 2011-2012 Facebook Fellowship. |
April 10, 2014 |
Speaker: Silvio Lattanzi, Senior Research Scientist at Google Research New York Title: Large scale graph-mining Abstract: The amount of data available and requiring analysis has grown at astonishing rate in recent years. To cope with this deluge of information it is fundamental to design new algorithms to analyze data efficiently. In this talk, we describe our effort to build a large scale graph-mining library. We first describe the general framework and few relevant problems that we are trying to solve. Then we describe in details two results on local algorithms for clustering and on learning noisy feedback from a crowdsource system. Bio: Silvio Lattanzi is a Senior Research Scientist at Google Research New York. He received his Phd from University La Sapienza of Rome. During his PhD he interned twice at Google Research Mountain View and once at Yahoo! Research Santa Clara, he also spent a semester visiting University of Texas Austin. His main research interests are in large-scale graph mining, information retrieval and probabilistic algorithms. Silvio has published several papers in top tier conferences in information retrieval, social network analysis and algorithms. He also served as program committee or senior program committee for several top conferences including WWW, WSDM and KDD. |
April 24, 2014 |
Speaker: John Langford, Senior Researcher at Microsoft Research @NYC Title: Learning to Interact Abstract: Large quantities of data are not explicitly labeled in the manner of traditional supervised learning. Instead, they come from observations. How do we effectively learn to intervene given this data source? I will address both the process of learning as well as some new work about the process of optimally and efficiently gathering information. Bio: John Langford is a machine learning research scientist, a field which he says "is shifting from an academic discipline to an industrial tool". He is the author of the weblog hunch.net and the principal developer of Vowpal Wabbit. John works at Microsoft Research New York, of which he was one of the founding members, and was previously affiliated with Yahoo! Research, Toyota Technological Institute, and IBM's Watson Research Center. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his Ph.D. in Computer Science from Carnegie Mellon University in 2002. He was the program co-chair for the 2012 International Conference on Machine Learning. |
Fall 2013
Thursday Noons, Hill Center
11:40am -- 12pm (Room 502, Lunch)
12:00pm -- 1pm (Room 552, Talk)
Rutgers Busch Campus, 110 Frelinghuysen Rd Piscataway
October 3rd |
Speaker: Edo Liberty, Senior Research Scientist, Yahoo! Research at NYC Bio: Edo received his B.Sc in Physics and Computer Science from Tel Aviv university and his Ph.D in Computer Science from Yale University, under the supervision of Steven Zucker. During his PhD he spent time at both UCLA and Google as an engineer and a researcher. After that, he joined the Program in Applied Mathematics at Yale as a Post-Doctoral fellow. In 2009 he joined Yahoo! Labs in Israel. He recently moved to New York to lead the machine learning group which focuses on the theory and practice of (very) large scale data mining and machine learning. In Particular, theoretical foundation of machine learning, optimizations, scalable scientific computing, and machine learning systems and platforms. Title: Simple and Deterministic Matrix Sketches Abstract: A sketch of a matrix A is another matrix B which is significantly smaller than A, but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can be processed only once and storage is severely limited. In this paper, we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. Our experiments corroborate the algorithm's scalability and improved convergence rate. The presented algorithm is deterministic, simple to implement, and elementary to prove. |
October 17th |
Speaker: Ping Li, Department of Statistics & Biostatistics, Department of Computer Science, Rutgers University Bio: Title: Flexible Statistical Modeling from Massive Data by Boosting and Trees (and Comparisons with Deep Learning) Abstract: Logistic regression has been around for perhaps 100 years. In textbooks, the derivative of the log likelihood is written as {y_k - p_k}, where y_k=0 or 1 is the k-th class label and p_k is the class probability. About 5 years ago, I had the observation that the derivative can also be written as {y_k - p_k} - {y_0 - p_0} if using 0-th class as baseline, due to the sum-to-zero constraint. The second derivative can be written differently too, of course. It turns out using these new derivatives could lead to almost unbelievably substantial improvement in classification accuracies in the boosting framework (e.g., MART and LogitBoost). |
November 8th (Jointly with ECE/SIP Seminar) |
Speaker: Yann LeCun, Professor, New York University. (Jointly with the ECE/SIP Seminar. Note the special date) Title:Computer Perception with Deep Learning Abstract: Pattern recognition tasks, particularly perceptual tasks such as vision and audition, require the extraction of good internal representations of the data prior to classification. Designing feature extractors that turns raw data into suitable representations for a classifier often requires a considerable amount of engineering and domain expertise. Bio:Yann LeCun is the founding director of the Center for Data Science at New York University, and Silver Professor of Computer Science, Neural Science, and Electrical Engineering at the Courant Institute of Mathematical Science, the Center for Neural Science, and the ECE Department at NYU-Poly. |
November 14th |
Speaker: Sanjiv Kumar, Google Research at NYC Title: Learning binary representations for fast similarity search in massive databases Abstract: Binary coding based Approximate Nearest Neighbor (ANN) search in huge databases has attracted much attention recently due to its fast query time and drastically reduced storage needs. There are several challenges in developing a good ANN search system. A fundamental question that comes up often is: how difficult is ANN search in a given dataset? In other words, which data properties affect the quality of ANN search and how? Moreover, for different application scenarios, different types of learning methods are appropriate. In this talk, I will discuss what makes ANN search difficult, and a variety of binary coding techniques for non-negative data, data that lives on a manifold, and matrix data. Bio: Sanjiv Kumar is currently a Research Scientist in Google Research, NY. He received his PhD from The Robotics Institute, Carnegie Mellon University in 2005, and a Masters from Indian Institute of Technology Madras, India in 1997. During 1997-2000, he worked in National University Hospital Singapore developing a Robotic Colonoscopy system, and in National Robotics Engineering Consortium, Pittsburgh USA on Robotic Transportation system. His research interests include large scale machine learning and computer vision, graphical models, medical imaging and robotics. robotics. |
November 21th |
Speaker: Yi Wang, Professor in Radiology, Cornell Weil Medical School, NYC Title: Bayesian image reconstruction to decode biomarkers from noisy incomplete data in MRI Abstract:What seen in a microscopic voxel (~1mm3) in current medical imaging is a complex sum of contributions from millions of cells in that voxel and is invariably highly contaminated noise. Decoding critical cellular information about diseases from noisy image data is often an ill-posed inverse problem. Fortunately, there is abundant prior information in medical imaging, such as anatomic structures, to regulate the inverse problem using the Bayesian approach. We will demonstrate the Bayesian reconstruction in magnetic resonance imaging (MRI), which is very sensitive to the presences of many diseases. One example is the quantitative susceptibility mapping to estimate from MRI data the molecular polarizability in the scanner magnet that reflects essential cellular activities. Another example is the 4D imaging of high spatial-temporal resolution to capture the dynamic transport processes that perfuse and vitalize tissue. Bio: Yi Wang (PhD 1994, University of Wisconsin-Madison) is the Faculty Distinguished Professor of Radiology and professor of Biomedical Engineering at Cornell University. Dr. Wang is a Fellow of ISMRM and AIMBE. Dr. Wang is an active grant reviewer for many agencies including NIH and the European Research Council. Dr. Wang as a PI has been awarded multiple NIH grants. Dr. Wang has published more than 130 papers in peer-reviewed scientific journals and a textbook, .Principles of Magnetic Resonance Imaging.. Dr. Wang has been a very active researcher in MRI. Dr. Wang has invented several key technologies in cardiovascular MRI, including multi-station stepping table platform, bolus chase MRA, time-resolved contrast enhanced MRA, and navigator motion compensation for cardiac MRI. Dr. Wang has pioneered quantitative susceptibility mapping (QSM), a vibrant new field in MRI for studying magnetic susceptibility properties of tissues in health and diseases. |