![]()
Technical Reports
Technical Reports2007-001: Simultaneous model selection via rate-distortion theory, with applications to clustering and significance analysis of gene expression data, by Rebecka Jörnsten [3/27/07]... In this paper we introduce a simultaneous approach to subset model selection, applicable to both model selection in clustering and significance analysis of differential expression. Our approach draws on results from rate-distortion theory. The rate-distortion formulation allows us to turn the combinatorial model selection into a fast and simple line search. ... 2007-002: Quantile scale curves, by Kesar Singh, David E. Tyler and Jingshan Zhang [3/28/07] A concept of scale curve based on data-depth was introduced in Liu, Parelius and Singh (1999). A scale curve describes the growth of a scalar scale measure of a multivariate population/data-cloud, as the distribution/data progresses center-outwardly. This article proposes a new concept for a scale curve, called the quantile scale curve, which is conceptually and computationally simpler than a data-depth based scale curve. The article focuses on the data-analytic utility of the proposed quantile scale curve, and in particular on a proposed graphical test for detecting linear and nonlinear association between two groups of variables. Other problems addressed using the concept of a quantile scale curve are: exploring heavy tailedness (via a bootscale plot), and multivariate location and scale testing. The article includes a distributional result on simplicial volumes which is of independent interest. 2007-003: Penalized linear unbiased selection, by Cun-Hui Zhang [4/20/07] We introduce MC+, a fast, continuous, nearly unbiased, and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO interferes with variable selection. Subset selection is unbiased but computationally costly. The MC+ has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the minimum non-convexity of the penalized loss given the level of bias. The PLUS computes multiple local minimizers of a possibly non-convex penalized loss function in certain main branch of the graph of such solutions. Its output is a continuous piecewise linear path encompassing from the origin to an optimal solution for zero penalty. We prove that for a universal penalty level, the MC+ has high probability of correct selection under much weaker conditions compared with existing results for the LASSO for large n and p, including the case of p gg n. We provide estimates of the noise level for proper choice of the penalty level. We choose the the sparsest solution within the PLUS path for a given penalty level. We derive degrees of freedom and Cp-type risk estimates for general penalized LSE, including the LASSO estimator, and prove their unbiasedness. We provide necessary and sufficient conditions for the continuity of the penalized LSE under general sub-square penalties. Simulation results overwhelmingly support our claim of superior variable selection properties and demonstrate the computational efficiency of the proposed method. 2007-004: Clustering with multiple distance metrics - mixture models with profile transformations, by Rebecka Jörnsten [4/23/07] Clustering methods often require the selection of a distance metric; how do we define data objects as `close' enough to be grouped together, or `far' enough apart to be separated? Choosing an appropriate distance metric is not always easy. We consider high-dimensional gene expression data as an example. The shape of a gene's expression profile across experimental conditions is often considered to be the most informative, which translates to choosing correlation as a similarity metric. However, when genes with a similar expression profile exhibit expression differences on a scale of two-fold to ten-fold, correlation comparisons do not suffice, implying that a Euclidean distance metric is more appropriate. We propose a model-based clustering approach, MIXT (MIXture modeling with profile Transformations), which incorporates multiple distance metrics simultaneously. The modeling framework constitutes a between-cluster parameterization, allowing for direct and objective cluster comparisons. With this more efficient parameterization, we detect clusters that a standard model-based clustering approach may miss. We demonstrate the utility of the MIXT model via the analysis of a time-course gene expression data set, with two experimental factors, and discuss the biological relevance of the gene clusters identified. 2007-005: Some performance bounds for least squares regression with L1 regularization, by Tong Zhang [9/24/07] We derive performance bounds for L1-regularized least squares regression that can be directly compared to performance bounds for Dantzig selector in [3]. Our main result for L1-regularization non-trivially dominates that of [3] in the following sense: The condition for our bound to apply is strictly weaker. When the condition holds, the performance guarantee proved for L1-regularization is non-trivially better than that of Dantzig selector. 2007-006: A practical procedure to find matching priors for frequentist inference, by Juan (Jane) Zhang and John E. Kolassa [10/24/07] We give a practical way to find the matching priors proposed by Welch and Peers (1963) and Peers (1965). Then we investigate the use of saddlepoint approximations combined with matching priors and obtain p-values of the test of interest. The advantage of our procedure is the flexibility of choosing different initial conditions so that one can adjust the performance of the test. Two examples have been studied via Monte Carlo simulation. One relates to the ratio of two exponential means, and the other is about the logistic regression model. One of the numerical studies is under small sample size settings. 2007-007: General maximum likelihood empirical Bayes estimation of normal means, by Wenhua Jiang and Cun-Hui Zhang [12/26/07] We propose a general maximum likelihood empirical Bayes (GMLEB) method for the estimation of a mean vector based on observations with iid normal errors. We prove that under mild moment conditions on the means, the average mean squared error (MSE) of the GMLEB is within an infinitesimal fraction of the minimum average MSE among all separable estimators which use a single deterministic estimation function on individual observations, provided that the risk is of greater order than (log n)5/n. We also prove that the GMLEB is simultaneously uniformly approximately minimax when the p-th moment of the unknown means is between (log n)kappa1/n and np/2/(log n)kappa2. Simulation experiments demonstrate that the GMLEB outperforms the James-Stein and several state-of-the-art threshold estimators in a wide range of settings without much down side. 2007-008: Information-theoretic optimality of variable selection with concave penalty, by Cun-Hui Zhang [12/31/07] We prove the optimality of the MC+ [16] in the sense that the amount information it requires for consistent variable selection in the linear regression model is of the same order as the minimum possible under mild conditions on deterministic or random design matrices. A similar result has been proved for the LASSO when the design matrix has iid normal entries [13], but due to the estimation bias, the LASSO does not enjoy this optimality property in general without two restrictive assumptions. Similar but less explicit optimality results can be obtained for the SCAD and other methods with "unbiased" concave penalty on the squared loss, with a more complicated version of our proof. Some simulation results are reported to support our claims. |