Statistics Seminar

Fall 2018

Seminars are held from 4:00 p.m. – 5:00 p.m. in Griffin-Floyd 100 unless otherwise noted.Refreshments are available before the seminars from 3:30 p.m. – 4:00 p.m. in Griffin-Floyd Hall 103.

Date Speaker    Title (click for abstract)
Sep 27 Ming Yuan(Columbia University)

On the Sample Complexity for Approximating High Dimensional Functions of Few Variables

Oct 4 Sebastien Haneuse(Harvard T.H. Chan School of Public Health)

On the analysis of two-phase designs in cluster-correlated data settings

Oct 11 Veronika Rockova(University of Chicago Booth School of Business)

Theory for BART

Oct 18 Hongcheng Liu(University of Florida)

Second order optimality conditions in high-dimensional learning

Oct 25 Hongyuan Cao(Florida State University)

Regression analysis of longitudinal data with omitted asynchronous longitudinal covariate

Nov 1 Pierre Jacob (Harvard University)

Unbiased Markov chain Monte Carlo with couplings

Nov 8 Masayo Y. Hirose(The Institute of Statistical Mathematics)

A Second-order Empirical Bayes Confidence Interval in the Presence of High Leverage for Small Area Inference

Nov 15 Subharup Guha (University of Florida)

A Nonparametric Bayesian Technique for High-dimensional Regression

Abstracts

On the Sample Complexity for Approximating High Dimensional Functions of Few Variables
Ming Yuan, Columbia University

We investigate the optimal sample complexity of recovering a general high dimensional sparse function, and means for tradeoff between sample and computational complexities. Exploiting the connection between approximation of a smooth function and exact recovery of a grid function, we identify the optimal sample complexity for recovering a high dimensional sparse function based on point queries. Our result provides a precise characterization of potential loss of information when restricting to point queries as opposed to the more general linear queries, as well as effects of measurement error in recovery.

On the analysis of two-phase designs in cluster-correlated data settings
Sebastien Haneuse, Harvard T.H. Chan School of Public Health

In public health research information that is readily available may be insufficient to address the primary question(s) of interest. One cost-efficient way forward, especially in resource-limited settings, is to conduct a two-phase study in which the population is initially stratified, at phase I, by the outcome and/or some categorical risk factor(s). At phase II detailed covariate data is ascertained on a sub-sample within each phase I strata. While analysis methods for two-phase designs are well established, they have focused exclusively on settings in which participants are assumed to be independent. As such, when participants are naturally clustered (e.g. patients within clinics) these methods may yield invalid inference. To address this we develop a novel analysis approach based on inverse-probability weighting (IPW) that permits researchers to specify some working covariance structure, appropriately accounts for the sampling design and ensures valid inference via a robust sandwich estimator. In addition, to enhance statistical efficiency, we propose a calibrated IPW estimator that makes use of information available at phase I but not used in the design. A comprehensive simulation study is conducted to evaluate small-sample operating characteristics, including the impact of using naive methods that ignore correlation due to clustering, as well as to investigate design considerations. Finally, the methods are illustrated using data from a one-time survey of the national anti-retroviral treatment program in Malawi.

Theory for BART
Veronica Rockova, University of Chicago Booth School of Business

The remarkable empirical success of Bayesian additive regression trees (BART) has raised considerable interest in understanding why and when this method produces good results. Since its inception nearly 20 years ago, BART has become widely used in practice and yet, theoretical justifications have been unavailable. To narrow this yawning gap, we study estimation properties of Bayesian trees and tree ensembles in nonparametric regression (such as the speed of posterior concentration, reluctance to overfit, variable selection and adaptation in high-dimensional settings). Our approach rests upon a careful analysis of recursive partitioning schemes and associated sieves of approximating step functions. We develop several useful tools for analyzing additive regression trees, showing their optimal performance in both additive and non-additive regression. Our results constitute a missing piece of the broader theoretical puzzle as to why Bayesian machine learning methods like BART have been so successful in practice.

Second order optimality conditions in high-dimensional learning
Hongcheng Liu, University of Florida

In modern data-driven applications, high dimensionality has become a looming challenge; in solving a learning problem whose fitting parameters are much more than the samples, the traditional statistical theories and tools may fail as a result of overfitting. This talk will focus on a formerly developed regularization scheme, the folded concave penalty (FCP). On FCP-based learning, there remain open questions (i) whether tractable stationary points may be sufficient to ensure the statistical performance; (ii) whether the statistical performance can be algorithm-independent; and (iii) whether high-dimensional learning is possible beyond the common assumption of restricted strong convexity. My answers to the above questions are all affirmative. This talk will present theoretical evidence and numerical experiments to showcase the efficacy of certain pseudo-polynomial-time computable stationary points that are characterized by the second-order necessary conditions of the FCP-based formulations.​

Regression analysis of longitudinal data with omitted asynchronous longitudinal covariate
Hongyuan Cao, Florida State University

Long term follow-up with longitudinal data is common in many medical investigations. In such studies, some longitudinal covariate can be omitted for various reasons. Naïve approach that simply ignores the omitted longitudinal covariate can lead to biased estimators. In this article, we propose new unbiased estimation methods to accommodate omitted longitudinal covariate. In addition, if the omitted longitudinal covariate is asynchronous with the longitudinal response, a two stage approach is proposed for valid statistical inference. Asymptotic properties of the proposed estimators are established. Extensive simulation studies provide numerical support for the theoretical findings. We illustrate the performance of our method on dataset from an HIV study.

Unbiased Markov chain Monte Carlo with couplings
Pierre Jacob, Harvard University

Markov chain Monte Carlo methods provide consistent approximations of integrals as the number of iterations goes to infinity. However, these estimators are generally biased after any fixed number of iterations, which complicates parallel computation and other tasks. In this talk, I will explain how to remove this burn-in bias by using couplings of Markov chains and a telescopic sum argument due to Glynn & Rhee (2014). The resulting unbiased estimators can be computed independently in parallel, and various methodological developments follow. I will discuss the benefits and limitations of the proposed framework in various settings of Bayesian inference. This is joint work with John O’Leary and Yves F. Atchade. available at arxiv.org/abs/1708.03625.

A Second-order Empirical Bayes Confidence Interval in the Presence of High Leverage for Small Area Inference
Masayo Y. Hirose, The Institute of Statistical Mathematics

In small area estimation, the second-order empirical Bayes confidence interval, the coverage error of which is of the third-order for a large number of areas, is widely used when the sample size within each area is not large enough to make reliable direct estimates based on the design-based approach. Especially, Yoshimori and Lahiri (2014) proposed such empirical Bayes confidence interval for providing a smaller length than that of the confidence interval using a direct estimates. However, this interval may have an issue in the presence of high leverage. In this talk, we will introduce an empirical Bayes confidence interval which makes milder condition than the Yoshimori and Lahiri (2014)’ confidence interval, which proposed in Hirose (2017). Moreover, we will also show our confidence interval being more tractable. Furthermore, we will also report the results of our simulation study for showing overall superiority of our confidence interval method over the other methods.

A Nonparametric Bayesian Technique for High-dimensional Regression
Subharup Guha, University of Florida

The methodology discussed in this talk is motivated by recent high throughput investigations in biomedical research, especially in cancer. Advances in array-based and next-generation sequencing technologies allow for simultaneous measurements of biological units (e.g., genes) on a relatively small number of subjects. Practitioners often wish to select important genes involved with disease processes and develop efficient prediction models for patient-specific clinical outcomes, such as continuous survival times or categorical tumor subtypes. The analytical challenges posed by such data include not only high-dimensionality, but also the existence of considerable gene-gene correlations induced by biological interactions.

We propose an efficient, nonparametric Bayesian framework for simultaneous variable selection, clustering and prediction in high-throughput regression settings with continuous or discrete outcomes, called VariScan. The statistical model utilizes the sparsity induced by Poisson-Dirichlet processes (PDPs) to group the covariates into lower-dimensional latent clusters consisting of covariates with similar patterns for the subjects. The data are permitted to direct the choice of a suitable cluster allocation scheme, choosing between PDPs and their special case, Dirichlet processes. Subsequently, the latent clusters are used to build a nonlinear prediction model for the responses using an adaptive mixture of linear and nonlinear elements, thus achieving a balance between model parsimony and flexibility.

Contrary to conventional belief, cluster detection is shown to be aposteriori consistent for a general class of models as the number of covariates and subjects grows, guaranteeing the high accuracy of the model-based clustering procedure. Through simulation studies and analyses of benchmark cancer data sets, we demonstrate that the VariScan technique compares favorably to, and often outperforms, well-known statistical and machine learning techniques for Big Data in terms of prediction accuracies of the responses.