Publications
Hypothesis selection via sample splitting for valid powerful testing in matched observational studies
William Bekerman, Abhinandan Dalal, Carlo del Ninno, Dylan S Small
Under Review
[link] [code]
Observational studies are valuable tools for inferring causal effects in the absence of controlled experiments. However, these studies may be biased due to the presence of some relevant, unmeasured set of covariates. One approach to mitigate this concern is to identify hypotheses likely to be more resilient to hidden biases by splitting the data into a planning sample for designing the study and an analysis sample for making inferences. We devise a powerful and flexible method for selecting hypotheses in the planning sample when an unknown number of outcomes are affected by the treatment. We investigate the theoretical properties of our method and conduct extensive simulations that demonstrate pronounced benefits, especially at higher levels of allowance for unmeasured confounding. Finally, we demonstrate our method in an observational study of the multi-dimensional impacts of a devastating flood in Bangladesh.
An observational study on effects of contact, collision, and non-contact sports participation on cognitive and emotional health
Hannah A Jin, William Bekerman, Dylan S Small, Amanda Rabinowitz
Under Review
[link]
In light of controversy regarding collision sports like football, this study uses the Adolescent Brain Cognitive Development (ABCD) dataset to determine how intensity different types of sports participation during childhood relate to cognitive and emotional outcomes in 9974 young adolescents. We calculated sports volume as the number of hours that the child participated in certain sports in a year during the child’s most active period. Our outcomes are measured when the youths are aged 8.9-11.1 years old: the NIH Toolbox Cognition score and the Parent Child Behavior Checklist (CBCL) emotional problems score. We fit a linear mixed effects model with our outcomes as the dependent variable and with contact sports volume, collision sports volume, non-contact sports volume, and demographic covariates as fixed effects and with random intercepts for family and site ID. Non-contact sports volume was associated with more emotional problems, while contact sports volume was associated with less emotional problems. There was no significant association between collision sports volume and emotional problems. Compared to participants with no history of TBI, participants with possible mild TBI and TBI-LOC had more emotional problems. This study provides insights into how sports participation relate to cognitive and emotional outcomes during a crucial developmental time window.
RZiMM-scRNA: A regularized zero-inflated mixture model framework for single-cell RNA-seq data
Xinlei Mi, William Bekerman, Anil K Rustgi, Peter A Sims, Peter D Canoll, Jianhua Hu
Annals of Applied Statistics
[link] [code]
Applications of single-cell RNA sequencing in various biomedical research areas have been blooming. This new technology provides unprecedented opportunities to study disease heterogeneity at the cellular level. However, unique characteristics of scRNA-seq data, including large dimensionality, high dropout rates, and possibly batch effects, bring great difficulty into the analysis of such data. Not appropriately addressing these issues obstructs true scientific discovery. Herein we propose a unified Regularized Zero-inflated Mixture Model framework, designed for scRNA-seq data (RZiMM-scRNA), to simultaneously detect cell subgroups and identify gene differential expression based on a developed importance score, accounting for both dropouts and batch effects. We conduct extensive simulation studies in which we evaluate the performance of RZiMM-scRNA and compare it with several popular methods, including Seurat, SC3, K-means, and hierarchical clustering. Simulation results show that RZiMM-scRNA demonstrates superior clustering performance and enhanced biomarker detection accuracy, compared to alternative methods, especially when cell subgroups are less distinct, verifying the robustness of our method.
Our empirical investigations focus on two brain tumor studies dealing with astrocytoma of various grades, including the most malignant of all brain tumors, glioblastoma multiforme (GBM). Our goal is to delineate cell heterogeneity and identify driving biomarkers associated with these tumors. Notably, RZiMM-scNRA successfully identifies a small group of oligodendrocyte cells, which has drawn much attention in biomedical literature on brain cancers. In addition, our method discovers several new biomarkers which are not discussed in the original studies, including PLP1, BCAN, and PTPRZ1—all associated with the development and malignant growth of glioma—as well as CAMK2B, which is downregulated in glioma and GBM and implicated in neurodevelopment, brain function, learning and memory processes.
Comparison of CYGNSS and Jason-3 wind speed measurements via Gaussian processes
William Bekerman, Joseph Guinness
Data Science in Science
[link] [code]
Wind is a critical component of the Earth system and has unmistakable impacts on everyday life. The CYGNSS satellite mission improves observational coverage of ocean winds via a fleet of eight micro-satellites that use reflected GNSS signals to infer surface wind speed. We present analyses characterizing variability in wind speed measurements among the eight CYGNSS satellites and between antennas, using a Gaussian process model that leverages comparisons between CYGNSS and Jason-3 during a one-year period from September 2019 to September 2020. The CYGNSS sensors exhibit a range of biases, mostly between −1.0 m/s and +0.2 m/s with respect to Jason-3, indicating that some CYGNSS sensors are biased with respect to one another and with respect to Jason-3. The biases between the starboard and port antennas within a CYGNSS satellite are smaller. Our results are consistent with, yet sharper than, a more traditional paired comparison analysis. We also explore the possibility that the bias depends on wind speed, finding some evidence that CYGNSS satellites have positive biases with respect to Jason-3 at low wind speeds. However, we argue that there are subtle issues associated with estimating wind speed-dependent biases, so additional careful statistical modeling and analysis is warranted.
Determining decomposition levels for wavelet denoising using sparsity plot
William Bekerman, Madhur Srivastava
IEEE Access
[link] [code]
We present a method to select decomposition levels for noise thresholding in wavelet denoising. It is essential to determine the accurate decomposition levels to avoid inadequate noise reduction and/or signal distortion by noise thresholding. We introduce the concept of sparsity plot that captures the abrupt transition from noisy to noise-free Detail component, readily revealing the cut-off for the maximum decomposition levels. The method uses the sparsity parameter to determine the noise presence in each detail component and measures the magnitude change in the sparsity values to distinguish between noisy and noise-free Detail components. The method is tested on both model and experimental signals, and proves effective for various signal lengths and types, as well as different Signal-to-Noise Ratios (SNRs). The method can be embedded with any wavelet denoising method to improve its performance. The code is available via GitHub and denoising.cornell.edu, as well as the corresponding author's group website.