Abstract
This thesis focuses on regression methodology for prediction and
classification in situations where there are many predictors but limited
number of observations. This situation is common in chemometrics and
microarray data. In chemometrics, we obtain the absorbance level of each
sample at hundreds or thousands of wavelengths (variables) in a
calibration of a near-infrared (NIR) calibration. In microarray data, we
have expression level of thousands of genes or proteins (variables) from
each sample. When the variables are put in regression analysis, we have a
vector of response and hundreds or thousands of predictors in the model.
The challenge in regression analysis is how do we infer the pattern in
the data when the number of samples is limited.
The situation where we have a large number of variables but a limited
number of samples in a dataset raises many problems. We address some of
them in this research, including parameter estimation method, variable
selection, and inference, and develop methodology to deal with them. We
deal with the question of variable selection in NIR calibration and
conclude that the variable selection does not guarantee better
prediction. A case-by-case investigation is necessary to determine
whether all available variables are relevant for prediction. In
microarray data, we infer that procedures to select variables into
logistic regression based on multivariate information give a better model
fit than using t-statistics.
We deal with the problem of parameter estimation with such a large number
of variables by considering some models where we can put all of the
available variables in the model. The goal of selecting differentially
expressed genes from logistic regression with random effects is not
possible due to a limited amount of information. To deal with this
inference problem, we investigate a linear mixed model where we assume
the random effects follow a mixture of three normal distributions. The
mixture distribution corresponds to genes that are down, non, and up
differentially expressed. The inference on each gene becomes whether a
gene belongs to one of the mixture components. In this context,
estimation of fold-change and identification of differentially expressed
genes can be done simultaneously. We conclude that the method performs
reasonably well to identify the genes. This is validated by a spike-in
study and simulation. In applying the model to find coregulated genes,
the method identifies the genes while its performance relies on the
amount of information in the data.