Regression on high-dimensional predictor space : with application in chemometrics and microarray data

Gusnanto, Arief

Author: Gusnanto, Arief

Date: 2004-12-17

Location: Sal Lennart Nilsson, Nobels väg 15A, Karolinska Institutet

Time: 9.00

Department: Institutionen för medicinsk epidemiologi och biostatistik / Department of Medical Epidemiology and Biostatistics

Abstract

This thesis focuses on regression methodology for prediction and classification in situations where there are many predictors but limited number of observations. This situation is common in chemometrics and microarray data. In chemometrics, we obtain the absorbance level of each sample at hundreds or thousands of wavelengths (variables) in a calibration of a near-infrared (NIR) calibration. In microarray data, we have expression level of thousands of genes or proteins (variables) from each sample. When the variables are put in regression analysis, we have a vector of response and hundreds or thousands of predictors in the model. The challenge in regression analysis is how do we infer the pattern in the data when the number of samples is limited.

The situation where we have a large number of variables but a limited number of samples in a dataset raises many problems. We address some of them in this research, including parameter estimation method, variable selection, and inference, and develop methodology to deal with them. We deal with the question of variable selection in NIR calibration and conclude that the variable selection does not guarantee better prediction. A case-by-case investigation is necessary to determine whether all available variables are relevant for prediction. In microarray data, we infer that procedures to select variables into logistic regression based on multivariate information give a better model fit than using t-statistics.

We deal with the problem of parameter estimation with such a large number of variables by considering some models where we can put all of the available variables in the model. The goal of selecting differentially expressed genes from logistic regression with random effects is not possible due to a limited amount of information. To deal with this inference problem, we investigate a linear mixed model where we assume the random effects follow a mixture of three normal distributions. The mixture distribution corresponds to genes that are down, non, and up differentially expressed. The inference on each gene becomes whether a gene belongs to one of the mixture components. In this context, estimation of fold-change and identification of differentially expressed genes can be done simultaneously. We conclude that the method performs reasonably well to identify the genes. This is validated by a spike-in study and simulation. In applying the model to find coregulated genes, the method identifies the genes while its performance relies on the amount of information in the data.

URI: http://hdl.handle.net/10616/43385

Issue date: 2004-11-26

Publication year: 2004

ISBN: 91-7140-153-9

Collections

Total Visits

	Views
Regression ...(legacy)	321
Regression ...	181

Total Visits Per Month

	October 2023	November 2023	December 2023	January 2024	February 2024	March 2024	April 2024
Regression ...	2	4	0	1	1	0	0

Top country views

	Views
United States	98
Sweden	81
Germany	46
China	42
United Kingdom	28
Ireland	14
South Korea	12
Indonesia	7
Finland	6
Hong Kong	6

Top cities views

	Views
Sunnyvale	23
Kiez	17
Beijing	16
Leeds	16
Dublin	14
Seoul	12
Stockholm	12
Malmo	9
Ashburn	6
Sollentuna	6