Karolinska Institutet
Browse

Regression on high-dimensional predictor space : with application in chemometrics and microarray data

thesis
posted on 2024-09-02, 21:20 authored by Arief Gusnanto

This thesis focuses on regression methodology for prediction and classification in situations where there are many predictors but limited number of observations. This situation is common in chemometrics and microarray data. In chemometrics, we obtain the absorbance level of each sample at hundreds or thousands of wavelengths (variables) in a calibration of a near-infrared (NIR) calibration. In microarray data, we have expression level of thousands of genes or proteins (variables) from each sample. When the variables are put in regression analysis, we have a vector of response and hundreds or thousands of predictors in the model. The challenge in regression analysis is how do we infer the pattern in the data when the number of samples is limited.

The situation where we have a large number of variables but a limited number of samples in a dataset raises many problems. We address some of them in this research, including parameter estimation method, variable selection, and inference, and develop methodology to deal with them. We deal with the question of variable selection in NIR calibration and conclude that the variable selection does not guarantee better prediction. A case-by-case investigation is necessary to determine whether all available variables are relevant for prediction. In microarray data, we infer that procedures to select variables into logistic regression based on multivariate information give a better model fit than using t-statistics.

We deal with the problem of parameter estimation with such a large number of variables by considering some models where we can put all of the available variables in the model. The goal of selecting differentially expressed genes from logistic regression with random effects is not possible due to a limited amount of information. To deal with this inference problem, we investigate a linear mixed model where we assume the random effects follow a mixture of three normal distributions. The mixture distribution corresponds to genes that are down, non, and up differentially expressed. The inference on each gene becomes whether a gene belongs to one of the mixture components. In this context, estimation of fold-change and identification of differentially expressed genes can be done simultaneously. We conclude that the method performs reasonably well to identify the genes. This is validated by a spike-in study and simulation. In applying the model to find coregulated genes, the method identifies the genes while its performance relies on the amount of information in the data.

History

Defence date

2004-12-17

Department

  • Department of Medical Epidemiology and Biostatistics

Publisher/Institution

Karolinska Institutet

Publication year

2004

Thesis type

  • Doctoral thesis

ISBN-10

91-7140-153-9

Language

  • eng

Original publication date

2004-11-26

Author name in thesis

Gusnanto, Arief

Original department name

Department of Medical Epidemiology and Biostatistics

Place of publication

Stockholm

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC