Exploring Missing Data Handling and Batch Effects in DNA Methylation Analysis

DNA methylation is a fundamental epigenetic modification within the human genome, influencing various biological processes, including gene expression and cellular development. In the context of methylation data analysis, values typically range from 0 to 100 (or 0 to 1), depending on the chosen scale. This research project aims to investigate the handling of missing data within these methylation datasets, focusing on a specific function known as 'dmrseq.'*

One of the key challenges in DNA methylation analysis is the assumption-free distribution of data, as it often fails to account for underlying data structures. Moreover, batch effects may arise during data collection, impacting the reliability of our analysis. For instance, consider a scenario where we have 75 control samples and 75 cases. Each sample is collected in one of three different ways (sampling types A, B, and C). To determine the significance of specific regions, it is crucial to consider these batch effects. This necessitates the application of a mixed model approach or other models that can be investigated in this thesis.

Our approach involves addressing these challenges through a mixed model, accounting for batch effects while testing the significance of given genomic regions.You will implement and assess this mixed model approach using synthetic data and real-world methylation data.

Research Methodology:

Utilize the 'dmrseq' function to handle missing data, particularly employing a 'windows approach' on silico data.
Develop and implement a mixed model to account for batch effects when testing the significance of genomic regions on silco data.
Apply this methodology to real DNA methylation data, exploring various scenarios and assessing the model's performance.

Goal

The primary objective of this master's thesis project is to comprehensively study the treatment of missing data in DNA methylation analysis. We will leverage the 'dmrseq' function, which offers an elegant approach to address missing values, particularly through a 'windows approach.'

Learning outcome

This research project aims to enhance our understanding of missing data handling and batch effects in DNA methylation analysis. By applying advanced statistical models and methodologies, we anticipate improving the accuracy and reliability of epigenetic research.

Qualifications

No prior knowledge of biology or medicine is required. Proficiency in programming languages such as R or Python is essential. Familiarity with methylation data analysis is advantageous but not mandatory. Strong analytical and problem-solving skills. Willingness to work with real data and develop computational models. Capability to collaborate and communicate findings effectively.

Supervisor

Thu Thi Nguyen

Collaboration partners

Marcin Wojewodzic - Norwegian Cancer Registry

References

Detection and inference of differentially methylated regions from Whole Genome Bisulfite Sequencing
Analyze Illumina Infinium DNA methylation arrays
Whole-Genome Bisulfite Sequencing Data Standards and Processing Pipeline
https://github.com/ben-laufer/DMRichR"