Main project
Due date: 11:59pm on Sunday, 21 December 2025.
Use the corresponding invite link in this google doc (accessible with your EPFL account) to accept the project, and either join an existing team or create a new one. Once this is done, go to the course GitHub organization and locate the repo titled main-project-TEAM-NAME to get started.
Intermediate submission
- By Sunday November 9, you should have chosen a team and a topic.
- On Friday November 21 your team will submit a 1-2 page writeup. Your writeup should provide a preliminary introduction to the topic you will study and provide clear motivation for why they are interesting and/or relevant.
Final submission
You are required to hand in a PDF version of your report report.pdf (max 20 pages) and the source code used. You should not show the actual code in the PDF report, unless you want to point out something specific.
Your README.md should contain instructions on reproducing the PDF report from the quarto file. This can be useful if you have issues with the automatic generation of the PDF report right before the deadline. Your README.md should also include a brief description of the contributions of each team member, if you are a team of three students.
Checklist:
The goal of this project is quite broad, students are free to come up with their own ideas. Simulation studies are the designated topic and while we suggest below a list of topics, groups interested in studying a bit deeper one of the methodological concepts from this course (not listed below) are encouraged to approach the teachers during the exercises and discuss their ideas. Prospective topics for the final project will be gradually revealed during the lectures.
Part of the grade for the final project will be awarded for value added (e.g., simulation study answering a previously unclear question). All of the prospective topics that will be introduced during the lecture will have this element, and by half-way through the semester (when the final project will start) it should be clear through the examples what the project should aspire to. We will also discuss this in person at some point, likely on Week 6. The remaining part of the project will be awarded for
- quality of the report (clarity, readability, structure, referencing, etc.)
- graphical considerations (well chosen graphics with captions, referenced from the main text)
- concepts explored beyond the scope of the course (in the soft sense that they were not fully covered during classes)
- overall quality (scope, correctness, demonstration of understanding, etc.)
A project seriously lacking in any of the criteria above will be penalized.
Topics for the final project
Unless otherwise stated, you are not required to code everything from scratch (unless there is no available source code).
1. Cross-validation for PCA
- A simulation study to compare the advantages of EM over the repaired CV for PCA, covered in Week 4.
- Implement a third approach to CV for PCA based on matrix completion (boils down to performing SVD with missing data). Details about this approach are given in the supplementary notes and a deeper dive into the matter is covered in Perry (2009), Section 5.
- Compare the three methods.
2. Comparison of variable selectors in regression
- Hastie et al. (2020) have some surprising results in their simulation study, but one important method (adaptive lasso) is omitted. Try to recreate the study with adaptive lasso included (and perhaps elastic net, too?).
- The project should address the following question: When it comes to variable selection, which method to choose and under which settings and for which aim/criterion (explainability, predictability, sparsity, or number of correct covariates)?
3. Comparison of cross-validation methods for data with temporal structure
When short-range temporal dependence (autocorrelation) is not taken care of, “simple” cross-validation methods can break down as the validation and training samples are no longer independent. For instance, the “simple” CV approaches can lead to underestimation of smoothing parameters (overfitting). Several methods for dealing with such issues have been proposed in the literature. Among these, we mention the following - removing distance-based buffers around hold-out points in the LOOCV - block cross-validation - neighborhood cross validation, recently proposed by Wood (2024)
The project aims at
- exploring why cross-validation might fail in presence of temporal (or spatial) dependence, and
- making a survey of some of the modified cross-validation methods proposed in the literature to deal with autocorrelation and to compare them in a non-parametric regression setting (e.g., with smoothing splines).
Naturally, different short-range autocorrelation schemes should be investigated (e.g., AR or MA Gaussian processes).
References related to this subject: Chu and Maaron (1991), Arlot and Celisse (2010), Roberts et al. (2016).
4. The EM algorithm for different patterns of missingness
- Comparison of the performances of the EM algorithm for different percentage of missing values and for different missing-data mechanisms.
- Optional (if you have time and energy): In a setting with missing data, what about comparing parameter estimates obtained via EM with those obtained (with maximum likelihood) after imputation? You can choose one (or more) of the imputation methods described here.
- The project should address the following question: When you are faced with a missing data problem, when is the EM algorithm a good option for statistical inference (the estimation process)?
5. Diving into one of the course topics, e.g., MM algorithms or Monte Carlo integration (Week 7)
Consult with the teacher to define the project and ensure its feasibility.