Project 3
Hydrological system in Switzerland
Due date: Monday, March 31
Use the corresponding invite link in this google doc (accessible with your EPFL account) to accept the project, and either join an existing team or create a new one.
You are required to hand in a PDF version of your report report.pdf (generated from the quarto file report.qmd, if applicable) and the quarto file itself. The report.qmd should contain all the code necessary to reproduce your results: you should not show the actual code in the PDF report, unless you want to point out something specific.
Your README.md should contain instructions on reproducing the PDF report from the quarto file. This can be useful if you have issues with the automatic generation of the PDF report right before the deadline.
An aternative to quarto (or Rmarkdown), is to use LaTeX to produce the report. In that case, you will also need to submit the source code where chunks should be well commented and references to the figures in the report should be clear.
Checklist:
Introduction: Local causal discovery
For a target variable \(T\), local causal discovery methods aim at learning its Markov blanket, i.e., the direct causes (parents), direct effects (children), and spouses (direct causes of the direct effects) of \(T\), yielding its “neighbourhood” in the causal DAG
\(\Rightarrow\) not interested in the causal DAG of the entire system of \((T,\mathbf{X})\) but only on the causal DAG around \(T\)
Popular constraint-based methods:
PC algorithm can be used to extract the Markov blanket of any variable of interest
HITON-PC (Aliferis et al. 2003): identifies the Markov blanket of a target variable using conditional independence tests
Data: Hydrologically simulated dataset
- Data simulated using the hydrological modelling system PREVAH (PREecipitation-Runoff-EVApotranspiration Hydrotope model); see this paper and this one
\(\Rightarrow\) developed to improve the understanding of the spatio-temporal variability of hydrological processes in catchments with complex topography; see this paper
- Dataset consists of 307 catchments in Switzerland for which
- discharge [mm]
- precipitation [mm]
- snowmelt [mm]
- soil moisture [mm]
- temperature [°C]
- (actual) evapotranspiration [mm]
- radiation [W \(m^{-2}\)]
were simulated at a daily-resolution from 1981 to 2016
Data: Hydrologically simulated dataset
Data were calibrated and validated for each catchment by running it with observed meteorological input data (precipitation and temperature) and assessing the simulated discharge values against observed values
Catchments’ flood events are mainly driven either by snowmelt (Alps) or rainfall (Jura, Plateau, and Southern Alps) or by their mixture (Pre-Alps); see this paper
\(\Rightarrow\) system of the hydrological and climate variables is spatially dynamic
Source: Part of the data is downloaded from Envidat and the other part is courtesy of Massimiliano Zappa from the WSL
The goal
Aim of the study: We are interested in understanding the hydrological and climatological drivers of extreme discharges during Spring (March-April-May)
To make the analysis simpler, for a given catchment, consider a binary variable stating whether the discharge variable exceeds its 90% empirical quantile or not.
\(\rightarrow\) stationarity is implictely assumed. Is it reasonable? Maybe we can do better…
The goal of this project is thus to investigate
which variables intervene on the propensity of flood events
how far can causal discovery pay off in terms of prediction accuracy
Tasks: For a given catchment
Determine the 90% quantile of the discharge levels (stationary or non-stationary?) and create the binary flood event variable
Consider all the daily variables (except discharge at time \(t\)) as well as their lagged versions by three days (you can investigate if this lag is informative about floods and consider alternatives if not)
Tasks: For a given catchment
- Model the effect of these variables on the binary flood event variable (target variable \(T\)) by performing classification using logistic regression, where
- 3.1. you consider all the variables defined in 2.,
and reduce the set features/covariates by
3.2. performing standard variable selection, such as, e.g., by using a \(L_1\) penalty (LASSO)
3.3. using the PC algorithm for causal discovery and selecting only variables in the Markov blanket around \(T\)
3.4 using a local causal discovery method (such as implemented in the
RpackagesbnlearnandMXM) to get the Markov blanket around \(T\) and keeping only the corresponding variables
- Compare the prediction performance on a test set (the last two years, say)
Tasks
Do this
for a low-elevation and a high-elevation catchments
adding a brief description of the chosen local discovery method to the report