The ensemble of CMIP6 daily predictor variables for statistical downscaling
This is the technical documentation for the daily predictor variables of a subset of Coupled Model Intercomparison Project Phase 6 (CMIP6) global climate models (GCMs) that can be used for statistical downscaling. The documentation provides a general description of the datasets included in the ensemble of predictor variables, the methodology for how the variables were created, a description of how the folders and files of data are organized for download, and a summary of how CanESM2 and CanESM5 predictor variables available through the Canadian Climate Data and Scenarios (CCDS) site may differ.
On this page
Equilibrium climate sensitivity
- What is equilibrium climate sensitivity?
- Involvement of ECS in the IPCC assessment reports and CMIPs
- Limitations of ECS
- Selection of CMIP6 GCMs for the predictors dataset
Description input datasets
- Reanalysis datasets
Preprocessing of predictor variables
- Datasets downloaded: daily time series
- Programs used for processing
- Double precision and pressure level isolation
- Unit Conversion
Format of predictor datasets
- Structure of grid-box directories and predictor files
Differences between CanESM5 and CanESM2
- Availability of standardized datasets
- Dataset licence References
One of the ways of obtaining local-scale climate change scenarios is to use regression-based statistical downscaling of GCMs. In this approach, an empirical relationship between GCM predictors (i.e., near-surface and upper-level atmosphere circulation variables) and surface predictands (such as observed temperature or precipitation from a station) is derived by linear or non-linear transfer functions. For this purpose, an ensemble of daily predictor variables are produced from CanESM5, MPI-ESM1.2-HR, NorESM2-MM, and two reanalysis datasets.
A total of 26 predictor variables are included in each ensemble, composed of both raw and derived variables, with multiple atmospheric variables available at three different pressure levels. Predictor variables are available at the daily scale on a 64 by 128 latitude-longitude global Gaussian grid with T42 spectral truncation. The historical simulation for 1979-2014 as well as the four Tier 1 Shared Socioeconomic Pathways (SSPs) prioritized by the Intergovernmental Panel on Climate Change (IPCC) and Scenario Model Intercomparison Project (ScenarioMIP) (SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5) and SSP1-1.9 (due to its relevance for the Paris Agreement) for 2015-2100 are available for each GCM.Reference 7 Two reanalysis dataset options are available for the historical period 1979-2014 (ECMWF ERA5 and NCEP-DOE Reanalysis 2).
GCMs chosen for inclusion into the CMIP6 predictors dataset was determined by three factors. Firstly, the equilibrium climate sensitivity (ECS) must have been calculated according to the Gregory methodology and the selected GCMs must cover a range of ECS values (see sections 1.1. and 1.2.). Secondly, the GCM must have run the historical simulation and as many of the five SSPs as possible (SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, and SSP5-8.5). Thirdly, for the relevant simulations, the seven base variables at all three included pressure levels (if applicable) must be available for download on Earth System Grid Federation (ESGF) website.
1. Equilibrium climate sensitivity
Inclusion of a subset of CMIP6 GCMs for the CMIP6 predictors dataset was done with the intent to include GCMs that span a range of ECS values. This includes CanESM5 and GCMs produced by other, non-Canadian, modelling organizations.
1.1. What is equilibrium climate sensitivity?
Understanding the Earth’s response to changes in atmospheric carbon dioxide (CO2) and determining its sensitivity to any perturbations in CO2 level, is a fundamental goal to those studying climate science.Reference 13 One of the earliest and simplest concepts applied to gauge the climate sensitivity of climate models, equilibrium climate sensitivity (ECS) is a measurement extensively used by the climate modelling community.Reference 4
ECS is a hypothetical value representing the increase in globally averaged surface temperature once a climate system reaches equilibrium after an instantaneous doubling of atmospheric CO2 concentration.Reference 4Reference 13 The most common method for calculating the ECS of a GCM is the Gregory method, used prior to and since the CMIP5, it produces a value also termed the effective ECS.Reference 4 Using the Gregory method, atmospheric CO2 is instantaneously quadrupled, instead of doubled, and the model is run for 150 years, instead of to equilibrium.Reference 4 The surface temperature at equilibrium can then be extrapolated for a doubling of CO2, assuming that the response of the model is roughly linear and half of the warming that should occur from a quadrupling of CO2.Reference 4
1.2. Involvement of ECS in the IPCC assessment reports and CMIPs
Since the establishment of ECS as a standard metric of climate sensitivity and model response to atmospheric CO2, the CMIP has mandated that every contributing GCM estimate the ECS as one of the requirements for participation.Reference 4 The Diagnostic, Evaluation and Characterization of Klima (DECK) experiments are experiments that every GCM must produce simulations for as a condition for entry into the CMIP.Reference 4 The instantaneous quadrupling of CO2 and resultant run are one of these required experiments.Reference 4 As such, each generation of GCMs prepared for the CMIPs has produced an ECS range. While past ranges have all been quite consistent with each other over generations of models (1.5°K to 4.5°K), the GCMs participating in the current CMIP (phase 6) have produced a wider ECS range (1.8°K to 5.6°K) with a greater number of models producing higher values and, numerous models which exceed the previous upper limit of the range.Reference 4Reference 13 Calculations of ECS started with the first IPCC report in the 1990s, and the CMIP6 ECS range is the largest of any generation of models since that time.Reference 4
1.3. Limitations of ECS
It should be noted that ECS is an uncertain quantity and not without weaknesses or assumptions. As previously mentioned, ECS is a hypothetical quantity as a large instantaneous change in atmospheric CO2 is not a realistic scenario for a climate system. An instantaneous change does not allow for any time-dependant or time-varying responses and effects such as feedbacks. The ECS also does not measure any quantity aside from the change in temperature. However, despite any shortcomings due to its simplicity, ECS is a widely used measure in climate science as it provides highly relevant information about how a climate system responds to perturbations and targets for global temperature thresholds.Reference 13
1.4. Selection of CMIP6 GCMs for the predictors dataset
Based on the criteria listed in the Overview section, the GCMs currently included in the CMIP6 predictors dataset are CanESM5, NorESM2-MM, and MPI-ESM1.2-HR. In the case that more than one version of the same model met the aforementioned criteria, the model with the higher resolution was selected. As models with a finer grid generally are able to reproduce climate responses and systems with less error and bias when compared to observations, preference was shown for models with higher atmospheric spatial resolution. Additional models may be added to the predictors dataset in the near future. See Table 1 for a full list of all datasets included in the predictors ensemble and an overview of each dataset.
Table 1. Availability of datasets for each model included in the predictors ensemble. Pressure levels apply for all non surface-level variables. Reanalysis datasets do not have an ECS or a variant ID, and are only available for the historical time period; therefore, multiple columns are marked ‘not applicable’ (n/a).
|Model||ECS (°K)||Pressure levels (hPa)||SSPs||Variant||Leap years|
|ECMWF ERA5||n/a||500, 850, 1000||n/a||n/a||Yes|
|NCEP-DOE Reanalysis 2||n/a||500, 850, 1000||n/a||n/a||Yes|
|CanESM5||5.6||500, 850, 1000||SSP1-1.9, SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5||r1i1p1f1||No|
|NorESM2-MM||2.5||500, 850, 1000||SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5||r1i1p1f1||No|
|MPI-ESM1.2-HR||3.0||500, 850, 1000||SSP1-2.6, SSP2-4.5, SSP3-7.0, SSP5-8.5||r1i1p1f1||Yes|
2. Description input datasets
2.1. Reanalysis datasets
The National Centers for Environmental Prediction-Department of Energy (NCEP-DOE) Atmospheric Model Intercomparison Project (AMIP)-II Reanalysis (also called NCEP-DOE Reanalysis 2) as well as European Centre for Medium-Range Weather Forecasts (ECMWF) Atmospheric Reanalysis Fifth Generation (ERA5) datasets were included as part of the predictors dataset.
NCEP-DOE Reanalysis 2 is an improved version of its predecessor, NCEP/NCAR Reanalysis 1, as it includes updated parameterizations of physical processes and error fixes.Reference 6 ERA5 builds on past ECMWF reanalysis datasets, includes the latest systems and features, and was constructed using research and information from ECMWF and ECMWF partners.Reference 2 ERA5, compared to ERA-Interim, provides higher spatial and temporal resolution, and has advancements such as improved troposphere, improved representation of tropical cyclones, better global balance of precipitation and evaporation, better precipitation over land in the deep tropics, better soil moisture, and more consistent sea surface temperature and sea ice.Reference 1
The Canadian Earth System Model version 5 (CanESM5) experiments were prepared as part of CMIP6. CanESM5 is the current version of Canadian Centre for Climate Modelling and Analysis’s (CCCma) earth system model and is an updated version of CanESM2 made available for CMIP5. For additional details on CanESM5, please see Swart et al. (2019).Reference 9
The Max Planck Institute for Meteorology Earth System Model version 1.2 (MPI-ESM1.2) experiments were prepared as part of CMIP6. The MPI-ESM1.2 is the current version of the Plank Institute for Meteorology’s GCM and is an updated version of the MPI-ESM prepared for the CMIP5. Five coupled model configurations of the MPI-ESM1.2 are available, though of these only two versions meet the inclusion criteria for the predictors dataset as defined in the Overview section. These versions are the MPI-ESM1.2-LR, a low-resolution version, and the MPI-ESM1.2-HR, a high-resolution version. The atmospheric grid spacing of each model is approximately 200 km and 100 km, respectively. The ECS of both versions of the MPI-ESM1.2 is the same as it was tuned explicitly to 3°K. Therefore, the version with the higher resolution, MPI-ESM1.2-HR, was included in the ensemble of predictor variables (see section 1.4.). For additional details on the MPI-ESM1.2-HR, please see Mauritsen et al. (2019) and Müller et al. (2018).Reference 3Reference 5
The Norwegian Earth System Model version 2 (NorESM2) experiments were prepared as part of CMIP6. The NorESM2 is the current version of the Norwegian Climate Center’s GCM and is an updated version of the NorESM1 prepared for CMIP5. Like its predecessor, the NorESM1, multiple versions of the NorESM2 were produced, primarily a low-resolution (NorESM2-LM) and a medium-resolution version (NorESM2-MM). The atmosphere-land resolution of each aforementioned model is approximately 1° and 2°, respectively. The two resolutions of NorESM2 have very similar ECS values at 2.54°K for NorESM2-LM and 2.50°K for NorESM2-MM. Thus, the higher resolution NorESM2-MM was selected for inclusion into the ensemble of predictor variables (see section 1.4.). For additional details on the NorESM2-MM, please see Seland et al. (2020).Reference 8
3. Preprocessing of predictor variables
3.1. Datasets downloaded: daily time series
CanESM5, NorESM2-MM, MPI-ESM1.2-HR, and NCEP-DOE data are available from online databases in NetCDF format as global daily time series. Since ERA5 data are only available at hourly or monthly time frequencies, daily means were calculated using values from the hours of 00:00, 06:00, 12:00, and 18:00. These time steps were chosen as NCEP-DOE daily means are calculated using the same times. Total precipitation was the only ERA5 variable to be downloaded for all 24 hours as it was the only variable calculated as a sum and not a mean value. The method for calculating daily total precipitation was based on the method provided by the ECMWF.Reference 12 It should be noted that at the time of the calculation of the predictor variables, data prior to 1979 was not yet available for ERA5, thus the sum of daily total precipitation for January 1 1979 only begins at 07:00 hours.
ERA5 data was downloaded in NetCDF format utilizing Copernicus's Climate Data Store (CDS) API for the hours 00:00, 06:00, 12:00, and 18:00. Surface level variables were downloaded from the ‘reanalysis-era5-single-levels’ dataset and multiple level variables from the 'reanalysis-era5-pressure-levels' dataset. Variables were downloaded on the native grid (0.25°x0.25°) without altering the grid or interpolating the data in the API request. Interpolation occurred at a later step using the same method as was used for the NCEP-DOE Reanalysis 2 variables to ensure consistency (see section 3.5.).
The variables downloaded from each climate dataset are listed in Table 2. Variables that did not require further analyses are listed as ‘raw’ in Table 2. In addition to these variables, four were derived. The four derived variables are wind variables that were manually calculated from U- and V- wind components using NCAR Command Language (NCL) functions. ERA5 datasets provide two of these derived variables (divergence and relative vorticity), thus no calculations were necessary. The final derived variable, specific humidity, only needed to be calculated for NCEP-DOE Reanalysis 2 datasets using air temperature and relative humidity. Other datasets provided specific humidity as a variable and therefore, it is listed as a raw variable.
3.3. Programs used for processing
The ensemble of scripts used to extract and process the datasets as well as formulate predictor files were executed on a Unix system in a Bourne-again shell (bash) environment. Python version 3.7.6 and NCL version 6.6.2 were used to produce the predictors (specific functions named in section 3.5. and in Table 2). The majority of the preprocessing methodology was the same across all datasets with the goal of producing datasets that are comparable.
Main steps of preprocessing:
- Download raw variables
- Convert variables to double precision
- Interpolation (all datasets except CanESM5)
- Calculate derived variables
- Conversion of units (if necessary)
Table 2. Basic description of raw and derived predictor variables. NCL functions used to calculate the derived variables are also listed underneath the data type in the type column.
|Air temperature||°C||2 metres||Raw|
|Mean sea level pressure||Pa||Mean sea level||Raw|
|Specific humidity1||kg/kg||Pressure levels||Raw
|Geopotential height||m||Pressure levels||Raw|
|Zonal wind||m/s||Pressure levels||Raw|
|Meridional wind||m/s||Pressure levels||Raw|
|Relative vorticity2,3||s-1||Pressure levels||Derived
|Wind direction2,4||0-360°||Pressure levels||Derived
|Wind speed2||m/s||Pressure levels||Derived
3.4. Double precision and pressure level isolation
Desired pressure levels (1000, 850, and 500 hPa) were isolated for multiple level atmospheric variables. This step was not necessary for ERA5 data as pressure levels can be selected and downloaded individually. Values were then converted to double precision prior to any calculation to retain as much raw original information as possible.
Additional preprocessing for all datasets except CanESM5 consisted, primarily, of interpolation. Both reanalysis datasets and GCMs MPI-ESM1.2-HR and NorESM2-MM were interpolated to match the T42 global Gaussian grid of the CanESM5 data using the specialized NCL function ‘f2gsh_Wrap’. The function interpolates scalar values on fixed grids onto a Gaussian grid with optional triangular truncation, which, in this case, was set to 42. The resultant grid is 64 degrees of latitude by 128 degrees of longitude, with a uniform longitudinal resolution of 2.8125° and a nearly uniform latitudinal resolution of 2.8125°. ERA5 data required the additional step of conversion from hourly to daily datasets. The NCL function ‘calculate_daily_value’ was used following conversion to double precision to calculate daily means from hourly ERA5 data. The sum of daily total precipitation for the ERA5 dataset was calculated in Python using the ‘Dataset.resample’ function of the Xarray package.
3.6. Unit Conversion
Unit conversion was done for a few variables.
All GCMs and NCEP-DOE Reanalysis 2:
- 2m air temperature (°K converted to °C)
- total precipitation (kg/m/s converted to mm/day)
- 2m air temperature (°K converted to °C)
- total precipitation (m/day converted to mm/day)
- geopotential height (geopotential (m2/s2) converted to geopotential height (m))
The final step for producing the predictor variables was to standardize the values according to the historical reference period, 1981-2010, for each dataset (each GCM, NCEP-DOE, ERA5) while retaining the original values. Standardization is, in this case, according to a long-term climatic mean and standard deviation over the historical reference period. The 1981-2010 date range was selected as the reference period for standardization of the CMIP6 predictor variables as it is commonly used in climate science.Reference 10Reference 11 All predictor variables were standardized according to the 1981-2010 reference period except for wind direction, for which a standardized value would serve no purpose. As a variable, wind direction is not continuous and is not normal in distribution as it varies drastically in space and time. Additionally, standardizing wind direction would remove all information relating to direction. Standardized values (n) are produced from predictor values (x) utilizing the mean (µ) and standard deviation (σ) over the 1981-2010 reference period for each data source and according to individual grid box using the following expression:
4. Format of predictor datasets
4.1. Structure of grid-box directories and predictor files
Each grid cell is assigned numbers according to indexed latitude and longitude coordinates. From each grid cell, a folder named Box_iiiX_jjY can be downloaded where iii ranges from 001 to 128, the longitudinal index, and jj ranges from 01 to 64, the latitudinal index (see Table 5 and Table 6). Each grid box contains many subfolders identifying the source of the dataset used to calculate the predictor variables (i.e. GCMs have individual folders for each simulation (historical and each SSP), while reanalysis datasets have one subfolder each) and the year range. Within each subfolder is a second set of subfolders that separate standardized and original values. A detailed description of folder names is described in Table 3.
Folders of original data (i.e., not standardized) contain 26 predictor variables, while folders of standardized data contain 23 predictor variables as wind direction, at all three pressure levels, was not standardized. Each file contains one column of data in a csv format. The naming structure of the files is derived from the CMIP6 naming template with each file using the format:
variable ID_time frequency_source ID_experiment ID_member ID_grid label_time range_type.csv
Variable IDs, or variable names, are listed below in Table 4, and file formats for each source dataset can be found in Table 3. It should be noted that reanalysis datasets do not possess member IDs and, therefore, the label is omitted from their file names. Grid label is ‘gn’ for CanESM5 predictors as the data are represented on its native grid. For reanalysis predictors, NCEP-DOE and ERA5, as well as all other GCMs, the grid label is ‘gr’ as the data has been regridded. The extra category ‘type’ was added to the naming template to differentiate between files containing standardized, ‘sd,’ and original, ‘og,’ data. It should also be noted that files containing CanESM5 and NorESM2 data have fewer values than those containing NCEP-DOE, ERA5, or MPI-ESM1.2-HR data as CanESM5 and NorESM2 use a 365-day calendar and, therefore, do not include leap years (see Table 1).
Table 3. List of dataset subfolders and the template for file formats.
|Subfolders for datasets||Time frame||Structure of file name 1|
|NCEP-DOE2_1979-2014||1979 to 2014||varID_day_NCEP-DOE_RE2_gr_19790101-20141231_type.csv|
|ECMWF_ERA5_1979-2014||1979 to 2014||varID_day_ECMWF_ERA5_gr_19790101-20141231_type.csv|
|CanESM5_historical_1979-2014||1979 to 2014||varID_day_CanESM5_historical_r1i1p1f1_gn_19790101-20141231_type.csv|
|CanESM5_ssp119_2015-2100||2015 to 2100||varID_day_CanESM5_ssp119_r1i1p1f1_gn_20150101-21001231_type.csv|
|CanESM5_ssp126_2015-2100||2015 to 2100||varID_day_CanESM5_ssp126_r1i1p1f1_gn_20150101-21001231_type.csv|
|CanESM5_ssp245_2015-2100||2015 to 2100||varID_day_CanESM5_ssp245_r1i1p1f1_gn_20150101-21001231_type.csv|
|CanESM5_ssp370_2015-2100||2015 to 2100||varID_day_CanESM5_ssp370_r1i1p1f1_gn_20150101-21001231_type.csv|
|CanESM5_ssp585_2015-2100||2015 to 2100||varID_day_CanESM5_ssp585_r1i1p1f1_gn_20150101-21001231_type.csv|
|SourceID_historical_1979-2014||1979 to 2014||varID_day_SourceID_historical_r1i1p1f1_gr_19790101-20141231_type.csv|
|SourceID_ssp126_2015-2100||2015 to 2100||varID_day_SourceID_ssp126_r1i1p1f1_gr_20150101-21001231_type.csv|
|SourceID_ssp245_2015-2100||2015 to 2100||varID_day_SourceID_ssp245_r1i1p1f1_gr_20150101-21001231_type.csv|
|SourceID_ssp370_2015-2100||2015 to 2100||varID_day_SourceID_ssp370_r1i1p1f1_gr_20150101-21001231_type.csv|
|SourceID_ssp585_2015-2100||2015 to 2100||varID_day_SourceID_ssp585_r1i1p1f1_gr_20150101-21001231_type.csv|
Table 4. List of the 26 predictor variable IDs and corresponding variable names.
|No.||Variable ID||Predictor variable|
|1||mslp||Mean sea level pressure|
|2||p1_f||1000 hPa Wind speed|
|3||p1_u||1000 hPa Zonal wind component|
|4||p1_v||1000 hPa Meridional wind component|
|5||p1_z||1000 hPa Relative vorticity of true wind|
|6||p1th||1000 hPa Wind direction|
|7||p1zh||1000 hPa Divergence of true wind|
|8||p5_f||500 hPa Wind speed|
|9||p5_u||500 hPa Zonal wind component|
|10||p5_v||500 hPa Meridional wind component|
|11||p5_z||500 hPa Relative vorticity of true wind|
|12||p5th||500 hPa Wind direction|
|13||p5zh||500 hPa Divergence of true wind|
|14||p8_f||850 hPa Wind Speed|
|15||p8_u||850 hPa Zonal wind component|
|16||p8_v||850 hPa Meridional wind component|
|17||p8_z||850 hPa Relative vorticity of true wind|
|18||p8th||850 hPa Wind direction|
|19||p8zh||850 hPa Divergence of true wind|
|20||p500||500 hPa Geopotential|
|21||p850||850 hPa Geopotential|
|23||s500||500 hPa Specific humidity|
|24||s850||850 hPa Specific humidity|
|25||shum||1000 hPa Specific humidity|
|26||temp||Air temperature at 2 m|
Table 5. Latitude coordinates rounded to four decimal places for the 64 by 128 latitude-longitude global Gaussian grid shown with the associated grid box number according to indexed latitude. Latitudes are indexed from south to north and represent the Y index of the grid box numbering system (Box_iiiX_jjY). Note that latitude coordinates correspond to grid box centres.
Table 6. Longitude coordinates for the 64 by 128 latitude-longitude global Gaussian grid shown with the associated grid box number according to indexed longitude. Longitudes are indexed from the Greenwich meridian towards the east and are represented as the X index of the grid box numbering system (Box_iiiX_jjY). Note that longitude coordinates correspond to grid box centres.
|iii (X)||Longitude (°East)|
5. Differences between CanESM5 and CanESM2 predictor variables
The core methodology for calculating CMIP6 (CanESM5) predictor variables was based on the methodology used for calculating CanESM2 predictors. Nonetheless, there are few steps that differ in producing the CMIP6 predictors. For the CanESM2 predictors, National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis 1 data was also processed and made available to users, whereas for the CMIP6 predictors, NCEP-DOE Reanalysis 2 as well as ECMWF ERA5 datasets are included. Additionally, the CMIP6 predictors include a number of additional GCMs to cover a range of ECS values. Moreover, while only the CanESM2 dataset was converted to double precision, all datasets were converted to double precision for the CMIP6 predictors (see section 3.4.). Lastly, the naming scheme of folders and file names has been changed to reflect the CMIP6 naming convention, however, variable names will remain the same to reduce confusion when comparing between the predictor projects (see section 4; see Table 3).
5.1. Availability of standardized datasets
While previous predictor datasets (i.e., CanESM2) consisted of only standardized values, both original (non-standardized) values and standardized values are available to users for all datasets. The decision to make both original and standardized values available to users occurred for three reasons. Firstly, and most importantly, to justify standardization, values must follow a normal distribution; generally, this is not the case for precipitation and wind variables. Secondly, by standardizing values, much of the valuable information contained within the data is lost such as the mean, standard deviation, and minimum and maximum values. Finally, providing original data allows users the option to standardize the data using a baseline time period of their choosing. Nevertheless, standardized values were provided for comparison purposes with other predictor datasets.
6. Dataset licence
Open Government Licence - Canada (http://open.canada.ca/en/open-government-licence-canada)