Tag Archives: feature importance

A Comparison of Machine Learning Regression Models for Feature Importance Determination of Climate Forcings in East Antarctic Sea Ice Summer Melt

Ryan Eagan

Submission Date: December 15th, 2023

A Report submitted for MAST638: Machine Learning for Marine Science, Fall 2023

Abstract

In the Adelie Sea region of East Antarctica, atmospheric and oceanographic forcing influence sea ice concentration, thickness, and summer sea ice melt, particularly in the local Dumont d’Urville (DDU) Sea. Previous work (Smith et al., 2011) demonstrated that the sea ice coverage in the DDU sea is out of phase with the overall patterns of sea ice concentration on the East Antarctic Coast near Adelie Land. However, the cause(s) of this variation has not yet been fully explored. Using ERA5 climate reanalysis data (2003-2022, gridded 0.25°x0.25° hourly) from the European Centre for Medium-Range Weather Forecasts (ECMWF), we construct a summertime (NDJF) climatology of the surface, atmospheric and oceanographic conditions at DDU and in the surrounding area (area in a box bounded by 120°E -54°S and 160°E -70°S). To gain insight into the local-to-regional processes that drive sea ice concentration and thickness, we compare eight supervised machine learning regression models to determine feature importance with respect to sea ice ablation. We find agreement in feature importance across five of the models. The best performing model, Random Forest regression, demonstrates a strong bias towards the sea surface temperature feature and minimizes all other features. Comparison of feature importance and model performance suggests further model tuning can provide conclusive results with respect to the importance and relationship of atmospheric forcings with sea ice concentration.

1 Introduction and Background

Satellite observations over the past 40 years have shown a marginal trend of increase in total Antarctic sea ice (Eayrs et al., 2019), though there is strong temporal and regional variability (Kusahara et al., 2019). The drivers of the variability are still not fully understood, but the role of local atmospheric forcing is considered to be a significant factor along with regional-scale atmospheric and oceanic processes (Eayrs et al., 2019). Though sea ice extent has been consistent, future projections and trend detection rely on a comprehensive understanding of the processes that drive interannual and seasonal sea ice cycles.

 The sea ice off the Eastern Antarctic coast, particularly in and around the Dumont d’Urville Sea off the Terre Adelie coastline (between 139 and 146 °E and between 60 and 70°S), exemplifies this observed variability. The Dumont d’Urville region, which houses a permanent French research station on Petrel Island (66 39’ 48” S and 140 0’ 4” E) provides a unique opportunity to study the sea ice regime with a focus on investigating the interplay of atmospheric and oceanic forcings with the sea ice. Smith et al. (2011) analyzed sea ice concentration from 2003-2009 in the Dumont d’Urville Sea using AMSR-E remotely sensed data and demonstrated that the high seasonal and interannual variability is out of phase with the larger surrounding area.

To better understand the driving mechanisms of sea ice melt in this area, we compare supervised machine learning multivariate regression algorithms for identifying potential relationships between atmospheric variables and sea ice concentration through feature importance. Weekly mean sea ice concentration and atmospheric reanalysis data are regressed with the assumption that the relationship between the sea ice concentration and atmospheric variables are linear and independent.

2 Data and Methods

2.1 Study Period

Our analysis considers only the austral summer season which starts on November 15th and ends on February 15th. During this period, surface melt is the main ablation process.

2.2 Data

            The regression analysis is built on a combination of remotely sensed sea ice data and ERA5 climate reanalysis datasets. ERA5 reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF) were obtained from the Copernicus Climate Data Store in NetCDF format, gridded at 0.25° x 0.25° (Hersbach et al., 2020). We obtained hourly values at 0:00, 6:00, 12:00 and 18:00 at single levels from November 15th, 2003 to February 15th2022 for the following variables: 2 meter surface temperature (t2m), sea surface temperature (sst), wind direction, windspeed magnitude, low cloud cover (lcc), and downwelling longwave radiation (ssrd). Relative humidity (r) was obtained from the corresponding pressure levels dataset at the 1000 millibar level. These variables are uses as the predictor inputs for our regression models.

             Remotely sensed sea ice data, data set G10033, were retrieved from the National Ice Center for the full study area on a weekly basis from 2003 to 2022 (Fetterer & Stewart, 2020). The raster sea ice concentration is derived from U.S. National Ice Center (USNIC) sea ice charts based largely on satellite observations. The SIGRID-3 dataset has a resolution of 10 km x 10 km on the Equal-Area Scalable Earth (EASE) grid projection. The data provide sea ice concentration per grid cell as a percentage of sea ice content. We use the mid-range total concentration as the predicted value for our regression models.

2.3 Methods

2.3.1 Data Preprocessing and Feature Engineering

            Sea ice data is reprojected to EPSG:4326 – WGS 84 coordinate system to match the ERA5 gridded data coordinate system to ensure the alignment of coordinates. The grid cells are then weighted as a function of latitude to correct for area. Additionally, we remove data for February 29th in any leap years so the data can be aligned by week number. Wind magnitude and direction is computed from the U and V wind vectors provided by the ERA5.

Dimensional reduction of the spatial data is accomplished by taking first the daily means over the entire spatial area, that is we average all grid cells and time respective hourly entries to get one mean for the study area per day. These are then averaged by week and assigned the corresponding ISO week number (1,2, 3,…52). To capture the austral summer, we start at week 45 carry through the new year end stop at week 8. This gives us data from the beginning of November through the end of February, which includes the season shoulders.

Outliers in the 5th and 95th percentile are removed, along with any weeks that are missing data before being split into 70% (N=123) training data and 30% (N=54) test data for the machine learning algorithms. The data is scaled using Scikit-learn’s StandardScaler which transforms the data to a z-score ( d where  is the mean and  is the standard deviation (Pedregosa et al., 2011). We confirm a close to normal distribution for each variable from a generated pair plot (Appendix, Figure A1) to justify the choice of the z-score standardization for the data. Pearson correlation coefficients were calculated for each variable with every other variable revealing some multicollinearity between surface temperature, relative humidity and sea surface temperature (Appendix, Figure A2).

2.3.3 Regression Algorithms

We compare eight different regression algorithms from the Scikit-learn Machine Learning in Python package: Ordinary Least-Squares (OLS), Elastic-Net (Elastic and ElasticCV), Random Forest (RF), Ridge Regression (RidgeCV), Lasso Regression (LassoCV), Bayesian Regression (Bayes) and Stochastic Gradient Decent Regression (SGD) (Pedregosa et al., 2011). Each regressor was run individually in its own machine learning pipeline which includes the StandardScaler transform and the regression model. No additionally preprocessing or regression ensembles were considered. Model performance was compared using the Mean Squared Error (MSE) and correlation coefficients (R2) calculated between the predicted sea ice concentration from the test data and respective true sea ice concentration. Each model was trained on the split training data.

Bayes, OLS, and SGD models were run as is with no additional hyperparameters provided to the model. These models were run as out of the box baselines for comparison to the RF, Lasso, Ridge and ElasticNet models. The Random Forest regression model was also run without changing the default hyperparameters. LassoCV, RidgeCV, and ElasticCV are all models that implement regularization and a corresponding cross-validation approach to optimize the regularization parameter alpha (Pedregosa et al., 2011). Each model was provided with an array of values from 1.0 to 0.00001 to explore in its selection process. ElasticCV uses an additional parameter as a ratio of the L1 and L2 regularization parameters (see results discussion for further details), of which 0.5 was used.

3 Results and Discussion

3.1 Regression Algorithm Performance

The results for each model are shown in Table 1, with the best performing model, RF, at the top and the worst performing model, SGD, at the bottom. Feature importance for each model is shown in Figure 1, where we see OLS, Bayesian and SGD all highlight close to the same feature importance, though their performance poor compared to the Random Forest algorithm. 

MODELMAEMSER2EVS
Random Forest4.15122.6030.8340.847
LassoCV4.60428.7850.7880.796
ElasticCV4.60230.1590.7780.788
RidgeCV4.62030.5080.7760.783
Bayes4.62330.5220.7750.785
OLS4.62030.5300.7750.782
SGD4.65931.2850.7700.780
Elastic4.61931.7370.7660.780

Table 1 – Regression model performance from best to worst.

A screenshot of a graph

Description automatically generated
Figure 1: Model feature importance. Upper right shows RF (top) and LassoCV (below) feature selections.

In all of our models we see sea surface temperature dominating the feature importance which is a direct representation of the sea ice surface temperature in the ERA5 data. We see a strong negative relationship in surface temp, sea surface temp, wind direction, wind magnitude and relative humidity in the feature importance for the OLS, Bayes, SGD, RidgeCV and ElasticCV models. Low cloud cover and downwelling longwave radiation show a small positive relationship, which suggests low cloud cover and resulting downwelling longwave radiation have a net effect of attenuating sea ice melt while our other forcing variables suggest they help to drive sea ice melt.

3.2 Random Forest Results 

The Random Forest algorithm, without an hyperparameter adjustments performed well as an overall predictor to the data with the lowest MSE of all the models and best R2 at 82%. Despite this performance, the RF model favors sea surface temperature and minimizes all other features. This is most likely a result of sea surface temperature contributing the most value at each decision tree node and in turn, aggregating it’s weight as the most important feature which is a known issue with the impurity-based feature importance which are biased towards high cardinality features (Pedregosa et al., 2011). Further tuning the RF algorithm and selecting permutation-importance for the decision criteria may improve the RF feature selection and reduce the bias.

3.3 Ridge and Lasso

            In addition to the RF algorithm which will merit further tuning, the RidgeCV, LassoCV, and ElasticCV regression models were of the most interest for their use of a regularization parameter. Each regression model, like OLS, uses the minimization of the residual sum of squares (RSS) as the cost function. Ridge and Lasso regression introduce a new parameter, L2 and L1 respectively, to the cost function. These parameters act to further penalize high feature coefficients which results in a more robust method of feature selection and importance determination.

            The L1 regularization parameter is the 1-norm of the regression coefficients vector added to the cost function with a coefficient, lambda, that can be selected to weight the contribution of the L1 norm. The L2 parameter is the 2-norm of the regression coefficients vector, that is the squared sum of the coefficients (Raschka et al., 2022). Equations 1 and 2 show the L1 and L2 norms.

(1)
(2)

The cost or loss function for the regression models is given by equation 3.

(3)

RidgeCV utilizes the L2 norm as the regularization penalty, shown in equation 4, where  is the tunable weight that determines the contribution of the L2 norm to the loss function (Pedregosa et al., 2011).

(4)

Respectively for Lasso, we have the same but with the L1 norm, equation 5,

(5)

and ElasticCV uses an additional hyperparameter which is a ratio that represents a combination of the two norms (Pedregosa et al., 2011).

            Our RidgeCV and ElasticCV models performed better than OLS, Bayes, and SGD, but not as well the RF and LassoCV algorithms. LassoCV came in second best on performance but eliminated all but the sea surface (sst) and surface air temperature (t2m) features. Here we see that the L1 penalty zeroed out these features to improve the model performance, whereas with RidgeCV using the L2 norm and ElasticCV using a 0.5 ratio of both norms did not eliminate any features. The results of the LassoCV model demonstrate a similar bias expressed by the RF algorithm and suggest further investigation into hyperparameter tuning before additional conclusions can be made.

            Each of these three models performed a cross-validation on five variations of the training set and optimized model performance over the range of 1.0 to 0.00001 for , all of which in turn converged on 1.0 as the best value. We also ran ElasticNet a single model without cross validation, , L1/L2 ratio = 0.75 for comparison. The heavy weight from the 0.75 ratio on the L1 norm eliminates all the same features as the Lasso regression except the downwelling longwave radiation feature, one that isn’t the strongest positive feature in the other models.

3.4 Linearity Challenges

            In our consideration of multivariate linear regression models, we assume that the atmospheric variables used have a linear relationship with sea ice concentration. In initial exploration of the data, we see multicollinearity with surface temp (t2m), sea surface temp (sst) and relative humidity (r), which may pose challenges for some linear regression models. Our interest in Ridge and Lasso regression models stems from their ability to handle multicollinearity within the data (Raschka et al., 2022). What we have not considered is the interaction terms between our predictor variables and their relationships. Physical processes and feedback cycles do exist between low cloud cover, downwelling longwave radiation and humidity. These potential interaction terms may not be readily expressed in the linear models as configured and analyzed above and may benefit from exploring polynomial regression models to address this.

4 Conclusions

The Random Forest model shows the most potential for best fit and prediction, but the strong bias in the feature selection warrants further investigation and model tuning. The improved performance with the cross-validation models suggest further improvements may be achievable with a larger cross-validation. Five of the eight models we explored show a strong agreement in feature importance which suggests there’s motivation to continue our approach to determining potential weights for atmospheric forcing variables. Next steps will focus improvement of the Random Forest model to help minimize the bias in feature selection and see if the refined results agree with the other models.

As an initial first exploration of supervised machine learning for identifying strong forcing relationships between atmospheric processes and sea ice ablation, our results are still inconclusive, but show promise with further model refinement and analysis.

Appendices

A screenshot of a graph

Description automatically generated
Figure A1: Pair plot and distribution of the ERA5 reanalysis data and NSIDC sea ice concentration (tc_mid).
A screenshot of a graph

Description automatically generated
Figure A2: Correlation matrix of ERA5 and sea ice variables. 
A graph with blue dots and a red line

Description automatically generated
Figure A3: Actual vs predicted for the Random Forest algorithm. MSE: 22.603 and R2: 0.834

References 

Eayrs, C., Holland, D., Francis, D., Wagner, T., Kumar, R., & Li, X. (2019). Understanding the Seasonal Cycle of Antarctic Sea Ice Extent in the Context of Longer‐Term Variability. Reviews of Geophysics57(3), 1037–1064. https://doi.org/10.1029/2018RG000631

Fetterer, U. S. N. I. C. C. by F., & Stewart, J. S. (2020). U.S. National Ice Center Arctic and Antarctic Sea Ice Concentration and Climatologies in Gridded Format, Version 1. National Snow and Ice Data Center. https://doi.org/10.7265/46cc-3952

Hersbach, H., Bell, B., Berrisford, P., Hirahara, S., Horányi, A., Muñoz‐Sabater, J., et al. (2020). The ERA5 global reanalysis. Quarterly Journal of the Royal Meteorological Society146(730), 1999–2049. https://doi.org/10.1002/qj.3803

Kusahara, K., Williams, G. D., Massom, R., Reid, P., & Hasumi, H. (2019). Spatiotemporal dependence of Antarctic sea ice variability to dynamic and thermodynamic forcing: a coupled ocean–sea ice model study. Climate Dynamics,52(7–8), 3791–3807. https://doi.org/10.1007/s00382-018-4348-3

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research12, 2825–2830.

Raschka, S., Liu, Y., Mirjalili, V., & Dzhulgakov, D. (2022). Machine learning with PyTorch and Scikit-Learn: develop machine learning and deep learning models with Python. Birmingham Mumbai: Packt.

Smith, M. B., Labat, J.-P., Fraser, A. D., Massom, R. A., & Koubbi, P. (2011). A GIS approach to estimating interannual variability of sea ice concentration in the Dumont d’Urville Sea near Terre Adélie from 2003 to 2009. Polar Science5(2), 104–117. https://doi.org/10.1016/j.polar.2011.04.007