Skip to main content

Official Journal of the Asia Oceania Geosciences Society (AOGS)

  • Research Letter
  • Open access
  • Published:

The effectiveness of machine learning methods in the nonlinear coupled data assimilation

Abstract

Implementing the strongly coupled data assimilation (SCDA) in coupled earth system models remains big challenging, primarily due to accurately estimating the coupled cross background-error covariance. In this work, through simplified two-variable one-dimensional assimilation experiments focusing on the air–sea interactions over the tropical pacific, we aim to clarify that SCDA based on the variance–covariance correlation, such as the ensemble-based SCDA, is limited in handling the inherent nonlinear relations between cross-sphere variables and provides a background matrix containing linear information only. These limitations also lead to the analysis distributions deviating from the truth and miscalculating the strength of rare extreme events. However, free from linear or Gaussian assumptions, the application of the data-driven machine learning (ML) method, such as multilayer perceptron, on SCDA circumvents the expensive matrix operations by avoiding the explicit calculation of background matrix. This strategy presents comprehensively superior performance than the conventional ensemble-based assimilation strategy, particularly in representing the strongly nonlinear relationships between cross-sphere variables and reproducing long-tailed distributions, which help capture the occurrence of small probability events. It is also demonstrated to be cost-effective and has great potential to generate a more accurate initial condition for coupled models, especially in facilitating prediction tasks of the extreme events.

Introduction

As more record-breaking weather events occur under global warming, using coupled earth system models to produce reliable seasonal to decadal predictions is progressively crucial for decision-makers to manage the risk (Wang et al. 2017b; Penny and Hamill 2017; Raymond et al. 2020). The accuracy of model initialization is significant for predictions on seasonal to decadal timescales (Boer et al. 2016). Coupled data assimilation (CDA) serves as a solution, which combines the prior model predictions and observations from different earth components together to obtain the best initial conditions for each component. By maintaining the interaction between different components, CDA mitigates the initial shock and generates physically balanced initial conditions (Zhang 2011; He et al. 2020). Different from weakly CDA (WCDA), strongly CDA (SCDA) allows observations directly influence the state estimation of another component through coupled cross background-error covariance (CCEC). The distinguished performance of SCDA has been proved by models of various complexities (Zheng and Zhu 2010; Park et al. 2015; Sluka et al. 2016; Penny et al. 2019; Kalnay et al. 2023). While SCDA is a theoretically optimal approach, current operational centers mostly combine the existing atmospheric and oceanic assimilation systems to construct WCDA systems (Fujii et al. 2021). This is because the incorrect CCEC in SCDA will lead to inferior analysis quality compared to WCDA (Han et al. 2013).

It remains challenging for estimating accurate CCEC and implementing SCDA, due to the difference in variability of each component, lead–lag correlation between components, sampling errors, high computational cost, and nonlinear interaction at the interface (Penny et al. 2017; Zhang et al. 2020; Zheng et al. 2022). Attempts have been made to surmount these difficulties, and one such effort involves the development of a multi-timescale, high-efficiency approximate EnKF (MSHea–EnKF). It aims to enhance the computational efficiency and the accuracy of error statistics for slow scale (Yu et al. 2019), while the high observation frequency of fast scale can also help address the multiscale problem (Tondeur et al. 2020). Leading averaged coupled covariance (LACC) method is proposed to alleviate issues arising from lead–lag relationships, and the real-world assimilation experiments also prove that LACC could produce high-quality analyses (Lu et al. 2015; Sun et al. 2020). Reconditioning, Schur product localization (Smith et al. 2018) and the correlation–cutoff method (Yoshida and Kalnay 2018) are three effective approaches for mitigating the sampling error. Although proven to be powerful in state estimation, the dominant SCDA methods, including ensemble-based, variational and hybrid frameworks, have disadvantages of expensive computational cost and assumptions on linearity and Gaussianity that are detrimental to complex high-dimensional coupled models (Zhang and Zhang 2012; He et al. 2017; Evensen et al. 2022). Even the particle filter (PF), free from linear or Gaussian assumptions, faces the unavoidable curse of dimensionality and filter degeneracy when applied to geophysical systems with high dimensions. Some PF variants have been proposed to mitigate these problems by introducing localization schemes or giving equal particle weights (Tödter and Ahrens 2015; Poterjoy 2016; Zhu et al. 2016; Skauvold et al. 2019; Feng et al. 2020).

The data-driven machine learning (ML) method has drawn tremendous attention today, due to its capability of nonlinear expression and spatiotemporal feature extraction, coupled with the advantages of strong generalization and computational efficiency (Sarker 2021; Xu et al. 2021). The successful applications of ML in assimilation for single models provide a promising approach for addressing the aforementioned challenges (Brajard et al. 2020; Arcucci et al. 2021; Ruckstuhl et al. 2021; Huang et al. 2021; Zhou and Zhang 2023). This study aims to investigate (1) the limitation of conventional SCDA strategy that based on variance–covariance correlation and (2) the potential effectiveness of ML in nonlinear SCDA. This paper is organized as follows: Sect. “Problem Definition” elucidates the conflict between the linear update mechanism of conventional assimilation methods and the nonlinear reality. The nonlinear assimilation experiment design is presented in Sect. “Experiment settings for strategy effectiveness evaluation” and the main results are analyzed in Sect. “Performance of different SCDA strategies”. Conclusions will be provided in Sect. “Conclusion and discussion”.

Problem definition

Data sets

The present study employs the monthly reanalysis data of oceanic and atmospheric components: the sea surface temperature (SST) is obtained from the Hadley Centre Sea Ice and SST data set version 1 (HadISST1) (Rayner et al. 2003); the sea surface salinity (SSS) is from the Hadley Centre’s subsurface temperature and salinity data set EN4.2.2 (Gouretski and Reseghetti 2010); and the sea surface height (SSH) measurements is derived from the Simple Ocean Data Assimilation product (SODA 3.15.2) (Carton et al. 2018). The outgoing longwave radiation (OLR) is from NOAA interpolated outgoing longwave radiation data set (Liebmann and Smith 1996); the precipitation rate (PRC) is taken from the Global Precipitation Climatology Project (GPCP) (Adler et al. 2003); and the air temperature at 2 m (T2m) is from the fifth generation ECMWF Reanalysis (ERA5) (Hersbach et al. 2020). The SSH data is available from 1980 to 2020, while others cover the period from 1979 to 2022. The provided data will be used to evaluate the performance of different SCDA strategies in handling relations of various complexity.

Linear characteristic of coupled data assimilation

Ensemble Kalman filter (EnKF) has shown powerful capability in SCDA for coupled ocean–atmosphere models (Liu et al. 2013). The analysis equation of EnKF can be written as (Sakov and Sandery 2015):

$${{\varvec{X}}}^{a}={{\varvec{X}}}^{p}+K\left({{\varvec{X}}}^{o}-{\varvec{H}}{{\varvec{X}}}^{p}\right)$$
(1)

where superscripts \(a\), \(p\) and \(o\) represent analysis, prediction and observation, respectively, hereafter; \({{\varvec{X}}}^{a}\) represents the analysis field; \({{\varvec{X}}}^{p}\) denotes the background field (a.k.a. the prior model prediction field); \({{\varvec{X}}}^{o}\) is the observation field; \({\varvec{H}}\) is the linearized observation operator; \({\varvec{K}}\) represents the Kalman gain matrix, which can be written as

$$\begin{array}{c}K=B{{\varvec{H}}}^{T}{\left({\varvec{H}}{\varvec{B}}{{\varvec{H}}}^{T}+{\varvec{R}}\right)}^{-1}\end{array}$$
(2)

where \({\varvec{B}}\) denotes the flow dependent background error covariance matrix, a statistical variance–covariance matrix estimated by ensemble members to characterize the correlation of variables among the model grid points. \({\varvec{R}}\) is the observation error covariance matrix that can be derived from the observation error of instruments.

In order to analyze the increment of prior prediction induced by the observations from another earth component during one assimilation cycle, here we consider a two-variable one-dimensional field denoted as \((x,y)\), with \(x\) and \(y\) represent oceanic and atmospheric components, respectively. To directly illustrate the adjustment and highlight the role of \({\varvec{B}}\) in adjustment (Appendix B), we further simplify the update equation by assuming that there are only accurate observations of oceanic variable \({x}_{o}\) available at the model grid points, so we have \({{\varvec{X}}}^{o}={x}_{o}\), \({\varvec{R}}=0\) and \({\varvec{H}}=(\text{1,0})\). Supposing that the prior model prediction is \({{\varvec{X}}}^{p}={({x}_{p},{y}_{p})}^{T}\), the update equation finally turns to

$$\begin{array}{c}{{\varvec{X}}}^{a}={{\varvec{X}}}^{p}+B{{\varvec{H}}}^{T}{\left({\varvec{H}}{\varvec{B}}{{\varvec{H}}}^{T}\right)}^{-1}\delta X\end{array}$$
(3)

where \(\delta {\varvec{X}}={{\varvec{X}}}^{o}-{\varvec{H}}{{\varvec{X}}}^{p} ={x}_{o}-{x}_{p}=\delta x\), so that the analyses of \(x\) and \(y\) are computed by

$$\begin{array}{c}{x}_{a}={x}_{p}+\delta x\end{array}$$
(4)
$$\begin{array}{c}{y}_{a}={y}_{p}+\frac{{\sigma }_{yx}}{{\sigma }_{x}^{2}}\delta x\end{array}$$
(5)

where \({\sigma }_{yx}\) represents the covariance between \(y\) and \(x\), \({\sigma }_{x}^{2}\) represents the standard deviation of the \(x\). \({\sigma }_{yx}\) and \({\sigma }_{x}^{2}\) are parts of \({\varvec{B}}\). Equation (5) clearly shows that the oceanic observation innovation is projected to atmospheric component through a linear coefficient, which is the ratio of \({\sigma }_{yx}\) and \({\sigma }_{x}^{2}\).

If we substitute the variance–covariance correlation with linear regression to describe the relationship between air–sea variables, then the analysis field will be expressed as

$$\begin{array}{c}{x}_{a}={x}_{p}+1\times \delta x+0\end{array}$$
(6)
$$\begin{array}{c}{y}_{a}={y}_{p}+a\times \delta x+b\end{array}$$
(7)

where 0 and \(b\) represent the bias, and the regression coefficient \(a\) is calculated by

$$\begin{array}{c}a=\frac{{\sum }_{i=1}^{N}\left({x}_{i}-\overline{x }\right)\left({y}_{i}-\overline{y }\right)}{{\sum }_{i=1}^{N}{{(x}_{i}-\overline{x })}^{2}}=\frac{\frac{{\sum }_{i=1}^{N}\left({x}_{i}-\overline{x }\right)\left({y}_{i}-\overline{y }\right)}{N}}{\frac{{\sum }_{i=1}^{N}{{(x}_{i}-\overline{x })}^{2}}{N}}=\frac{{\sigma }_{xy}}{{\sigma }_{x}^{2}}\end{array}$$
(8)

Therefore, the only difference between two assimilation strategies is the bias in update equation of \(y\), that means SCDA based on the variance–covariance correlation is analogous to linear regression mathematically. These deductions are in agreement with previous works (Anderson 2003; Zhang et al. 2007). Equations (5) and (8) collectively demonstrate that through conventional SCDA, observations can only linearly impact the adjustment of state from different components.

Correlation between variables from different components

Tropical Pacific is a region characterized by intense air–sea interaction. To investigate the importance of nonlinear part in the relationships between different earth components, we evaluate the difference of determinable coefficient (\({R}^{2}\)) between the second-order and first-order Taylor expansion of the function \(y=f(x)\) over \((20^\circ {\text{S}}-20^\circ {\text{N}}, 100^\circ {\text{E}}-80^\circ {\text{W}})\), represented by linear regression model and quadratic fitting model, respectively. \({R}^{2}\) is an important metric for quantifying the explanatory power of the model, with larger values indicating a more adaptive model to the task. The mathematical formulation is expressed as

$$\begin{array}{c}{R}^{2}=1-\frac{{\sum }_{i=1}^{n}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({y}_{i}-\overline{y }\right)}^{2}}\end{array}$$
(9)

where \(n\) is the number of the truth of atmospheric components \({y}_{i}\), \({\widehat{y}}_{i}\) is the estimates from fitting models and \(\overline{y }\) is the mean of the truth. Decadal variability is removed from data by subtracting the monthly mean at different locations.

From Fig. 1, we observe that nonlinear correlations are prevalent among variables in the tropical Pacific. Utilizing the quadratic fitting model to characterize these relationships can enhance \({R}^{2}\) more than 10%, even larger than 24% in certain instances. It suggests that nonlinear component plays a crucial role and nonlinear approaches are more suitable to describe the relationships between variables from different earth components. However, the conventional SCDA only captures the linear correlation between variables to adjust the initial conditions for the coupled models, which conflicts with the realistic nonlinear relationships. SCDA may lead the predictions unreliable, along with the linearized model and measurement operator.

Fig. 1
figure 1

Difference in R2 between the quadratic fitting model and linear regression model for modeling the correlations between local monthly anomalies. The blue line denotes that R2 of quadratic fitting is 10 percent larger than linear regression

Although nonlinear correlation also exists within a single earth component, the fact that all variables are governed by the same dynamic equations helps mitigate the negative impact caused by the linear deficiency of assimilation algorithms. However, it remains challenging for coupled system, subject to different dynamic equations, to avoid the incongruity between variables and address the linear deficiency. These inferences suggest that introducing non-linearity into SCDA methods is necessary for generating more accurate analysis field and reliable predictions.

Experiment settings for strategy effectiveness evaluation

Multilayer perceptron (MLP)

MLP is a fully connected feedforward artificial neural network, comprising an input layer, at least one hidden layer and a output layer (Subasi 2020). The number of hidden layers and neurons is task-dependent. Each neuron introduces nonlinearity through a nonlinear activation function, enabling the MLP to approximate any complex functions. The weights help establish connections between neurons in adjacent layers. The network gradually converges to the optimal solution by minimizing the loss function. MLP has exhibited remarkable performance in various applications, including image recognition and pattern recognition. Flexibility, a notable strength of MLP, enables the network to tackle a variety of tasks (Hornik et al. 1989). MLP remains effective even when faced with a single input feature, underscoring its ability to perform well in low-dimensional input settings (Taud and Mas 2018).

Experiment design

Under assumptions in Sect. “Linear characteristic of coupled data assimilation”, three SCDA strategies will be tested to handle relations between SST and OLR with different degrees of nonlinearity: quadratic fitting, a data-informed nonlinear polynomial fitting model; MLP, an effective data-driven and adaptive nonlinear machine learning approach; and EnKF, representing variance–covariance correlation and serving as a baseline for comparison.

Several temporally corresponding pairs of \(\left({sst}_{o},{olr}_{t}\right)\) points will be randomly selected from SST and OLR data sets in Sect. “Datasets” as test data set \({{\varvec{X}}}_{test}\), with \({sst}_{o}\) serves as the \({x}_{o}\) in Sect. “Linear characteristic of coupled data assimilation” and \({olr}_{t}\) as the atmospheric truth to assess the analyses \({olr}_{a}\) generated by different SCDA strategies. Similarity is crucial for avoiding the deviation from realistic situation and effectively evaluating strategies, so the number of pairs depends on the value of \(JSD\) between the distributions of selected points and the real situation (estimated by the entire data set). The remaining part serves as the training data set \({{\varvec{X}}}_{train}\). The mean of the entire data set serves as \({{\varvec{X}}}^{p}\) and \({\varvec{B}}\) is estimated by anomalies \(\Delta {{\varvec{X}}}_{train}\), where \(\Delta\) denotes \({{\varvec{X}}}_{train}\) minus \({{\varvec{X}}}^{p}\). \(\Delta {{\varvec{X}}}_{train}\) will be utilized to train parameters for strategies to model the relationship (Fig. 2).

Fig. 2
figure 2

Nonlinear assimilation experimental design

To facilitate training, the data for MLP is initially normalized within the range of (− 1,1). The rectified linear unit (ReLU) is chosen as the activation function to process the input nonlinearly. The Mean Square Error (MSE) is employed as the loss function and the Adaptive Moment Estimation (Adam) algorithm serves as weight updating scheme. The learning rate, as well as the depth and width of MLP are determined empirically. Considering the data volume, K-fold cross-validation is introduced to determine the optimal validation data set and trained model.

Evaluation metrics

To systematically measure the performance of different assimilation strategies, six evaluation metrics are introduced. Besides \({R}^{2}\), the root-mean-square error (RMSE) and mean absolute error (MAE) serve to quantify the precision of the assimilation model. Compared to MAE, RMSE amplifies the contribution of larger errors to comprehensive performance of the assimilation model, which could help evaluate the performance of model in the extreme events. Pearson correlation coefficient (Corr) is used to evaluate the degree of coordinated variation of the analyses and the truth \({olr}_{t}\).

Considerable emphasis should also be placed on whether the distribution of the analysis field deviates from the truth. The Kullback–Leribler divergence (\({D}_{KL}\)), also known as relative entropy, is commonly used to evaluate the disparity between two distributions. \({D}_{KL}\) is defined as

$$\begin{array}{c}{D}_{KL}(P|\left|Q\right)={\sum }_{i}P\left(i\right)\text{log}\left(\frac{P\left(i\right)}{Q\left(i\right)}\right)\end{array}$$
(10)

where \(P\) is the baseline distribution and \(Q\) is the sample distribution. \({D}_{KL}\) is a non-negative asymmetric metric, signifying that \({D}_{KL}(P|\left|Q\right)\ne {D}_{KL}(Q|\left|P\right)\). The lower the values of \({D}_{KL}\), the smaller the disparity between \(Q\) and \(P\).

The Jensen–Shannon divergence (\(JSD\)) is a more comprehensive measure of distribution similarity, which inherits the capabilities of \({D}_{KL}\) and addresses its asymmetry deficiency simultaneously. The formula is as follows:

$$\begin{array}{c}JSD\left(P,Q\right)=\frac{1}{2}{D}_{KL}(P|\left|M\right)+\frac{1}{2}{D}_{KL}(Q|\left|M\right)\end{array}$$
(11)

where \(M\) is the average of \(P\) and \(Q\). The range of \(JSD\) is 0–1. Here we define that the difference between \(P\) and \(Q\) is negligible when \(JSD\le 0.01\).

Performance of different SCDA strategies

We first evaluate the effectiveness of strategies in near-linear, weak and strong nonlinear relations at local grid points within the tropical pacific, where the data assimilation is practiced on. 100 points are selected to reflect the realistic situations for these examples according to the standard of \(JSD\) (Fig. 6), and 20 repeated experiments are conducted at different grid points for each condition (Table 1). Consistent with the mathematical deduction, Fig. 3 shows that the analysis of EnKF is a result of linear mapping. Figure 3a demonstrates that variance–covariance correlation (EnKF) may yield more accurate state estimation than the unsuitable data-informed quadratic fitting strategy under near-linear situation, though the latter shows statistically superiority (Table 1). When faced with nonlinear relations (Fig. 3b–d), linear strategy (i.e., EnKF) sacrifices information, especially in dealing with rare extreme events (OLR smaller than \(240 W/{m}^{2}\)). Besides modeling the ordinary events accurately, MLP strategy is more reliable in predicting extreme or out-of-sample events, which other strategies are prone to underestimate. However, both linear and data-informed nonlinear strategies could generate state estimation that deviate from the truth to different extent, particularly for “low probability, high impact” events. The advantage of flexibility enables data-driven MLP strategy adaptive to different relations, achieving statistically superior evaluation metrics (Table 1). The correlation coefficeinets in Table 1 also indicate that the linear strategy has difficulty capturing the evolving character of complex relationships. Figure 3d shows that the bimodal analysis of EnKF is inconsistent with the unimodal truth, because the variance–covariance correlation is analogous to linear mapping, determining that EnKF will yield a wrong distribution consistent with SST but not OLR (Fig. 6). Nonlinear strategies all successfully reproduce the unimodal distribution and gain smaller \({D}_{KL}\).

Table 1 Statistical average evaluation metrics of different strategies for relations with different complexity
Fig. 3
figure 3

Analyses produced by EnKF, quadratic fitting and MLP strategies during one assimilation cycle for relations between SST and OLR: a near-linear relation at \((5^\circ {\text{S}}, 135^\circ {\text{E}})\); b weak nonlinear relation at \((0^\circ , 165^\circ {\text{W}})\); c strong nonlinear relation at \((2.5^\circ {\text{S}}, 105^\circ {\text{W}})\); d analysis distributions for the strong nonlinear relation. In (ac), the orange points denote the truth; the gray points represent the training data set and the values in the legend are \(RMSE\) (\({R}^{2}\)). In (d), the values in the legend are \({D}_{KL}\), and a smaller value indicates a higher degree of similarity between the analysis distribution and the truth. To show clearly the performance of different strategies, points in (c) are selected evenly and the results depended on PDF are shown in Fig. 8c

Further comparison is made based on a more strongly nonlinear relationship between SST and OLR at (\(5^\circ {\text{S}}{-}5^\circ {\text{N}}, 130^\circ {\text{E}}{-}100^\circ {\text{W}}\)). 1500 points are selected for this example to validate the trained assimilation models (Fig. 7). The SCDA strategies based on least-squares criterion are not only required to reduce RMSE, but closely represent the true evolving relationship between variables. Here we use the evolution of OLR with SST based on joint probability distribution to evaluate the three trained models. The analysis produced by EnKF shows a large margin of error, even completely away from the range of OLR when SST is below \(25\,^\circ{\text{C}}\) (Fig. 4). Other two nonlinear strategies generate more accurate state estimations without significant deviation. The data-informed quadratic fitting strategy is prone to overestimates the value of OLR when SST is smaller than \(22\,^\circ{\text{C}}\), and fails to reproduce the evolution of OLR when SST is larger than \(28\,^\circ{\text{C}}\). However, the data-driven MLP strategy effectively captures the various evolving character within this complex correlation (Lau et al. 1997; Jiang and Zhu 2020). The evolution reproduced by MLP closely matches the truth when SST is smaller than \(25\,^\circ{\text{C}}\), whereas other strategies exhibit substantially increased margins of error.

Fig. 4
figure 4

Evolution of OLR with SST generated by different SCDA models at \((5^\circ {\text{S}}{-}5^\circ {\text{N}}, 130^\circ {\text{E}}{-}100^\circ {\text{W}})\). The blue points denote the OLR that has the maximum probability to happen for a given SST in a real situation and the interval between points is \(0.5\,^\circ{\text{C}}\). It shows that as SST increases, the value of OLR first increases \((SST\le 25\,^\circ{\text{C}} )\), remains relatively constant \((25\,^\circ{\text{C}} \le SST\le 28\,^\circ{\text{C}} )\) and finally exhibits a curve relation with SST \((28\,^\circ{\text{C}} \le SST)\). The other color points represent the analyses of OLR produced by different strategies for the given SST. The functions of gray and orange points remain the same as before

These inferences imply that the linear variance–covariance correlation is not suitable for modeling ubiquitous nonlinear relationships, resulting in substantial errors in the analyses and predictions. Introducing nonlinear strategies will remedy the linear limitation of conventional SCDA and improve the prediction skills for coupled models. Utilizing nonlinear fitting strategies to improve the state estimation proves to be more computationally expensive (Appendix C). In contrast, the ML strategy circumvents the need for constructing \({\varvec{B}}\) explicitly, emerging as a promising approach for implementing SCDA in coupled models. Figure 5 also demonstrates that the loss function will converge stably and quickly as data volume grows, contributing to an improvement in computational efficiency of SCDA.

Fig. 5
figure 5

Evolution of the loss function with epochs for the training and validation data set of relations between \(\Delta SST\) and \(\Delta OLR\) at (a) \((0^\circ , 165^\circ {\text{W}})\); (b) \((5^\circ {\text{S}}{-}5^\circ {\text{N}}, 130^\circ {\text{E}}{-}100^\circ {\text{W}})\)

Conclusions and discussion

This study aims to clarify that the conventional SCDA based on the linear variance–covariance correlation faces limitations in addressing complex relations within coupled systems. The simplified two-variable one-dimensional nonlinear assimilation experiments based on SST and OLR are conducted in the tropical Pacific, a region characterized by intense air–sea interaction (Wang et al. 2017a). Experimental results indicate that the conventional SCDA strategy (i.e. ensemble-based assimilation method) is suitable for near-linear situations, but fails to represent nonlinear relationships. Given the universal nonlinear relationships in the real world, it is necessary to develop nonlinear SCDA strategies. Instead, the data-driven advantage enables ML strategy, represented by MLP here, to overcome the linear or Gaussian limitations of conventional strategies and adapt to relations with various complexity. This strategy also achieves comprehensively improved analysis quality than linear and data-informed nonlinear strategies, especially for regions with strongly nonlinear interaction between earth components. The superior results of evaluation metrics and the longer tail of analysis distributions collectively illustrate the significant potential effectiveness of ML in generating more accurate analysis field and enhancing the prediction skills of small probability events (Frame et al. 2022). In addition, it circumvents the explicit construction of background matrix and subsequent costly matrix operations, presenting a cost-effective approach for implementing SCDA. The augment of the data volume and input features could further enhance the computational efficiency of SCDA based on ML strategy, which can be achieved by increasing the ensemble size and integrating heterogeneous data from different sources. However, the limitation of ML in handling imbalanced data set may hinder the further improvement, solutions like introducing physical mechanism into ML strategy become imperative (Xie et al. 2021).

Availability of data and materials

The data sets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

CDA:

Coupled data assimilation

SCDA:

Strongly coupled data assimilation

WCDA:

Weakly coupled data assimilation

CCEC:

Coupled cross background-error covariance

LACC:

Leading averaged coupled covariance

SST:

Sea surface temperature

SSS:

Sea surface salinity

SSH:

Sea surface height

OLR:

Outgoing longwave radiation

PRC:

Precipitation rate

T2m:

The air temperature at 2 m

EnKF:

Ensemble Kalman filter

ML:

Machine learning

MLP:

Multilayer perceptron

\({R}^{2}\) :

Determinable coefficient

RMSE:

Root-mean-square error

MAE:

Mean absolute error

Corr:

Pearson correlation coefficient

\({D}_{KL}\) :

Kullback–Leribler divergence

\(JSD\) :

Jensen–Shannon divergence

References

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (Grant No. 2023YFF0805202), the National Natural Science Foundation of China (Grant No. 42175045), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB42000000).

Funding

This work was supported by the National Key R&D Program of China (Grant No. 2023YFF0805202), the National Natural Science Foundation of China (Grant No. 42175045), the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDB42000000).

Author information

Authors and Affiliations

Authors

Contributions

Ziying Xuan performed the data analysis and wrote the original draft. Fei Zheng provided the funding acquisition, and guided the revision of the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fei Zheng.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Distribution similarity and loss evolution

See Figs. 6, 7, 8, 9

Fig. 6
figure 6

\(JSD\) for examples of local relation between SST and OLR, when 100 points are selected. a, b at \((5^\circ {\text{S}}, 135^\circ {\text{E}})\); c, d at \((5^\circ {\text{S}}, 155^\circ {\text{W}})\); e, f at \((2.5^\circ {\text{S}}, 105^\circ {\text{W}})\). The values after the “selected” are \(JSD\)

Fig. 7
figure 7

Same as Figure A1 but for the relation between SST and OLR at (\(5^\circ {\text{S}}{-}5^\circ {\text{N}}, 130^\circ {\text{E}}{-}100^\circ {\text{W}}\)), when 1500 points are selected

Fig. 8
figure 8

Analysis distributions produced by EnKF, quadratic fitting and MLP strategies during one assimilation cycle for the examples of the relation between SST and OLR: a near-linear relation at \((5^\circ {\text{S}}, 135^\circ {\text{E}})\); b weak nonlinear relation at \((0^\circ , 165^\circ {\text{W}})\). In (a, b), the values in the legend are \({D}_{KL}\). c Analyses produced by different strategies during one assimilation cycle for strong nonlinear relation at \((2.5^\circ {\text{S}}, 105^\circ {\text{W}})\)

Fig. 9
figure 9

Evolution of the loss function with epochs for the training and validation data set for relations between \(\Delta SST\) and \(\Delta OLR\) at (a) \((5^\circ {\text{S}}, 135^\circ {\text{E}})\); the PDF-dependent (b) and the evenly selected example (c) at \((2.5^\circ {\text{S}}, 105^\circ {\text{W}})\)

Appendix B

Influence of observation error covariance in adjustment

In real-world assimilation experiments, the observation error covariance \({\varvec{R}}\) can be derived from the observation error of instruments and is usually non-zero. In this study, \({\varvec{R}}\) is also a changeable positive variance, when only oceanic observation \({x}_{o}\) is available. If we set \({\varvec{R}}={\varvec{c}}\boldsymbol{*}\boldsymbol{ }{\sigma }_{x}^{2}\), where \({\varvec{c}}\) is a changeable positive constant and \({\sigma }_{x}^{2}\) is the background error variance of oceanic component. Then the update Eq. (1) turns to:

$$\varvec{X}^{a} = \varvec{X}^{p} + \left( {\begin{array}{*{20}c} {\sigma _{x}^{2} } & {\sigma _{{xy}} } \\ {\sigma _{{yx}} } & {\sigma _{y}^{2} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } \right)\left( {\left( {\begin{array}{*{20}c} 1 & 0 \\ \end{array} } \right)\left( {\begin{array}{*{20}c} {\sigma _{x}^{2} } & {\sigma _{{xy}} } \\ {\sigma _{{yx}} } & {\sigma _{y}^{2} } \\ \end{array} } \right)\left( {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } \right) + \varvec{c*}\sigma _{x}^{2} } \right)^{{ - 1}} \delta \varvec{X}$$
(12)

and the analyses of \(x\) and \(y\) are described as:

$${x}_{a}={x}_{p}+\frac{1}{1+{\varvec{c}}}({x}_{o}-{x}_{p})$$
(13)
$${y}_{a}={y}_{p}+\frac{{\sigma }_{yx}}{{\sigma }_{x}^{2}}\frac{1}{1+{\varvec{c}}}({x}_{o}-{x}_{p})$$
(14)

Equations (13) and (14) clearly shows that the oceanic observation innovation is projected to atmospheric component through a linear coefficient. When \({\sigma }_{x}^{2}\) is fixed, the value of \({\varvec{c}}\) will influence the analyses generated by EnKF. To further clarify the influence of \({\varvec{R}}\), Fig.10 provides an example based on SST and OLR at a local grid point. It clearly shows that when \(0\le {\varvec{c}}<1\)(\(0\le {\varvec{R}}<{\sigma }_{x}^{2}\)), the SCDA model weights observation \({x}_{o}\) more than prior prediction \({x}_{p}\) and when \({\varvec{c}}=0\) (\({\varvec{R}}=0\)), observation is completely trusted to modify the prior predictions \({{\varvec{X}}}^{p}\). When \(1\le {\varvec{c}}\)(\({\sigma }_{x}^{2}\le {\varvec{R}}\)), \({x}_{p}\) plays more important role in assimilation, analyses \({{\varvec{X}}}^{a}\) are gradually close to \({{\varvec{X}}}^{p}\) as \({\varvec{c}}\) increases. When \({\varvec{c}}=\boldsymbol{\infty }\)(\({\varvec{R}}=\boldsymbol{\infty }\)), analyses \({{\varvec{X}}}^{a}\) are finally equal to \({{\varvec{X}}}^{p}\). Although \({\varvec{R}}\) derived from instruments will influence the analyses, it will not change the linear characteristic of conventional SCDA strategies. However, the variance–covariance relationship (\({\sigma }_{yx}/{\sigma }_{x}^{2}\)) between cross-sphere variables estimated by ensemble members can be replaced by a nonlinear form to supply nonlinearity to SCDA strategies.

See Fig. 10.

Fig. 10
figure 10

Evolution of the analyses of (a) SST and (b) OLR with \({\varvec{c}}\) at a local grid point, when only observation of SST \({x}_{o}\) is available. The orange dashed lines denote the prior predictions. The texts represent the values of analyses (blue crosses) for the given \({\varvec{c}}\)

Although \({\varvec{R}}\) derived from instruments will influence the analyses, it will not change the linear characteristic of conventional SCDA strategies. However, the variance–covariance relationship (\({\sigma }_{yx}/{\sigma }_{x}^{2}\)) between cross-sphere variables estimated by ensemble members can be replaced by a nonlinear form to supply nonlinearity to SCDA strategies.

Appendix C

Construction of nonlinear background matrix

To enhance the accuracy of state estimation, we try to introduce nonlinearity into SCDA by constructing a nonlinear \({\varvec{B}}\) based on quadratic fitting. The quadratic fitting offers the benefits of simplicity, nonlinearity, and a higher level of accuracy. Then the analysis formulations of \(x\) and \(y\) are:

$${x}_{a}={x}_{p}+\delta x$$
(15)
$${y}_{a}={y}_{p}+a\delta {x}^{2}+b\delta x+c$$
(16)

where \(\delta x={x}_{o}-{x}_{p}\), the corresponding tangent linear format can be written as:

$${\delta x}_{a}=\delta {x}_{o}$$
(17)
$${\delta y}_{a}=2a\left({x}_{o}-{x}_{p}\right)\delta {x}_{o}+b\delta {x}_{o}$$
(18)

then the reconstructed nonlinear \({\varvec{B}}\) can be obtained through inverse deduction:

$$B=\left(\begin{array}{cc}1& 0\\ 2a\left({x}_{o}-{x}_{p}\right)+b& 0\end{array}\right)$$
(19)

And compared to the original one, the analysis equation has one more constant vector \({\varvec{C}}={(0,c)}^{T}\):

$$\begin{array}{c}{{\varvec{X}}}^{{\varvec{a}}}={{\varvec{X}}}^{{\varvec{p}}}+{{\varvec{B}}{{\varvec{H}}}^{{\varvec{T}}}\left({\varvec{H}}{\varvec{B}}{{\varvec{H}}}^{{\varvec{T}}}\right)}^{-1}\delta X+C\end{array}$$
(20)

This simplest scenario demonstrates that the nonlinear \({\varvec{B}}\) in Eq. (19) is no longer a symmetrical constant matrix and the elements of \({\varvec{B}}\) are functions dependent on observations. However, as the degree of polynomial fitting, variable and observation increase, along with segmented partitioning, the rapid growing complexity of \({\varvec{B}}\) will result in a surge of computational cost. The adaptability of this matrix requires further examination. Therefore, more practical nonlinear strategies are needed for SCDA.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xuan, Zy., Zheng, F. & Zhu, J. The effectiveness of machine learning methods in the nonlinear coupled data assimilation. Geosci. Lett. 11, 43 (2024). https://doi.org/10.1186/s40562-024-00347-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40562-024-00347-5

Keywords