Potential of deep predictive coding networks for spatiotemporal tsunami wavefield prediction

Data assimilation is a powerful tool for directly forecasting tsunami wavefields from the waveforms recorded at dense observational stations like S-Net without the need to know the earthquake source parameters. However, this method requires a high computational load and a quick warning is essential when a tsunami threat is near. We propose a new approach based on a deep predictive coding network for forecasting spatiotemporal tsunami wavefields. Unlike the previous data assimilation method, which continuously computes the wavefield when observed data are available, we use only a short sequence from previously assimilated wavefields to forecast the future wavefield. Since the predictions are computed through matrix multiplication, the future wavefield can be estimated in seconds. We apply the proposed method to simple bathymetry and the 2011 Tohoku tsunami. The results show that our proposed method is very fast (1.6 s for 32 frames of prediction with 1-min interval) and comparable to the previous data assimilation. Therefore, the proposed method is promising for integration with data assimilation to reduce the computational cost.


Introduction
Tsunami forecasting systems have advanced significantly in recent years, especially since the 2004 Indian Ocean and 2011 Tohoku tsunamis. In most forecasting systems, information on the earthquake source parameters, that is, epicenter, focal mechanism, and depth, are necessary to run the tsunami propagation model (Satake 2014(Satake , 2015. Calculating the earthquake source parameters from seismic observations in real-time is a difficult task because of the complexity of the structure beneath the Earth and the uncertainties in source characterization. Even though the parameters can also be calculated using the waveform inversion method (Satake 1989), it still requires many trials and errors to find the best-fit fault configuration. With the vast deployment of offshore observational stations in recent years, many studies prefer tsunami forecasting systems that estimate sea surface deformation instead of the earthquake source mechanisms (Maeda et al. 2015;Tsushima et al. 2011;Wang et al. 2019a). Several studies have proposed comprehensive forecasting systems that use a tsunami database to forecast tsunami inundation caused by a near-field earthquake (Fauzi and Mizutani 2020; Gusman et al. 2014;Mulia et al. 2018;Setiyono et al. 2017). However, these studies are not free from uncertainty in estimating the tsunami source.
Real-time data transmission through a high density of observational instruments interconnected with a cabled network has been conducted in Japan for more than a decade to provide tsunami predictions. In the Nankai region, the Dense Oceanfloor Network System for Earthquakes and Tsunamis (DONET), which is a cabled network for tsunami and seismic measurement, has been extended to cover a wider region (Baba et al. 2014;Kaneda et al. 2015). A new cabled network, Fauzi and Mizutani Geosci. Lett. (2020) 7:20 Seafloor Observation Network for Earthquakes and Tsunami along the Japan Trench (S-Net) (Yamamoto et al. 2016), is currently being installed in the Japan Trench, and some operations were started in 2016.
With the availability of dense observational stations, Maeda et al. (2015) first proposed a data assimilation (DA) method which was able to forecast tsunami wavefields without the need to know information on the earthquake source parameters by utilizing the observed tsunami amplitude. The technique was introduced to avoid uncertainties in estimating the tsunami source. The method was successfully implemented for the 2011 Tohoku earthquake. However, besides requiring a dense observation network, it also has a high computational cost because of the direct linear long-wave (LLW) numerical simulation. Several examples of the use of a dense observational network for tsunami DA, such as DONET, have been demonstrated in previous studies (Wang et al. 2018(Wang et al. , 2019b. To reduce the computational cost, Wang et al. (2017) used a Green's function database to speed up the assimilation process. However, unlike the previous tsunami DA, the method was not developed to predict the tsunami wavefield, but only to synthesize waveforms at points of interest. An improved DA was presented by Yang et al. (2019) with an even higher computational cost because of the introduction of the ensemble Kalman filter into the model.
Here we conduct a numerical experiment by integrating a deep predictive coding network (Lotter et al. 2017) with the DA. The method was initially used in the computer vision field to predict future video frames. In this study, the predictive coding network is used to predict the next time steps of the tsunami wavefield utilizing a short sequence of the previously assimilated wavefield. In this manner, tsunami propagation after the last assimilated wavefield is estimated by the predictive coding network. In other words, the method we propose is a joint method between DA and predictive coding network, which is not independent from the conventional DA. Like the other typical datadriven methods, the computational time of the predictive coding network is short. The method reduces the computational time for DA; this enables the tsunami wavefield and arrival time at the coastal region to be predicted within a short time (1.6 s). We first developed a tsunami propagation database from predefined scenarios to train the predictive coding network. To evaluate the performance of our proposed method, we conducted experiments with two cases: simple bathymetry, and the 2011 Tohoku earthquake.

Data assimilation
The sequential DA method has been widely used in weather forecasting (Kalnay 2002). The tsunami DA method proposed by Maeda et al. (2015) for simulating tsunami wavefields in real-time is based on the optimal interpolation method, which has a lower computational cost than the more advanced method using the ensemble Kalman filter, under the assumption that the system is linear. Even though the optimal interpolation method is simple, however, the DA approach showed good agreement with real tsunami data (Gusman et al 2016;Heidarzadeh et al. 2019;Wang et al. 2017Wang et al. , 2019a and the synthetic case (Mulia et al. 2017). In the numerical simulation, the tsunami wavefield at the nth time step is represented as x n η n�t, x, y , M n�t, x, y , N n�t, x, y , where η is the tsunami height, M and N are the depthintegrated tsunami fluxes in the x and y directions and t is the time step. The DA method can be expressed as At each time step, the forecasted tsunami wavefield x f n at the nth time step is computed by numerically solving linear long-wave theory using the assimilated wavefield at the previous time step x a n−1 . The vector F is the propagation matrix, which corresponds to the 2-D LLW tsunami propagation model. The residuals between the observed tsunami amplitude and the forward simulation at the observational stations y n is calculated as y n − Hx f n . H is a vector that has a value of 1 at the observational stations and zero at the other elements, and is used to extract the forecasted tsunami height at the observational stations. The residuals are then multiplied by the smoothing matrix W to bring the assimilated wavefield closer to the observed tsunami wavefield. The smoothing matrix is an essential factor in DA as it controls the quality of the assimilated wavefield. We compute the smoothing matrix by solving the following equation: where P = �ε f ε f T � and R = �ε O ε OT � are the error covariance matrices of the forward simulation and the observations, respectively. ε f and ε O are the errors of the forward simulation and observations, respectively, while ε f T and ε OT are the corresponding transpose matrices. We assume the Gaussian-distributed errors to compute both covariance matrices, with a characteristics distance of 20 km (Maeda et al. 2015;Wang et al. 2019a).
By iteratively solving Eqs. (1) and (2), the tsunami wavefield is assimilated, and we can obtain forecasted tsunami waveforms at any location inside the model domain during and after the assimilation process.

Deep predictive coding networks
Traditionally, static images are used to train computer vision models. However, in the real world, the visual world involves spatiotemporal movement. As a complex system, the human brain is continuously making spatiotemporal predictions based on the incoming sensory stimuli, and this is mirrored in the concept of predictive coding (Friston and Kiebel 2009;Rao and Ballard 1999). Predictive coding networks were initially developed by Lotter et al. (2017) based on this concept, but reformulated in modern deep learning techniques and trained using a gradient descent method with an implicitly embedded loss function. A network consists of several repeating stacked layers that make local predictions of the input to the modules. The difference between the actual input and its prediction then proceeds to the higher layer. The architecture of the proposed method is shown in Fig. 1. Each level of the network is composed of four main components: an input convolutional unit ( A l ), a recurrent representation unit ( R l ), a prediction unit ( A l ) and an error representation unit ( E l ). The recurrent representation unit R l estimates the prediction A l of the input A l . The error unit E l computes the difference between A l and A l , and then passes it to the next layer of the network as input A l+1 . The representation unit R l receives a copy of the error matrix E l along with the input from the representation unit of the higher layer R l+1 , which is then used to estimate future predictions.
Predictive coding networks initially focused on image sequences in video data. In this study, we use the assimilated tsunami wavefields from the DA process. The readers are referred to Lotter et al. (2017) for the detailed explanation of the predictive coding networks. Considering a sequence from the assimilated tsunami wavefield x t as the input to the model, the target for the lowest layer of the network is set to the input sequence itself, that is A t 0 = x t ∀t . Except for layer 0, the targets for the higher layers A t l are determined by a convolution of error units from the lower layer E t l−1 , which is followed by an exponential linear unit (ELU) activation function (Clevert et al. 2016) and maxpooling, as described in Eq. (4). After several experiments, we found that the ELU activation function is more suitable for our study than the rectified linear unit function that was used in the previous model (Lotter et al. 2017).
where H is the max-pooling result of f (C) , C is the convolution result of E t l−1 , and f is the ELU activation function. We used the convolutional long short-term memory units (convolutional LSTM) (Hochreiter and Schmidhuber 1997;Shi et al. 2015) as the backbone for the representation units. Convolutional LSTM is the most important part of the model as it keeps the information Fig. 1 Architecture of the deep predictive coding network from the previous time step and passes it to the next time step of the sequence, similar to the human brain's ability in remembering or forgetting information. By using the convolutional LSTM, R t l is then determined based on the representation from the previous time step R t−1 l , E t−1 l , as well as R t l+1 . R t l+1 should go through an upsampling procedure because of the max-pooling in A t l units. The predictions A t l are estimated through a convolution of R t l and followed by an ELU activation function (Eq. (5)): where K is the convolutional results of R t l . Finally, the error units E t l are computed from the difference between A t l and A t l and then divided into ELU-activated positive and negative values [Eq. (6)]. The model is trained by minimizing the sum of the error units using a gradient descent method to fine-tune the network parameters. The training loss which should be minimized is expressed in Eq. (7): where λ t and λ l are the weighting factors by time and layer, respectively, and n l is the number of units in the lth layer. We set λ t = 0 for the first time step and λ t = 1 for the remaining steps, while λ 1 = 1 for the lowest layer and λ 1 = 0 for the upper layers. Those criteria are selected because they produced the best performing model as suggested by Lotter et al. (2017).
First, we need to develop a database from multiple predefined scenarios to train the model. The scenario and experiment settings are explained in the next section. For each scenario, we simulate the tsunami propagation using the LLW tsunami model. The synthesized tsunami heights at every time step ( t = 1 s) are used to estimate the tsunami wavefield through the assimilation process using Eq. (1). Once the assimilation process has started, we use the assimilated wavefield from those scenarios with 1-min intervals to train the model. Since the proposed algorithm learns the pattern of the tsunami propagation during the training process, we would expect the model to behave similarly to the tsunami propagation model. The predictive coding network in this study was composed of four layers of networks with 3 × 3 filter sizes, and channel sizes of 1, 48, 96 and 192 for each layer. The model was trained using a gradient descent algorithm, RMSprop, with a learning rate α of 0.001, decreasing by a factor of 10 halfway through the training process.

Joint DA-predictive coding
In this study, the predictive coding network is integrated with DA to provide quick early tsunami warnings. The DA can represent the tsunami wavefields accurately, but it comes at a relatively high computational cost. Therefore, the predictive coding network is introduced to save computational time. Figure 2 shows a simple schematic of the joint DA-predictive coding method. In a real practice, once the tsunami reaches the offshore observational stations, the DA starts to reconstruct the wavefield. By only using a short sequence of data-assimilated wavefield as the input, the predictive coding will predict the future time steps promptly. The DA process keeps running in the background, so that if a newly assimilated wavefield is available, the predictive coding will continue the prediction process with the new data. Thus, predictions with longer time steps with better accuracy can be obtained.
With this simple approach, if one has already been using DA for its tsunami warning system, there is no need to significantly change the system because the predictive coding will directly use the DA's output to make future predictions. Therefore, with much faster computational time, as discussed in the next section, integrating predictive coding with DA is expected to provide a quicker forecast than solely using DA.
Here, we configured the model using four assimilated wavefields, enabling the model to predict the next four steps in the wavefield. Four frames of the wavefields are selected by considering the fact that the model Fig. 2 Schematic of the joint DA-predictive coding method accumulates the information over time to provide accurate predictions. After several trials, we found that four frames of the input are adequate to provide reasonable predictions. Adding more input frames may provide more accurate predictions; however, it also increases the computational time of DA, which may delay the tsunami warning.

Numerical implementations and results
To quantify the prediction accuracy of the proposed method, the forecasted wavefields are compared to the assimilated wavefields computed by DA. However, quantitative assessment of the generative models is a complex problem (Theis et al. 2016), and we adopted a structural similarity index measurement (SSIM) (Wang et al. 2004) to measure the model performance quantitatively. SSIM is currently widely used in the image vision field to provide a clear judgment of the similarity between two images: where µ x and µ y are the averages of the images x and y, respectively, and σ 2 x and σ 2 y are the variances of x and y, respectively. C 1 = (k 1 L) 2 and C 2 = (k 2 L) 2 are two variables that are designed to avoid a zero denominator, while L is the dynamic range of the pixel values and k 1 = 0.01 and k 2 = 0.03 are constants. SSIM is a signed expression ranging between -1 and 1, with a larger value indicating a greater similarity. We also used a conventional statistical measure, the root mean square error (RMSE), which measures the average magnitude of the prediction error: where η i and η i are the tsunami height from the proposed method and DA at point i , respectively.

Simple bathymetry
To assess the performance of our proposed method, we conducted a numerical experiment with a simple bathymetry profile as the simplest case. The numerical domain is shown in Fig. 3. We set a homogeneous We set an array of 25 virtual stations as a dense observational station is typically required by the conventional DA method to provide a good assimilation result. The distance between observational points in the x-and y-directions was 15 km. We used a two-dimensional cosine basis function (Hossen et al. 2015) for the initial water surface: where L is the characteristics source size, and we chose L = 70 km. The maximum initial height η 0 = 1 at x = x 0 and y = y 0 , which is the center of the source. Three scenarios of the initial tsunami sources were used; two for the training, and one for testing purposes. The location of the initial tsunami sources for training and testing is selected arbitrarily and shown in Fig. 3. Because the wave propagation in a homogeneous bathymetry is not complex, we assumed that two initial sources are enough for model training. For training, we first simulated the tsunami propagation using the LLW tsunami model. Assuming the recorded waveform at observational stations as the observed data, once the tsunami signal had been recorded at the stations, the assimilation process had started, and we concatenated the assimilated wavefield with 1-min intervals from those two scenarios as the training input. We found that training the model with a training epoch of 350, which indicate the number of times the model go through the training dataset, was enough to provide a good representation of the database. Similarly, for the prediction, once the assimilation process had begun, four assimilated tsunami wavefields with an interval of 1 min originating from the testing scenario were used as the input to the model.
A comparison between the predictions, DA, and forward modeling is shown in Fig. 4. We set the initial water surface at t = 0 as the tsunami source. When the tsunami reached the observational stations, in this case, at t = 11 min, the data assimilation started. Here, we used four assimilated wavefields starting from t = 11 min to 14 min (Fig. 4a) to predict the next four frames of the wavefield, i.e., t = 15 to 18 min (Fig. 4b). From the comparison, the estimated wavefields are visually very similar to the DA (Fig. 4c). The model can mimic the characteristics of the DA indicated with SSIM and RMSE values ranging from 0.850 to 0.955 and 0.003 to 0.009 m, respectively. We included a box and whiskers diagram to further analyze the differences between the forecasts and DA (Fig. 5). The box and whiskers plots suggest that the forecasts had a similar distribution to the DA, in which the median of the absolute errors are very close to the zero. We also randomly selected ten observational stations and compared the recorded waveforms produced by the proposed method and DA (Fig. 6). The selected stations are plotted in Fig. 3. Overall, the forecasted waveforms were almost identical to the waveforms estimated by the DA with a mean correlation coefficient of 0.988. The figure shows that the DA (Fig. 4c) provided a good wavefield approximation in the vicinity of the observational points, though a broader coverage of the stations may be necessary to improve the quality of the DA, as exhibited in the forward modeling results (Fig. 4d). However, here we only focused on the performance of the proposed model over the DA.

The 2011 Tohoku tsunami
Next, we applied the proposed method to the tsunami induced by the 2011 Tohoku earthquake. We used bathymetry data with a resolution of 4050 m (Fig. 7a). The bathymetry dataset, which was obtained from the Cabinet Office of Japan, is based on a nautical chart and digital data compiled by the Japan Coast Guard and the Japan Hydrographic Association. For model training, we arranged multiple scenarios for simple rectangular fault models. We set 15 reference points as the top-center of the faults. The placement of the reference points is shown in Fig. 7b. We set an earthquake magnitude ranging from 8.0 to 9.0 (0.2 intervals); in total, there were 90 scenarios. The earthquake depth, strike and dip angles were determined based on SLAB 2.0 (Hayes et al. 2018), while the rake angle was set to 90º for all scenarios, representing the thrust fault mechanism. To calculate the area of the fault, the magnitude scale We used a coseismic deformation in an elastic half-space model (Okada 1985), to compute the initial sea surface elevation for the DA. We set the same number of training epoch as used in the simple case, and the training process was conducted in a similar manner. For model testing, we used a source model of the 2011 Tohoku earthquake, which was calculated by conducting  a joint inversion using a tsunami, GPS and seafloor deformation data (Gusman et al. 2012).
A comparison between the predictions, DA and forward modeling is shown in Fig. 8. In this case, the DA process started at t = 1 min. We used four frames of the assimilated wavefield from t = 1 to 4 min (Fig. 8a) as the input to the model to predict the next four frames of the wavefield (t = 5 to 8 min). The results show that the predictions (Fig. 8b) are very similar to the DA (Fig. 8c) with SSIM and RMSE values ranging from 0.949 to 0.955 and 0.173 to 0.196, respectively. We further explored the capability of the proposed method by recursively feeding back the prediction into the model to provide longer wavefields predictions up to t = 36 min. We show four snapshots of the resulting forecasts at t = 14, 21, 28 and 35 min (Fig. 9). In Fig. 10, we also compared the assimilated and forecasted waveforms from ten randomly selected S-Net stations and with real waveform data from the 2011 Tohoku tsunami recorded at five GPS buoys (GB801, GB802, GB803, GB804, GB806). The location of the GPS buoys and selected observational stations are shown in Fig. 7a. At ten S-Net stations, the forecasted waveforms show reasonably good agreement with the DA, with a mean correlation coefficient of 0.748. However, both the assimilated and forecasted waveforms underestimated the observations at the GPS buoys. Nonetheless, both exhibit similar trends to the observations.

Discussion
Computational speed is one of the most critical factors for real-time tsunami forecasting. In the previous studies (Gusman et al. 2016;Maeda et al. 2015;Yang et al. 2019), tsunami DA successfully provided accurate results at a relatively high computational cost. The predictive coding network learns the pattern of tsunami propagation during the training period. Once the model has been trained, the spatiotemporal tsunami wavefield prediction can be made quickly by only performing a matrix multiplication procedure. We used a personal computer equipped with an Intel i7 processor, an 8-gigabyte graphics processing unit (GPU), and 16 gigabytes of memory for model training and testing. Computationally, generating four frames of future wavefield predictions utilizing four frames of the assimilated wavefield as the input requires a computational time of 0.2 s. In total, it only requires 1.6 s to provide wavefield predictions from t = 5 to 36 min in the case of the 2011 Tohoku earthquake, while the DA requires 360 s. With this quick computation time, we can provide immediate warnings, which is very important for the evacuation process.
In the previous DA, the tsunami model is based on LLW theory. Even though incorporating nonlinearity or a dispersive effect may improve the quality of the DA, it is better to avoid, as it may further increase the computational cost. Since the learning process of the proposed method is based on the database, incorporating those scenarios is more practical, because the tsunami simulation is conducted in advance.
For the case of the 2011 Tohoku tsunami, both the proposed method and DA-generated waveforms at the GPS buoys underestimate the observations (Fig. 10). With the limited memory capacity of our standard GPU, we decided to use a relatively coarse bathymetry resolution. As a result, a high accuracy forecasting result may not be achieved in this study, because, as explained in previous studies (Gusman et al. 2009;Satake and Tanioka 1995), the forecast accuracy is strongly dependent on the topography resolution. A more sophisticated GPU should be used in future studies to accommodate a finer bathymetry resolution. In addition, since the error propagates over time steps, the optimum length of future predictions should be carefully investigated in future work. It is clear from Figs. 4b, 8b, and 9a that the SSIM and RMSE tend to degrade over the time steps. The degradation is also shown in the box and whiskers plots (Fig. 11), where the range and length of the box and whiskers tend to increase at the longer time steps of the predictions. Another problem is that the S-Net stations are located inside the earthquake source region. In such a condition, the sea surface displacement cannot be directly observed by the offshore bottom pressure gauges after an earthquake occurs because the deformation wavelength is much larger than the ocean depth. Therefore, the sea surface elevation shortly after the earthquake is almost the same as that before the earthquake . To solve Fig. 10 Comparisons of the waveforms between the proposed method, DA and observations at GPS buoys and ten randomly selected stations for the 2011 Tohoku tsunami. The location of the selected stations is shown in Fig. 7a this problem, a method which able to generate sea surface deformation inside the source region (e.g., Tanioka 2018) should be integrated with the DA. Overall, based on the results, the proposed network shows promise for integration with DA to reduce the computational cost.

Conclusions
We conducted a study using a deep predictive coding network along with the DA to forecast a tsunami wavefield in real-time. The objective of this research was to assess whether the application of a deep predictive coding network combined with the DA could be implemented for a real-time warning system. We conducted two study cases: simple bathymetry and the 2011 Tohoku tsunami.
The results showed that only utilizing four frames of the assimilated wavefields enabled the model to satisfactorily forecast the next four frames of the wavefield with SSIM values in the range 0.850-0.955, and up to 32 future frames with SSIM values in the range 0.781-0.955 for the simple bathymetry and the 2011 Tohoku tsunami, respectively. With a quick computational time and reasonably accurate results, we conclude that the proposed method is applicable for a real-time warning system. Abbreviations DA: Data assimilation; DONET: Dense Oceanfloor Network System for Earthquakes and Tsunamis; ELU: Exponential linear unit; GPS: Global Positioning System; GPU: Graphics processing unit; LLW: Linear long-wave; RMSE: Root mean square error; S-Net: Seafloor Observation Network for Earthquakes and Tsunami along the Japan Trench; SSIM: Structural similarity index measurement.