Temporal dynamics of streamflow: application of complex networks

Han, Xudong; Sivakumar, Bellie; Woldmeskel, Fitsum M.; Guerra de Aguilar, Milena

doi:10.1186/s40562-018-0109-8

Research Letter
Open access
Published: 22 March 2018

Temporal dynamics of streamflow: application of complex networks

Xudong Han¹,
Bellie Sivakumar^1,2,3,
Fitsum M. Woldmeskel¹ &
…
Milena Guerra de Aguilar⁴

Geoscience Letters volume 5, Article number: 10 (2018) Cite this article

3777 Accesses
22 Citations
3 Altmetric
Metrics details

Abstract

This study employs the concepts of complex networks to study the temporal dynamics of streamflow, with emphasis on annual scale (i.e., year-to-year connections). The study proposes a new approach to construct the streamflow network at the annual scale. It uses the daily streamflow data to construct the annual streamflow network, instead of using the annual (mean or accumulated) streamflow data. With this approach, each year serves as a node in the network, with each node having a time series of daily streamflow values (not a single streamflow value). Streamflow data observed over a period of 151 years (October 1862–September 2013) from the Mississippi River basin at St. Louis, Missouri, USA are considered for implementation of the approach. The properties of the annual streamflow network are investigated using three complex network-based methods: degree centrality, clustering coefficient, and degree distribution. The sensitivity of the results to streamflow correlation threshold is also examined. The results suggest that (1) there are only a few very significant nodes (years) in the annual streamflow network (degree centrality method); (2) the annual streamflow network is not a classical random graph, but may be a small-world network or scale-free network (clustering coefficient method); and (3) the network exhibits a combination of exponential and power-law distribution (degree distribution method). Based on the identification of a significant stretch of period (around the 1950s–1990s) with very weak connections with the rest of the period studied, the results also suggest the influence of dam construction (and other anthropogenic factors) on the evolution of annual streamflow dynamics.

Background

Identification of patterns in data (e.g., streamflow) serves as a fundamental approach towards modeling and prediction of the underlying systems. Numerous methods have been developed for identification of patterns in data (in space, time, and space–time) and possible connections between the components involved. Such methods can be categorized in different ways depending on their concepts and use of data, such as linear and nonlinear, deterministic and stochastic, parametric and non-parametric, supervised and unsupervised, and their combinations. The methods include those that are based on correlation, trend, spectrum, data distribution, data reconstruction, dimension, scaling, regression, clustering, and classification, among others. They have been extensively applied to identify patterns in hydrologic data around the world; see, for example, Labat et al. (2011), Sivakumar and Singh (2012), Özger et al. (2013), Tongal and Berndtsson (2014), and Xu et al. (2015) for some recent studies, and Salas et al. (1995) and Sivakumar and Berndtsson (2010) for compilations.

A key aspect in the identification of patterns in data is the search for “connections.” In this context, the concepts of “complex networks” (e.g., Watts and Strogatz 1998; Barabási and Albert 1999; Girvan and Newman 2002; Estrada 2012) seem to provide new avenues—a network is a set of points called “nodes” connected by a set of connections called “links.” Applications of the concepts of complex networks in hydrology have been gaining momentum in the last few years. Thus far, they have included studies of river networks (Rinaldo et al. 2006; Zaliapin et al. 2010; Czuba and Foufoula-Georgiou 2014, 2015; Rinaldo et al. 2014), rainfall monitoring networks (Malik et al. 2012; Boers et al. 2013; Scarsoglio et al. 2013; Sivakumar and Woldemeskel 2015; Jha et al. 2015; Jha and Sivakumar 2017; Naufan et al. 2017), and streamflow monitoring networks (Tang et al. 2010; Sivakumar and Woldemeskel 2014; Halverson and Fleming 2015; Braga et al. 2016; Serinaldi and Kilsby 2016; Fang et al. 2017). Such studies have employed different methods, including degree centrality, clustering coefficient, degree distribution, closeness centrality, shortest path length, and community structure. The outcomes of such applications are encouraging, as they have important implications for the development of hydrologic models, interpolation/extrapolation of hydrologic data, and classification of catchments. The ability of the concepts of complex networks to represent all types of connections also makes them a potential candidate to serve as a generic theory for hydrology (Sivakumar 2015).

Despite their encouraging outcomes, it is important to recognize that most of the above studies have addressed only the spatial connections in hydrologic networks. Since temporal dynamics are an integral part of hydrologic systems, especially from the perspective of time series analysis for modeling and prediction, studying the suitability of complex networks for temporal connections is crucial. To our knowledge, the only studies that have attempted this, in the context of streamflow analysis, are those conducted by Tang et al. (2010), Braga et al. (2016), and Serinaldi and Kilsby (2016). Tang et al. (2010) employed the visibility graph algorithm (Lacasa et al. 2008) to construct networks for daily streamflow series of three rivers: one in China (the Yangtze River) and two in the United States (the Umpqua River and the Ocmulgee River). They then used degree distribution and accumulative degree distribution to identify the type of such streamflow networks. Using daily streamflow data, Braga et al. (2016) employed the horizontal visibility graph (HVG) to construct streamflow networks from 141 gaging stations that cover 53 Brazilian rivers. They further characterized these 141 networks by examining their degree distributions and clustering coefficients. They reported that the river discharges in several stations had evolved to become more or less correlated over the years and attributed that behavior to changes in the climate system and other man-made phenomena. Serinaldi and Kilsby (2016) used the directed horizontal visibility graph (DHVG) to study the dynamics of daily streamflow fluctuations from 699 stations in the continental United States. They explored irreversibility by mapping the time series into ingoing, outgoing, and undirected graphs and comparing the corresponding degree distributions. They showed that the degree distributions do not decay exponentially, but tend to follow a sub-exponential behavior. The outcomes of these studies have important implications for streamflow modeling, prediction, and catchment classification.

In the present study, we attempt to further advance the applications of the concepts of complex networks for temporal connections in streamflow. Our objective here is to study the year-to-year connections in streamflow, i.e., temporal dynamics at the annual scale. This is motivated by the need to study long-term water management and the influence of large-scale climate patterns as well as anthropogenic effects, including the role of climate change. However, taking advantage of the general availability of daily streamflow time series (for most locations around the world), this study adopts a new approach to construct the streamflow network at the annual scale. The study uses daily streamflow data and constructs the streamflow network corresponding to the annual scale, instead of using the annual (accumulated or average) streamflow and employing the visibility graph. In other words, in this study, each year is considered as a node, with each node consisting of a time series of (365 daily) streamflow values, rather than a single (annual) streamflow value. This approach is different from the one employed in Tang et al. (2010), Braga et al. (2016), and Serinaldi and Kilsby (2016), who considered each day as a node and the entire daily time series/year as a network. The properties of the annual streamflow network are then identified using different methods.

For implementation, streamflow data from the Mississippi River basin in the United States are studied. Specifically, daily streamflow data over a period of as many as 151 years (October 1862–September 2013) observed in the Mississippi River basin at St. Louis, Missouri are used. Considering each year as a node, three different methods are employed to investigate the connections in this annual streamflow network: degree centrality, clustering coefficient, and degree distribution. Different threshold values (i.e., correlations in streamflow between nodes) are also used to study the influence of threshold on the outcomes of degree centrality, clustering coefficient, and degree distribution methods.

The rest of this paper is organized as follows. First, the network construction and the three methods used in this study are described. Next, details of the study area and streamflow data are presented. Then, analysis and results are presented, followed by a discussion. Finally, some closing remarks are made.

Network methodology

Network construction

A network (or a graph) is a set of points joined together by a set of lines, as shown in Fig. 1. The points are referred to as nodes (or vertices) and the lines are referred to as links (or edges). Mathematically, a network can be represented as G = {P,E}, where P is a set of N nodes (P₁, P₂,…, P_N) and E is a set of n links. The network shown in Fig. 1 has N = 7 (nodes) and n = 8 (links), with P = {1, 2, 3, 4, 5, 6, 7} and E = {{1,7}, {2,4}, {2,5}, {2,7}, {3,7}, {4,7}, {5,6, {6,7}}. Figure 1, consisting of a set of identical type of nodes connected by identical type of links, is perhaps the simplest form of network. This kind of network, however, is rarely seen in nature, since natural (e.g., streamflow) networks are often far more complex. Indeed, there are many ways in which natural networks may be more complex. For instance, networks can (1) have different types of nodes and/or links; (2) contain nodes and links with a variety of properties associated with them (e.g., weights); (3) have links that can be directed; (4) contain multi-links, self-links, and hyperlinks; and (5) contain nodes of two distinct types, with links running only between unlike types (called bipartite). For further details, the interested reader is directed to Estrada (2012), among others.

In a network, the existence/non-existence of links is identified based on a measure that represents the strength of the link. The measure used to identify the link and its strength may be different, depending on the network under consideration and the problem of interest. For instance, in the analysis of spatial connections in a streamflow monitoring network (such as the one shown in Fig. 1), a common measure used is the spatial correlation between nodes, and node pairs that have spatial correlation values exceeding a certain threshold value (T) may be assigned links (e.g., Sivakumar and Woldemeskel 2014). However, in the analysis of temporal streamflow connections, the difference in streamflow values between nodes can be used as a measure, and node pairs that have differences below a certain threshold may be assigned links (e.g., Braga et al. 2016). With this basic network concept, construction of the streamflow network, in this study, to represent the temporal dynamics at the annual scale is described next.

Let us assume that we have daily streamflow data observed over a period of N years at a gaging station. If the objective is to study the day-to-day connections in streamflow, then one can construct the network based on the daily streamflow values using, for example, the visibility graph method (e.g., Lacasa et al. 2008), considering each day as a node in itself, with each node having a single streamflow value (see Fig. 2a), as has been done by, for example, Tang et al. (2010), Braga et al. (2016), and Serinaldi and Kilsby (2016). However, if the objective is to identify the year-to-year connections in streamflow (or connections at any scale coarser than daily), then two different approaches may be adopted:

1.
Compute certain statistic (e.g., mean, total) of streamflow for the annual scale, and then use the visibility graph method to construct the network based on such annual streamflow values. In this approach, each year is treated as a node (see Fig. 2b), and a node has only one streamflow value, i.e., the annual streamflow value; and
2.
Use the daily streamflow values to construct the streamflow network at the annual scale. In this approach, again each year is treated as a node, but then each node is made up of a time series of (365 or 366) daily streamflow values (see Fig. 2c).

The present study adopts the latter approach for network construction of streamflow at the annual scale, as it possesses the following advantages over the former: (1) it is simple, as it considers the daily data as they are and eliminates the need for visibility graph (or other methods) for network construction; (2) the construction takes into consideration the within-year streamflow variability to identify connections, rather than simply considering one annual value; and (3) the resulting network is similar to a network in space (i.e., each station as a node with a time series of streamflow and the connections between them as links), and therefore, the analysis becomes fairly straightforward and generic. For the purpose of convenience in the present analysis, each year is considered to contain only 365 days (i.e., February 29th in leap year is excluded). Therefore, the network construction adopted in this study for temporal dynamics is more similar to the construction adopted in Sivakumar and Woldemeskel (2014) and Halverson and Fleming (2015) for spatial dynamics than to the one adopted in Tang et al. (2010), Braga et al. (2016), and Serinaldi and Kilsby (2016) for temporal dynamics.

Network methods

There exist a variety of measures to study the properties of complex networks. These include centrality, clustering, adjacency, distance, community structure, bipartivity, subgraphs, and communicability, among others. Extensive details of these measures are available in Estrada (2012), among others. These measures identify/quantify different properties of networks. For some measures, there are also different definitions, submeasures, and the corresponding methods, as appropriate. In what follows, a brief description of degree centrality (centrality), clustering coefficient (clustering), and degree distribution (adjacency) is provided, as they are employed in this study to examine streamflow connections.

Degree centrality

Centrality is one of the most basic and intuitive measures of a network, as it identifies the significance of the nodes in the network. The concept of centrality goes back to the studies of Bavelas (1948) and Leavitt (1951) for communication networks. However, Jeong et al. (2001) and Newman (2001) were among the first to use the concept in the context of complex networks. A number of centrality-based measures have been proposed in the network literature, such as degree centrality, centrality beyond nearest neighbors (e.g., Katz centrality, eigenvector centrality, subgraph centrality, PageRank centrality, and vibrational centrality), closeness centrality, betweenness centrality, and information centrality; see Estrada (2012) for details. Among these, the degree centrality has been one of the most widely used measures.

The idea behind the use of degree centrality as a network measure is that it identifies whether a given node, say i in a network, is more significant (or central or influential) than another node in the network. For instance, the node with the highest degree centrality value is considered as the most significant in the network, while the node with the lowest degree centrality value is considered as the least significant. The degree centrality of node i in a network of N nodes is defined as the number of first neighbors (or simply neighbors) of node i divided by the total number of possible neighbors (N − 1) in the network. The neighbors of node i are identified through finding the nodes that have links to node i according to an assumed threshold.

Let us consider a selected node i in a network of N nodes. So, the total number of possible direct neighbors for node i is N − 1, which means the total number of possible direct links for node i is N − 1. Let us assume that node i has only k neighbors (i.e., nodes), denoted as k_i, in the network according to an assumed threshold. This means that node i has k_i direct links (that connect it to k_i other nodes in the network). Therefore, the degree centrality of node i is given by the ratio of the number of direct links for node i (i.e., k_i) to the total number of all possible direct links for node i (i.e., N − 1). The procedure is repeated for each and every node of the network. An example of the calculation of the degree centrality is presented in Sivakumar and Woldemeskel (2014).

Clustering coefficient

One of the most basic properties of a network is its tendency to cluster. The concept of clustering has its origin in sociology, under the name fraction of transitive triples (Wasserman and Faust 1994). However, Watts and Strogatz (1998) were the first to use this concept in the context of complex networks. The tendency of a network to cluster is quantified by the clustering coefficient. There exist several definitions of clustering coefficient; see Watts and Strogatz (1998), Barrat and Weigt (2000), and Newman (2001) for details. However, the clustering coefficient method proposed by Watts and Strogatz (1998), which measures the local density, is widely used. A brief description of its calculation is presented here, as this method is used in the present study.

Let us consider first a selected node i in the network, having k_i links which connect it to k_i other nodes (i.e., neighbors) according to an assumed threshold, as mentioned earlier. If the neighbors of the original node i were part of a cluster, there would be k_i(k_i − 1)/2 links between them. Let us also assume that among the k_i(k_i − 1)/2 links, the number of ‘actual links’ that exist (according to the assumed threshold) is only E_i. With these, the clustering coefficient of node i is given by the ratio between the number E_i of links that actually exist between the k_i nodes and the total number of links k_i(k_i − 1)/2, i.e.,

$$ C_{i} = \frac{{2E_{i} }}{{k_{i} \left( {k_{i} - 1} \right)}}. $$

(1)

The procedure is repeated for each and every node of the network. The average of the clustering coefficients of all the individual nodes is the clustering coefficient of the whole network C. An example of the clustering coefficient calculation can be found in Sivakumar and Woldemeskel (2014).

The clustering coefficient of the individual nodes and of the entire network can be used to obtain important information about the type of network, grouping (or classification) of nodes, and identification of the most significant nodes. For instance, a very high clustering coefficient (close to 1.0) indicates a regular network, since in a regular network, every node is connected to every other node in the same manner. A very low clustering coefficient (close to zero), with C = p (where p is the probability of any two nodes in the network being connected), indicates a (classical) random network, since the connections between the nodes are purely random in nature. For a small-world network (e.g., Watts and Strogatz 1998), the clustering coefficient is generally smaller than that of the regular network but also considerably larger than that of a comparable random network (i.e., having the same number of nodes and links). A scale-free network (e.g., Barabási and Albert 1999) may also have such a clustering coefficient value. Therefore, it is often not easy to distinguish between small-world networks and scale-free networks based on the clustering coefficient alone (both small-world networks and scale-free networks essentially belong to the category of random networks, but their properties are different from that of classical random networks). However, other network-based measures, such as the shortest path length (e.g., Watts and Strogatz 1998) and the degree distribution (e.g., Barabási and Albert 1999), can provide reliable information to identify/distinguish between small-world networks and scale-free networks, or even some other type. It is relevant to note, at this point, that for a number of real-world networks studied in the literature, including hydrologic networks, the clustering coefficient is reported to be above 0.5 (e.g., Watts and Strogatz 1998; Jeong et al. 2000; Newman 2001; Newman et al. 2001; Tsonis and Roebber 2004; Suweis et al. 2011; Scarsoglio et al. 2013; Sivakumar and Woldemeskel 2014, 2015; Halverson and Fleming 2015), suggesting that such networks are not classical random networks, but may be small-world networks or scale-free networks or some other types.

Degree distribution

In a network, different nodes may have different number of links. The number of links (k) of a node is called node degree. The degree is an important characteristic of a node, as it allows one to derive many measurements for the network. The spread in the node degrees is characterized by a distribution function p(k), which expresses the fraction of nodes in a network with degree k. This distribution is called degree distribution (e.g., Barabási and Albert 1999). The degree distribution is often a reliable indicator of the type of network.

In a random graph, since the links are placed randomly, the majority of nodes have approximately the same degree, and close to the average degree $ \overline{k} $ of the network. Therefore, the degree distribution of a completely random graph is a Poisson distribution with a peak at p($ \overline{k} $), and is given by

$$ p\left( k \right) = \frac{{e^{{ - \overline{k} }} \overline{k}^{k} }}{k!}. $$

(2)

Similarly, depending upon the properties of networks, the degree distribution can also be Gaussian, given by

$$ p\left( k \right) = \frac{1}{{\sqrt {2\pi \sigma_{k} } }}e^{{ - \left( {\frac{{\left( {k - \overline{k} } \right)^{2} }}{{2\sigma_{k}^{2} }}} \right)}} , $$

(3)

exponential, given by

$$ p\left( k \right) \sim e^{{ - k/\overline{k} }} , $$

(4)

power-law or scale-free, given by

$$ p\left( k \right) \sim k^{ - \gamma } , $$

(5)

or other, or their combinations.

Among these distributions, the power-law or scale-free distribution (e.g., Barabási and Albert 1999) has attracted the most attention in the literature on complex networks, since such a distribution has been found in a number of natural and social networks (e.g., Barabási and Albert 1999; Kim et al. 2004; Keller 2005; Clauset et al. 2010). The fractal or scale-free nature of numerous natural systems, including hydrologic systems, and their ability to self-organize themselves, already well-documented in the literature (e.g. Mandelbrot 1983; Bak 1996; Rodriguez-Iturbe and Rinaldo 1997; Peckham and Gupta 1999; Barnsley 2012), give both credence and motivation to further advance research on scale-free networks. While it is true that some scale-free networks display an exponential tail, the functional form of p(k) still deviates significantly from the Poisson distribution expected for a random graph.

Study area and data

In the present study, streamflow data from the Mississippi River basin are considered to investigate the usefulness of complex networks for temporal streamflow dynamics. The Mississippi River originates at Lake Itasca in northern Minnesota in the United States and flows for about 3770 km (2342 mi) through the mid-continental United States, the Gulf of Mexico Coastal Plain, and its subtropical Louisiana Delta (Fig. 3). The entire river basin measures about 4.76 million km² (1.84 million mi²), of which about 3.22 million km² (1.24 million mi²) is in the continental United States; see Alexander et al. (2012) for further details.

In the Mississippi River basin, streamflow data are measured at thousands of locations. For the present study, daily streamflow data observed in a sub-basin station of the Mississippi River basin at St. Louis, Missouri (USGS station 07010000) are analyzed; see Fig. 3 for the location of St. Louis. The sub-basin is situated between 38°37′03″ latitude and 90°10′47″ longitude, on downstream side of west pier of Eads Bridge at St. Louis, 24.1 km downstream from the Missouri River, and at 289.6 km above the Ohio River. The drainage area of this sub-basin is 251,230 km² (97,000 mi²). The natural flow of stream in this sub-basin is affected by many reservoirs and navigation dams in the upper Mississippi River basin and by many reservoirs and diversion for irrigation in the Missouri River basin (e.g., Alexander et al. 2012).

For the present analysis, daily streamflow data observed over a period of 151 years (October 1862–September 2013) (i.e., “water year”) are considered. The data are obtained from the USGS National Water Information System website; see http://nwis.waterdata.usgs.gov/nwis. Figure 4 shows the variation of this daily streamflow series. It is relevant to mention here that the temporal dynamics of streamflow (and other river-related processes) observed at the St. Louis station have been investigated by many studies in recent years. Among such studies, those that have employed nonlinear dynamic and chaos concepts for system identification, prediction, and catchment classification (e.g., Sivakumar and Jayawardena 2002; Sivakumar and Wallender 2005; Sivakumar et al. 2007) may be of particular interest in the context of complex networks, as there is potential to construct networks based on nonlinear data reconstruction (phase space reconstruction). This will be addressed in a future study.

Analysis and results

Using the daily streamflow data of 151 years (October 1862–September 2013), the annual streamflow network for the Mississippi River basin at St. Louis, Missouri is constructed, following the procedure explained earlier. The annual streamflow network thus constructed has 151 nodes, corresponding to 151 years of daily data. Each node consists of 365 daily streamflow values (excluding the data for February 29 in leap years). This allows calculation of correlations in streamflow between each of the 151 nodes (years) with each and every other node in the network. In this study, the Pearson correlation coefficient is used to calculate the correlation. The correlations in flow between nodes, in turn, allow identification of neighbors (i.e., links) for each and every node in the network, which is the key to the implementation of the degree centrality, clustering coefficient, and degree distribution methods. It is important to note that the correlation threshold (T) may significantly influence the identification of the neighbors (i.e., links), and hence, the outcomes of the methods. However, the optimum correlation threshold is not known a priori. To take this issue into account and examine the influence of threshold, eight different threshold values are considered in the analysis: 0.3, 0.4, 0.5, 0.6, 0.65, 0.7, 0.75, and 0.8 (see Sivakumar and Woldemeskel (2014) for some details on the selection of the correlation threshold values). The results are presented next, where different threshold values may be considered for different methods to allow better visualization of the differences in results.