Building a large dam: identifying the relationship between catchment area and slope using the confidence ellipse approach

With the population projections indicating continued growth during this century, construction of large dams can be considered as one of the best available options to meet the future increases in water, food, and energy demands. While there are reports that thousands of large dams will be built in the near future, a key question is: what are the appropriate conditions for selecting the sites for these dams? The site of a large dam should be carefully evaluated based on many factors, such as socioeconomic development, water resources availability, topographic characteristics, and environmental impacts. This study aims to partly address the above question through identifying the relationship between two topographic characteristics (i.e., catchment area and slope) of a river reach to build a large dam based on the 30-m-resolution global drainage networks. The information about 2815 existing large dams from the Global Reservoir and Dam (GRanD) database is collected for analysis. The confidence ellipse approach is introduced to establish the quantitative relationship between these two variables, which is then used to evaluate the site selection of a large dam from the perspective of topographic characteristics. The results show that: (1) each large dam can well correspond to the nearest river reach in the global drainage networks and (2) the logarithmic values of catchment area and slope can be well described by a confidence ellipse, which is obtained based on the means, standard deviations, and Pearson correlation coefficients of the two variables. The outcomes of this study will be of great value for policymakers to have a more comprehensive understanding of large dam development in future.


Introduction
The world's population has been increasing rapidly since the beginning of the twentieth century, and will reach 9.7 billion by 2050 according to the medium-growth projection scenario of the United Nations (United Nations 2015). Along with population growth, global water, food, and energy consumptions have all kept increasing (Chen et al. 2016;Shi et al. 2019). Among these, water is the key factor in maintaining sustainable social development due to its utilization for various purposes, such as drinking, irrigation, and hydropower. Due to the significant variability in the spatial-temporal distribution of the available water resources, there is often a mismatch between water supply and demand (UNDP 2006;Chen et al. 2016). Therefore, water scarcity, defined in terms of access to water, occurs in most areas of the world (Molden et al. 2007). According to some estimates, about 1.6 billion people live in areas, where human capacity or financial resources were likely to be insufficient to utilize adequate water resources due to lack of infrastructure projects and restricted accesses (Molden et al. 2007). Thus, it is important and necessary to build infrastructure projects, such as dams, to increase water withdrawals from rivers, lakes, groundwater, and other sources (Chen et al. 2016).
It is worth noting that the function of a dam is closely related to its scale. Poff and Hart (2002) suggested that large dams (and related reservoirs and hydropower stations) can play a more important role than small ones; specifically, WWAP (2016) indicated that employment and poverty reduction had been affected directly and indirectly by large dams. Many studies (Chen et al. 2016;Shi et al. 2019) have used a dam height greater than 10 m and a reservoir capacity larger than 0.1 km 3 as criteria for large dams. Normally, large dams can serve two or more purposes such as hydropower generation, irrigation, water supply, navigation, recreation, and reduction of risks from natural disasters (Molle et al. 2007;WWAP 2012;Spänhoff 2014;Shi and Wang 2015;Di Baldassarre et al. 2017;Fulazzaky et al. 2017;Shi et al. 2018;Wang et al. 2019;Zhang et al. 2019;Liu et al. 2020Liu et al. , 2021, and thus have greatly enhanced the societies' capabilities in water resources management (WCD 2000). In addition, large dams have the power to slow climate change (Muller 2019). Consequently, despite the negative impacts on society (e.g., resettlement, dispossession, and cultural alienation) and environment (e.g., aquatic biodiversity, changes to flow regime and morphology of rivers, and impeding the movement of organisms) (Nguyen et al. 2017;Wu et al. 2019;Gierszewski et al. 2020), whether the construction of large dams should continue should no longer be a question. The fact that thousands of large dams are either planned or under construction worldwide (including some in developed countries, such as the United States) is the best evidence (Zarfl et al. 2015).
Based on some key global socioeconomic data (e.g., population, consumption of water, food and energy, gross domestic product (GDP)) and information about existing large dams all over the world (e.g., number and storage capacity), studies have revealed the close association between large dams and socioeconomic development (Chen et al. 2016;Shi et al. 2019). However, the question about the appropriate conditions for selecting the sites for building large dams has not been adequately addressed. It is well known that the site of a large dam should be carefully evaluated based on many factors, such as socioeconomic development as the major driving force, water resources availability as the essential support, topographic characteristics as the necessary consideration, and environmental impacts as the vital constraint. The present study focuses on one of these factors, i.e., topographic characteristics. It aims to partly address the above question through identifying the relationship between two topographic characteristics (i.e., catchment area and slope) of a river reach to build a large dam based on the geographic information of the existing large dams and the high-resolution global drainage networks. The information about 2815 existing large dams from the Global Reservoir and Dam (GRanD) database (Lehner et al. 2011) is collected for analysis. The confidence ellipse approach is introduced to establish the quantitative relationship between these two variables, which is then used to evaluate the site selection of a large dam from the perspective of topographic characteristics. It is believed that the outcomes of this study will be helpful for policymakers to have a more comprehensive understanding of large dam development in future.

Research data
The dam data used in this study are derived from the GRanD database (Lehner et al. 2011). For each dam, the basic information includes name, year of completion, longitude, latitude, height, storage capacity, installed capacity, catchment area, main use, and so on. The present study uses the same criteria as those in our previous studies (Chen et al. 2016;Shi et al. 2019) for large dams (i.e., a dam height greater than 10 m and a reservoir capacity larger than 0.1 km 3 ). With these criteria, there is a total of 2815 large dams constructed between 1900 and 2010 (see "Global map of large dams and drainage networks" section for details).
To investigate the topographic characteristics of large dams, the 30-m-resolution global drainage networks (i.e., Tsinghua Hydro30) (Bai et al. 2015a, 2015b) extracted from the ASTER (Advanced Spaceborne Thermal Emission and Reflection Radiometer) Global DEM (Digital Elevation Model) dataset (ASTER GDEM Validation Team 2009, 2011) are adopted. For each river reach, related geographic information includes channel index, Strahler order, coordinates, channel geometry, hillslope geometry, and codification, among others. The coordinates are originally composed of the longitudes and latitudes of the start, middle, and end points of a river reach. In this study, only the longitude and latitude of the middle point are used to represent the location of a river reach. The channel geometries are composed of several variables, such as channel length, channel slope, elevation of the start and end points, and catchment area in the upstream of a river reach Li et al. 2018).

Topological relationships between large dams and river reaches
Since the longitude and latitude are available, the location of each of the 2815 existing large dams can be determined. In this study, the 30-m-resolution global drainage networks (i.e., Tsinghua Hydro30) is used to establish the topological relationships between large dams and river reaches. It is supposed that an existing large dam was built on the river reach nearest to this large dam; therefore, the topological relationship between a large dam and the corresponding river reach can be established as follows: 1. The distances between a large dam and all the river reaches of the drainage networks can be calculated using where X dam,i and Y dam,i are the longitude and latitude of the ith large dam, respectively, and X river,j and Y river,j are the longitude and latitude of the jth river reach.

For the ith large dam, the minimum value among all the D j values can be selected using
where MinD i is the minimum distance between the ith large dam and the nearest river reach.
3. The river reach with the MinD i value is selected as the location of the ith large dam in the drainage networks. With this, the channel geometries (i.e., catchment area and slope) of this river reach can be extracted.
For all selected river reaches, the extracted channel geometries should be validated before they are used for subsequent statistical analyses. However, it is worth noting that, in this study, only the catchment area values are compared against those of the existing large dams, because the slope values of the existing large dams are unavailable. To evaluate the performance of the extracted channel geometries (i.e., catchment area and slope), the Nash-Sutcliffe Coefficient of Efficiency (NSCE) is used as the assessment criterion (Nash and Sutcliffe 1970). The equation to compute the NSCE value is given by where CA exi,i is the catchment area value of the ith existing large dam; CA ext.i is the extracted catchment area value of the river reach corresponding to the ith existing large dam; CA exi is the mean value of the catchment area values of all the existing large dams; and N is the total number of the existing large dams.
In addition, the performance of the extracted channel geometries is evaluated using the Taylor Diagram, which can summarize multiple aspects of the performance (i.e., correlation coefficient, root mean square error, and standard deviation) in a single diagram (Taylor 2001).

Statistical analysis of topographic characteristics at large dam sites
In general, topographic characteristics are the essential factors that need to be considered when determining the sites of large dams. For example, river reaches with larger catchment areas and slopes, which can impound more water and favor hydropower generation, are more suitable for building dams. There are also several other factors (e.g., valley shape and geological structure of bedrock) that influence the location of large dam construction. However, in this study, catchment area and slope are regarded as the most important factors, because catchment area can control the yield of water for reservoir storage and slope can control the water level difference for hydropower generation (Ijam and Tarawneh 2012; Liu et al. 2018; Adanta and Warjito 2020). After data preprocessing, the channel geometries of the 2815 existing large dams are used for statistical analysis. First, several statistical characteristics (e.g., mean and standard deviation) of the catchment area and slope values are calculated. Second, based on these statistical characteristics, the relationship between catchment area and slope is analyzed, and the equation to quantify such relationship is proposed using the confidence ellipse approach. Third, the suitable topographic characteristics (i.e., catchment area and slope) to build large dams are determined by the quantitative relationship for future references.
It is well known that a general form of the elliptic equation is where x and y denote the logarithmic catchment area and slope, respectively; a and b denote the lengths of the semi-major axis and semi-minor axis of an ellipse, respectively; and the point with the coordinates of (m, n) denotes the center of an ellipse. The confidence ellipse approach is used to obtain the correct geometry between two variables based on their means, standard deviations, and Pearson correlation coefficients (Schelp 2018). The procedure is as follows: 1. Calculate the covariance of the two variables x and y. 2. Divide the covariance by the product of the standard deviations of the two variables, which yields the Pearson correlation coefficient (PCC): where cov xy is the covariance of the two variables x and y; σ x is the standard deviation of variable x; and σ y is the standard deviation of variable y.
3. Draw an ellipse that is centered at the origin with the horizontal radius (HR) and vertical radius (VR) calculated as follows: 4. Rotate the ellipse counter-clockwise by 45°, and scale the ellipse horizontally and vertically with the standard deviations of the two variables.
Shift the ellipse such that its center is situated at the point (μ x , μ y ), where μ x is the mean of variable x and μ y is the mean of variable y.
For more detailed explanations, examples, and proof about the confidence ellipse approach, please refer to Schelp (2018).

Global map of large dams and drainage networks
As mentioned earlier, along with other infrastructures, dams have enhanced the capability of our society in water resources management. By the year 2010, a total of 2815 large dams had been constructed all over the world (Chen et al. 2016). The spatial distribution of these large dams (as well as related reservoirs and hydropower stations) is shown in Fig. 1. Of these large dams, classified by their catchment area values, only 41 dams (1.5%) had catchment area values over 500,000 km 2 , followed by 41 dams (1.5%) with catchment area values in the range 200,000 ~ 500,000 km 2 , 56 dams (2.0%) with catchment area values in the range 100,000 ~ 200,000 km 2 , and 543 dams (19.3%) with catchment area values in the range 10,000 ~ 100,000 km 2 . The remaining large dams (75.7%) all had catchment area values below 10,000 km 2 . In terms of their locations, most large dams were constructed in the basins of the great rivers, such as the Mississippi River in North America, the La Plata River in South America, the Yangtze River, the Yellow River, and the Ganges River in Asia, the Rhine River and the Danube River in Europe, the Niger River and the Nile River in Africa, and the Murray-Darling River in Oceania. In addition, according to statistics, there were 916, 792, and 586 large dams in North America, Asia, and Europe, accounting for 32.5%, 28.1%, and 20.8% of the world total (i.e., 2815), respectively. Figure 2 shows an example of the topological relationships between large dams and river reaches in the La Plata River basin. The black triangles denote the large dams, while the red lines denote the selected river reaches corresponding to the large dams. An example of a large dam is shown, as illustrated in Fig. 2, listing the various attributes of this large dam (e.g., dam name, longitude, latitude, year of completion, and controlled drainage area) and the topological information (e.g., RegionIndex, BSValue, and BSLength) of the corresponding river reach.
At the global scale, catchment area values of all the selected river reaches are compared against the controlled drainage area values of the corresponding large dams to validate whether the selected river reaches and the existing large dams match well. Figure 3a shows the comparison between the two. It is observed that all the points are basically evenly distributed on both sides of the 1:1 line. The NSCE value is found to be 0.61. In addition, the Taylor Diagram (Taylor 2001) is adopted in this study to further evaluate the performance of the selected river reaches (Fig. 3b). According to the multiple aspects of the performance in the Taylor Diagram (i.e., correlation coefficient, centered root mean square error, and standard deviation (normalized)), catchment area values of all the selected river reaches (i.e., represented by the blue point in Fig. 3b) are overall close to the controlled drainage area values of the existing large dams (i.e., represented by the black point in Fig. 3b). The correlation coefficient is 0.84 (higher than 0.8), which is basically satisfied; moreover, the centered root mean square error is 0.61, and the standard deviation (normalized) is 1.13, indicating that the standard deviation for the selected river reaches is only 13% larger than that for the existing large dams. As a result, the selected river reaches are regarded to be applicable to the subsequent statistical analyses.

Statistical characteristics of the extracted channel geometries
In this study, we simply sort all the data by the dam names (rather than by the region or some other specific property of the dams, for example) to eliminate possible biases in selecting the data sets for training and testing. With this, the first 80% of the data are used as the training set to determine the parameters in Eq. (4), and the last 20% are used as the testing set to validate the judgement. Based on the channel geometries (i.e., catchment area and slope) of all the selected river reaches in the training set, the mean and standard deviation values are calculated for catchment area and slope, respectively, which are the basis of the confidence ellipse approach. As catchment areas vary greatly from 1 to millions of km 2 , the logarithmic catchment areas are calculated for an easier analysis. For consistency, the logarithmic slopes are also calculated. The mean values of the logarithmic catchment areas and slopes are found to be 3.28 and − 0.80, respectively, while the standard deviations of the logarithmic catchment areas and slopes are 0.89 and 0.61, respectively.
Using the logarithmic catchment areas and slopes of the data in the training set, the scatter diagram is obtained first (i.e., represented by the red points in Fig. 4). It is clear that the spatial distribution of these points is shaped like an ellipse. Therefore, the confidence ellipse approach is adopted to establish the quantitative relationship between catchment area and slope. Since the mean values of the logarithmic catchment areas and slopes are 3.28 and − 0.80, respectively, i.e., m = 3.28 and n = − 0.80, the center of the confidence ellipse is (3.28, − 0.80), i.e., the black point in Fig. 4. In this study, three confidence ellipses with different confidence levels (i.e., 68.3%, 95.5%, and 99.7%) are drawn with standard deviations of different multiples (i.e., 1σ, 2σ, and 3σ), respectively. These results are also shown in Fig. 4. The a and b values of these three confidence ellipses are listed in Table 1. Moreover, using the logarithmic catchment areas and slopes of the data in the testing set, the relevant scatter diagram is also obtained (i.e., represented by the green points in Fig. 4). It is observed that the green points are evenly distributed within the confidence ellipses, indicating the effectiveness and superiority of this method.

Implications and limitations
Based on Eq. (4) and the parameters provided in this study, the topographic characteristics that are suitable for building large dams can be identified, which would be of significant practical value. Specifically, for a future planned large dam site with catchment area CA 0 and slope S 0 , the logarithmic catchment area and slope are x 0 = log(CA 0 ) and y 0 = log(S 0 ), respectively. Then, whether a given site is suitable for building large dams can be judged using points falling outside the largest confidence ellipse, and the number of points is obviously more than it should be (i.e., 2815-2806 = 9). The reason for this is as follows: as stated by Schelp (2018), the 68.3-95.5-99.7 rule is only valid for a one-dimensional dataset and it does not take covariance into account; thus, the ellipse, which is a twodimensional dataset, will "produce" more outliers than that expected based on the 68.3-95.5-99.7 rule. Moreover, it is possible that other factors, beyond catchment area and slope, may impact the construction of those large dams outside the ellipses. In this study, the basic information of each dam includes name, year of completion, longitude, latitude, height, storage capacity, installed capacity, catchment area, main use, and so on, while Tsinghua Hydro30 can provide other geographic variables, such as length and elevation of a river reach. Since the impacts of length and elevation on dam construction can be represented by the selected variable (slope) to a certain extent, no other factors can be further analyzed at this stage. We will try to provide more rigorous interpretation on this issue in our future studies once the relevant data are available. Nevertheless, the confidence ellipse approach is still valuable in this study for establishing the quantitative relationship between the topographic characteristics of large dams.
In addition, it is observed, in Fig. 5, that most large dams were built in developed countries (48.2%) and developing countries (49.4%). For example, in some developed countries, such as the United States, the exploitation rate of suitable sites for large dams is very Fig. 4 Quantitative relationship between catchment area and slope obtained from the confidence ellipse approach with different standard deviations (i.e., 1σ, 2σ, and 3σ) high (over 80%), and thus dam development will slow down in these countries (Bartle 2021). In contrast, fewer large dams were built in the least developed countries, because such countries had low financial capital, and hence lack of investment on large dam construction. It is anticipated that the upsurge of future large dams will occur in the least developed countries, probably with the support from developed countries and developing countries. However, besides the topographic characteristics of large dams that are investigated in this study, site selection of every large dam should undergo a rigorous and transparent cost-benefit analysis, as suggested by the OECD (2016). Although the present study is certainly useful and significant, it also has an important limitation. In this study, only two topographic characteristics (i.e., catchment area and slope) of a river reach to build a large dam were selected for analysis. These two topographic characteristics are crucial, because catchment area can control the yield of water for reservoir storage and slope can control the water level difference for hydropower generation (Ijam and Tarawneh 2012; Liu et al. 2018; Adanta and Warjito 2020). At the same time, however, they are certainly not sufficient, since there are many other important topographic characteristics as well. Nevertheless, the identified quantitative relationship between these two topographic characteristics could still be considered representative because of their importance, which would be valuable for future large dam construction.

Conclusions
Using the confidence ellipse approach, this study identified the relationship between two representative topographic characteristics (i.e., catchment area and slope) of large dams based on the geographic information of existing large dams and the 30-m-resolution global drainage networks. The results show that: 1) over three quarters of the 2815 existing large dams had catchment area values below 10,000 km 2 ; 2) each large dam could correspond to a river reach in the global drainage networks, and the accuracy was overall acceptable; and 3) the logarithmic values of catchment area and slope can be well described by a confidence ellipse, and the obtained quantitative relationship between catchment area and slope can be used to evaluate the site selection of a large dam. Overall, the outcomes of this study would be of great value for providing a reference for future development of large dams.