Observed (blue) and simulated (orange) streamflow for a representative basin in the Río Negro basin. The top panel shows results when using CARAVAN as the sole precipitation input, where the model struggles to capture peak flows and tends to overestimate low streamflow values. The bottom panel shows results when CHIRPS, MSWEP, and local gauge measurements are combined as inputs, with simulations improving notably in both peak flow representation and low-flow accuracy.

Enhancing LSTM streamflow modeling in data-scarce rain-dominated basins: the impact of using multiple precipitation products as model inputs

Project Lead: Hernán Querbes Duhart

Data Science Leads: Nicoleta Cristea and Scott Henderson

GitHub Repo

Uruguay relies heavily on hydropower, with three dams along the Río Negro providing nearly half of the country’s electricity. Accurate streamflow modeling is therefore critical for dam operations and energy reliability. Machine learning, particularly Long Short-Term Memory (LSTM) networks, has emerged as a leading approach for streamflow modeling, with proven ability to capture long-term hydrological dependencies.

Training LSTM models requires data from a large number of basins, as greater basin diversity helps models make better predictions for a wider variety of conditions. Datasets collected for this purpose, called large-sample datasets, include meteorological forcings such as time series of precipitation, temperature, and radiation, among others, as well as static basin attributes. One widely used example is CARAVAN, a global dataset that relies on reanalysis-based data, where variables are derived from numerical models without incorporating satellite or gauge measurements. This has been shown to reduce model performance in the US compared to locally derived products. In contrast, blended datasets such as MSWEP and CHIRPS combine reanalysis, satellite, and gauge measurements, offering potentially more accurate precipitation estimates.

This study evaluates LSTM-based streamflow modeling in Uruguay, training and testing models across eleven basins in the Río Negro where limited gauge data makes reliable simulation challenging. First, the study characterizes and compares four precipitation datasets as alternative inputs: CARAVAN, MSWEP, CHIRPS, and CAMELS, a gauge-based product, evaluated over rain-dominated basins in the southern US selected for their hydrological similarity to those in Uruguay. This comparison is then applied to Uruguay, where no gauge-based precipitation dataset exists but local rain gauge measurements are available. The study then assesses how precipitation product choice influences model performance and how hydrological signatures change when precipitation products are combined as model inputs. Results provide insight into the suitability of global precipitation datasets for data-driven streamflow modeling in data-scarce regions, with implications for water resources planning.

Over the course of this project, I gained hands-on experience working with cloud environments, specifically Azure, which I had no prior experience with, and developed a deeper understanding of data-driven streamflow modeling and the role that precipitation data quality plays in model performance.