Karun is the longest and most affluent river of Iran. With
950-km long, this river originates in the north of the province of Khuzestan (at
) and forks into two branches in the city of Khorramshahr, and finally
empties into the Persian Gulf (at the coordinates
) (Fig. 1.). The city of Ahvaz is the most
populated (1,136,989 people) and important city located along the path of this river
). The Ahvaz has a significant contribution in the
reduction of this river’s quality. The quality of this river is better in
northern parts (beginning) by passing through different cities, its quality
decreases. In order to assess the quality of Karun River,
information was considered from two metering stations, one before and the other
located after the city of Ahvaz. The employed data in this study included
calcium, magnesium, nitrite, and nitrate. Most parameters were measured monthly
by Khuzestan Water and Power Authority from 1995. Since the collected
statistics at these stations in the early years were incomplete, the new and
more complete information recorded over 2013 to 2015 were used. Nevertheless,
since some parameters were not collected in a few months, the data from these
periods were removed. Hence, the data of the first and second station were
finally down-sampled to 36 and 38 data, respectively. In the present study,
three scenarios were considered to predict water quality. The first scenario
includes using the collected parameters in the first station, the second
scenario includes employing the collected parameters in the second station, and
the third scenario uses the data from both stations.
Fig. 1: Location of
the study area in Khuzestan province of Iran
Minimum, maximum, mean,
standard deviation and skewness
coefficient can describe the quality parameters of a water. Therefore, the
specifications of collecting parameters are presented in Table 1.
2.2. Artificial neural network
Structure of neural
networks include 3 separate layers: 1) input layer which is responsible for introducing the data to the model, 2) hidden
layer (s) where the data are processed, and 3) the output layer to produce
results. Each layer comprises one or multiple elements known as a neuron. A schematic view of a neural network is
demonstrated in Fig. 2. A number of
neurons in the input, hidden, and output
layers depend on the problem type and are determined based on the difficulty level of
the problem. In case an insufficient number of neurons is selected, the network
may not demonstrate an appropriate degree of freedom for training purposes. On
the other hand, in case of selecting a large
number of neurons for the hidden layer, the learning process can take a
considerably long time to complete. A number
of neurons in input and output layers is constant and depends on the number of
input and output parameters. Gamma test can be used to determine optimal
parameters for the input layer. Although the number of neurons in the hidden
layers is determined through trial and error (Salehnia et
al., 2013), it is suggested that the number of neurons
in the hidden layer should be within the range n-m, where n and m are the numbers
of neurons in the input and output layers, respectively.
Table 1: Basic statistics of the measured water quality
variables in Karun River, Iran
Number for Embedding
Total coliform bacteria
a Min: minimum.
b Max: maximum.
c SD: standard deviation.
*=Unit is count per 100 mL
Fig. 2: A typical artificial neural network
neural networks (ANNs) is the most common method to solve complex, nonlinear
mathematical problems. Similarly, multilayer perceptron (MLP) is the most
widely used types of neural network in solving such problems. In order to
create an MLP neural network, the appropriate threshold function, weight, and
bias should be determined for each neuron. During training of neural networks,
the weight and bias of each neuron are altered until their favorable values are
obtained. The most important threshold functions used in the development of MLP models include Gaussian,
sigmoid, and tangent sigmoid.
In this study,
parameters of calcium and magnesium were selected as the input, and parameters
of nitrate and nitrite were selected as the output. According to the literature
studies, no randomization was conducted on the data of water quality.
Therefore, in order to predict the water quality of Karun river, the data of
water quality were divided into two categories according to (Basant et al., 2010). These categories included training and validation data, each comprising
50 items (80 percent) and 24 items (20 percent) of the total data. Regarding
the first station, the two categories included 29 and 8 items of the data, and
for the second station, 31 and 8 data were included.
Some drawbacks might
be observed in the performance of the neural network due to the difference
between the maximum and minimum ranges
for each parameter as well as the different type of each variable. Therefore,
it seems necessary to convert the parameters into a dimensionless interval so
as to standardize them. The general formula for standardization within the
interval (a, b) is as follows:
where xs and xo are the original and normalized observational parameters,
respectively. a and b represent the upper and lower limits of standardization. xmin and xmax
indicate the maximum and minimum values of parameter x, respectively. Since a
and b are considered zero and one in the present study, respectively, the
formula is further simplified as:
Moreover, Marquardt algorithm was used to
train the neural network, since according to literature studies, this method is
more powerful and faster than the other existing methods. An optimal number of hidden layers was obtained
through trial and error and based on the proposed domain by
Ehteshami (2014) for Karun river.
2.3. Gamma test
described in the previous section, in order
to determine the optimum neuron of the input layer, it is helpful to use gamma test
(GT). This method is one of the most important procedures to select a useful predictor from a database. Since GT has
been used in many studies in the field of ANN (Tian et al., 2016), it was then used in the presented study.
A formal proof of GT
was extended by Chang et al. (2010). By supposing a set of data observation in the following form:
are input vectors
confined to some closed bounded set
outputs. The system of GT can be expressed in
the following form:
Where f is a smooth function and r is a random variable representing
noise. In general, the mean of the distribution of r is assumed as 0 and the
variance of the noise (Kim and Kim, 2008) is bounded. The gamma statistic
is the main parameter,
which can estimate the model’s output variance.
For each vector xi,
are the kth
distance, and the corresponding Gamma function of the output values:
Where y is the corresponding y-value for the kth nearest
neighbor of xi in Equation (8). In order to compute
are calculated by
univariate linear regression equation with least-squares:
The value of
is the intercept of the Equation (9). A is a gradient of a line
that describes the complexity of the
model. The high value of A show
more complexity and low one indicate less complexity. Another term that can
describe invariant noise called Vratio:
is the variance of output y. According to the
definition of Vrario,
the value of Vrario
close to 0 indicate a high degree of predictability of given output y.
In addition, the estimation of noise variance on the given output can be more
credible if the standard error (SE) is close to 0.
GT estimate the mean square error (MSE) of noise variance
which cannot be modeled by the smoothest
possible model (Goyal et al., 2013).
2.4. Performance evaluation of models
In order to assess the generated neural networks, four metrics, namely
RMSE, MAE, and R2 were employed. RMSE metric represents the error of
the model and is defined according to Relation 6. MAE metric determines over-
and underestimation. The coefficient of
determination (R2) represents the percentage of the variables which
can be estimated by the model, and is calculated as follows:
In the above equation,
and n are respectively the
representatives of predicted values,
observed values and the number of data.
In this study, gamma test was used to omit less-effective
parameters. However, this procedure also reduced the number of input parameters
of the neural network. Table 2 shows the results of gamma test for BOD
simulation, using Scenario 1, Scenario 2 and Scenario 3. In row “embedding” in
the table, different types of input parameters for each station are determined.
Here, 0 is assigned to the parameter which is
not considered as an input for ANN model and 1 is assigned for the considered parameter as an ANN model input. The ordering
of 0, 1 given in Table 2-4 (1st row) is the same as the ordering of
parameters given in Table 1.
As shown in Table 2, the best inputs to develop
ANN model to estimate BOD for station 1 are
Turbidity, SS, TA, Temperature, NO2, Total coliform bacteria, TDS, EC, pH, SO4, HCO3,
Cl. In this station, the number of neural network inputs decreased from
19 to 13 parameters. In station 2, the number of input parameters to achieve the best
gamma was reduced to 11. In this station, Turbidity, TA, PO4, NO3, NO2, NH4, TDS, EC, pH, SO4, and HCO3 were determined as the most
optimal input parameters for ANN model to estimate BOD. Furthermore, if data in
both stations 1 and 2 are simultaneously used in gamma test, the number of
input parameters will reduce to 10. Applying gamma test for the data collected
from station 1, station 2 showed that TA, NO2, EC, pH, SO4, and HCO3 must be selected as input parameters in all three scenarios.
and pH were not thus selected as inputs, at all.
Table 2: The best selective masks and their performance
criteria for BOD
Both of station
Table 3: The best selective masks and their
performance criteria for COD
Both of station
Table 4: The best selective masks and their
performance criteria for DO
Both of station
The Results of gamma test
for COD estimation are shown in Table 3. Applying
gamma test for data from station 1, showed that the best parameters for COD
simulation include SS, TA, PO4,
Temperature, NH4, TDS and Mg. While, gamma test results implied that Turbidity,
SS, TA, PO4, Temperature, NO3, NH4, pH, SO4,
and Cl are the best input parameters for COD simulation considering data in
station 2. Furthermore, Aggregation of data in station 1 and 2 for gamma test
demonstrated that TA, PO4, Temperature, NO3, Total coliform bacteria, HCO3 and
Ca should be selected as input for COD
simulation (Table 3). Moreover, Table 4 shows different inputs for Do simulation
due to applying gamma test under three scenarios.
Table 5: Results of ANN to predict BOD, COD and
a St. 1: station 1.
b St. 2: station 2.
Based on gamma test analysis, it can be
inferred that SS, TA, and temperature are
the common parameters for COD simulation for all three scenarios. While phosphate was also the only common parameter for DO simulation under three
scenarios. In a study on the Karun River, Emamgholizadeh et al. (2014) investigated the sensitivity of MLP model to
input parameters using omitting them one by one. Although they used fewer
parameters; however, they reported similar results regarding the little impact of parameters such as Ca and Mg
in predicting BOD, COD and DO. Phosphate and turbidity were used in most
scenarios to predict BOD, COD and DO and their effectiveness in determining these
parameters can be expressed. These results correspond with findings of Emamgholizadeh et al. (2014) and Singh et al.
(2009). Phosphate plays an important role in
oxidation as well as the energy-release process and its increment, increases
the number of microorganisms (Singh et al.,
2009). Turbidity is an important parameter in
determining the self-purification and the amount of dissolved oxygen in the
river (Talib and Amat,
2012). Therefore, it plays an important role in the simulation of the quality of the Karun River
Fig. 3: Scatter plots of observed and predicted BOD using
the data of station 1 (top panel), station 2 parameters (middle panel) and both
station (bottom panel): a training and b testing.
ANNs results for simulation of BOD, COD and DO
are shown in Table 5. Statistics of RMSE, MAE and coefficient correlation were
used to compare simulation results with observed data.
RMSE and MAE values show that the neural
network could well predict these parameters. These results correspond to those
of Emamgholizadeh et al.’s study (2014). However, the presented study has improved RMSE
and MAE. In the present study, two phases of “training” and “testing” were
employed. Comparing corresponding results of each phase shows that the
current networks has sufficient accuracy
for simulation of desired parameters. The Values of RMSE
and MAE gave in Table 5 indicate the ANNs
for simulation BOD have more appropriate performance than the other ANNs under
However, All the ANNs have the acceptable
performance to simulate BOD, COD and DO in training and testing phases.
Figures 3 to 5 show that how well the predicted values of BOD, COD and DO match