Study area

Karun is the longest and most affluent river of Iran. With

950-km long, this river originates in the north of the province of Khuzestan (at

the coordinates

) and forks into two branches in the city of Khorramshahr, and finally

empties into the Persian Gulf (at the coordinates

) (Fig. 1.). The city of Ahvaz is the most

populated (1,136,989 people) and important city located along the path of this river

(

). The Ahvaz has a significant contribution in the

reduction of this river’s quality. The quality of this river is better in

northern parts (beginning) by passing through different cities, its quality

decreases. In order to assess the quality of Karun River,

information was considered from two metering stations, one before and the other

located after the city of Ahvaz. The employed data in this study included

calcium, magnesium, nitrite, and nitrate. Most parameters were measured monthly

by Khuzestan Water and Power Authority from 1995. Since the collected

statistics at these stations in the early years were incomplete, the new and

more complete information recorded over 2013 to 2015 were used. Nevertheless,

since some parameters were not collected in a few months, the data from these

periods were removed. Hence, the data of the first and second station were

finally down-sampled to 36 and 38 data, respectively. In the present study,

three scenarios were considered to predict water quality. The first scenario

includes using the collected parameters in the first station, the second

scenario includes employing the collected parameters in the second station, and

the third scenario uses the data from both stations.

Fig. 1: Location of

the study area in Khuzestan province of Iran

Minimum, maximum, mean,

standard deviation and skewness

coefficient can describe the quality parameters of a water. Therefore, the

specifications of collecting parameters are presented in Table 1.

2.2. Artificial neural network

Structure of neural

networks include 3 separate layers: 1) input layer which is responsible for introducing the data to the model, 2) hidden

layer (s) where the data are processed, and 3) the output layer to produce

results. Each layer comprises one or multiple elements known as a neuron. A schematic view of a neural network is

demonstrated in Fig. 2. A number of

neurons in the input, hidden, and output

layers depend on the problem type and are determined based on the difficulty level of

the problem. In case an insufficient number of neurons is selected, the network

may not demonstrate an appropriate degree of freedom for training purposes. On

the other hand, in case of selecting a large

number of neurons for the hidden layer, the learning process can take a

considerably long time to complete. A number

of neurons in input and output layers is constant and depends on the number of

input and output parameters. Gamma test can be used to determine optimal

parameters for the input layer. Although the number of neurons in the hidden

layers is determined through trial and error (Salehnia et

al., 2013), it is suggested that the number of neurons

in the hidden layer should be within the range n-m, where n and m are the numbers

of neurons in the input and output layers, respectively.

Table 1: Basic statistics of the measured water quality

variables in Karun River, Iran

Kurtosis

Skewness

SDc

Mean

Maxb

Mina

Unit

Variable

Number for Embedding

26.743

4.077

45.76

40.00

344.00

2.00

NTU

Turbidity

1

26.496

4.443

53.68

64.00

420.00

20.00

mg/L

SS

2

0.492

.226

14.50

168.49

206.00

133.00

mg/L

TA

3

8.503

4.024

0.02

0.02

0.12

0.01

mg/L

PO4

4

-1.066

.010

5.70

22.41

35.70

12.00

?C

Temperature

5

3.777

.074

1.65

6.65

13.53

1.92

mg/L

NO3

6

9.539

.398

0.006

0.01

0.05

.010

mg/L

NO2

7

-0.641

-.496

0.14

0.51

0.85

0.28

mg/L

NH4

8

-1.833

-.282

46540.76

61887.83

110000.00

2100.00

U/100mL*

Total coliform bacteria

9

-0.323

2.112

366.88

1711.51

2585.00

856.00

mg/L

TDS

10

-0.294

.351

573.79

2692.04

4040.00

1300.00

S/m?

EC

11

0.280

-1.631

0.22

7.60

8.00

6.90

–

pH

12

0.625

-.110

2.26

9.24

15.51

2.83

mg/L

SO4

13

3.233

-.240

0.39

3.24

4.34

1.76

mg/L

HCO3

14

-0.616

-.485

4.46

15.47

27.00

7.55

mg/L

Cl

15

6.986

.112

1.63

7.71

15.60

3.37

mg/L

Ca

16

0.000

-.472

1.30

4.64

7.80

1.51

mg/L

Mg

17

-0.585

.431

4.43

15.81

26.48

7.04

mg/L

Na

18

1.804

1.669

101.45022

618.05

905.00

268.50

mg/L

TH

19

0.558

.675

0.96

3.3115

6.22

1.08

mg/L

BOD

–

-0.314

.078

5.09

15.81

28.40

8.40

mg/L

COD

–

0.558

-.248

1.22

7.42

10.00

3.80

mg/L

DO

–

N=74.

a Min: minimum.

b Max: maximum.

c SD: standard deviation.

*=Unit is count per 100 mL

Fig. 2: A typical artificial neural network

Employing artificial

neural networks (ANNs) is the most common method to solve complex, nonlinear

mathematical problems. Similarly, multilayer perceptron (MLP) is the most

widely used types of neural network in solving such problems. In order to

create an MLP neural network, the appropriate threshold function, weight, and

bias should be determined for each neuron. During training of neural networks,

the weight and bias of each neuron are altered until their favorable values are

obtained. The most important threshold functions used in the development of MLP models include Gaussian,

sigmoid, and tangent sigmoid.

(1)

(2)

(3)

In this study,

parameters of calcium and magnesium were selected as the input, and parameters

of nitrate and nitrite were selected as the output. According to the literature

studies, no randomization was conducted on the data of water quality.

Therefore, in order to predict the water quality of Karun river, the data of

water quality were divided into two categories according to (Basant et al., 2010). These categories included training and validation data, each comprising

50 items (80 percent) and 24 items (20 percent) of the total data. Regarding

the first station, the two categories included 29 and 8 items of the data, and

for the second station, 31 and 8 data were included.

Some drawbacks might

be observed in the performance of the neural network due to the difference

between the maximum and minimum ranges

for each parameter as well as the different type of each variable. Therefore,

it seems necessary to convert the parameters into a dimensionless interval so

as to standardize them. The general formula for standardization within the

interval (a, b) is as follows:

(4)

where xs and xo are the original and normalized observational parameters,

respectively. a and b represent the upper and lower limits of standardization. xmin and xmax

indicate the maximum and minimum values of parameter x, respectively. Since a

and b are considered zero and one in the present study, respectively, the

formula is further simplified as:

(5)

Moreover, Marquardt algorithm was used to

train the neural network, since according to literature studies, this method is

more powerful and faster than the other existing methods. An optimal number of hidden layers was obtained

through trial and error and based on the proposed domain by

Ehteshami (2014) for Karun river.

2.3. Gamma test

As is

described in the previous section, in order

to determine the optimum neuron of the input layer, it is helpful to use gamma test

(GT). This method is one of the most important procedures to select a useful predictor from a database. Since GT has

been used in many studies in the field of ANN (Tian et al., 2016), it was then used in the presented study.

A formal proof of GT

was extended by Chang et al. (2010). By supposing a set of data observation in the following form:

(6)

Where,

are input vectors

confined to some closed bounded set

and,

are corresponding

outputs. The system of GT can be expressed in

the following form:

(7)

Where f is a smooth function and r is a random variable representing

noise. In general, the mean of the distribution of r is assumed as 0 and the

variance of the noise (Kim and Kim, 2008) is bounded. The gamma statistic

is the main parameter,

which can estimate the model’s output variance.

For each vector xi,

the

are the kth

nearest neighbors

(8)

Where,

denotes Euclidean

distance, and the corresponding Gamma function of the output values:

(9)

Where y is the corresponding y-value for the kth nearest

neighbor of xi in Equation (8). In order to compute

the

points

are calculated by

univariate linear regression equation with least-squares:

(10)

The value of

is the intercept of the Equation (9). A is a gradient of a line

that describes the complexity of the

model. The high value of A show

more complexity and low one indicate less complexity. Another term that can

describe invariant noise called Vratio:

(11)

Where,

is the variance of output y. According to the

definition of Vrario,

the value of Vrario

close to 0 indicate a high degree of predictability of given output y.

In addition, the estimation of noise variance on the given output can be more

credible if the standard error (SE) is close to 0.

GT estimate the mean square error (MSE) of noise variance

which cannot be modeled by the smoothest

possible model (Goyal et al., 2013).

2.4. Performance evaluation of models

In order to assess the generated neural networks, four metrics, namely

RMSE, MAE, and R2 were employed. RMSE metric represents the error of

the model and is defined according to Relation 6. MAE metric determines over-

and underestimation. The coefficient of

determination (R2) represents the percentage of the variables which

can be estimated by the model, and is calculated as follows:

(6)

(7)

(8)

In the above equation,

,

and n are respectively the

representatives of predicted values,

observed values and the number of data.

3. Results

and discussion

In this study, gamma test was used to omit less-effective

parameters. However, this procedure also reduced the number of input parameters

of the neural network. Table 2 shows the results of gamma test for BOD

simulation, using Scenario 1, Scenario 2 and Scenario 3. In row “embedding” in

the table, different types of input parameters for each station are determined.

Here, 0 is assigned to the parameter which is

not considered as an input for ANN model and 1 is assigned for the considered parameter as an ANN model input. The ordering

of 0, 1 given in Table 2-4 (1st row) is the same as the ordering of

parameters given in Table 1.

As shown in Table 2, the best inputs to develop

ANN model to estimate BOD for station 1 are

Turbidity, SS, TA, Temperature, NO2, Total coliform bacteria, TDS, EC, pH, SO4, HCO3,

Cl. In this station, the number of neural network inputs decreased from

19 to 13 parameters. In station 2, the number of input parameters to achieve the best

gamma was reduced to 11. In this station, Turbidity, TA, PO4, NO3, NO2, NH4, TDS, EC, pH, SO4, and HCO3 were determined as the most

optimal input parameters for ANN model to estimate BOD. Furthermore, if data in

both stations 1 and 2 are simultaneously used in gamma test, the number of

input parameters will reduce to 10. Applying gamma test for the data collected

from station 1, station 2 showed that TA, NO2, EC, pH, SO4, and HCO3 must be selected as input parameters in all three scenarios.

Magnesium, sodium,

and pH were not thus selected as inputs, at all.

.

Table 2: The best selective masks and their performance

criteria for BOD

Both of station

Station 2

Station 1

Parameters

1011001110111100000

1011011101111100000

1110101011111111000

Embedding

0.0550

0.0001

0.0242

Gamma statistic

0.0963

0.0843

0.0700

Gradient

0.0359

0.0239

0.0692

Standard error

0.2200

0.0005

0.0970

V ratio

Table 3: The best selective masks and their

performance criteria for COD

Both of station

Station 2

Station 1

Parameters

0011110010000101000

1111110100011010000

0111010010100000100

Embedding

0.0379

0.0001

0.0001

Gamma statistic

0.1394

0.0902

0.1377

Gradient

0.0350

0.0349

0.0264

Standard error

0.1518

0.0001

0.0001

V ratio

Table 4: The best selective masks and their

performance criteria for DO

Both of station

Station 2

Station 1

Parameters

1101000111101000000

1001000111101010000

1111100000001001000

Embedding

0.0440

0.0145

0.0001

Gamma statistic

0.1782

0.1985

0.1642

Gradient

0.0302

0.0609

0.0516

Standard error

0.1761

0.0582

0.0001

V ratio

The Results of gamma test

for COD estimation are shown in Table 3. Applying

gamma test for data from station 1, showed that the best parameters for COD

simulation include SS, TA, PO4,

Temperature, NH4, TDS and Mg. While, gamma test results implied that Turbidity,

SS, TA, PO4, Temperature, NO3, NH4, pH, SO4,

and Cl are the best input parameters for COD simulation considering data in

station 2. Furthermore, Aggregation of data in station 1 and 2 for gamma test

demonstrated that TA, PO4, Temperature, NO3, Total coliform bacteria, HCO3 and

Ca should be selected as input for COD

simulation (Table 3). Moreover, Table 4 shows different inputs for Do simulation

due to applying gamma test under three scenarios.

Table 5: Results of ANN to predict BOD, COD and

DO

Correlation coefficient

MAE (mg/L)

RMSE (mg/L)

Stage

Station

Variable

0.89

0.0411

0.0090

Training

St. 1a

BOD

0.91

0.0395

0.0112

Testing

0.88

0.0452

0.0084

Training

St. 2b

0.84

0.0421

0.0142

Testing

0.81

0.0357

0.0093

Training

Both station

0.80

0.0574

0.0132

Testing

0.89

0.0573

0.0112

Training

St. 1

COD

0.85

0.0596

0.0089

Testing

0.89

0.0695

0.0183

Training

St. 2

0.87

0.0609

0.0297

Testing

0.81

0.0984

0.0401

Training

Both station

0.79

0.0594

0.0114

Testing

0.82

0.0325

0.0085

Training

St. 1

DO

0.81

0.0338

0.0086

Testing

0.84

0.0500

0.0184

Training

St. 2

0.86

0.1606

0.0528

Testing

0.82

0.0626

0.0086

Training

Both station

0.84

0.0770

0.0125

Testing

a St. 1: station 1.

b St. 2: station 2.

Based on gamma test analysis, it can be

inferred that SS, TA, and temperature are

the common parameters for COD simulation for all three scenarios. While phosphate was also the only common parameter for DO simulation under three

scenarios. In a study on the Karun River, Emamgholizadeh et al. (2014) investigated the sensitivity of MLP model to

input parameters using omitting them one by one. Although they used fewer

parameters; however, they reported similar results regarding the little impact of parameters such as Ca and Mg

in predicting BOD, COD and DO. Phosphate and turbidity were used in most

scenarios to predict BOD, COD and DO and their effectiveness in determining these

parameters can be expressed. These results correspond with findings of Emamgholizadeh et al. (2014) and Singh et al.

(2009). Phosphate plays an important role in

oxidation as well as the energy-release process and its increment, increases

the number of microorganisms (Singh et al.,

2009). Turbidity is an important parameter in

determining the self-purification and the amount of dissolved oxygen in the

river (Talib and Amat,

2012). Therefore, it plays an important role in the simulation of the quality of the Karun River

water.

Fig. 3: Scatter plots of observed and predicted BOD using

the data of station 1 (top panel), station 2 parameters (middle panel) and both

station (bottom panel): a training and b testing.

ANNs results for simulation of BOD, COD and DO

are shown in Table 5. Statistics of RMSE, MAE and coefficient correlation were

used to compare simulation results with observed data.

RMSE and MAE values show that the neural

network could well predict these parameters. These results correspond to those

of Emamgholizadeh et al.’s study (2014). However, the presented study has improved RMSE

and MAE. In the present study, two phases of “training” and “testing” were

employed. Comparing corresponding results of each phase shows that the

current networks has sufficient accuracy

for simulation of desired parameters. The Values of RMSE

and MAE gave in Table 5 indicate the ANNs

for simulation BOD have more appropriate performance than the other ANNs under

three scenarios.

However, All the ANNs have the acceptable

performance to simulate BOD, COD and DO in training and testing phases.

Figures 3 to 5 show that how well the predicted values of BOD, COD and DO match

measured values