Since the COVID-19 pandemic began; each country responded with different measures at different time periods so it was impossible to develop the equation which could globally predict the COVID-19 epidemiology trend using latitude, longitude and outbreak period as an input variable. Instead the global model is made with the number of days since the start of the data collection as an input variable and number of patients in each group as the output variable. In order to develop such model, the data set was modified in terms of summing the number of cases, across locations, for each day. In addition to the global model total of four different models were created and these are China model, Italian model, Spanish model and USA model. These four models are interesting and should be studied further since each of these countries reacted to the outbreak differently. Among them the China model is the most reliable since the number of confirmed and deceased patients is decreasing while the number of recovered patients is growing. However, the Italian, Spanish and USA models are still unpredictable since each day many new instances of confirmed and deceased cases appear while the number of recovered patients remains small when compared to the previous two groups. In each case the produced mathematical equation for prediction of epidemiology trend consists of three equations which were obtained for recorded confirmed, deceased, and recovered cases. The mathematical equation in general form for prediction of epidemiology trend is given in
equation (1).
China model
For China model the province of Hubei was chosen as a training/testing model based on the number of confirmed cases, as well as the number of deceased and recovered cases. The equations for confirmed, deceased, and recovered patients are shown in
Table 3 with
scores, while the performance comparison of each equation with real data is given in
Figure 1.
Since the equations are too long to fit into
Table 3, the coefficients
,
and
were used with the full form of these coefficients is given in Appendix 6.1. The initial population in each GP iteration was 455, 225, and 281. The GP was executed for 188 generations for the model of confirmed and deceased cases while the value for the model of recovered cases randomly set to 120. In each generation total of 21, 28, and 47 members were randomly chosen from population to be used in the tournament selection process. The initial tree depth was set in range from 4 or 6 up to 9 to 12 depending on the case as shown in
Table 3. All mathematical equations were obtained with very high crossover coefficient value which was between 0.91 and 0.92. Low values of mutation coefficients were selected in each case meaning that the mutation of the candidate solutions did not have as large of an influence on the population members in comparison to the crossover operation. Stopping criteria was randomly chosen as 0.469, 0.36, and 0.092. The maximum number of samples were randomly chosen to be 0.81, 0.804, and 0.72. The constant range in GP parameter represents the constant interval (−3221.2, 6386.5). As seen on
Figure 1 the mathematical models obtained for confirmed, deceased, and recovered cases have excellent performance since comparison between estimation produced by these mathematical expressions and real data generates
score higher than 0.998 as it can be seen on
Figure 1(a)–(c). The Epidemiology estimation shown in
Figure 1(d) shows the high-quality estimation of the GP model on real data.
Italian model
In this subsection, results for the obtained Italian model are presented. The best equations obtained for each case are given in
Table 4 alongside GP parameters that were used to obtain each equation and
score of each equation. The comparison of each case with real data and epidemiology trend are shown in
Figure 2.
The coefficients
and
were introduced in order to shorten the notation of mathematical formulas. The full form of these coefficients is given in Appendix 6.2. As seen from
Table 4 the best mathematical expressions obtained using GP are selected based on their
score. The population of mathematical expressions in each iteration for each case was randomly chosen to be 665, 823, and 606. These numbers are much higher to those used to obtain the Chinese model. The GP trained the models for a maximum number of 170, 171, and 135 generations since the stopping criteria of 0.862, 0.675, and 0.498 were not achieved. In each generations total of 35, 21, and 34 depending on the case population members (mathematical expressions) were competing to become part of the next generation. The crossover coefficient in each case when compared to mutation coefficients was much higher meaning that crossover had much higher influence than mutation coefficient. The stopping criteria in each case was randomly chosen and the values are 0.862, 0.675, and 0.498. This criteria was never met since the GP stopped execution after the maximum number of generations is achieved. The maximum number of samples of training data set in each case that was used was 90%, 97%, and 80%, respectively. The constant range used for the construction of each equation was sizable and, for example, for the model of confirmed cases the constants ranged from −8825.9 up to 7587.3. The parsimony coefficients that are responsible for controlling the program growth from generation to generation were randomly chosen for each case and are equal to 0.065, 0.036, and 0.011.
From
Figure 2 it can be seen that mathematical expressions have good approximation on Confirmed, Deceased, and Recovered Cases as well as estimation of epidemiology trend.
Spain model
Spanish epidemiology trend is somewhat similar to the Italian epidemiology trend. Although the number of confirmed cases each day is shown to be rapidly growing and the number of deceased cases is smaller than in Italian epidemiology model while the number of recovered case is much larger. The equations used for estimation in each case with GP parameters and
scores are shown in
Table 5. The comparison of mathematical expressions with real data for each case is shown in
Figure 3.
In
Table 5 the coefficients
,
and
were used to shorten the display of mathematical equations within the table. The full form of these coefficients is given in Appendix 6.3. The numbers of candidate solutions for each presented group are 647, 538, and 623, respectively. The first stopping criteria in GP is the maximum number of generations which is 126, 191, and 198 generations. The other stopping criteria—stopping criteria coefficient, which was in the presented cases set to (0.447, 0.1, and 0.4) was not achieved hence the GP stopped the execution when the first stopping criteria was met that is maximum number of generations. The crossover coefficients are 0.92, 0.92, and 0.92. The next three coefficients in each case represent subtree mutation, hoist mutation and point mutation coefficients and for the confirmed cases shown in
Table 5 are 0.04, 0.001, and 0.005, respectively. Again, the crossover operation on the population members is shown to have a greater influence than the mutation coefficients. The maximum number of samples that were randomly selected in training data set between generations, for each model are 91%, 81%, and 95%. The constant range interval that GP used for random selection of constants in order to construct the mathematical expressions range from −4874.4 up to 2609.3 in confirmed case, from −3120.4 up to 2542.8 in deceased cases and from −1459.5 up to 8805.6 in recovered cases. The parsimony coefficients in each case are 0.032, 0.056, and 0.059, respectively. These are very low values which means that the mathematical expressions became very large due to lack of correlation between input and output variables. Note that the size of mathematical expressions cannot be seen from
Table 5 but from coefficients given in
Appendix.
As seen from
Figure 3 this model is similar to the Italian model which means that the number of confirmed and deceased cases is increasing rapidly while the number of recovered cases is slowly increasing. The mathematical expressions from
Table 5 that are shown in
Figure 3(a)–(c) it can be seen that these mathematical expressions follow the trend confirmed, deceased, and recovered cases really well. In
Figure 4(d) all three mathematical expressions were combined as shown in
equation (1). The daily number of recovered and deceased cases were subtracted from the daily number of confirmed cases in order to determine the number of active cases. This procedure is done for real data and for the obtained mathematical expressions. From
Figure 3 it can be seen that the mathematical expression almost perfectly estimates the real data with only a small deviation.
USA model
The USA is one of the last countries in which COVID-19 epidemic began and it is almost impossible to make predictions since the number of confirmed cases and deceased cases is rapidly growing. However, the data collected so far is enough to obtain mathematical expressions for confirmed, deceased, and recovered cases and possibly make the estimation of epidemiology curve using aforementioned expressions. The mathematical expressions for each case with GP parameters that were used to obtain these mathematical expressions and their performance measured in terms of
score are shown in
Table 6. The comparison of each mathematical expression with real data and the epidemiology trend is shown in
Figure 4.
As seen from the
Table 6 in order to write mathematical expressions in shorter form the coefficients
,
,
and
and the full form of these coefficients is given in Appendix 6.4. The
variable in
Table 6 represents latitude. The larger sizes of these equations can be attributed to the lack of correlation between input and output data, causing the selected parsimony coefficient to be very small. The randomly chosen parsimony coefficient for each model was equal to 0.05, 0.08, and 0.08, respectively. The number of population members in each case is equal to 141, 181, and 196, respectively. The number of generations which represents one of stopping criteria for each case was set to 199, 104, and 159, respectively. The randomly chosen stopping criteria coefficient was set to the values of 0.864, 0.92, and 0.78, respectively. The value of stopping criteria was never reached so the GP algorithm stopped execution after maximum number of generations was reached. The crossover coefficient (0.77, 0.79, and 0.78) although lower than in previous models is still dominating variational operator when compared to other three mutational operators. The maximum number of samples that were randomly chosen in each generation for each case are equal to 99%, 99%, and 98%. The coefficient range that was used in GP to construct mathematical expressions (population members) are from −7050.5 up to 8861.4 for the model of confirmed cases, from −9534.1 up to 9763.5 for the model of deceased cases, and from −2886.8 up to 7821.8 for the model of recovered cases. As seen from
Figure 5, the mathematical expressions for each case perform almost perfectly with smaller deviations when compared to the real data. When the performance of all three mathematical expressions is compared, the mathematical expression for recovered cases shows the largest deviation from the real data. In
Figure 5(d) equation (1) is used on real data in order to obtain the number of active cases. The same procedure is applied to the estimation curve with exception that the number of confirmed, deceased, and recovered cases were generated with mathematical expressions presented in
Table 6. The estimation generated using these mathematical expressions is almost identical to the real data.
Global model
In order to find the global mathematical model that could be utilized for estimation on real data, the data set used for training and testing on confirmed, deceased, and recovered cases must be adjusted. In each data set, the sum of all cases for each day must be calculated in order to determine the total number of confirmed, deceased, and recovered cases. The latitude and longitude as input variable in this model are set to zero since the inclusion of these parameters into the model failed to generate accurate mathematical models. The mathematical models obtained on confirmed, deceased, and recovered cases have one input variable which is number of days since the first data entry. In
Table 7 the mathematical equations obtained using GP for each case are shown with the GP parameter used to obtain these equations and
score that each mathematical equation achieved on the testing data set. The graphical representation of each mathematical equation compared to the real data and estimation epidemiology trend in comparison with real data is shown in
Figure 5.
In
Table 7 the coefficients
,
,
,
and
were introduced since the mathematical expressions are too large to fit in the table. The
represents the number of days from which the data collection for COVID-19 outbreak started. The full form of these coefficients is given in Appendix 6.4. Besides the mathematical expressions the GP parameters which were used to obtain these mathematical expressions were also shown alongside the
score. As in previous subsections, all GP parameters were randomly selected. The number of population members for each case equal 374, 439, and 282, respectively. The first stopping criteria is the maximum number of generations which was set to 159 in confirmed case, 180 in deceased case, and 156 in recovered case. The second stopping criteria was set for each case to 0.63, 0.8, and 0.8, respectively. Since the second stopping criteria represents the minimum
which is the fitness value of the best population member in the given generation and was never achieved the GP algorithm stopped the training process when the maximum number of generations was achieved. In each generation the total of 37, 21, and 20 population members were competing against each other to become the members of the next generation. As in previous cases the half-and-half method was utilized to construct the initial population with tree depth in range from 4 to 12 in confirmed case, 4 to 10 in deceased, and 3 to 8 in recovered case receptively. The crossover coefficient in each case with values of 0.73, 0.72, and 0.73 is still the dominating EC operator when compared to the values of the three mutation coefficients. The maximum number of samples that were randomly picked from training data set in every generation for each case was 99%, 96%, and 92%, respectively. The coefficient range which was used in each case by the GP algorithm to construct population members or to perform crossover or mutation was set from −6337.8 up to 5006.5 for the model of confirmed cases, from −4511.08 up to 2513.46 for the model of deceased cases, and from −7246.9 up to 6715.4 for the model of recovered cases. The parsimony coefficients are set to very low values which means that the population members could grow in length and depth from generation to generation. The results of the low parsimony coefficients are very large mathematical expressions that are able to correlate input with the output variable.
In
Figure 5(a)–(c) the performance of mathematical expressions for confirmed, deceased, and recovered case are compared to real data. The comparison showed that mathematical expressions provide good approximations to the real data for confirmed and deceased models, with the mathematical equation for recovered case showing some deviations from the real data. In
Figure 5 using
equation (1) the epidemiology trend estimation is made. The real data from confirmed, deceased, and recovered cases models was used to determine the number of active cases using
equation (1). Then the data generated by three equations from
Table 7 was used in
equation (1) in order to obtain the number of infected cases. The comparison of these two curves in
Figure 5(d) shows that the epidemiology curve based on these three equations shows a good approximation of real epidemiology data.
Discussion
In
Table 1 the possible ranges of hyperparameters used in GP algorithm are shown. The crossover probabilities for all models tended to higher values, with all mutation probabilities tending to lower values. Because of this, crossover probability had a bigger influence on population of symbolic expressions in each generation than the mutation coefficient probabilities. Stopping criteria coefficient was never achieved, because of a very low upper bound of values from which it could be selected. Due to this, the secondary stopping criteria, number of generations, was used to stop the training process in all models. The number of generations shows tendency towards the upper bound (200) across models. The maximum number of samples used in training shows the tendency to the upper bound of possible values. The resulting range of parsimony coefficients which were used to generate the best symbolic expressions is very low. This happened due to weak correlation between input and output variables, in order to enable growth of symbolic expressions in terms of achieving lower
value. The other hyperparameters show equal distribution amongst possible values, not showing a tendency to either bound.
Today a vast number of AI algorithms exist which can be used to solve specific problems. The most popular AI algorithms are ANNs which are trained in a similar manner to GP in order to solve specific problem. The result of training the ANN to solve a specific problem is the architecture which is capable of solving the problem. The transformation of ANN architecture into mathematical expression is almost impossible due to the large number of neurons and their interconnections. On the other hand the benefit of using GP algorithm is it will, after training, produce the mathematical expression that correlates inputs and the desired output.
In order to obtain epidemiology trend equation for the Chinese, Italian, Spanish and USA models GP was utilized to obtain symbolic expressions for the number of confirmed, deceased, and recovered cases. For each symbolic expression obtained using GP the dataset consisted of latitude, longitude, and a number of confirmed/deceased/recovered cases for each day since the dataset start. Due to a small dataset at the moment of the research the latitude and longitude were fixed for each specific location in each model. The Chinese model is the most unique model when compared to other modes since the outbreak of COVID-19 started earlier than in other models and the number of confirmed and deceased cases is stagnating while the number of recovered cases is slowly growing. The latitude and longitude for the Chinese model was fixed at Hubei province (latitude: 31, longitude: 112) due to a small numbers of other locations for reported confirmed/deceased/recovered cases. The best symbolic expressions for confirmed, deceased, and recovered cases obtained using GP generated
score values of
,
and
, respectively. These results showed that these symbolic expressions estimate the confirmed, deceased, and recovered cases with high accuracy when compared to real data. Combining the best symbolic expressions in terms of
score for confirmed/recovered/deceased cases generated epidemiology trend equation which follows the real epidemiology curve with high accuracy which can be seen from
Figure 1. From that figure, two interesting features can be noticed: after 20 days since the start date number of confirmed cases rapidly grew while the number of deceased and recovered cases had slowly increased, after 25 days since the start date the epidemiology curve achieved maximum value and started to decrease. The rapid increase in number of confirmed cases can be attributed to the fact that the strict measures of social distancing and hygiene were not fully implemented. The decrease in epidemiology curve can be attributed to the fact that the number of confirmed cases started to stagnate while the number of deceased and recovered cases are slowly growing. The epidemiology curve showed that the number of infected cases dropped down below 1000 after 70 days from the start date. The decrease in number of infected cases is a result of strict quarantine measures followed by obligatory protection equipment as well as social distancing and extremely high hygiene and sanitation standards.
For the Italian model, the symbolic expressions for confirmed/deceased/recovered cases were obtained using the same procedure as described in case of the Chinese model. Since at the time the symbolic expressions were obtained there were small numbers of locations in which COVID-19 was confirmed and they were mostly concentrated at hospitals the location in terms of latitude and longitude values was fixed at the city of Milan (latitude: 45.27, longitude: 9.11). When compared to the Chinese model the Italian model had exponential growth of the number of confirmed/deceased/recovered cases. This growing trend can be attributed to the fact that extreme quarantine measures, social distancing and high sanitation standards were not implemented as they should have been. The best symbolic expressions for confirmed/deceased/recovered cases obtained using GP generated
score values of
,
and
, respectively. As seen in
Figure 2 all these symbolic expressions are following the real trend of confirmed, deceased, and recovered cases with high accuracy. Using these equations the epidemiology equation was generated and when compared to the real trend it can be seen that this equation estimates the real trend with high accuracy. At the time of conducted investigation the Italian epidemiology trend was exponentially growing without any indication of decreasing. In order to decrease number of confirmed/deceased cases and increase the number of recovered cases, Italy will have to introduce strict measures of quarantine, social distancing, and high hygiene and sanitation standards.
The epidemiology trend in Spanish model shows the same behavior as the Italian model, with exception of the number of confirmed cases, the growth of which was slowly decreasing. The procedure for obtaining the symbolic expressions for confirmed, deceased, and recovered cases is the same as in previous models. Due to the small number of reported locations the latitude and longitude values were fixed to the city of Madrid (latitude:
, longitude:
). The best symbolic expressions using GP for confirmed, deceased, and recovered cases achieved
values of
,
and
, respectively. These values showed that these symbolic expressions are estimating the number of confirmed, deceased, and recovered cases with high accuracy. The epidemiology equation was formulated using the best symbolic expressions for confirmed, deceased, and recovered cases. From
Figure 3 it can be seen that the epidemiology equation of Spanish model is estimating the number of infected cases with high accuracy when compared to the real data. From the epidemiology curve it can be noticed that the number of infected cases is slowly increasing. This trend can be attributed to the fact that after 70 days since the start date was recorded the number of confirmed and deceased cases showed a slightly slower increasing trend while the number of recovered cases was constantly growing. As mentioned previously, the Spanish model is similar to the Italian model with the exception of the number of infected cases which is slowly increasing after 70 days since the initial data was recorded in the data set. This means that the extreme quarantine measures as well as social distancing, high hygiene and sanitation standards were implemented in full form.
When compared to the other countries in the USA the outbreak has recently started. However, due to high number of violations of extreme quarantine measures as well as social distancing, hygiene, and sanitation measures the number of confirmed and deceased cases exponentially increased while the number of recovered cases showed only a slow increase. Again, the procedure for obtaining symbolic expressions for confirmed, deceased, and recovered cases using GP is the same as in previous models and due to a small number of reported locations the latitude and longitude values were fixed at the city of New York (latitude: 40.716, longitude: −74). The best symbolic expressions obtained for confirmed, deceased, and recovered cases were obtained using GP that achieved
values of
,
and
, respectively. All three equations are estimating the number of aforementioned cases with high accuracy which can be seen in
Figure 4. These three equations were used to formulate epidemiology trend equation and the results showed that this equation is estimating the number of infected cases with high accuracy when compared to the real epidemiology curve. From the real data shown in
Figure 4 it can be seen that the number of infected cases is still rapidly growing which can be attributed to the fact that general population is not following strict measures from World Health Organization (WHO) or Center for Disease Control (CDC). In order to stop this trend, the general population should follow extreme measures of quarantine as well as social distancing and high hygiene and sanitation standards. If not the consequences could be catastrophic.
The epidemiology trend equation on global scale was obtained following the same procedure as in the case of Chinese, Italian, Spanish, and USA model. However, the latitude and longitude were omitted due to small number of outbreak locations, at the time the analysis was conducted. So the only input variable was the number of confirmed/deceased/recovered cases for each day since the start date. The best symbolic expressions obtained for three cases using GP achieved
score values of
,
and
, respectively. These values indicated that the obtained symbolic expressions are estimating the numbers of confirmed, deceased, and recovered cases with very high accuracy when compared to the real data. Using these symbolic expressions the epidemiology equation was formulated for global model and from
Figure 5 it can be seen that the estimated epidemiology trend is following the real epidemiology trend with high accuracy. From estimated and real epidemiology curve two interesting features can be noticed. After 20 days since the data collection started on the global scale the stagnation of the number of infected cases can be noticed. After 50 days since the start date the number of infected cases stared to rapidly increase which can be attributed to the fact that outbreak had started in Italy, Spain, USA and other countries. The epidemiology trend should start to decrease if quarantine measures, social distancing as well as high hygiene and sanitation standards defined by WHO are followed globally.