Modeling of the output current of a photovoltaic grid-connected system using random forests technique

This study presents a prediction technique for the output current of a photovoltaic grid-connected system by using random forests technique. Experimental data of a photovoltaic grid-connected system are used to train and validate the proposed model. Three statistical error values, namely root mean square error, mean bias error, and mean absolute percentage error, are used to evaluate the developed model. Moreover, the results of the proposed technique are compared with results obtained from an artificial neural network-based model to show the superiority of the proposed method. Results show that the proposed model accurately predicts the output current of the system. The root mean square error, mean absolute percentage error, and mean bias error values of the proposed method are 2.7482, 8.7151, and −2.5772%, respectively. Moreover, the proposed model is faster than the artificial neural network-based model by 0.0801 s.


Introduction
In recent years, the installation of photovoltaic (PV) systems has rapidly expanded (Rajkumar et al., 2011). Three types of PV systems are generally used, namely stand-alone PV system, hybrid PV system, and grid-connected PV system. Grid-connected PV systems convert solar radiation into electrical power and inject it directly to the grid without storage (Yang et al., 2010). Grid-connected PV system supports the stability of the electrical power of the system voltage, reduces electrical losses, and reduces the loading level of power transformers (Albuquerque et al., 2010). However, the connection of a PV system to a conventional electrical power system changes the nature of this system from a passive to an active electrical power system (Masters, 2013). Thus, the new nature of the electrical power system must be considered from many viewpoints, such as system's protection and management. Based on that, grid-connected PV system must be optimally designed, installed, and operated to achieve a secure and stable electrical power system.
The key role of optimally sizing, installing, and controlling a grid-connected PV system is represented by predicting system's output power or current (Milosavljevic´et al., 2015). The output current of a grid-connected PV system is a function of meteorological variables, such as solar radiation and ambient temperature. Therefore, any PV system performance model is usually developed based on these meteorological variables (Sharma and Chandel, 2013).
In general, several methods for predicating PV systems output current can be found in the literature (Almonacid et al., 2009(Almonacid et al., , 2014Bacher et al., 2009;Bahgat et al., 2005; historical output power data available for two PV modules. The performance of these models was compared with actual performance. Based on the results, the best adopted methodology for output power forecasting is selected. In Almonacid et al. (2014), a predicting methodology for forecasting the PV output power for 1 h ahead has been developed. Hourly solar global radiation and air temperature data were utilized for developing the model. Here, two ANN models were developed to predict the solar global radiation and air temperature in the next hour.
Based on that ANNs have been used widely for predicting PV systems output. However, the use of ANNs for such a purpose has some limitations and challenges such as the complexity of the training process, the calculation of the hidden layer neurons, and the ability of handling highly uncertain data (Khatib et al., 2012). In the meanwhile, some novel methods with high accuracy and capability of handling highly uncertain data, such as random forests (RFs), can be used for this purpose.
RFs are an ensemble machine learning method that uses many decision tree models for classification and regression. RFs are a combination of tree predictors that depend on the random vector values independently at the same distribution in the forest (Breiman, 2001). The tree construction in the RFs does not depend on the previous trees. The trees are created independently by using bootstrap aggregation of the dataset (Breiman, 1996). RFs do not over fit as a predictor and run fast and efficiently when handling large datasets which gives it a superior predictive performance. Furthermore, RFs do not require assumptions of data distribution through the trees. Moreover, RFs can handle both continuous and discrete variables (Tooke et al., 2014).
The RFs technique was successfully used for modeling the solar radiation. In Sun et al. (2016), the authors used the RFs model for predicting daily solar radiation for three sites in China. The default number of the parameters was used (500 trees and five leaves per tree). The results of this research work show that the performance of the RFs model is better than the results obtained by linear, exponential, and logarithmic models. In the meanwhile, in Kratzenberg et al. (2015), the authors used the RFs model for predicting hourly and the daily solar radiation. Here, the internal RFs parameters were not mentioned. In Gala et al. (2015), RF-based model with different values of the internal RFs parameters to predict hourly solar radiation is presented. The authors set the number of trees in the range of (10, 50, 100, and 300 trees), while the range of leaves is proposed to be five and 20 leaves per tree. Here, the best number of trees and leaves per tree are selected based on the best performance of the model. This study presents a technique for predicting the output current of a PV grid-connected system by using RFs. Experimental data of a 3 kWp PV grid-connected system installed at the Universiti Kebangsaan Malaysia campus are used. These data contain hourly solar radiation, ambient temperature, and actual system output current. The performance of the developed RFs model is compared with that of an ANN-based model to show the superiority of the proposed method.

Mathematical modeling of PV current
In general, each PV module consists of a number of solar cells that are connected in series and parallel. In the dark (with no sunlight), the solar cell acts as a diode in reverse mode. Under solar radiation, the solar cell generates DC current. A solar cell can be represented as an electrical circuit, as depicted in Figure 1. Energy Exploration & Exploitation 36(1) Equation (1) represents the behavior of the solar cell where R s is a resistance that represents the semiconductor material losses, R sh represents the leakages current, I ph is the photogenerated current, I 0 is the saturation current, T c is the cell absolute temperature, k is the Boltzmann constant, m is the diode ideality factor, and q is the electrical charge (Masters, 2013). Therefore, R s is usually estimated by this ratio where STC is the standard test condition wherein solar radiation (G) is 1000 w/m 2 , ambient temperature (T A ) is 25 C, and wind speed (v w ) is 1.5 m/s. V oc is the open-circuit voltage, I sc is the short-circuit current, FF is the actual fill factor, and FF 0 is the ideal fill factor of the solar cell at R s ¼0. FF 0 can be calculated by the following where (V 0 Þ is the ratio between the open-circuit voltage and thermodynamic voltage (VT). The output current of a solar cell can be described theoretically as a linear regression where I STC is the PV module current at STC, G is the solar radiation, G STC is the solar radiation under STC, is the short-circuit temperature coefficient, and T c, STC is the cell temperature at STC. The cell temperature (T c ) can be obtained from the ambient temperature (T A ), and the nominal operating temperature of the cell (NOCT) which is usually supplied by the manufacturer. NOCT is defined as the operating temperature of a PV module under solar radiation (G) of 800 w/m 2 , ambient temperature (T A ) of 20 C, and wind speed (v w ) of 1 m/s (Masters, 2013). T c can be calculated by the following RFs technique The RFs model implies bagging and random decision trees. Bagging can be defined as a technique that is applied for prediction functions so as to reduce the variance of such functions. RFs, therefore, are an extension of bagging that use decorrelated trees. In general, a simple RFs model is usually developed by using low number of input variables so as to have them splitted randomly at each node (Breiman and Cutler, 2014). RF-based prediction models' development procedure usually starts by establishing a new set of values that are equal to the size of the originally spotted data. These data are selected from the original dataset by a random bootstrapping. After that, the selected dataset is formulated as a sequence of binary splits so as to create the desired decision trees. Here, at each node of these trees, the split is calculated by selecting the value of the variable subject to a minimum error rate. Eventually, an average of the aggregating predictors is taken for regression prediction and while the majority vote is taken for prediction in classification (Liaw and Wiener, 2002).
Bootstrap aggregation techniques facilitate the determination of error and the rate of change of the input on the output variables (variable importance (VI)) (Breiman, 2001). The accuracy of this process (Error rates) and variables correlation (variables importance) are calculated by omitting values from each bootstrap sample in a process called ''out-of-bag'' (OOB) data (Breiman, 2001;MATLAB, 2016). OOB data have an important role in tree growth process, whereas OOB data are compared with the predicted values at each step.

Classification and regression
Classification and regression trees are a recursive partitioning method that is used for predicting continuous dependent variables in regression and categorical predictor variables in classification. The decision follows the flow of the nodes from the root to the leaf for all the trees in forest that contains the response. In regression trees, the response has resulted in a numeric form, while it is resulted in a nominal form (true or false) in classification trees.
On one hand, classification trees are used in which objects are recognized, understood, and differentiated (Alpaydin, 2010). Classification trees, in general, group the objects into independent categories based on a specific criterion. The main role of the classification trees is to provide an understanding on the relationships between the subjects and the objects. Classification trees are mostly used in language, inference, decision making, and in all types of environmental interaction. The principle of classification trees is overlapped with the machine learning role (Mills, 2011).
On the other hand, regression trees are a statistical method that is used for determining the relationships between the variables. Many techniques for modeling and analysing the data are included in regression when the relationships are studied between a dependent variable and one or more independent variables (Armstrong, 2012). Regression trees are widely used in prediction, and its role overlaps with the field of machine learning as in the classification trees.
The principle of the RFs is changed depending on the constructed regression and classification trees (Liaw and Wiener, 2002). In general, the decision trees are constructed based on the following phases. First, all training input data are used to examine all the possibilities of the binary split in each predictor or classifier, then the split that has the best optimization criterion is chosen. The optimization criterion in regression trees denotes that the chosen split has the minimum mean square error (MSE) which is calculated between the predicted data and the actual data during the training process. In classification trees, one of three measures including Gini's diversity index, deviance, or twoing rule is used to choose the split. Second, the selected split is imposed to divide into two new child nodes. Finally, the process is repeated for new child nodes until finishing the construction of the trees (reach the minimum MSE in regression trees) (MATLAB, 2016). In RFs, regression trees are formed by growing each tree depending on a random vector. The RFs predictor is formed by taking the average of all the trees in the forest (Breiman, 2001).

RFs algorithm
The procedure for predicting the response in RFs algorithm is a combination between training and testing phases. In the training phase, the RFs algorithm starts by drawing multiple bootstrap samples (N) from the original data and then creates a number of unpruned classification or regression trees (CART) for each N. About one-third of the samples are left out in the construction which are called OBB data (Liaw and Wiener, 2002). The OOB data is a term used to get a running unbiased estimate of the prediction error as trees are added to the forest in the construction phase. Thus, OOB data play a primary role in tree growth, that is, OOB data is compared with the predicted values at each step. Meanwhile, the trees are constructed into the forest according to minimize the error rate than the obtained value of the OOB data (Breiman, 2001). The RFs error rate of Breiman (2001) depends on two parameters: the correlation between any two trees and the strength of each tree individually. After creating the final splits, the data are predicted at each bootstrap iteration by using the tree growth technique with the bootstrap sample (Liaw and Wiener, 2002). Based on that, the number of trees in the forest is a hyper-parameter in the RFs algorithm which should be found so as to assure accurate prediction results. Breiman (2001) suggests trying the default, half of the default, and twice the default and then taking the best one.
In testing phase, the testing data are distributed in the forest to start the prediction procedure. The data flow into the trees is traced to the constructed splits. The final nodes are predicted in the new data by determining the average aggregation of the predictors through all trees. Figure 2 shows the main structure of the RFs algorithm.

VI measure
RFs algorithm provides a significant measure of the VI into the dataset. This measure is implemented into the training phase which has examined the individual effects of each of the inputs variables on the output of the algorithm. The VI aim is to improve the prediction accuracy (the VI value decreases when the prediction accuracy increases) which is done using the OOB data and during the dawning of the N in each tree by sampling with replacement (Breiman, 2001).
The VI of any variable can be obtained by randomly altering all the values of the variables (f) in the OOB sample in each tree into the forest. The VI measure is calculated as ratio of the average of the difference between the prediction accuracy before and after altering the

138
Energy (1) variable f to the total number of all the trees into the forest (Breiman, 2001). The importance score of each variable is obtained by using the following equation (Guo et al., 2011) where cðtÞ corresponds to OBB samples for a specific tree, ðtÞ represents the in-bag samples for a particular tree, t represents the tree number (1, 2, . . . , T), T is the total number of trees, and c t ð Þ i and c t ð Þ i,f are the predicted classes for each sample for a tree before and after altering the variable. x i represents the sample value, and L j is the true label; both are in the training stage.
Application of the RFs technique for predicting the PV output current Figure 3 shows the RFs prediction flowchart. The prediction of a PV system output current starts by first setting the input samples and variables into the Bagger algorithm. In this work, the inputs are solar radiation, ambient temperature, day number, hour, latitude, longitude, and number of PV modules. As a fact, no mathematical formula sets the optimum number of trees (Liaw and Wiener, 2002). In this application, the initial number of trees is supposed to be 250 trees as the half of the default number of tress as Breiman (2001) suggested, and the initial number of leaves in each tree is supposed to be five as the default of the Bagger algorithm (MATLAB, 2016). In some studies such as Tooke et al. (2014), the authors set the number of trees to 500 during the training and testing phases. Meanwhile, the authors do not optimize the number of trees and leaves, thus affecting system accuracy.
The initial numbers implemented in the first stage of the training process of the algorithm are used only to estimate the VI, outliers, and proximity matrix to manage the algorithm parameters. Thereafter, the process for the VI measure is conducted . As a result, solar radiation, day number, hour, and ambient temperature are the most important variables. However, the other variables are neglected because they have constant values. Thus, the other variables do not affect the prediction process and system accuracy.
Second, the outliers in the training dataset are detected by using cluster analysis. The cluster analysis of data or clustering is a task of grouping the training data in such way that the dataset is in the same group that is more similar to one another than the other groups. Many typical of clustering models are used as subspace models, connectivity models, connectivity models, centroid models, density models, group models, centroid models, distribution models, and graph-based models. In this study, the density model is used. The density model defines the clusters as connected dense regions in the dataset space. The data points in the x-axis are processed by optics and those in the y-axis are processed by the reachability distance. Figure 4 shows that the points that are belonging to a cluster have low reachability distance to their nearest neighbor. Therefore, the model can easily detect clusters of points and noise points that do not belong to any of these clusters. In the RFs algorithm, the outliers are detected by using cluster analysis.
However, in machine learning techniques, removing the outliers data from the dataset will significantly increase the accuracy of the results. Here, the outliers in the training dataset are detected, removed, replaced. Outlier detection is the identification of observations that do not conform to a prospective pattern in a dataset. It is normally accomplished with statistics and thresholds. Cluster analysis is one of the most popular technique for detecting outliers or noises that do not belong to a dataset. A normal distribution model is used to analyze the dataset. The outliers expected in the dataset are depicted in Figure 5, which shows the percentage of training dataset detected to be outliers. These percentages are 54.4% of the dataset found in the first pattern, 0.265% in the second pattern, 0% in the third pattern, 0.53% in the fourth pattern, 1.6% in the fifth pattern, 4.51% in the sixth pattern, 11.6% in the seventh pattern, 18.03% in the eighth pattern, and 9.02% in the ninth pattern. However, the algorithm is developed to remove these outliers and replace them by using a better way to increase the accuracy of the results. Finally, the optimization of the number of trees and the number of leaves in each tree for the modified dataset and greatest VI is conducted. In this stage, an iterative method starts the training phase for large trial numbers; these trial numbers are 50,000 values for 500 trees and 100 leaves to find the best number of trees and leaves in each tree.

Proposed model evaluation
In this paper, three values are used to evaluate the proposed model. They are root MSE (RMSE), mean absolute percentage error (MAPE), and mean bias error (MBE). RMSE is an efficiency indicator of the prediction process; a large positive RMSE value represents a large deviation scale in the prediction values from the target values. MBE or mean forecasted error is used as an average deviation indicator; a negative value means that the prediction is  underforecasted and vice versa. MAPE represents an accuracy indicator. RMSE, MBE, and MAPE are expressed as follows where I P i represents the predicted value, I i is the target value, and n is the number of observations.

Results and discussion
In this research, the output current of a 3 kWp PV system installed at the Faculty of Engineering Built & Environment, UKM, Malaysia (101.7713 E, 2.921065 N) is used (see Figure 6). The system consists of polycrystalline silicon panels (25 modules) tilted at 15 . The specifications of the PV module are listed in Table 1.
The performance data of the system, as well as meteorological data (solar radiation and ambient temperature), are used in this research. Six months of the hourly recorded data are utilized in the research work. The monitoring system consists of solar radiation transmitter of high-stability silicon PV detector model WE300 with accuracy of AE 1%, temperature sensor for the surface of the PV panel model WE710 with accuracy of AE0.25 C, air temperature sensor model WE700 with range of À50 C to+50 C and accuracy of AE 0.1 C, and current transducer Model: CTH-050 with input range of 0-50 A (DC) and output of 4-20 mA. In this research, the dataset used is divided into two parts: 70% for training and 30% for testing and validation. Figure 7 shows the dataset utilized in this research. Figure 8 shows the VI rates for the seven inputs used in this study. From Figure 6, the most important variable is solar radiation with a rate of 2.4 out of 2.5. Then, day number has a rate of 0.87 out of 2.5, whereas hour has a rate of 0.77 out of 2.5. Finally, ambient temperature has a rate of 0.6 out of 2.5. The other inputs have a rate of 0 out of 2.5, that is, these variables have no effect on the predicted values of the output current. Following these results, the inputs that were used in this study are solar radiation, ambient temperature, day number, and hour.  Optimizing the parameters of the RFs prediction algorithm requires searching for the best number of trees and the number of leaves in each tree that achieve the best values of RMSE, MAPE, and MBE. Tables 2 to 4 show the 5000 trials for 500 trees with 100 leaves in each tree and the effect of the numbers of trees and leaves on prediction accuracy. From the tables, the best number of trees is found to be 65, whereas the best number of leaves is found to be one leaf per tree.
After setting the parameters of the algorithm, the training process was conducted by using 70% of the dataset. Thereafter, 30% of these data are used to validate the proposed model. Actual data, as well as the predicted data by the ANN-based model, are compared with the predicted data by the proposed RFs model to show the superiority of the proposed model. Figure 9 shows the PV system output current based on the developed RFs model, ANNbased model, and actual data. From the figure, the developed RFs model shows better results than the ANN-based model. Moreover, the predicted values by RFs do not deviate from the interval of the measured values. Furthermore, the RF-based model is faster than the ANN-based model in terms of training and testing processes. Table 5 shows the     evaluation of the proposed models. From Table 5, RFs exceed the ANN in predicting the system output current. The RMSE, MAPE, and MBE values for the proposed model are 2.7, 8.7, and À2.58%, respectively. These results prove the superiority of RFs in predicting the system output current compared with the ANN-based model.

Conclusion
A model for predicting the output current of a PV system by using RFs was presented in this paper. Experimental data of a 3 kWp PV system were used in developing the proposed model. Three statistical error values, namely RMSE, MAPE, and MBE, were employed to evaluate the accuracy of the proposed model. Based on the results, the RFs model was found to be accurate in modeling PV output current and exceeded the ANN-based model. The RMSE, MAPE, and MBE values for the proposed RFs model were 2.7482, 8.7151, and À2.5772%, respectively. Based on that, the proposed RFs model can therefore be used as an efficient machine learning for predicting hourly output current of PV system.

Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors would like to acknowledge the financial support from the Universiti Kebangsaan Malaysia funding under the research projects ETP-2013-044 and DIP-2014-028. Financial support is also received from Alpen-Adria-Universita¨t Klagenfurt Project Number AST4340004 (Smart Grids).