Reconstruction of critical coalbed methane logs with principal component regression model: A case study

Wireline logging plays a critical role in coalbed methane exploration. However, the lack of crucial log data, such as neutron and sonic logs, makes coalbed methane exploration difficult. In this paper, we propose a principal component regression model incorporating a multiscale wavelet analysis, a histogram calibration, a principal component analysis, and a multivariate regression to reconstruct essential neutron and sonic logs from conventional logs (i.e., density, resistivity, gamma ray, spontaneous potential, and caliper logs). Our proposed model does not need core-related correlation, and there is no local optimization. We have applied the model to evaluate coalbed methane content in a real case. Firstly, we use the multiscale wavelet analysis and histogram calibration to improve logs’ reliability and lateral comparability. Then, we apply principal component analysis to transform the well-correlated wireline logs into linearly independent components and regress reconstruction functions for neutron and sonic logs with multivariate regression. The reconstructed logs are like the measured logs in trend, mean, and scale. Finally, we apply the reconstructed neutron logs to predict the coalbed methane-content distribution. The predicted distribution is not only following the regional distribution characteristics of coalbed methane enrichment zones but also validated by the coalbed methane production data. In summary, the successful applications of wireline-log reconstruction and regional coalbed methane-content prediction have demonstrated the reliability of the proposed principal component regression model.


Introduction
With the advancement of coalbed methane (CBM) exploration and development in China, wireline-logging technologies played a crucial role in CBM reservoir evaluation and development (Chen et al., 2018;Hatherly et al., 2016;Huang et al., 2019;Ren et al., 2018;Teng et al., 2015;Wang et al., 2017;Zhou and Esterle, 2008;Zhou and O'Brien, 2016). However, several factors, such as instrument inconsistency and logging environments, affect log data quality and cause systematic or non-systematic errors (Kang et al., 2018;Koizumi, 2007;Niebuhr et al., 2001). The lack of enough wireline-log data is a common phenomenon in the early stage of CBM exploration. In a given CBM block, most wells have conventional wireline logs, such as density, resistivity, gamma ray, caliper, and spontaneous potential (SP) logs, and only a few boreholes have the neutron and sonic logs. The lack of neutron and sonic logs makes CBM interpretation and evaluation difficult. Therefore, it is desirable to reconstruct neutron and sonic logs using conventional wireline logs.
In the literature, several methods have been proposed to reconstruct wireline logs or to do the parameter estimation from multi-geophysical logs. Among them, empirical methods, neural network models, and multivariate regression methods are the most used methods (Goutorbe et al., 2006;Hatherly et al., 2016;Rezaee et al., 2008;Rolon et al., 2009;Zhou and Esterle, 2008;Zhou and O'Brien, 2016). The empirical models represent the correlations among petrophysical parameters by testing sampled cores in the laboratory, for example, Gardner's equation, Faust's equation, Han's equation, and so on (Faust, 1953;Gardner et al., 1974;Mavko et al., 2009). With these equations, we can transform density to P-velocity, resistivity to P-velocity, and P-velocity to S-velocity. However, empirical correlations are related to tested cores and experimental conditions. Without appropriate calibration, it is dangerous to apply them to CBM reservoirs. A neural network is a multi-layer network that is designed according to the algorithm of error propagation (Rumelhart and McClelland, 1986). By designing the network model, neural networks result in extremely nonlinear mapping between inputs and outputs and can deal with various kinds of log reconstruction and parameter estimation (Goutorbe et al., 2006;Hatherly et al., 2016;Rolon et al., 2009;Siregar et al., 2017;Zhou and Esterle, 2008;Zhou and O'Brien, 2016). However, its convergence rate is slow, and it is prone to descend into a local minimum. Multivariate regression is a statistical method to establish a linear and nonlinear empirical relationship between multiple independent variables, which can be applied easily in practice. However, inputs are not optimized and are probable information redundancy (Lorenzen, 2018;Wang et al., 2017).
At present, many methods have been proposed to improve the reconstruction reliability and accuracy of wireline logs. One method uses filtering or multiscale wavelet analysis (MWA) to denoise wireline logs (Chen et al., 2018;Honorio et al., 2012;Yu et al., 2010). Others use histogram calibration to improve the lateral comparability of logs since logging instruments and logging environments are not often the same (Quartero et al., 2014;Ren et al., 2016). Traditionally, multivariate regression analysis can be used to uncover the statistical correlation between multiple independent variables and a single dependent variable. However, information on wireline logs is redundant. In geoscience, the principal component analysis (PCA) is very often used for dimensional reduction (Scheevel and Payrazyan, 2001;Wang et al., 2017).
In this paper, we propose a principal component regression (PCR) model incorporating an MWA, a histogram calibration, a PCA, and a multivariate regression to reconstruct key logs for CBM-reservoir evaluations.

PCR model
The PCR workflow is shown in Figure 1. The denoising and histogram standardization steps are expected to improve the reliability and lateral comparability of wireline logs; the PCA is expected to transform the well-correlated wireline logs into linearly independent components; the multivariate regression is expected to generalize a reconstruction function of reference borehole to other boreholes.

Denoising with MWA
The signals in wireline logs include high-frequency, medium-frequency, and low-frequency components. The low-frequency and medium-frequency components are mostly useful signals (Chen et al., 2018;Honorio et al., 2012;Yu et al., 2010). In contrast, the highfrequency component is mostly noise.
MWA is a method to analyze the time-frequency characteristics of a signal (Chen et al., 2018). For a given time-domain signal f(t), it can be transformed into the time-frequency domain where a is the scale factor, b is the shift factor, t is time, and w(t) is a mother wavelet. Normally, we use the orthogonal Daubechies wavelet (e.g., db8) to process log data (Chen et al., 2018). We can transfer the input data into small-scale (high-frequency), medium-scale (medium-frequency), and large-scale (low-frequency) components. We can attenuate the high-frequency noise by reconstructing the input data with medium-scale and large-scale components.

Standardizing with histogram calibration
After denoising, it is necessary to standardize logs and calibrate the influences of instrumental and environmental factors between different wells (Quartero et al., 2014;Ren et al., 2016).
In most cases, the strata around the coalbed in China are sandstone, siltstone, and mudstone. For a given CBM area, logging responses around reservoirs are believed to be similar from boreholes to boreholes. Consequently, the histograms of wireline logs are similar from boreholes to boreholes. Therefore, we can use histogram calibration to standardize wireline logs.
In general, the standardization procedure includes three steps. Firstly, we need to choose a reference borehole. The reference borehole should have a relatively complete logging suite. Secondly, we compare the log histogram of a borehole with the log histogram of the reference borehole. In this step, the mean and range differences of the histograms are desirable. Finally, we calibrate the wireline log of a borehole to the reference borehole by shifting the mean and scaling the range of the input log. After these steps, we can eliminate the influences of borehole environments and improve the lateral comparability of logs among boreholes.

Calculating PCs of wireline logs
The primary purpose of PCA is to express the most variations of observed data with fewer independent variables (Barrash and Morin, 1997;Ren et al., 2018). For a well with n types of log data and m samples, we can express the original log data as a matrix X . . .
where x j is the mean, and S j is the variance. Defining the correlation coefficient matrix R as below we can calculate the eigenvalues of R and sort them as k 1 ! k 2 ! Á Á Á ! k n . The corresponding eigenvectors are the principal components (PCs). If the summation of the first p eigenvalues is greater than 0.85 times of the summation of all eigenvalues, the cumulative contribution of first p PCs is greater than 85%. We use the first p PCs to represent the information of all logs.

Multivariate regression of PCs
In the literature, nonlinear regression methods, such as RBF (radial basis function) neural networks, generalized regression neural networks, and fuzzy neural networks, have been used to do the parameter estimation from multi-geophysical logs (Goutorbe et al., 2006;Rolon et al., 2009;Siregar et al., 2017;Zhou and O'Brien, 2016). However, PCs are linearly independent. We use a linear regression method and the least square method to estimate the regression coefficients of the independent variables (Lorenzen, 2018). Here, we input the selected first p PCs and express the multivariate regression as where y j is the target of the reconstructed log, j is the sample number, C is the principal component, and b is the regression coefficient. After multivariate regression, we achieve the PCR model to reconstruct critical wireline logs.

Geological settings
The study area is located in the southeastern Qinshui basin, Shanxi province, which is one of the key CBM blocks in China. The whole Qinshui basin experienced three main tectonic movements in geological time, i.e., the N-S long-time slight compression in Indosinian period (250 Ma), the NW-SE short-time strong compression in Yansanian period (208 Ma), and the short-time and slight compression in Himalayanian period (65 Ma) (Huang et al., 2019). The strong compression in the Yanshanian period has the most critical influence on the tectonic forms. As a result, the basin formed a large synclinorium basin with bilateral symmetry. In general, the area is a NE-SW plunging syncline with gentle and parallel secondary folds, as shown in Figure 2. The strata are relatively flat, with an average dip of 4 . The primary coal-bearing strata are Carboniferous No. 15 coal (Taiyuan Formation) and Permian No.3 coal (Shanxi Formation) (Peng et al., 2017;Ren et al., 2018). The thickness of No. 3 coal is 1 to 6 m (3 m on average), and the thickness of No. 15 coal is 3 to 7 m (5 m on average). These two main coal seams are widely distributed in the whole area with favorable CBM reserve conditions, i.e., moderate buried depth, large thickness, and high gas content (Peng et al., 2017;Zhang et al., 2013). Because the No. 15 coal is deep, the current CBM target is the No. 3 coal. Therefore, we use the No. 3 coal as an example in our study.

Log denoising
We apply the MWA to the original log data, as shown in Figure 3. In general, the large-scale component indicates the lithology trend, as shown in Figure 3  to form the de-noised log, as shown in Figure 3(a). In comparison with the original log (dark curve), the de-noised log (gray dash curve) has a similar response in amplitude, but the curve is relatively smooth. Applying the denoising procedure to all wireline logs, we achieve highquality de-noised log data in the study area.

Log standardization
Since well A is the only well having complete logging suite in the study area, we treat it as the reference to calibrate the logs of other boreholes. As an example, we use a gamma ray (GR) log to illustrate the calibration process. We first plot the histogram of the GR log of well A and well B, as shown in Figure 4(a) and (b). For these two histograms, they differ not only in means but also in variation ranges. To fix these differences, we, then, shift the GR log of well B to make sure it has the same mean as well A, as shown in Figure 4(c). Finally, we scale the shifted GR log of well B to ensure it has the same variation range as well A, as shown in Figure 4(d). Consequently, the log data of well A and B have the same mean and variation range. This standardization process can increase the lateral comparability between wells. After the standardization, we compare the logs and the de-noised logs of well B, as shown in Figure 5. In general, the data after standardization have the same shapes and trends as the de-noised logs. For density (DEN) log and resistivity (RD) log, the data before and after standardization are almost identical, indicating the instrumental and environmental influences on these data are negligible. For the GR log, the data before and after standardization differ in the scale and the mean, indicating the instrumental and environmental influences on these data are significant. For SP log, the data before and after standardization differ considerably in the mean and a little in the scale, indicating the instrumental and environmental influences on them are significant.
For caliper log (CAL), log data are the actual size of a borehole, but drilling conditions are different from boreholes to boreholes. These induce CAL variation laterally. For the PCR model described in this paper, the regressed coefficients of reference borehole are used to reconstruct compensated neutron log (CNL) and acoustic or sonic log (AC) of other boreholes. It is necessary to make sure the data of other boreholes having a similar mean and scale with reference borehole. Therefore, we standardize CAL along with the other wireline logs to proceed CNL and AC reconstructions. Here, the CAL differs considerably with the de-noised log in mean and scale, as shown in Figure 5.
In summary, the instrumental and environmental influences on most log data are apparent, and the process of standardization is not optional but a necessary step. After the standardization, the logs in the study area can be cross-compared confidently.

PCA analysis
In the study area, only well A has the complete logging suite, i.e., neutron log (CNL), sonic log (AC), density log (DEN), spontaneous potential log (SP), resistivity log (RD), gamma ray log (GR), and caliper log (CAL). In contrast, other wells only have conventional logs, i.e., the DEN, SP, RD, GR, and CAL logs. Since the CNL is a direct indicator of gas content and the AC log can be used to predict porosity and pore pressure, it is essential to reconstruct the CNL and AC log for the other wells.
Since all wells have the DEN, SP, RD, GR, and CAL logs, we, firstly, use these logs as input to compute their cross-correlation coefficients. For well A, the correlation coefficients are as shown in Table 1. As expected, most logs are very well correlated. If we use these correlated logs to fit a reconstruction function for the CNL and AC log, the existence of an overfitting phenomenon is expected. We use the established PCR model described in this paper to overcome the redundant information between well logs.
Firstly, we normalize the logs to the range of [À1, 1]. Secondly, we use the Jacobi method to calculate the PCs of input logs (Ren et al., 2018;Wang et al., 2017). For well A, we list the calculated eigenvalues, the variance contributions, and the cumulative contribution rate in Table 2. The variance contributions are very different from PCs to PCs. For the first PC, the variance contribution is 0.75 (accounting for 75% variation); for the last PC, the variance contribution is 0.02 (accounting for 2% variation). The cumulative variance contribution of the first three PCs is 95%, which is larger than 0.85. Therefore, the first three PCs can represent the most information variations of input logs. Finally, we use the first three PCs to fit an empirical function to reconstruct CNL and AC log. Since the standardized logs of GR, DEN, and log 10 (RD) contribute significantly to the PCs, we cross-plot them (Figure 6(a))   and compare them with the cross-plot of the first three PCs (Figure 6(b)), respectively. The cross-plot of GR, DEN, and log 10 (RD) logs is poorly separated, but the cross-plot of the first three PCs is well separated. After the PCA, the PCs are linearly independent. Since all logs from different boreholes are de-noised, standardized, and normalized before PCA, the characteristics of eigenvalues and PCs of all wells in the study area are expected to be similar. We use the calculated PCs to perform regression and reconstruct CNL and AC log.

CNL reconstruction
The lack of CNLs makes CBM interpretation and evaluation difficult. In this case, we use the proposed PCR model to regress a local reconstruction function shown as follows C ¼ 25:75 À 3:18C 1 À 19:27C 2 À 1:30C 3 where C is the reconstructed CNL, C 1 is the first PC, C 2 is the second PC, and C 3 is the third PC.
To monitor the reconstruction quality, we plot the logs of well A in Figure 7 and crossplot the measured and reconstructed CNLs in Figure 8. In Figure 7, the formation between 560 m and 568 m is the No. 3 coal, where the DEN, GR, and SP logs have low responses, but the log 10 (RD), CAL, and CNL have high responses. The measured and reconstructed CNLs are generally similar in trend, mean, and scale. In Figure 8, the reconstructed and measured CNLs are almost linearly correlated, and the correlation coefficient is high ($0.82). These are direct demonstrations of high reconstruction quality. Therefore, the proposed reconstruction method for the CNL is good enough for well A. Because all logs of other boreholes are standardized to well A, we use the same reconstruction function of well A to reconstruct the CNLs for all boreholes.
For the dry CBM reservoir, the CBM content is proportional to the CNL response and the coalbed thickness. Therefore, we define a relative gas content V g , as shown in equation (7), to express the containing CBM content in the coalbed.
where the C i is the reconstructed CNL response; Dh is the sample interval of logs, m; N is the sample numbers of logs within the coalbed; b is the calibration coefficient.
Well A has the complete logging suite and the actual measured CBM content (19.2 m 3 /t). We compute the gas content of the study area with equation (7). Firstly, we calculate the calibration coefficient b (13.3) through normalizing the sum of CNL response to the measured CBM content of well A. Secondly, we use equation (7) to calculate the gas content for other boreholes. Finally, we interpolate the gas content distribution of No. 3 coal in the study area, as shown in Figure 9(a).
In general, the gas content in the study area is relatively high (15.5-22.5 m 3 /t), but the distribution is not uniform. The top-middle and bottom-middle locations have the highest Energy Exploration & Exploitation 38(4)   Figure 2, we find that the locations with a high gas content are located in the synclines, and this feature follows the regional characteristics of CBM enrichment (Peng et al., 2017;Teng et al., 2015;Zhang et al., 2013). Comparing Figure 9(a) with Figure 9(b), we find these two maps are similar. The top-left corner and the middle area have the highest values; the middle-right area has the lowest values; the transition between the high-value and the low-value areas is fast. These are additional demonstrations of the high reconstruction quality, although the influence factors of CBM production are complicated (Chen et al., 2019;Huang et al., 2019;Li and Wu, 2016;Song et al., 2005).
In summary, the proposed PCR model for the reconstruction of the CNL is relatively reliable and applicable in the study area.

AC log reconstruction
Similar to the CNL, the AC log is a crucial input for porosity evaluation and pore pressure prediction (Mavko et al., 2009;Zhang, 2011Zhang, , 2013. Beside well A, other boreholes in the study area do not have AC logs. We use the measured AC log and the calculated PCs from well A to the AC log with multivariate regression. The local reconstruction function is shown as follows A ¼ 263:64 À 43:10C 1 À 100:13C 2 À 8:91C 3 where A is the reconstructed AC log, C 1 is the first PC, C 2 is the second PC, and C 3 is the third PC. We plot the logs of well A in Figure 7 and cross-plot the measured and reconstructed AC logs in Figure 10. In Figure 7, the measured and reconstructed AC logs are generally similar in trend, mean, and scale. In the target coalbed (No. 3 coal), the measured and the reconstructed AC logs are almost identical. In Figure 10, the reconstructed and measured AC logs