Estimation of Hazard Functions in the Log-Linear Age-Period-Cohort Model: Application to Lung Cancer Risk Associated with Geographical Area

An efficient computing procedure for estimating the age-specific hazard functions by the log-linear age-period-cohort (LLAPC) model is proposed. This procedure accounts for the influence of time period and birth cohort effects on the distribution of age-specific cancer incidence rates and estimates the hazard function for populations with different exposures to a given categorical risk factor. For these populations, the ratio of the corresponding age-specific hazard functions is proposed for use as a measure of relative hazard. This procedure was used for estimating the risks of lung cancer (LC) for populations living in different geographical areas. For this purpose, the LC incidence rates in white men and women, in three geographical areas (namely: San Francisco-Oakland, Connecticut and Detroit), collected from the SEER 9 database during 1975–2004, were utilized. It was found that in white men the averaged relative hazard (an average of the relative hazards over all ages) of LC in Connecticut vs. San Francisco-Oakland is 1.31 ± 0.02, while in Detroit vs. San Francisco-Oakland this averaged relative hazard is 1.53 ± 0.02. In white women, analogous hazards in Connecticut vs. San Francisco-Oakland and Detroit vs. San Francisco-Oakland are 1.22 ± 0.02 and 1.32 ± 0.02, correspondingly. The proposed computing procedure can be used for assessing hazard functions for other categorical risk factors, such as gender, race, lifestyle, diet, obesity, etc.


Introduction
In cancer epidemiology, a risk of getting a cancer in a given age (t) is evaluated by the age-specific incidence rate, I(t), as the number of cases of a particular type of cancer per 100,000 population. Along with age, race and gender, as well as with time period and birth-cohort effects, [1][2][3][4] incidence rates also depend on other risk factors, such as geographical area, dietary factors, life style habits, etc., which can be viewed as categorical variables.
During the last 50 years, finding a direct relationship between the observed incidence rates and risk factors determining these rates has been one of the main challenges of cancer epidemiology. Some progress in solving this problem is achieved by the use of the log-linear model. 5,6 The log-linear age-periodcohort (LLAPC) model is used to account for age, time period and birth-cohort effects. [7][8][9][10] According to this model, an age-specific incidence rate of a cancer can be presented as a product of the time period and birth cohort coefficients, as well as an unknown agespecific hazard function, i.e. risk function of getting the cancer at a given age. Recently, 11 we expanded the use of the LLAPC model on cases when the mathematical form of the hazard function is unknown and proposed a novel computational procedure allowing one to separate the problem of estimating the time period and birth cohort coefficients from the problem of estimating the unknown hazard function.
In the present work, we expand the use of LLAPC model for characterizing unknown hazard functions for populations with different exposures to categorical risk factors (different categories of a categorical variable). In our model, the dissimilarity in exposure is presented by different descriptive categories of the corresponding categorical variable.
The proposed procedure was used for estimating the age-specific hazard functions of lung cancer (LC) for the gender-and race-specific populations living in different geographical areas. For this purpose, we utilized data on LC incidence rates observed in white men and women, in three geographical areas (namely: San Francisco-Oakland, Connecticut and Detroit), collected during 1975-2004. The estimates were obtained from the observed cancer incidence rates, and preliminarily corrected for time period and birth cohort effects. These corrections were made by the approach that we described in. 11 We have found that the LC hazard functions associated with living in these geographical areas have different amplitudes, but the overall shape of these functions is very similar. We have shown that geographical area risk factors influence the LC age-specific hazard functions in approximately the same manner in all ages.
Thus, in this work we provide a proof-of-concept that the proposed computing procedure can be successfully applied for estimating the influences of categorical risk factors on the hazard functions for a particular type of cancer.

log-linear age-period-cohort model
According to the LLAPC model of cancer presentation in aging, the observed incidence rates can be expressed by the product of unknown coefficients of the time period and the birth cohort effects and the unknown hazard function. This function presents a risk to get cancer in aging independently from the time period and birth cohort effects. Until recently, the use of this model in cancer epidemiology was limited to the cases when the mathematical form of the hazard function is known a priori (for instance, the form of hazard function can be taken from a biological model of cancer development), 8 but parameters of this function can be unknown. In this case, the time period coefficients, v j , the birth cohort coefficients, u l , as well as parameters of the given hazard function, h(t i ), can be derived by solving the following system of conditional equations: In (1), I i, j (t i ) is the observed incidence rate in the i-th age interval (t i denotes the midpoint of this interval) and in the j-th time period interval, while index l indicates the birth cohort age interval (note, l is defined by indices i and j). 11 The problem is to derive the time period and birth cohort coefficients, as well as parameters of the hazard function using the incidence rates, observed during the given set of time periods. The main obstacle in solving this problem is that multiple estimators of the time period and birth cohort coefficients can provide equally good solutions. [1][2][3][4] It means that for determining these coefficients, the identifiability problem has to be overcome.
In practice, the identifiability problem can be solved by the use of some assumptions. For instance in, 8 this problem was solved assuming that within each age interval, the observed cancer cases have a Poisson distribution and the mathematical form of the hazard function is given a priori. Adjustments of unknown parameters were performed by the LLAPC model using the maximum likelihood method for assessing the birth cohort and time period effect coefficients as well as parameters of the hazard function. An initial assumption that the cohort effect is absent was used at the beginning of the iteration process to determine the birth cohort and time period effect coefficients. These coefficients were estimated by anchoring one time period coefficient (v = 1) and one birth cohort effect coefficient (u = 1). Thus, the results obtained by this procedure depend on the hazard function used, and on the time period and cohort, to which the coefficients are anchored.
Recently in, 11 we expanded the use of the LLAPC model of cancer presentation in aging on cases when the mathematical form of the hazard function is unknown. In contrast to the previously used methods, a simple, computationally effective method 11 provides an estimation of the time period and birth cohort coefficients without any a priori knowledge of the hazard function. The only assumption used in that method is that the cohort effect coefficients of the neighbor cohorts are nearly the same. Thus, the results of assessing the birth cohort and time period effect coefficients obtained by the method 11 depend only on the time period and cohort, to which the coefficients are anchored, but not on the unknown hazard function. It allows one to separate the problem of estimating the time period and birth cohort coefficients from the problem of estimating the unknown hazard function. Moreover, as we have shown below, the use of the procedure 11 allows one to estimate the age-specific hazard function defined by the certain categorical risk factors.
estimation of hazard functions in the llaPC model Let us denote by I i, j,c (t i ) the observed incidence rates of cancer within a population exposed to the given categorical risk factor, presented by a set of descriptive categories (indexes), c, of a given categorical variable. In such cases, the LLAPC model can be presented by conditional equations: Here, v j,c and u l,c are the time period and birth cohort effect coefficients for the population exposed to the given category, c, of the considered risk factor. In practice, the categories might be encoded as 0, 1, 2, etc.
As can be seen from (2), the hazard function along with the age also depends on the category, c. By using our procedure, 11 one can obtain the estimates of the time period and birth cohort coefficients, v * j,c and u * l,c , and their standard errors SE(v * j,c ) and SE(u * l,c ) (here and below the asterisk denotes estimates, as well as estimators). Again, a distinguishable feature of the procedure 11 is that the aforementioned estimates are obtained without using any information on the hazard function, h c (t i ).
Using the obtained estimates of the time period and birth cohort coefficients, v * j,c and u * l,c , the observed incidence rates can be corrected for these effects in the following way: In calculations we use only the incidence rates when the number of cases is larger than 15. Therefore, to characterize the error distributions of the incidence rates, the normal distribution (instead of the Poisson distribution usually used) can be utilized. 12 It can be shown that when coefficients of variation of the I i, j, c (t i ), v * j,c and u * l,c are small, the incidence rates, I * i, j, c (t i ), corrected by formula (3), will be normally distributed. This proposition can be proven in the way analogous to one that is presented in 11 for analyzing the error distribution of the ratio of two observed incidence rates.
According to the standard rules of error propagation, 13 squares of the standard error of I * i, j, c (t i ), presented by (3), can be calculated by the following formula: where the coefficients before squares of the standard errors are squares of partial derivatives of I * c with respect to I c , v * c and u * c , correspondingly. From (2) and (3) one can obtain the following system of conditional equations: From (5) it can be seen that for assessing values of the hazard function, h c (t i ), in each i-th age interval there are m conditional equations. Therefore, for estimating n values (corresponding to the n age intervals) of the hazard function there are n × m conditional equations (5). To solve the system (5), a least squares method can be used. 14 In (6), the weights, w i,j , are given as reciprocals of the square of the standard error of estimates of the I * i, j (t i ) given by formula (4). Standard errors of the corresponding estimate, SE 2 [h * c (t i )], can be easily obtained from (6): (Note, when variables on the left side of the conditional equations (5) are normally distributed with known standard errors, the least square estimators, h * c (t i ), will be also normally distributed.) From (3)-(4) and (5)-(6) it follows that estimates, h * c (t i ), and their SE can be calculated by the observed incidence rates, I i,j,c (t i ), and the estimates of the coefficients, v * c and u * c . As noted in, 11 estimates of the coefficients v * j,c and u * j,c depend on the time period and cohort to which the coefficients are anchored (i.e. on the time period and birth cohort to which adjustments are made). Therefore, for populations differently exposed to the considered risk factor (see below), their hazard functions can be compared only in the cases when the same anchors are used.

estimation of the ratio of hazard functions
For populations with different exposures to the considered risk factor, the ratios of the corresponding age-specific hazard functions can be used as a measure of relative hazard. In fact, let us denote by h * the estimates of the hazard function corresponding to two categories, coded as 0 and 1. Then, at a given age interval, t i , the ratio, Analogously, taking h * 0 (t i ) as a standard, for multiple categories of a given risk factor (coded as c = 0, 1, 2, 3, …), the ratios; will give corresponding estimates of the relative hazard at a given age interval, t i , for populations exposed to the categories, c = 0, 1, 2, 3, …, compared to the hazard for a population exposed to the category, c = 0. Application estimation of relative risks of lung cancer associated with geographical area Data preparation and processing As a test-bed for the proposed procedure of evaluation of hazard functions, we analyzed the LC risks associated with a geographical area. In this work, we used the protocol for data preparation, analogous to the one described in. 11 The first primary, microscopically confirmed LC cases for white men and women collected during 1975-2004 were extracted from the SEER 9 registries. Data for three geographical areas were utilized in our study: (i) San Francisco-Oakland, (ii) Connecticut, and (iii) Detroit, coded as c = 0, c = 1, and c = 2, correspondingly. LC incidence rates, expressed per 100,000 persons, were age-adjusted by the direct method to the 2000 United States standard population. 15 The SE of the age-adjusted incidence rates were calculated as described in. 16 The obtained incidence rates were grouped in six five-year cross-sectional time periods. These periods were indexed by j: 1975-79 ( j = 1); 1980-84 ( j = 2); 1985-89 ( j = 3); 1990-94 ( j = 4); 1995-99 ( j = 5); and 2000-04 ( j = 6). Each of these subsets was grouped into 18 five-year age groups: 17 groups, ranging from 0 to 84 years, and the 18th group that included all cases for ages 85+. These groups were indexed by i in the following way: 0-4 (i = 1); 5-9 (i = 2), 10-14 (i = 3), …, 80-84 (i = 17), 85+ (i = 18). We only used the data for the groups over age 35 (i = 8, 9, …, 18), because the incidence rates for these groups had corresponding case counts that were statistically significant. We considered 16 birth cohorts (l = 1, 2, …, 16), corresponding to birth year ranges of 1890-94, …, 1965-69.
Thus, the age-adjusted incidence rates of LC in white men (as well as in white women) in three considered geographical areas were presented as the following sets of values: I i, j,0 (t i ), I i, j,1 (t i ), and I i, j,2 (t i ), (i = 8, …, 18, j = 1, …, 6). Analogously, the SE of these incidence rates were presented as:

Results and Discussion
Our procedure described in 11 was used to estimate the time period and birth cohort coefficients (and their SE) for the LC age-adjusted incidence rates in white men and women in each of three considered geographical areas. Estimates of the time period and birth cohort coefficients, v * j,c and u * l,c (c = 0,1,2), were obtained using v c  (6) and (7). Figure 1 presents the incidence rates observed in men during the six (five-year long) time periods of 1975-2004 in San Francisco-Oakland (panel A), Connecticut (panel B), and Detroit (panel C). Panels A-C of Figure 2 present the analogous rates observed in women. As can be seen from the panels A, B and C, the observed incidence rates differ remarkably during the observed six time periods. This significantly complicates studies of relationship between the observed incidence rates and age.
Tables 1 and 2 present the estimates of the agespecific hazard functions (as well as their SE) of LC for the considered geographical areas in men and women, correspondingly. Visual presentation of these estimates is given on panels D of Figures 1 and 2. As can be seen from these panels, the distribution of the estimated values of the corresponding hazard functions exhibits definite patterns having common features, such as an exponential rise in values (from the age about 40 until the age about 70), turnover (taking place at the age interval of 70-80) and a fast fall (at the older ages). Interestingly, the absolute values of the hazard functions of LC determined for men in the San Francisco-Oakland area appears to be systematically lower than the corresponding estimates for Connecticut or Detroit areas. Analogous distributions are observed for the hazard functions of LC determined for women in these areas. Based on these observations, we hypothesized that the risk factors of LC, associated with geographical area, uniformly influence the values of the age-specific hazard functions.
Outliers (i.e. those points which have large influence on the resulting fit) were excluded by the standard procedures of the linear regression analysis. 14 After omitting these outliers, the estimates of the constants were recomputed.
Calculations showed that for men living in Connecticut vs. San Francisco-Oakland, the estimate of the averaged relative hazard (±SE) of LC is 1.31 ± 0.02, while for men living in Detroit vs. San   Francisco-Oakland this estimate is 1.53 ± 0.02. Analogous calculations suggest that for women living in Connecticut vs. San Francisco-Oakland, the averaged relative hazard is 1.22 ± 0.02, while for women living in Detroit vs. San Francisco-Oakland this hazard is 1.32 ± 0.02. In Figure 3, panel (A) shows the graph of the relative hazards with their 95% CI, r *  Figure 4 show the relative hazards with 95% CI, for white women. On these panels, the horizontal line indicates the average of the relative hazards and error bars indicate the 95% CI. Assuming that the estimate of the averaged relative hazard is equal to the mathematical expectation of this estimator, the estimates of the relative hazards can be compared with the averaged relative hazard. When the 95% CI of the relative hazard intersects with the corresponding averaged relative hazard, this relative hazard can be considered as statistically indistinguishable from the averaged value.
Analysis of Figures 3 and 4 suggests that the agespecific relative hazards of LC are nearly constant and depend on the geographical areas and gender. In fact, data presented in Table 3 (after excluding one outlier) show that the risk of LC in Connecticut vs. San Francisco-Oakland is about 1.3 times higher for men, whereas for women, it is about 1.2 times higher. Analogously, data in Table 4 (after excluding outliers) show that for men in Detroit vs. San Francisco-Oakland this risk is about 1.5 times higher, while for women, it is about 1.3 times higher. In this connection, it should be mentioned that the trends appearing on Figures 3 and 4 are much exaggerated. This is because the scale of the x axis on these figures is about 100 times smaller than the scale for the y axis. Performed regression analysis showed, however, that slopes of the linear regression lines for men in Connecticut vs. San Francisco-Oakland (Fig. 3A) and Detroit vs. San Francisco-Oakland (Fig. 3B) are 0.0023 (SE of 0.0009) and 0.0014 (SE of 0.0020), correspondingly. Analogous slopes of the linear regression lines for women in Connecticut vs. San Francisco-Oakland (Fig. 4A) and Detroit vs. San Francisco-Oakland (Fig. 4B) are -0.0038 (SE of 0.0012) and -0.0056 (SE of 0.0023), correspondingly. We also found that even when outliers are not excluded, the slopes for men and women do not exceed 0.008 (i.e. the values of slopes are always near zero). This suggests that the age-specific relative hazards of LC are nearly constant.
Based on this analysis, we suggest that the risk factors of LC, associated with the geographical area, uniformly influence the values of the age-specific hazard functions. This can be illustrated by Figures 5  and 6 showing that after adjustments by the corresponding averaged relative hazard, the shapes of the age-specific hazard functions for white men and women living in Connecticut and Detroit are almost identical to the corresponding age-specific hazard functions for white men and women living in the San Francisco-Oakland area. For Connecticut and Detroit, adjustments of their hazard functions to the hazard function of the San Francisco-Oakland area were performed by dividing the hazard function values by the corresponding values of the averaged relative hazard.

conclusion
In this work, we proposed an efficient computing procedure for estimation of the age-specific hazard functions in the LLAPC model. This procedure is based on the novel approach for analysis of time period and birth cohort effects on the distribution of the age-specific cancer incidence rates, developed in our previous work. 11 0  The procedure proposed in the present work allows one to estimate the age-specific hazard functions for populations with different exposures to a given categorical risk factor. The ratios of hazard functions for populations with different exposures to a given categorical risk factor are used for characterizing relative age-specific hazards of cancers.
As a proof-of-concept that this procedure can be used to evaluate the influence of categorical risk factors on the age-specific hazard functions, we estimated LC risk for populations living in different geographical areas. For this purpose, we utilized data on the LC incidence rates in white men and women, collected in the San Francisco-Oakland, Connecticut and Detroit areas during 1975-2004.
We have found that the risks of LC in white men and women, associated with living in these geographical areas, differ in amplitude but the overall shape of these functions are similar, i.e. the geographical area risk factors influence the LC age-specific hazard functions in approximately the same manner in all ages. We have shown that in white men the averaged relative hazard of LC in Connecticut vs. San Francisco-Oakland is 1.31 ± 0.02, while in Detroit vs. San Francisco-Oakland this relative hazard is about 1.53 ± 0.02. In white women, analogous relative hazards in Connecticut vs. San Francisco-Oakland and Detroit vs. San Francisco-Oakland are 1.22 ± 0.02 and 1.32 ± 0.02, correspondingly.
We suggest that the proposed computing procedure can be used for assessing hazard functions for other categorical risk factors, such as gender, race, lifestyle, diet, obesity, etc.

Acknowledgements
This work was partially supported by 5 P30 CA36727 (NIH) grant and LB506 grant (Nebraska Department of Health). Authors acknowledge Dr. Leo Kinarsky for fruitful discussion and helpful comments.