Comparing Methods for Measurement Error Detection in Serial 24-h Hormonal Data

Measurement errors commonly occur in 24-h hormonal data and may affect the outcomes of such studies. Measurement errors often appear as outliers in such data sets; however, no well-established method is available for their automatic detection. In this study, we aimed to compare performances of different methods for outlier detection in hormonal serial data. Hormones (glucose, insulin, thyroid-stimulating hormone, cortisol, and growth hormone) were measured in blood sampled every 10 min for 24 h in 38 participants of the Leiden Longevity Study. Four methods for detecting outliers were compared: (1) eyeballing, (2) Tukey’s fences, (3) stepwise approach, and (4) the expectation-maximization (EM) algorithm. Eyeballing detects outliers based on experts’ knowledge, and the stepwise approach incorporates physiological knowledge with a statistical algorithm. Tukey’s fences and the EM algorithm are data-driven methods, using interquartile range and a mathematical algorithm to identify the underlying distribution, respectively. The performance of the methods was evaluated based on the number of outliers detected and the change in statistical outcomes after removing detected outliers. Eyeballing resulted in the lowest number of outliers detected (1.0% of all data points), followed by Tukey’s fences (2.3%), the stepwise approach (2.7%), and the EM algorithm (11.0%). In all methods, the mean hormone levels did not change materially after removing outliers. However, their minima were affected by outlier removal. Although removing outliers affected the correlation between glucose and insulin on the individual level, when averaged over all participants, none of the 4 methods influenced the correlation. Based on our results, the EM algorithm is not recommended given the high number of outliers detected, even where data points are physiologically plausible. Since Tukey’s fences is not suitable for all types of data and eyeballing is time-consuming, we recommend the stepwise approach for outlier detection, which combines physiological knowledge and an automated process.


Data generation
We simulate measurements for 5 hormones; glucose, insulin, thyroid stimulating hormone (TSH), cortisol and growth hormone (GH), according to their physiological characteristics and the laboratory setting where our sample was drawn. This setting was reproduced in simulation as described below: • 24 hours with measurements every 10 minutes, in total 144 measurements per hormone and person. • 3 meals at time 0, 18, 54.
• Night from time 84 to time 138.
For each hormone we generated measurements. The mean hormone value at time t, Y(t) consisted of a constant baseline level and one or more peaks using an absorption/elimination model. A peak starting at ts has the form: where C0 determines the minimum hormone values over time, C1 the peak value and λa and λe the rate of absorption and elimination of the hormone in blood. The latter is directly related to the halflife of the hormone by λe=ln(2)/half-life. Random between and within person variation was added to the generated mean values. The specific minimum and location and duration of peaks, and the random intra and inter person variation were based on the observed patterns in our data. Specific features of each hormone are: • Glucose: Three clear peaks after meals where the third one is slightly higher than others. At night, the hormone level is stable and low, and the variation is smaller. Physiologically, glucose levels cannot be below 2.8 mmol/L. • Insulin: Three clear peaks after meal, and the hormone is highly correlated with glucose (corr.=0.75). At night, the hormone level is stable and low and the variation is smaller. • TSH: One big peak, where the hormone builds up in the evening from 6 pm (t=54) with highest levels at 11 pm (t=84), with large variation. • Cortisol: Peaks at the end of the night.
• GH : Sharp peaks, and the number of peaks varies from 0 to 20 across the individuals.
Inter-person variation is generated by varying the highest concentration reached during peaks, following a normal distribution (specific parameters are provided in the table below). For TSH, cortisol and GH, the location of the peaks also varies across the people. In this way we generated 24hour hormonal data for 38 simulated subjects. Table A1 shows the specific parameters used for simulating 24-hour hormonal data of 38 individuals.

Table A1 Parameters for generating 24-hour glucose, insulin, TSH, cortisol and GH data
In each individual, for each hormone we generated measurement errors at 14 time points. To generate random measurement errors in each hormone at 7 randomly selected time points (5% out of 144 points), we replaced the true measurement by an error measurement drawn from a uniform distribution with a wide range (-10 x intra-person SD to 15 x intra-person SD). Furthermore we generated related dilution errors at 7 time points which were the same across all hormones for one individual. The dilution errors were generated by dividing the original measurement by 2.  Figure A1 shows simulated 24-hour hormonal data for glucose, insulin, TSH, cortisol and GH of the first two generated individuals are shown. The hormone specific measurement errors are indicated by a red dot. The dilution errors are indicated by a green dot. Figure A2 displays how many points are indicated as measurement errors by each method averaged across the 38 simulated subjects. The EM algorithm indicated the highest number of measurement errors, followed by the stepwise approach. Especially for the hormones where the intra-person variation was larger during day than during night (glucose and insulin), the EM algorithm indicated high numbers of measurement errors.  Table A2 shows what percentage of true errors (random errors and dilution errors) were detected by each method and how much non-errors were identified as errors by each method. When it comes to detecting true error, the EM algorithm performed best. However, the EM algorithm also indicated the most non-errors as measurement errors. Especially for insulin, the number of true measurements which were indicated as error was extremely high. This is explained by the fact that the intra person variation in insulin differed between day and night, and the insulin residuals were not normally distributed without log transformation. The percentage of non-error detected as measurement error was much lower in Stepwise approach and Tukey's fences than in the EM algorithm.

Figure A2 Simulated 24-hour glucose, insulin, TSH, cortisol, and GH data of the first two generated individuals
Stepwise approach is to be preferred when detecting dilution errors.