Uncovering the cognitive mechanisms underlying the gaze cueing effect

The gaze cueing effect is the tendency for people to respond faster to targets appearing at locations gazed at by others, compared with locations gazed away from by others. The effect is robust, widely studied, and is an influential finding within social cognition. Formal evidence accumulation models provide the dominant theoretical account of the cognitive processes underlying speeded decision-making, but they have rarely been applied to social cognition research. In this study, using a combination of individual-level and hierarchical computational modelling techniques, we applied evidence accumulation models to gaze cueing data (three data sets total, N = 171, 139,001 trials) for the first time to assess the relative capacity that an attentional orienting mechanism and information processing mechanisms have for explaining the gaze cueing effect. We found that most participants were best described by the attentional orienting mechanism, such that response times were slower at gazed away from locations because they had to reorient to the target before they could process the cue. However, we found evidence for individual differences, whereby the models suggested that some gaze cueing effects were driven by a short allocation of information processing resources to the gazed at location, allowing for a brief period where orienting and processing could occur in parallel. There was exceptionally little evidence to suggest any sustained reallocation of information processing resources neither at the group nor individual level. We discuss how this individual variability might represent credible individual differences in the cognitive mechanisms that subserve behaviourally observed gaze cueing effects.


Supplementary Material 2: DDM Recovery Analysis
We performed a recovery analysis on the Diffusion Decision Model that appears in the manuscript to ensure the parameters could be reliably identified from each other.This was achieved by first simulating 100 data sets with a sample of generated parameters (sampled using Latin Hypercube Sampling), then fitting the DDM to that dataset to see whether the estimated parameters were correlated with the generated parameters.Specifically, we generated and estimated parameters for each gaze cueing condition (i.e., a parameter for cued and miscued trials) for parameters that varied across conditions, except for z since z in the miscued condition was 1− z in the cued condition.As can be seen in Table 1, we found a high correlation between estimated and generated data sets, suggesting that the parameters were identifiable.
Table 1: Correlations between estimated and generated parameters for each gaze cue condition with no between trial variability parameters.

Hierarchical Modelling Analysis
As discussed in the manuscript, we preregistered a slightly different hierarchical-level analysis compared to what was included in the manuscript.Originally, we had planned to completely constrain the individual-level estimates, but after running the analyses we realized that this did not permit sufficient flexibility, resulting in differences in parameter estimates which looked reliable but were  in too much credibility being allocated to z, whereby z appeared to show highly credible -yet extremely small -shifts across cued and miscued trials.We think it is unlikely that these results reflect true shifts in z across trials but rather, are an artifact of the parameter restraints given that none of the parameters were permitted to move in a theoretically implausible direction at the individual level.We believe this to be the case because this seemingly credible yet extremely small differences in z were not maintained when we unconstrained the individual level estimate for z while constraining the other parameters (as shown in the analyses presented in the manuscript).
The estimates of t0, however, did remain fairly stable in comparison.
Further, the results obtained in the original analyses (shown in Figure ??) seem less plausible due to their inconsistency with the individual-level modelling, which did not find a high probability that z was subserving cueing effects.Moreover, the results for z in the analyses below seem consistent with extreme hierarchical shrinkage, where many individuals with parameter values in one area (i.e., right on 0, as the parameter values of individuals are not allowed to be below zero in this analysis) strongly pull the values of other individuals towards them, resulting in a highly certain group-level estimate due to the tight clustering of the shrunken individual parameter estimates.

Complex Model
Response Time

Supplementary Material 6: AIC Inclusion Probabilities
As can be seen in SM.5, AIC inclusion probabilities result in similar qualitative conclusions for each dataset to the BIC inclusion probabilities presented in the manuscript.As discussed in the manuscript, the main difference between AIC and BIC is that all of the models are relatively more likely compared to null models in BIC, as BIC penalises flexibility more than AIC.

Factors in Data Set 3
As mentioned in the Discussion of the main manuscript, a limitation of our design was that we ignored several potential moderating factors in the original data sets.One of the main reasons why we did not assess these potential moderating factors was because in most cases, there were not enough trials to reliably estimate DDM parameters for these moderating factors.This was, however, possible for Data Set 3. In this supplementary material, we assessed each potential moderating factor that was present in Data Set 3: SOA (200ms, 500ms) and Cue Emotional Expression (Happy, Fearful, Neutral, Angry), as separate data sets to see whether our results changed as a function of these moderating factors.
As shown in Figure SM.7 and Figure SM.8, there did not appear to be any obvious qualitative differences across moderating factors.We see a higher probability for the simple model in most plots compared to those reported in the manuscript, but this is because there were fewer trials and, particularly BIC, will favor simpler models when there are fewer trials.Therefore, we think it is reasonable to suggest that the overall conclusions from out paper hold across these SOA and cue emotion manipulations, at least for this data set.Supplementary Material 8: Individual-Level Analysis with DDM

Including Between Trial Variability Parameters
In this section, we replicated the individual-level analyses presented in the manuscript with a version of the DDM that included between trial variability parameters.There did not appear to be any obvious qualitative differences between the two analyses, although the z model performed better in some participants.

Data Set 1
Individual  , according to AIC, the t0 model was most likely to be the best performing candidate model followed by the simple model.These individual-level results as a whole appear to be qualitatively equivalent to those reported in the manuscript.
Parameter Inclusion Probabilities As shown in the top panel of Figure SM.10, 14 participants had greater than 50% chance of being best described by models that permitted t0 to vary across conditions.Two participants had greater than 50% chance of being best described by models that permitted z to vary across conditions.Zero participants had a greater than 50% chance of being best described by models that permitted v to vary across conditions.
Collapsing across participants, models that assumed t0 varied across cued and miscued trials were on average 36% likely to be the best candidate model when compared to models that did not make this assumption (z, v, z-v, simple; see Figure SM.10, top panel).Models that assumed z varied across cued and miscued trials were on average 18% likely to be the best candidate model compared to models that did not make this assumption.Models that assumed v varied across cued and miscued trials were on average 10% likely to be the best candidate model compared to models that did not make this assumption.

Data Set 2
Individual-Level Modelling pants had a greater than 50% chance of being best described by models that permitted t0 to vary across conditions.Fifteen participants had a greater than 50% chance of being best described by models that permitted z to vary across conditions.Two participants had a greater than 50% chance Note.Green represents models that assume the respective parameter varies across conditions.Larger proportion of green in these figures indicates broader support for a parameter driving differences across cued and miscued conditions.
of being best described by models that permitted v to vary across conditions.
Collapsing across participants, models that assumed t0 varied across cued and miscued trials were on average 17% likely to be the best candidate model when compared to models that did not make this assumption.Models that assumed z would vary across cued and miscued trials were on average 30% likely to be the best candidate model compared to models that did not make this assumption.Models that assumed v varied across cued and miscued trials were 3% likely to be the best candidate model compared to models that did not make this assumption.These results largely align with the results reported in the main manuscript, but with slightly more evidence in favour of z.

Data Set 3
Individual pants had a greater than 50% chance of being best described by models that permitted t0 to vary across conditions.24 participants had a greater than 50% chance of being best described by models that permitted z to vary across conditions.No participants had a greater than 50% chance of being best described by models that permitted v to vary across conditions.Collapsing across participants, models that assumed t0 varied across cued and miscued trials were on average 53% likely to be the best candidate model when compared to models that did not make this assumption.Models assuming that z would vary across cued and miscued trials were on average 38% likely to be the best candidate model compared to models that did not make this assumption.Models that assumed v varied across cues and miscued trials were 3% likely to be the best candidate model compared to models that did not make this assumption.
Again, there did not appear to be any obvious qualitative differences between the individuallevel analyses for the full diffusion model reported here and the diffusion model reported in the manuscript, although there appeared to be more participants best fit by the z model.

Response Time
As shown in Figure SM.14 and Figure SM.15, there did not appear to be any obvious qualitative differences across moderating factors.We see a higher probability for the simple model in most plots, but this is because there were fewer trials and, particularly BIC, will favor simpler models when there are fewer trials.Therefore, we think it is reasonable to suggest that the overall conclusions from out paper hold across these SOA and cue emotion manipulations, at least for this data set.

Method
This data was from the same experiment, with the same participants as Data Set 1 in the main manuscript (Gregory & Jackson, 2020).The exact same exclusion criteria were applied.No participants were excluded, as they all had an accuracy of >80%.However, 21 trials (<0.01%) were removed from the arrow cueing data set for response times being <100ms or >5000ms, leaving a total of 9,819 arrow cueing trials analysed.

Results
Stage  ipants had greater than 50% chance of being best described by models that permitted t0 to vary across conditions.Two participants had greater than 50% chance of being best described by models that permitted z to vary across conditions.Five participants had greater than 50% chance of being best described by models that permitted v to vary across conditions.Collapsing across participants, models that assumed t0 varied across cued and miscued trials were on average 43% likely to be the best candidate model when compared to models that did not make this assumption (z, v, z-v, simple).Models that assumed z varied across cued and miscued trials were on average 16% likely to be the best candidate model compared to models that assumed z did not vary.Models that assumed v varied across cues and miscued trials were on average 20% likely to be the best candidate model compared to models that did not make this assumption.
Therefore, as in Data Set 1a, according to these inclusion probabilities, t0 appeared to be the most likely out of the parameters of interest to be explaining any positive cueing effects in the observed data, although there was reasonable uncertainty.
At first glance these individual-level results would suggest that the arrow cueing effects and the gaze cueing effects in Data Set 1 were driven by common underlying mechanisms as they both showed the strongest evidence for an influence of t0 at the individual level.Interestingly however, inspection of Figures 6 and 7 in the main manuscript revealed that many of the people who had a high probability of being best described by t0 in Data Set arrow cueing were not the same people that had the highest probability of being best described by t0 in the gaze cueing (Data Set 1).Indeed, a post-hoc Bayesian correlation revealed that the there was moderate evidence for a lack of correlation between participants' t0 parameter inclusion probability in Data Set 1 and the arrow cueing data set (r = 0.04, BF = 0.36).Inspection of the cueing magnitudes in Figure 5 of the main manuscript and Figure SM.17 shows that this lack of consistency within participants maps onto their cueing magnitudes, whereby the same participants who had larger positive arrow Set 1, statistically demonstrated by anecdotal evidence in favour of no correlation between cueing magnitudes in Data Set 1 and the arrow cueing data (r = .20,BF = 0.70).This lack of relationship between the gaze cues in Data Sets 1 and the arrow cueing data set suggests there may be differences across individuals in comparative responses to arrow and gaze cues.However, we would suggest that this may also be indicating that the data were noisy and/or unreliable.Ultimately, without more data it is difficult to know.

Stage 3: Bayesian Hierarchical Modelling
As shown in the second panel of Figure ??, t0 had the most reliable difference in estimates across cued and miscued trials.Although this difference was not 95% credibly different from zero, the vast majority of the distribution suggested a parameter shift in the theoretically plausible direction and was closer to being 95% credibly different than zero compared to any of the other parameters.
The mean difference in estimates for v was also in the theoretically plausible direction on average, however there was a lower probability that this estimate was reliably different from zero.The differences in estimates for z were the least likely to be shifting across conditions in the theoretically plausible direction.
Therefore, at the group level, these results suggest that t0 was the most likely to be driving any positive cueing effects at the group level, but that there is still considerable uncertainty.This uncertainty makes sense since the individual-level modelling found the simple model performed quite well for a large proportion of participants.In other words, it is plausible for any group-level

Participant
Note. Green represents models that assume the respective parameter varies across conditions.Larger proportion of green in these figures indicates broader support for a parameter driving differences across cued and miscued conditions.Note.The red portion of the plots represent the space of parameter differences which is inconsistent with the theoretically plausible direction for each parameter assuming a positive cueing effect.Specifically, in order to be considered theoretically plausible, t0 needed to show a negative difference across cued and miscued trials, whereas v and z needed to show positive differences.This plot shows a large proportion of differences close to zero and in the theoretically implausible direction, which we attribute to variability on the individual level.
effects to be highly uncertain when substantial individual variability is present.These group level results for arrow cues are similar to those uncovered for the gaze cues in Data Set 1, but with stronger evidence to suggest t0 is driving the effect, which is likely due to the fact that the cueing magnitudes tended to be larger for the arrow cues compared to Data Set 1 (see Figure SM.17 and Figure 5 of the main manuscript).

Figure SM. 1 .
Figure SM.1.Differences in Group Level Parameter Estimates Across Cued and Miscued Trials

Figure
Figure SM.7.Weighted BIC Model Probabilities as a Function of Moderating Factor.
-Level Modelling Weighted Model Probabilities The top panel of figure SM.9 shows a graphical representation of the probability that each model is the best performing model for each participant in Data Set 1 according to BIC and AIC.Table2 shows the raw BIC and AIC values collapsed across participants as well as the exact weighted BIC and AIC probabilities collapsed across participants.According to BIC as shown in Figure SM.9, 20 participants were best described by the simple model, 14 participants were best described by the t0 model, and 5 by the z model.When collapsing across participants and taking the mean, as shown in Table 2, according to BIC the Simple model had the highest probability of being the best performing model, followed by the t0 model, although t0 still had the best raw BIC score.

Figure SM. 9 .
Figure SM.9.BIC and AIC Weighted Model Probabilities for Each Participant and Data Set Using the Diffusion Model with Between Trial Variability Parameters

Weighted
Model Probabilities According to BIC and as shown in Figure SM.9, 18 participants were best described by the simple model, 18 by the z model, 12 by the t0 model, and 3 by the v model.As shown in Table 4, averaged across participants, the z model was the most likely to be the best performing candidate model followed by the simple model.According to AIC, 22 participants were best described by the z model, 15 by the t0 model, 3 by the v model, and 1 by the t0-z model.Averaging across participants, the z model was the most likely to be the best-performing candidate model followed by the t0 model.Parameter Inclusion Probabilities As shown in the third panel of Figure SM.10, 11 partici-

Figure SM. 10 .
Figure SM.10.BIC Parameter Inclusion Probabilities for Each Participant and Data Set Using the Diffusion Model with Between Trial Variability Parameters -Level Modelling Weighted Model Probabilities According to BIC, as shown in Figure SM.9, 31 participants were best described by the t0 model, 25 by the z model, 12 by the simple model, 3 by the t0-z model, and one by the z-v model.As shown in Table 4, averaged across participants, the t0 model was the most likely to be the best performing model followed by the z model.According to AIC, as shown in Figure SM.9, 39 participants were best described by the t0 model, 22 by the z model, 7 by the z-t0 model, and 3 by the simple model.As shown in 4, averaged across participants, the t0 model had the highest probability of being the best performing model followed by the z model.Parameter Inclusion Probabilities As shown in the third panel of Figure SM.10, 30 partici-

Figure
Figure SM.15.Weighted BIC Model Probabilities as a Function of Moderating Factor.
1: Cueing Magnitudes Thirty-nine out of the 41 participants displayed faster response times for cued trials compared to miscued trials on average, as shown in Figure SM.17.The mean cueing magnitude for the arrow cueing data including all 41 participants was 22ms (SD = 19; Standardized Mean Change = 0.53).

Figure SM. 16 .
Figure SM.16.Cueing Magnitudes in Milliseconds for the Arrow Cueing Data Set

Figure SM. 17 .
Figure SM.17.BIC and AIC Weighted Model Probabilities for Each Participant

Figure SM. 18 .
Figure SM.18.BIC Parameter Inclusion Probabilities for Each Participant and Data Set.

Figure
Figure SM.19.Differences Between Hierarchical Bayesian Parameter Estimate Distributions Across Cued and Miscued Trials SM.2.

Table 2 :
Model performance as indicated by the individual-level modelling for Data Set 1 collapsed across participants."Probability" represents the relative probability of each model being the best candidate model according to BIC and AIC respectively."Raw Score" corresponds to the raw BIC and AIC scores.Higher probabilities indicate a higher probability of that model being the best performing model.Lower raw BIC and AIC scores indicate better model performance.To get the values shown in this table, we calculated a unique raw AIC and BIC and weighted AIC and BIC for each participant and model, and then calculated the average values by collapsing these individual scores across participants.

Table 3 :
Model performance as indicated by the individual-level modelling for Data Set 2 collapsed across participants."Probability" represents the relative probability of each model being the best candidate model according to BIC and AIC respectively."Raw Score" corresponds to the raw BIC and AIC scores.Higher probabilities indicate a higher probability of that model being the best-performing model.Lower raw BIC and AIC scores indicate better model performance.

Table 4 :
Model performance as indicated by the individual-level modelling for Data Set 3 collapsed across participants."Probability" represents the relative probability of each model being the best candidate model according to BIC and AIC respectively."Raw Score" corresponds to the raw BIC and AIC scores.Higher probabilities indicate a higher probability of that model being the best performing model.Lower raw BIC and AIC scores indicate better model performance.
Data Set 2 Individual-Level Model Fits With Between Trial Variability Parameters estimate DDM parameters for these moderating factors.This was, however, possible for Data Set 3. In this supplementary material, we assessed each potential moderating factor: SOA (200ms, 500ms) and Cue Emotional Expression (Happy, Fearful, Neutral, Angry), as separate data sets to see whether our results changed as a function of these moderating factors.

Table 6 :
Model performance as indicated by the individual-level modelling for arrow cueing collapsed across participants."Probability" represents the relative probability of each model being the best candidate model according to BIC and AIC respectively."Raw Score" corresponds to the raw BIC and AIC scores.Higher probabilities indicate a higher probability of that model being the best performing model.Lower raw BIC and AIC scores indicate better model performance.cueing magnitudes tended to have smaller or negative gaze cueing cueing magnitudes in Data