Early Experimental Psychology: How did Replication Work Before P-Hacking?

For many researchers, replication is still the “gold standard” that is crucial for verifying scientific findings (see, for example, Frank & Saxe, 2012; Iso-Ahola, 2020; Witte & Zenker, 2017). Indeed, Crandall and Sherman (2016) declared that: “[t]here is no controversy over the need for replication; virtually all scientists and philosophers of science endorse the notion that replication of one sort or another is absolutely essential” (p. 94). In recent decades, this has led to widespread concern because few experimental findings are actually being confirmed in this way (see, for example, Pashler & Wagenmakers, 2012; Reproducibility Project: Psychology; Wiggins & Chrisopherson, 2019). Before it is possible to plan how to remedy this situation, the reasons for the lack of replications must be identified. “Questionable research practices” such as p-hacking or post-hoc hypothesizing, the “file-drawer problem,” are often cited as contributing to the problem (Romero, 2019; Wiggins & Chrisopherson, 2019). These research practices are firmly embeddedwithin a scientific culture that is characterized by a highly competitive academic environment and a reward system that dissuades rather than encouraging replication (Crandall & Sherman, 2016; Romero, 2019). This setting fosters personal ambition, urging researchers to come up with innovative and ambitious projects continually and to publish as many papers as possible. Meanwhile, most journals only publish reports of original research offering statistically significant results, which has led to a “publication bias” (Romero, 2017). Replicability problems, as Pashler and Wagenmakers (2012) stated, “reflect deep-seated human biases and well entrenched incentives that shape the behavior of individuals and institutions” (p. 529). Fraud cases, such as that involving Diederik Stapel, show just how far a person might be willing to go when succumbing to such pressures (Stroebe, Postmes, & Spears, 2012; Derksen, 2021). Whether replication is really necessary and whether the problematic research practices mentioned above are due more to the present reward system, general human biases or an incorrect statistical or philosophical understanding are still open questions (Feest, 2019; Flis, 2019; Morawski, 2019). Given such uncertainties, it seems interesting to explore how research was undertaken in the past, when the current institutional conditions did not pertain—or at least, not yet fully. Stated differently: If the current replicability problem is related to recent research practices that have appeared as part of academic life in times of neoliberal capitalism and “big science,” then we might assume that replication worked differently in the past. Thus, in the present paper, I adopt a historical stance to reveal characteristics of nineteenth-century psychology experimental research practices and to describe the way research was replicated. The original experiments I present in this paper are well known, dating from 1860 to 1900, a period characterized by important changes in Europe, such as industrialization, workers movements, and the constitution of modern nation states. In this period, the Prussianmodel at German universities (Charle, 2004), offered a broad humanistic as well as a thorough experimental training in fields such as chemistry, physics, physiology, philosophy, and psychology, which was just emerging as an experimental sciencewith its own scientific community (Ash, 1980; Bringmann & Tweney, 1980; Rieber & Robinson, 2001). When the first replications were performed, neither young researchers nor senior professors suffered the pressure to “publish or perish,” and journals were not yet acting as gatekeepers taking decisions based on p-values. Often, researchers carried out years of patient experimenting before they published their results. Thus, what role did replication play within psychological research?


Introduction
For many researchers, replication is still the "gold standard" that is crucial for verifying scientific findings (see, for example, Frank & Saxe, 2012;Iso-Ahola, 2020;Witte & Zenker, 2017). Indeed, Crandall and Sherman (2016) declared that: "[t]here is no controversy over the need for replication; virtually all scientists and philosophers of science endorse the notion that replication of one sort or another is absolutely essential" (p. 94). In recent decades, this has led to widespread concern because few experimental findings are actually being confirmed in this way (see, for example, Pashler & Wagenmakers, 2012; Reproducibility Project: Psychology 1 ; Wiggins & Chrisopherson, 2019).
Before it is possible to plan how to remedy this situation, the reasons for the lack of replications must be identified. "Questionable research practices" such as p-hacking or post-hoc hypothesizing, the "file-drawer problem," 2 are often cited as contributing to the problem (Romero, 2019;Wiggins & Chrisopherson, 2019). These research practices are firmly embedded within a scientific culture that is characterized by a highly competitive academic environment and a reward system that dissuades rather than encouraging replication (Crandall & Sherman, 2016;Romero, 2019). This setting fosters personal ambition, urging researchers to come up with innovative and ambitious projects continually and to publish as many papers as possible. Meanwhile, most journals only publish reports of original research offering statistically significant results, which has led to a "publication bias" (Romero, 2017). Replicability problems, as Pashler and Wagenmakers (2012) stated, "reflect deep-seated human biases and well entrenched incentives that shape the behavior of individuals and institutions" (p. 529). Fraud cases, such as that involving Diederik Stapel, show just how far a person might be willing to go when succumbing to such pressures (Stroebe, Postmes, & Spears, 2012;Derksen, 2021).
Whether replication is really necessary and whether the problematic research practices mentioned above are due more to the present reward system, general human biases or an incorrect statistical or philosophical understanding are still open questions (Feest, 2019;Flis, 2019;Morawski, 2019). Given such uncertainties, it seems interesting to explore how research was undertaken in the past, when the current institutional conditions did not pertain-or at least, not yet fully. Stated differently: If the current replicability problem is related to recent research practices that have appeared as part of academic life in times of neoliberal capitalism and "big science," then we might assume that replication worked differently in the past. Thus, in the present paper, I adopt a historical stance to reveal characteristics of nineteenth-century psychology experimental research practices and to describe the way research was replicated.
The original experiments I present in this paper are well known, dating from 1860 to 1900, a period characterized by important changes in Europe, such as industrialization, workers movements, and the constitution of modern nation states. In this period, the Prussian model at German universities (Charle, 2004), offered a broad humanistic as well as a thorough experimental training in fields such as chemistry, physics, physiology, philosophy, and psychology, which was just emerging as an experimental science with its own scientific community (Ash, 1980;Bringmann & Tweney, 1980;Rieber & Robinson, 2001). When the first replications were performed, neither young researchers nor senior professors suffered the pressure to "publish or perish," and journals were not yet acting as gatekeepers taking decisions based on p-values. Often, researchers carried out years of patient experimenting before they published their results. Thus, what role did replication play within psychological research?

What Is Scientific Replication and When Did it Start?
Replication is a rather recent term that has been used by scientists since 1914 to refer to the activity of repeating an experiment or a series of observations (Tweney, 2018). 3 In several ways, the practice is connected to broader issues in the history and philosophy of science. The replication of experiments is deemed important by scientists as a way to ensuring stable and generalizable results. Scientists (i.e., "natural philosophers") have been aware of the need to repeat experiments for centuries. Schickore (2011) showed how redoing another investigator's experiments became an issue around 1670 and Grower (1997) pointed to the role this played in early scientific debates concerning Newton's experiments.
Nevertheless, since the 1980s, experimentation and replication have been criticized by sociologists, philosophers, and historians of science and interpreted as being the result of a rather naïve commitment to objectivity as an epistemic ideal (e.g., Daston & Galison, 2007). Shapin & Schaffer (1985) considered replication to be the product of the set of technologies which transforms what counts as belief into what counts as knowledge; it includes physical reiteration of a certain kind of experimentation as well as virtual witnessing through "literary technology." In the following, I will expose some examples of tables and graphs used by psychologists.
Although physical or virtual "reiteration," "repetition," or "replication" of an experiment sounds like an easy task to perform, judging whether an experiment is a valid copy is highly problematic because there is no unambiguous set of rules. Moreover, in his book "Changing order," Collins (1985) pointed out the problem of experimenter's regress. He views a scientific fact as an expression of a "form of life" (in Wittgenstein's sense). In a controversy, each group argues from a different stance or "world view"; there is no algorithm that can be used to evaluate different views objectively. "Experimenter's regress" refers to a closed loop: researchers need to accept, as a matter of fact, the phenomena of the experiment they want to replicate because they need them to calibrate their own machines or methods; only when they obtain the same result can they be sure that their machines or methods have worked.
If opinions as to the success or failure of the replication vary, Collins and other sociologists of science argue that there is no way to decide who is right. The epistemic uncertainty in experimentation, involving unexpressed and partially reported skills and assumptions, makes it impossible to fix clear-cut criteria to decide whether an experiment constitutes a "genuine" replication or not. Nevertheless, Radder (1992) argues that in most cases, the regress can be overcome with the help of negotiations about the adequacy of the instrument and the kind of phenomena produced by them.
Up until this point, I have cited works referring to replication in scientific practice, in general. What about this practice in psychology? Did nineteenth-century psychologists actually consider it important to replicate experiments? The answer is not self-evident, as publications on the replication crisis in psychology focus only on recent decades (see, for example, Wiggins & Chrisopherson, 2019). Laws (2016) even talked about an "enduring historical abandonment of replication" (p. 2). In order to clarify just when psychologists started to examine replication, Makel, Plucker, and Hegarthy (2012) performed a bibliographic search that showed increased use of the term "replication" in scientific publications since the 1950s. Nevertheless, this cannot be taken as an indicator of the real number of replication studies because previously alternative expressions were used instead of "replication" (for example: "repetition," "reproduction," "redoing," and "testing").
Precisely dating the historical origin of replication in psychology is not easy. Pettit (2018) has developed a timeline which chronologically situates a variety of milestone experiments and their successful or unsuccessful replications, starting with the examination of Anton Mesmer's magnetic (hypnotic) therapy at the French Academy of Sciences in 1784. Danziger and Shermer (1994) offer the only historical analysis dealing with practices of replication among psychologists of the past. They found that in the late 19th and early-twentieth century, investigators "(…) regarded each additional subject exposed to constant experimental procedures [in their own research,] as a replication of the original experiment" (Danziger & Shermer, 1994, p. 21). Psychologists generally disagreed on replications and the replicability of each other's experiments. Danziger and Shermer (1994) demonstrate this via the Wundt-Bühler controversy (1907/ 08) concerning the value of experiments on thinking and the Baldwin-Titchener controversy (1895/96) over the criteria for the selection of experimental subjects. The two historians conclude that the issue of what constitutes a valid object of study (in the first case) and a valid experimental subject (in the second) lay at the heart of these controversies.
Following Collins and Hacking, Danziger and Shermer (1994) attributed the clashes over replication to divergent styles of scientific practice. Wilhelm Wundt (1832-1920) and Edward Bradford Titchener (1867-1927 drew their models for proper research from the natural sciences: physics and physiology. For them, experimentation involved the creation of artificial laboratory conditions offering a high degree of "purity," which could only be achieved by carefully selecting experimental subjects and isolating certain features or variables. The price of this "purity" was severe restrictions on the objects and scope of research. Bühler and Baldwin, in contrast, adopted a looser and more openminded approach. Bühler accepted a mental state triggered by an intellectual task or puzzle as a valid object of study. Baldwin argued for using a sample of experimental subjects taken "from the given, everyday world," instead of following Wundt's rule of carefully selecting trusted insiders (who might be biased).
Nevertheless, there are several problems relating to the historical analysis undertaken by Danziger and Shermer (1994), as I will show in the present paper. To begin with, their study does not warrant at all their general concluding claims about psychologists assuming that their research methods are independent of the object of study. Moreover, the fact that human beings are historical entities, changing over time, does not, in principle, preclude investigation of some relatively stable, shared, or regular psychological features or processes that can be identified and studied via (replicated) experimentation. 4 My aim is to examine in more detail the way replication worked in early psychological experimentation. On a methodological level, my approach differs from previous research in several ways. I start my narrative earlier, focusing on some classical psychological experiments from the period 1860-1900, a key period in the emergence of psychology as an experimental science. Moreover, and contrary to Collins' claim that in science replication rarely worked and to Danziger and Shermer's work on controversies and failed replications, my historical examples include some successful repetitions. Nevertheless, in line with Collins (1985) and Shapin and Schaffer's (1985), my examples also show that agreement on the delimitation criteria between a "genuine" or "correct" replication and a "mistaken copy" becomes problematic as soon as we move into a controversial setting. This is even the case when both researchers share their model for proper research and many salient features of their laboratory routine. Furthermore, the present analysis of historical case studies evidences a rich variety of functions. I argue that replication constituted a tool not an end; it was a means of acquiring expertise and group identity, providing knowledge of and acquaintance with specific individual practical features, as well as a way to discuss and even attack theories held by well-known authorities.

Fechner's Experimental Praxis
Psychophysics was developed by Gustav Theodor Fechner (1801-1887), a physicist who worked on electricity and optics. In the 19th century, in fields such as optics, a new kind of experimentation became common, centered increasingly on quantification and measurement (Buchwald, 1989). Hacking (2014) showed that this was a general trend that reached the social sciences and psychology, stating: "the world was becoming numerical." Statistics were increasingly used to record and treat census figures (Porter, 1995), and toward the end of the century became a way to model variation and to infer general trends from numerical data (Gigerenzer et al., 1989). Fechner was aware of the work of Lapace, Bernoulli, and Gauss and contributed substantially to the development of statistical methods. Keen on quantification, he used the error law in his optics work and later extended its use to his psychophysical experiments. Furthermore, in the debate about determinism, he adopted an indeterminist stance, which he later tried to support mathematically (Hacking, 2014;Heidelberger, 2004). 5 Within his broader philosophical (panpsychist) project 6 and inspired by the work of the physiologist Ernst Weber, Fechner defined psychophysics as an "exact doctrine [dealing with] the functional, or interdependent, relation between body and soul (…)" 7 (Fechner, 1860, p. 8). In order to explore this relation, he experimented on himself using three main methods 8 of which the method of "adjustment" was one, determining the just noticeable difference (JND). Thus, a subject is asked to select the level of intensity of a stimulus (a weight, sound, or light) that it is just barely detectable or at the same level as another stimulus. Fechner took as his starting point the "principle of insufficient reason" 9 which assumes that random variations can be compensated through numerous repetitions of experiments. Thus, he stated that irregular "casualties [must be] compensated through frequent repetitions in a way that if the variation and sensibility stay the same, one obtains coincident results in the measurements taken at different times; this way the individual casualty loses weight and the final results are in so far independent from chance" (Fechner, 1860, p, 79; the emphasis is Fechner's).
To gather the high number of empirical observations needed for his calculations, Fechner experimented regularly at his home in Leipzig. "For several years [since 1855]," he explains, "I considered it as part of my daily work to undertake experiments during 1 hour (….)" (Fechner, 1860, p. 93). Psychophysical methods require working with formulas and measurement and much time and patience collecting data. For example, in the years 1856 and 1857, he explored his perception of a series of weights using the method of adjustment. He needed 24,576 experiments in which he recorded his psychological appreciations of weights with the "real" weight of the stimuli (measured in grams). On the whole, his measurements led him to conclude that the magnitude of sensation corresponds to the logarithm of the magnitude of the physical stimulus. 10 In the following 2 years (1858-1859), he completed another 16,384 experiments. After comparing both series, he concluded satisfied: "(…) the main result of this [later] series constitutes a complete confirmation of the previous results" (Fechner, 1860, p. 196).

Criticisms and Appropriations
Although Fechner did most of the experiments by himself at home, he did not work in isolation. Because of his bad eyesight, he needed others to do the experiments with lights. Thus, Fechner (1860) explained how some of the experiments were performed and repeated by his brother-in-law, the physiologist Alfred Volkmann, as well as by other colleagues (physicists and physiologists) who obtained similar results. 11 Following Heidelberger (2004), "Fechner's formulas and methods had unleashed enough 'paradigmatic energy' to start-off a new, normal, scientific tradition" (p. 212). Whether we can speak here of a specific paradigm or not is questionable, but it was clearly a challenging project which in the 1870s triggered strong reactions from philosophers, physiologists, and psychologists. Fechner's idealist approach clashed with the then growing materialism (Heidelberger, 2004). The philosophers Bergson and Mach questioned Fechner's attempts to measure sensations, arguing that the intensity of a sensation is not gradated, composed by a sum of psychological (JND) unities.
Psychologists such as Wundt and Titchener, together with Hermann Ebbinghaus, Marcel Foucault, and Georg Elias Müller also criticized Fechner's work 12 and rejected his spiritualist stance. At the same time, however, they by replicating his experiments adopted and tried to improve his methods. They viewed psychophysics as an innovative method for psychology. It required the use of some physiological instruments, as well as mathematical (statistical) skills; mastery that was not acquired via philosophical courses. For psychologists, it became a way to obtain experimental knowhow, as well as of enabling them to distinguish themselves professionally from (non-psychologist) philosophers and align their research methods with those of natural scientists.
Psychophysics became disconnected from Fechner's philosophical project and turned into an empirical science, attempting to reveal replicable trends in human perception. 13 Viewed as a scientific (objective) method, psychophysics was attractive to scholars from very different cultural, political and religious backgrounds who replicated Fechner's experiments. Two examples follow. Apart from the well-known psychologists cited above, mostly with protestant backgrounds, also the German Catholic priest Constantin Gutberlet (1837Gutberlet ( -1928 14 became interested. Eager to understand the working of the new "science of the soul," he started to replicate psychophysical experiments. His aim was to discuss its findings and judge the worth of its theories. This was rare among Catholic priests; the conflict (Kulturkampf) of the 1860s, when Bismarck's policy clashed with the Vatican's interests, had placed them in a difficult position. After 1870, the situation improved with the papal publication of Aeterni Patris in 1879, encouraging Catholic thinkers to embrace modern science. This was taken up by the Görres-Gesellschaft, whose influential journal ("Philosophische Jahrbuch der Görres-Gesellschaft") was edited by Gutberlet between 1888 and1924. Despite the fact that neo-scholasticism became an international movement, it was still controversial for a priest to be actively involved in an experimental science such as psychology. Gutberlet (1905) prepared a critical and comprehensive exposition of psychology's main findings and principles. On the one hand, he presented the empirical results together with current psychology theories and debates. On the other, he added corollaries to the theoretical discussions, arguing from a neoscholastic point of view against the prevalent "materialism" in psychology as well as Fechner's panpsychism (see Gutberlet, 1905, p. 191-193).
Another promoter of psychophysics was the younger psychologist Martín Navarro Flores , who worked in Catholic Spain. He was a member of the pedagogical institution "Institución Libre de Enseñanza" which endorsed an idealist-positivist philosophy (called Krausism) and launched a progressive, freethinking, educational reform movement. 15 In his textbook on experimental psychology published 1915, he dealt in depth with sensation and perception, frequently citing Fechner's work (Carpintero, 2004). After lamenting Spain's backwardness, Navarro Flores (1915) invited his colleagues to engage in psychological experimentation by replicating psychophysical experiments (Lafuente, 1988). The assumption about cultural and national differences influencing the outcome of psychological experiments and mental testing was widespread at that time in Spain (see, for example). We need to explore how the Spanish people will react to the tasks; he argued (Navarro Flores, 1915). Thus, instead of reproducing Fechner's results, Navarro expected psychophysical experiments to evidence the idiosyncratic way "the Spanish mind" perceives the world. He did this despite criticizing Fechner for not having considered further the role of physiological (bodily) processes as a bridge between physical and psychological phenomena (Lafuente, 1988). The strong neurophysiological tradition of Cajal's school in Spain at that time placed much emphasis on the brain functions, triggering his interest towards what Fechner called "inner psychophysics." 2In short, Fechner's panpsychic theory was mostly rejected by psychologists. His psychophysical experiments were also polemical and bombarded with criticism. At the same time, psychologists such as Wundt, Titchener, and Ebbinghaus welcomed psychophysics as a new way to experiment with psychological processes. They replicated Fechner's experiments, motivated by the hope to find a stable trend in the repeated measurements that would lead to knowledge about the functional relation between the outer world and human perception (i.e., the inner world). While the praxis of psychophysical experimentation acquired significance signalizing the identity of an emerging scientific community, later called "experimental psychology," 16 replication of Fechner's methods was also undertaken and promoted for different purposes outside the inner circle of German experimentalists.

Ebbinghaus' Memorizing Routine
The innovative memory experiments conducted with great care in the 1880s in Berlin by Hermann Ebbinghaus (1850Ebbinghaus ( -1909 were celebrated by Georg Elias Müller, William James, and Edward B. Titchener as a great advancement because for the first time a "central [higher] psychological function" had been experimentally investigated 17 (Müller & Schumann, 1894;Shakow, 1930). Ebbinghaus promoted psychology as a natural science and was, therefore, considered to be a "materialist." 18 He was a rather independent researcher, strongly influenced by Fechner's psychophysics. 19 Ebbinghaus' aim was to study the effect of time on rudimentary memorizing. In 1883, he started exploring the time conditions under which he could learn and reproduce series of syllables without error (Ebbinghaus, 1885). 20 His contribution is widely known; 21 thus, I will summarize it only very briefly: Using himself as the experimental subject, he read aloud lists containing varied numbers (12, 24, 36, etc.) of "nonsense syllables" at a rapid rate of 150 syllables per minute. When he finally had the impression that he had managed to memorize them, he tried to repeat the series by heart at the same pace. Whenever he noticed a mistake or a hesitation, he resumed reading until the learning was successful.
Ebbinghaus evaluated the effort ("Arbeit") needed for the learning indirectly, by measuring the number of readings and the time required to learn each list (Ebbinghaus, 1885, p. 41). To do this, he did not use any sophisticated instruments: only the lists of syllables, a metronome, a watch (or a chronoscope), and a rope with wooden balls for keeping count. One of his principal results was: learning a longer list requires more repetitions and relearning a list "costs" less time than learning it for the first time; the more time passes, the less is the advantage (see Table 1).
The second column shows the time that has passed since the first learning of the syllables list. The percentages in the third column show that relearning after 20 min was the most efficient, requiring less time (in comparison to the time needed to learn the list of syllables for the first time). This advantage decreases as the time interval between the moment of learning and relearning increases. After 1 hour, for example, the percentage of time saved decreases from 58.2% to 44.2% and steadily decreases until reaching 21.1% after 31 days (see last percentage in the third column). The last column shows the complementary percentage of the forgetting. Similarly to Fechner, Ebbinghaus argued that these measurements conform to a logarithmic formula which can be represented as curve.
Ebbinghaus seems to hold economic interest, referring in his report to "mental work" and "time savings," concepts that reflect the industrializing period in which work (force), economic efficiency, and soon also scientific management (Taylorism) would impose minute regulations to increase productivity rates. Danziger (1987) interpreted his terminology as reflecting an "energetic model," in which the level of psychological performance is viewed as a consequence of the mental energy accumulated and the amount of work invested in the memory task.
Even more striking is the fact that Ebbinghaus' report also examined in detail limitations and possible sources of error. He justified the fact that his experiments did not reflect the complexity of everyday life by reference to physicists who also work with abstractions. He was concerned, nevertheless, about the way in which changes in real life distorted the outcome of his measurements, such as those resulting from uneven material, changes in attention or mood, etc. Thus, Ebbinghaus adopted two strategies to ensure the objectivity of his data. Just as with Fechner, he expected the functional relation (Abhängigkeitsverhältnis) to be relatively constant whenever a large number of nearly mechanical repetitions were performed, in which oscillations would be compensated. Thus, he repeated his experiments over and over again, imposing a tough experimental routine on himself and a regular lifestyle. 22 When he undertook a self-examination of his own mind, as a researcher, he noticed a dangerous source of error: his own expectations, or, as he said it: the "secret influence of theories and points of view" (Ebbinghaus, 1885, p. 41). Even if he consciously tried to counteract such a trend, this would alter the natural working of the mind and thereby the outcome of the experiments. As a remedy, he made an effort to ignore the results as much as possible, while doing the experiments, and to examine them carefully and critically once experimentation was over. Such reflections might explain the rather curious strategy he adopted to add further proof to his results, citing previous "control-experiments" (from 1883/84, see Ebbinghaus, 1885, p. 107). He probably felt that these offered better empirical support because he was not yet aware of the results of his main study. Furthermore, he suggested using another person (who is unaware of the aim of the study) as the experimental subject in future research.
Ebbinghaus' (1885) examination of consciousness not only identified a dangerous source of bias, but also detected a will to learn about the real outcome of the experiment. He stipulated that, in the long run, these contrary trends would probably become compensated: whenever performance might become distorted by some involuntary desire to enhance a certain effect, it is probably balanced out by other trials in which the opposing desire to discover "factual truths" prevailed. One would not want to put so much effort into one's work, he reasoned, just to base it on the weak finding of "one's own phantasy" (Ebbinghaus, 1885, p. 41).
Ebbinghaus' self-analysis is interesting. We can see that, despite his commitment to positivist epistemology, he was aware that desires and expectations might condition the working of his mind. Ebbinghaus optimistically hoped that with the help of some cognitive strategies and numerous replications through which he aimed to rule out the two opposing biases (desires). Thus, he applied exactly the same statistical reasoning to the working of his mind as he deemed a scientist would, in perfect analogy to the ruling out of confounding variables in his experiments.

Early Reactions and Müller and Schumann's Replication
Ebbinghaus'(1885) work received praise as well as criticism in Germany and abroad. William James was impressed and a book review in the British journal Mind underlined his rigor and patience (Jacobs, 1885). 23 One of the most detailed examinations and repetitions of Ebbinghaus' research on memory was performed by Müller and Schumann (1894). 24 Their aim was "to get acquainted with Ebbinghaus' methods" and "to contribute to their improvement, accurateness and extension (…)" (Müller & Schumann, 1894, p. 81). At the same time, they hoped to gather new insight into the workings of memory. Over the 4 years from 1888 to 1892, Müller and Schumann organized 13 series of experiments (each running for approximately 100 days) in which they mechanically controlled the time of exposure of each syllable with a rotation apparatus (Rotationsapparat) and adopted several strategies to make the series of syllables more homogeneous. Furthermore, in some experiments, they systematically varied a variable. For example, in one series, they changed the rhythm of intonation to study the effect of this on the learning process. Thus, the experimental subjects used either a trochaic or an iambic rhythm, a variation which did indeed alter the results, as can be seen in Table 2.
The list of averages shows that the time required for the learning is similar using either rhythm, being for the trochaic type slightly lower (see first two lines in Table 2). Moreover, when the syllables were learned using one rhythm and afterward had to be re-learned with another, the time savings were less (see lines 4 (jamb-troch) and 5 (troch-jamb) than in the cases in which the rhythm had been maintained (see lines 3 (troch-troch) and 6 (jamb-jamb).
As in the original experiments, also for Müller and Schuman's replications much patience was needed to complete the many hours of experimentation. Aware of the confounding variables that could distort objective results, such as changes in mood, bias due to expectations, and the like, they followed Fechner's idea of balancing out errors through replicating a large number of trials. Müller was known as a drillmaster, imposing a tight regime of rules and checks on his students working at his laboratory. Kusch (1999) described his memory experiments as a parade-ground drill in which "(…) the actions of the subjects were highly constrained, repetitious, and somewhat mindless. Usually, the subject learned nonsense syllables without knowing the purpose of the experiment. The parallel between the parade ground and the memory experiment is strengthened further by the vocabulary used in the context of the latter: 'full hits', 'partial hits', 'drum', and 'sacrifice'" (p. 106). Here, it might be relevant to point out that Ebbinghaus and Müller had military experience in the French-Prussian war (1870/71).
Müller and Schumann's major innovation, though, was to separate the role of the experimenter from that of the experimental subject. They carefully checked subject reliability, selecting only those "for whom we could presuppose full reliability and a love for the truth" (Müller & Schumann, 1894, p. 264). But finding experimental subjects was not easy. They explained: "(…) memory experiments, as we have undertaken them, demand a great amount of patience and sacrifice of time and freedom in lifestyle, which is not agreeable to everyone" (Müller & Schumann, 1894, p. 264, p. 264). 25 Thus, control was not only executed within the experimental setting: similarly, to the work of Ebbinghaus, Müller tried to control external conditions as much as possible, asking the subjects to maintain a constant lifestyle. 20.3 20.0 (n = 48) trochaic re-learning after trochaic learning 8.7 7.8 (n = 24) jambic re-learning after trochaic learning 10.5 10.0 (n = 24) trochaic re-learning after jambic learning 10.4 9.0 (n = 24) jambic re-learning after jamic learning 8.9 7.4 (n = 24) "W a " refers to the arithmetic average "W c " to the median ("Centralwert").
Despite ensuring uniformity through rigorous repetitions of the procedures, Müller and Schumann were also aware of the existence of individual differences. Influenced by the work of Binet, they recognized the specific "sensorial character" of one person's memory, referring to the fact that some people seemed to remember images better, while another reacted more to sound. They argued that specifically the repetition of a standardized task over a long period of time brings the idiosyncrasies of the subject's mind to light. Nevertheless, after a quick comparison, Müller and Schumann (1894) arrived at the conclusion that individual differences are less relevant than other variables. 26

Replications a Century Later
Despite Müller and Schumann's successful replication of Ebbinghaus' experiments, in the past and in the present, critical voices have questioned the value and the validity of Ebbinghaus' legacy. Reductionistic mechanization and mathematization for experimental purposes is considered highly problematic when it comes to such a complex and meaningful process as memory. Ebbinghaus and psychologists working on memory after him were conscious of the limitations of his insights referring only to a very rudimentary, highly trained kind of memorizing process. 27 Moreover, mechanical learning soon became demonized by pedagogues and psychologists; thus, Bartlett (1932) famously referred to Ebbinghaus' learning experiment as being based on irrelevant "repetition habit." Notwithstanding such criticisms, Ebbinghaus' memory research had a long-lasting impact on psychology, appearing regularly in psychological textbooks. In 1985, when the centennial of his publication was celebrated in Passau (Traxel, 1987), historians such as Danziger described "Ebbinghaus' pioneering work" as a "fundamental contribution to the development of modern psychology" (Danziger, 1987, p. 217). He viewed Ebbinghaus' success in the applicability of his energetic approach to fit in well with educational concerns. Other authors such as Van Rappard (1987) stated that, although memory research had changed over time, "it can safely be said that he set the tone for what may be called in terms of Larry Laudan (1977) a 'research tradition'" (p. 43-44).
But not all historians would value Ebbinghaus' contribution. Draaisma (1995), for example, instead of considering it the beginning of a successful experimental tradition in psychology, laments the experiments as marking the end of an era of Romantic literary and neurological traditions. At the negative extreme of the spectrum of opinions, we find Smedslund, who pictured Ebbinghaus as a clever "illusionist," disqualifying his experimentation with the trivial accusation of not having kept track of the infinite number of different sources that might potentially have influenced the findings (Smedslund, 1987). 28 Despite its shortcomings, psychologists, on the other hand, often took Ebbinghaus' forgetting curve as reference point. Tulving (1985) claimed that replicating Ebbinghaus' work had become increasingly difficult over time because "most other people, especially in today's world, are probably incapable of mastering any longer lists of nonsense syllables under [Ebbinghaus' experimental] conditions" (p. 486). Whether unwilling or incapable, patient experimental subjects prepared to submit themselves to such a mechanical learning drill were never easy to find and even when they could be found, replication was cumbersome and problematic. 29 Nevertheless, a century later, some researchers managed to successfully replicate Ebbinghaus main experiments, following his instructions very closely (Heller, Mack, & Seitz, 1991;Murre & Dros, 2015). The aim of the replications was to "verify the reliability" of the original results and to "uncover" how the experiment was conducted (Murre & Dros, 2015, p. 2).
The graph in Figure 1 below shows Ebbinghaus' results as well as the averages obtained in the replication with regard to the percentage of time savings in relearning process after different time intervals. Despite slight differences between the two curves, the same trend can be detected: the more time has passed since the moment a list of syllables had been learned, the more time is needed to re-learn it. Thus, when using such strategy to visualize replication, each attempt is represented as additional layer to the original graph (see also the three layer graph in Murre & Dros, 2015).
While other previous attempts did not manage to obtain similar data and had struggled with a problem of interference, Heller, Mack, and Seitz (1990) noticed that Ebbinghaus' original learning speed-reading 150 syllables per minute-efficiently inhibits any mental search for associations or interference between them. Thus, to be able to read the material at the prescribed pace, thorough preparation was required. Only after hours of training is it possible to obtain a regular level of learning and relearning. Similarly, Shebilske and Ebenholtz (1971) had observed that problems in replicating the original results could be attributed to the fact that "Ebbinghaus was a highly trained learner whereas most modern experiments have used naïve subject" (p. 555). Given the pre-trail training and the way Ebbinghaus set up the experiments, he was certainly not studying memory as it is generally understood or used in everyday life. Despite the shortcomings in scope and external validity, it seems clear that his work offered some new and interesting insights into the working of the mind, inspiring many psychologists after him.

Replication and Self-Perception of Emotions in Wundt's Laboratory
To facilitate the setting up of psychological experiments such as Fechner's and Ebbinghaus', in 1879, a psychological laboratory was established by Wilhelm Wundt in Leipzig and, soon after, other psychological laboratories were founded all over the world. These were places where students could specialize in psychology. Thus, in the last two decades of the 19th century, a generation of psychologists received systematic training, giving rise to a new community of experts (Ash & Geuter, 1985;Danziger, 1990). In this setting, we find another use of replication: experimental psychology became a collaborative enterprise, and the repetition of experiments was part of student training and of learning laboratory routines in hierarchically organized laboratories in which teaching and research went hand in hand. 30 Experimental psychology constituted a productive, collective, and increasingly technical undertaking. 31 Research became more limited in time, as students usually finished their PhD in one year. Although not all PhD research was experimental, to replicate psychophysical experiments became a way to assert one's professional identity, demonstrating knowhow that distinguished a psychologist from a (traditional) philosopher. In Leipzig several methods were employed, such as psychophysics, mental chronometry, and introspection. 32 Introspection was a problematic method, though in the form of self-perception, it was viewed as a necessary tool because it provided access to psychology's object of study: the human consciousness. Thus, Wundt declared: experimentation "enables us to repeat the subjective sensations and emotions, which come along with the process as often as we wish (…)" (Wundt, 1888, p. 433). 33 While psychologists started to have extensive experience measuring sensations, identifying, and measuring emotions 34 was a more difficult task. Also, the cognitive status of emotions and their relation to bodily functions was a controversial topic. Wundt's physiological measurements of emotions, together with the James-Lange theory (1884), stipulated the relevance or even priority of physiological changes in emotional reactions. Following Dror's (1998) research, the appeal of physiological approaches would reach its peak soon after, in 1906.

Titchener's Critique
Replication becomes even more difficult when experiments were repeated in order to criticize or disqualify some theory, method or findings, leading to a scientific controversy. In the introduction, I mentioned the study by Danziger & Shermer (1994) on replications. They indicated controversies between two groups of scholars: on the one hand, Wundt and Titchener, grouped together as "purists" who aimed to emulate the natural sciences and whose way of practicing experimentation clashed with that of Bühler and Baldwin who, on the other hand, represented a looser conception of psychology. I will show in the following section that this description is problematic because controversies occurred even among the "purists", that is, researchers working within the Wundtian line of experimental psychology. Again, replication constituted a core element within the debate.
In 1899, Titchener, Wundt's most faithful follower, published a critique of Wundt's theory on feeling. He offered "empirical facts" (Tatsachenmaterial) contradicting the master's three dimensional theory, which had already become highly contested among psychologists. Titchener's new facts had been gathered by one of his students ("Herr W."). 37 The result of his replication was negative: the student could only find his feelings varying from pleasure to displeasure, neither of the other two dimensions of Wundt's theory appeared in his introspective reports.
The failed replication by Titchener's student constituted a public disqualification of Wundt's research. As one contemporary observed: "Thus the patient thinking of the expert Wundt is brought to zero with the greatest dispatch (…)!" (Buchner, 1900, p. 96). Wundt (1900) was offended and immediately questioned the validity of Titchener's critique and his student's replication. He denounced several methodological errors as reasons why the results differed. First, he criticized the fact that Titchener had left it up to a student to undertake the self-perceptions. Nevertheless, this was not a mistake: it indicates differences in research practices between the two laboratories. Whereas in Leipzig, professors and expert psychologists (for example, Wundt himself or one of his assistants) acted as experimental subjects, at the Cornell laboratory, trained student were used (not the author whose theory was being tested). 38 Second, Wundt criticized the omission of the records of the underlying bodily reactions. This again was not neglect: such physiological measurements were deemed useful in Leipzig where Wundt claimed that bodily symptoms were the parallel expressions of feelings; but in Cornell, this theory had been rejected two years earlier by another of Titchener's students, David Irons (1897). Irons had shown that bodily responses were only incidental to emotions, not intrinsic features. Thus, for Titchener and his students, such registers were not at all informative.
In 1908, Titchener, confirmed his critical stance once more. Visibly disturbed by the confrontation with his former teacher, Titchener ended with the following ethical rules, phrased in a prayer-like rhetoric: "(…) we must not be dogmatic, we must not be [sic!] too impatient for results, we must not set theory above observed facts: (…) we must use all the weapons in our critical armory against ourselves as against others, and against others as against ourselves" (Titchener, 1908, p. 231). This was, again, a resounding criticism of Wundt's research method and dogmatic attitude. The main "weapon" of that "critical armory" Titchener had in mind was precisely replication in the form of empirical results that would speak for themselves.
Toward the end of the 19th century, Wundt found himself in a scenario including emancipated former students. Tensions between competing researchers and laboratories polarized the variety of psychologies practiced at the time. Thus, this episode can be viewed as a single chapter in a more extensive history of repudiations against Wundt (Danziger, 1979;Mülberger, 2012). At the same time it also shows that Danziger and Shermer's grouping together of Wundt and Titchener as "purists" is problematic. It demonstrates that their categories of analysis are not adequate when it comes to explaining the underlying split leading to controversies and failed replications. If, as they state on page 22, both Wundt and Titchener's model of proper research was derived from the experimental sciences (namely physics and physiology), then why should they clash over a failed replication of each other's experiments?
The replication of Wundt's research on emotions by Titchener's student evidences disagreement that existed within the Wundtian tradition, involving two levels: theory and research practices. While Titchener adopted many salient features of Wundt's laboratory practices (Boring, 1927;1950), there were also some striking differences between their philosophies of science, theories, and working styles. Titchener rejected Wundt's (parallelist) body-mind theory, as we have seen, and his epistemological stance was closer to that of the British empiricists as well as Mach (Araujo & Marcellos, 2017;Leahey, 1981).
Thus, in the historical context of competing psychological laboratories, replication acquired yet another social function: challenging claims made by established authorities. The empiricist, Titchener, was one of Wundt's most faithful students and was certainly not keen on having a personal confrontation with his former mentor. His strategy was to let the "objective" (empirical) results of this student's research "speak for themselves." Nevertheless, Wundt was well known for reacting with fiercely personal attacks whenever his work was criticized: here he would make no exception.
Titchener's strategy was not uncommon. In the 1890s, Mary Calkins and her PhD student Cornelia Nevers challenged Jastrow's study of word associations, in which he had examined group differences (Mülberger, 2017). Jastrow had recognized distinctive trends in the responses when comparing men's reactions to women's, attributing some differences to women's "household instincts." With the help of several repetitions, the Wellesley psychologists offered different data, insisting that even if there were minor differences, these could not be attributed to any biological differences between the two groups (see also García Dauber, 2005).

Conclusions
In the present paper dealing with the praxis of replication, this is broadly understood as repetition of one's own or others' experiments. I have reviewed replications in three types of classical experiments that were undertaken during the last decades of the 19th century. They do not constitute all kinds of psychological research that were performed at that time, but they do represent a kind of research in which replication played a prominent role.
Overall, the early experimentalists' psychological research can be viewed as a kind of small scale and slow science, if compared with current practice, in the sense that it implied a reduced number of experimental subjects and years of repetitive experimentation. My analysis suffices to show that replication was an issue in the work of these experimentalists and, thus, has a longer tradition in psychology than current publications seem to imply.
The core idea when seeking confirmation was that psychological measurements under experimental conditions could indicate some stable, functional relations between two variables. In order to rule out other (undesired) influences, Fechner and Ebbinghaus repeated their experiments thousands of times and invited other researchers to do further replications. Thus, the first role of replication was to balance out error: to counteract the disturbing effects of confounding variables. Moreover, the findings, patiently acquired with the help of a few experimental subjects, were expected to help understand the workings of human mind. Any additional subject going through the same experimental conditions represented a replication.
In the psychological publications of the 19th century, findings (including formulas and data) were generally explained in the text. Additionally, some main results (usually averages for each experimental subject or experiment) were listed and presented in tables. Graphs were more rarely employed, posing difficulties regarding the representation of exact amounts. Replications were often represented within tables and graphs as an additional layer, adding a column or curve for each attempt. Such a "literary device" facilitated comparison.
No historian doubts that human beings and society at large have changed since the 19th century. At first sight, such an awareness seems to deny any useful role for replications across time and history. Moreover, there is no standard human mind but only very different and unique people. Given these assumptions, how can we expect psychological findings ever to be stable? Historians and philosophers of psychology nowadays generally adopt a negative position and often reject the value of experimentation and replication altogether. But in the light of the present research, this seems to be an overstatement, as there are psychological findings that have been successfully replicated. An experiment such as Ebbinghaus' can be reproduced yielding similar results, even after a century. This seems to indicate either a certain level of generalization within the workings of certain psychological processes across individuals and time, or a certain stability of social conventions. In any case, the fact that humans and their worlds change over time does not per se preclude historicalpsychological continuity and the possibility of replication fulfilling different roles.
The historical cases of my research demonstrate further that replication was not an end but a tool: a tool that could be employed for varies purposes. This does not mean that their out-come was necessarily made up. Repetitions were done with a certain aim in mind, be this to confirm the original outcome or to reject them offering different data whenever the original study was not deemed convincing. We have reason to assume that when an experiment did not fulfill its aim, the researcher did not feel inclined to publish the results (probably distrusting his method, subjects, and/or the workings of his instruments). At a later point, he might, nevertheless, suddenly use them to support his claim and findings.
To sum up, taking the historical replications together, we arrive at the following list of social functions and methodological purposes: 1. balancing out undesired variations (errors/confounding variables) in the measurements, 2. testing the stability of a finding (sometimes expressed in the form of a law or a statistical trend) using different experimental subjects, or repeating the experiment at a different place or time, 3. becoming acquainted with certain methods and gaining expertise within a scientific area, 4. training students who could then demonstrate their distinctive expertise and professional identity, In these four cases, the replications aimed at the reproduction of exactly the same results as the previous (original) experiment which means that the first results are taken as authoritative guidance and calibration. Any variation could be worrisome, leading the experimenter to revise the apparatus and the way the experiment was done.
In other situations, in which an experiment was repeated, a difference with regard to the former results was acceptable and to a certain extent even expected. This would be the case for repetitions (replications) employed for the following purposes: 5. exploring the effect of some new variables (not varied or controlled in the original study), 6. standardized repetition was also seen as providing hints as to the particularities and types of an experimental subject's mind, 7. cross-cultural appropriation: foreign scholars sought to learn how to do psychological experiments and to align their work with the scientific community, while pursuing their own religious and political agendas, In the two cases of cross-cultural appropriations exposed, Gutberlet attempted to gain expertise by obtaining the same results while Navarro expected the Spanish mind to react differently to the stimuli.
Finally, yet another purpose for which an experiment was repeated was: 8. to challenge a psychological theory or empirical findings published by an authority in an impersonal way (letting the empirical findings, i.e., "nature" talk).
In the last case, the experimenter expected to arrive at different data or findings. I have argued that such a strategy was used by Titchener to reject Wundt's theory while avoiding a personal confrontation. It was a strategy that could empower researchers from a weaker position or social status to question a well-established authority.
Finally, my historical examples can hardly give a definitive answer with regard to the broad question about the link between replicability and the wider context of academic life and modern society. Nevertheless, the present research seems to indicate some striking differences in the pace, time, and effort dedicated to replication. The reports of Fechner, Ebbinghaus and some of their followers capture their (and their experimental subjects') personal devotion and deep moral commitment to science as a way to "truth" that seemed to outweigh, at least to a certain extent, personal ambition. This can be seen in the numerous repetitions of their experiments as well as the careful selection of experimental subjects. While there could have been a file-drawer problem because researchers did not always publish immediately the results of all their experiments and replications, it is difficult to imagine early experimentalists such as Fechner, Ebbinghaus, Müller, Wundt, Titchener, or Calkins engaging in fraudulent practices such as p-hacking, even if they had had such a tool at hand, and even if it would have been beneficial for their academic careers. But whether this speculation is correct, is difficult to say.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

He was also influenced by the work of the British associationists and the physiologists Hermann von Helmholtz and Ewald
Hering. His approach is difficult to classify because he did not follow any specific psychological school or system, receiving vague descriptions as "methodology first" (van Rappard, 1987), "eclectic" (Boring, 1950), or "functionalist" (Caparrós & Anguera, 1986). 20. With regard to the origin of this study in Ebbinghaus' dissertation, habilitation and the "Urmanuskript" of 1880, see Gundlach (1985). 21. Some recent examples of studies referring to Ebbinghaus' forgetting curve and classical monograph, see Cyr and Hirst, 2019;Huang et al. (2021); and Otani et al. (2018). 22. Lifestyle refers to keeping a regular schedule for working, eating, and sleeping, avoiding, as much as possible, any disturbances such as travels or extreme emotional situations. See also Schaffer, 1988, on the importance of lifestyle for astronomical observations in the 18th century. 23. The reviewer was the psychologist Joseph Jacob, who published a short article on memory himself 2 years later; see Jacobs, (1887). 24. Georg Elias Müller (1850Müller ( -1934 had been a student of Lotze and became his successor at the University of Göttingen, (Haupt, 2001). Following Boring (1935), Lotze's support was due to his conviction that Müller possessed excellent qualities for doing experimental work in psychophysics. His coexperimenter, Friedrich Schumann , was at that time a doctoral candidate in physics who had chosen philosophy as his second subject. For more information on Schumann, see Lüer (2007) and Metzger (1940). 25. Nevertheless, in the end, they managed to finish all the series with five subjects: the experimental subjects were G.E. Müller and Schumann for each other's experiments, together with the students Pilzecker, Hoffmann, and Höltzcke. 26. In the following years, Müller continued to work in the field of memory research making significant contributions. For example, with Pilzecker, he studied associations, showing that the speed with which they are reproduced reflects their strength (Müller & Pilzecker, 1900); and he later developed the Treffermethode (Müller, 1911(Müller, , 1913(Müller, , 1917. 27. See, for example, Anderson's, 1985, andKintsch's 1985, critical appraisals in the 1980s. 28. For a more thorough response to Smedslund, see Teigen (1999); for a recent appraisal of Smedslund's legacy in psychology see Lindstad, Stänicke, and Valsiner (2020). 29. My aim here is not to present all the replications and the reactions toward Ebbinghaus' research. Hakes et al., 1964 andYoung, et al., 1965 criticize some methodological shortcomings and for references including unsuccessful replications, see Murre and Dros, 2015. 30. The human relations within these laboratories were not the same (Kusch, 1999), but they all offered such practical laboratory training. At Leipzig, these were called "Psychologischë Ubungen," and at the University of Cornell, students referred to them (though informally) as "Laboratory drill courses" (Boring, 1927, p. 497).
31. For information on the instruments used, see Wontorra (2013) and Haupt (2001). See also Gundlach (1996) on the symbolic role of the chronoscope. 32. Bringmann & Tweney, 1980;Danziger, 1990;Kusch, 1999. For a complete presentation of Wundt's psychology, see Araujo, 2016. On the use of chronometry, see Schmidgen, 2004Schmidgen, , 2014, and on introspection, see Danziger, 1980 andFeest, 2012, 33. For more detailed instructions on how introspective experiments work, he referred the reader in later editions of the Grundriss (Wundt, 1896(Wundt, /1913 to the textbook of his former student, Titchener (1900Titchener ( /1916. 34. The original term used at that time was "feeling" (Gefühl). It corresponds more to what we now understand as emotions, so this latter term is usually used in the secondary literature. 35. Sometimes the first pair is referred to as pleasantness-unpleasantness; and the second as excitement-tranquillization. For a discussion on Wundt's terminology and the difficulty of English translation, see Titchener, 1908. 36. Called "objective symptoms" by Wundt and the method to register them: "Ausdrucksmethode." 37. Given that the observations were made in the academic year 1897/98, Herr Watt might be referring to the American educational psychologist Guy Montrose Whipple (1876-1941) who joined Cornell University in 1898 and worked as an assistant in psychology until 1902. He received his PhD in 1900 under the supervision of Edward B. Titchener (Ruckmick, 1942) 38. In the eyes of other American psychologists such as Baldwin, this would still be considered too restricted (Danziger & Shermer, 1994).