Introduction
Revolutions in science have often been preceded by revolutions in measurement. Sinan Aral (cited in
Cukier, 2010)
Big Data creates a radical shift in how we think about research …. [It offers] a profound change at the levels of epistemology and ethics. Big Data reframes key questions about the constitution of knowledge, the processes of research, how we should engage with information, and the nature and the categorization of reality … Big Data stakes out new terrains of objects, methods of knowing, and definitions of social life. (
boyd and Crawford, 2012)
As with many rapidly emerging concepts, Big Data has been variously defined and operationalized, ranging from trite proclamations that Big Data consists of datasets too large to fit in an Excel spreadsheet or be stored on a single machine (
Strom, 2012) to more sophisticated ontological assessments that tease out its inherent characteristics (
boyd and Crawford, 2012;
Mayer-Schonberger and Cukier, 2013). Drawing on an extensive engagement with the literature,
Kitchin (2013) details that Big Data is:
•
huge in volume, consisting of terabytes or petabytes of data;
•
high in velocity, being created in or near real-time;
•
diverse in variety, being structured and unstructured in nature;
•
exhaustive in scope, striving to capture entire populations or systems (n = all);
•
fine-grained in resolution and uniquely indexical in identification;
•
relational in nature, containing common fields that enable the conjoining of different data sets;
In other words, Big Data is not simply denoted by volume. Indeed, industry, government and academia have long produced massive data sets – for example, national censuses. However, given the costs and difficulties of generating, processing, analysing and storing such datasets, these data have been produced in tightly controlled ways using sampling techniques that limit their scope, temporality and size (
Miller, 2010). To make the exercise of compiling census data manageable they have been produced once every five or 10 years, asking just 30 to 40 questions, and their outputs are usually quite coarse in resolution (e.g. local areas or counties rather than individuals and households). Moreover, the methods used to generate them are quite inflexible (for example, once a census is set and is being administered it is impossible to tweak or add/remove questions). Whereas the census seeks to be exhaustive, enumerating all people living in a country, most surveys and other forms of data generation are samples, seeking to be representative of a population.
In contrast, Big Data is characterized by being generated continuously, seeking to be exhaustive and fine-grained in scope, and flexible and scalable in its production. Examples of the production of such data include: digital CCTV; the recording of retail purchases; digital devices that record and communicate the history of their own use (e.g. mobile phones); the logging of transactions and interactions across digital networks (e.g. email or online banking); clickstream data that record navigation through a website or app; measurements from sensors embedded into objects or environments; the scanning of machine-readable objects such as travel passes or barcodes; and social media postings (
Kitchin, 2014). These are producing massive, dynamic flows of diverse, fine-grained, relational data. For example, in 2012 Wal-Mart was generating more than 2.5 petabytes (2
50 bytes) of data relating to more than 1 million customer transactions
every hour (
Open Data Center Alliance, 2012) and Facebook reported that it was processing 2.5 billion pieces of content (links, comments, etc.), 2.7 billion ‘Like’ actions and 300 million photo uploads
per day (
Constine, 2012). Handling and analysing such data is a very different proposition to dealing with a census every 10 years or a survey of a few hundred respondents.
Whilst the production of such Big Data has existed in some domains, such as remote sensing, weather prediction, and financial markets, for some time, a number of technological developments, such as ubiquitous computing, widespread internet working, and new database designs and storage solutions, have created a tipping point for their routine generation and analysis, not least of which are new forms of data analytics designed to cope with data abundance (
Kitchin, 2014). Traditionally, data analysis techniques have been designed to extract insights from scarce, static, clean and poorly relational data sets, scientifically sampled and adhering to strict assumptions (such as independence, stationarity, and normality), and generated and analysed with a specific question in mind (
Miller, 2010). The challenge of analysing Big Data is coping with abundance, exhaustivity and variety, timeliness and dynamism, messiness and uncertainty, high relationality, and the fact that much of what is generated has no specific question in mind or is a by-product of another activity. Such a challenge was until recently too complex and difficult to implement, but has become possible due to high-powered computation and new analytical techniques. These new techniques are rooted in research concerning artificial intelligence and expert systems that have sought to produce machine learning that can computationally and automatically mine and detect patterns and build predictive models and optimize outcomes (
Han et al., 2011;
Hastie et al., 2009). Moreover, since different models have their strengths and weaknesses, and it is often difficult to prejudge which type of model and its various versions will perform best on any given data set, an ensemble approach can be employed to build multiple solutions (
Seni and Elder, 2010). Here, literally hundreds of different algorithms can be applied to a dataset to determine the best or a composite model or explanation (
Siegel, 2013), a radically different approach to that traditionally used wherein the analyst selects an appropriate method based on their knowledge of techniques and the data. In other words, Big Data analytics enables an entirely new epistemological approach for making sense of the world; rather than testing a theory by analysing relevant data, new data analytics seek to gain insights ‘born from the data’.
The explosion in the production of Big Data, along with the development of new epistemologies, is leading many to argue that a data revolution is under way that has far-reaching consequences to how knowledge is produced, business conducted, and governance enacted (
Anderson, 2008;
Bollier, 2010;
Floridi, 2012;
Mayer-Schonberger and Cukier, 2013). With respect to knowledge production, it is contended that Big Data presents the possibility of a new research paradigm across multiple disciplines. As set out by
Kuhn (1962), a paradigm constitutes an accepted way of interrogating the world and synthesizing knowledge common to a substantial proportion of researchers in a discipline at any one moment in time. Periodically, Kuhn argues, a new way of thinking emerges that challenges accepted theories and approaches. For example, Darwin’s theory of evolution radically altered conceptual thought within the biological sciences, as well as challenging the religious doctrine of creationism. Jim Gray (as detailed in
Hey et al., 2009) charts the evolution of science through four broad paradigms (see
Table 1). Unlike Kuhn’s proposition that paradigm shifts occur because the dominant mode of science cannot account for particular phenomena or answer key questions, thus demanding the formulation of new ideas, Gray’s transitions are founded on advances in forms of data and the development of new analytical methods. He thus proposes that science is entering a fourth paradigm based on the growing availability of Big Data and new analytics.
Kuhn’s argument has been subject to much critique, not least because within some academic domains there is little evidence of paradigms operating, notably in some social sciences where there is a diverse set of philosophical approaches employed (e.g. human geography, sociology), although in other domains, such as the sciences, there has been more epistemological unity around how science is conducted, using a well defined scientific method, underpinned by hypothesis testing to verify or falsify theories. Moreover, paradigmatic accounts produce overly sanitized and linear stories of how disciplines evolve, smoothing over the messy, contested and plural ways in which science unfolds in practice. Nevertheless, whilst the notion of paradigms is problematic, it has utility in framing the current debates concerning the development of Big Data and their consequences because many of the claims being made with respect to knowledge production contend that a fundamentally different epistemology is being created; that a transition to a new paradigm is under way. However, the form that this new epistemology is taking is contested. The rest of this paper critically examines the development of an emerging fourth paradigm in science and its form, and explores to what extent the data revolution is leading to alternative epistemologies in the humanities and social sciences and changing research practices.
Computational social sciences and digital humanities
Whilst the epistemologies of Big Data empiricism and data-driven science seem set to transform the approach to research taken in the natural, life, physical and engineering sciences, their trajectory in the humanities and social sciences is less certain. These areas of scholarship are highly diverse in their philosophical underpinnings, with only some scholars employing the epistemology common in the sciences. Those using the scientific method in order to explain and model social phenomena, in general terms, draw on the ideas of positivism (though they might not adopt such a label;
Kitchin, 2006). Such work tends to focus on factual, quantified information – empirically observable phenomena that can be robustly measured (such as counts, distance, cost, and time), as opposed to more intangible aspects of human life such as beliefs or ideology – using statistical testing to establish causal relationships and to build theories and predictive models and simulations. Positivistic approaches are well established in economics, political science, human geography and sociology, but are rare in the humanities. However, within those disciplines mentioned, there has been a strong move over the past half century towards post-positivist approaches, especially in human geography and sociology.
For positivistic scholars in the social sciences, Big Data offers a significant opportunity to develop more sophisticated, wider-scale, finer-grained models of human life. Notwithstanding concerns over access to social and economic Big Data (much of which is generated by private interests) and issues such as data quality, Big Data offers the possibility of shifting ‘from data-scarce to data-rich studies of societies; from static snapshots to dynamic unfoldings; from coarse aggregations to high resolutions; from relatively simple models to more complex, sophisticated simulations’ (
Kitchin, 2014: 3). The potential exists for a new era of computational social science that produces studies with much greater breadth, depth, scale, and timeliness, and that are inherently longitudinal, in contrast to existing social sciences research (
Lazer et al., 2009;
Batty et al., 2012). Moreover, the variety, exhaustivity, resolution, and relationality of data, plus the growing power of computation and new data analytics, address some of the critiques of positivistic scholarship to date, especially those of reductionism and universalism, by providing more finely-grained, sensitive, and nuanced analysis that can take account of context and contingency, and can be used to refine and extend theoretical understandings of the social and spatial world (
Kitchin, 2013). Further, given the extensiveness of data, it is possible to test the veracity of such theory across a variety of settings and situations. In such circumstances, it is argued that knowledge about individuals, communities, societies and environments will become more insightful and useful with respect to formulating policy and addressing the various issues facing humankind.
For post-positivist scholars, Big Data offers both opportunities and challenges. The opportunities are a proliferation, digitization and interlinking of a diverse set of analogue and unstructured data, much of it new (e.g. social media) and much of which has heretofore been difficult to access (e.g. millions of books, documents, newspapers, photographs, art works, material objects, etc., from across history that have been rendered into digital form over the past couple of decades by a range of organizations;
Cohen, 2008), and also the provision of new tools of data curation, management and analysis that can handle massive numbers of data objects. Consequently, rather than concentrating on a handful of novels or photographs, or a couple of artists and their work, it becomes possible to search and connect across a large number of related works; rather than focus on a handful of websites or chat rooms or videos or online newspapers, it becomes possible to examine hundreds of thousands of such media (
Manovich, 2011). These opportunities are most widely being examined through the emerging field of digital humanities.
Initially, the digital humanities consisted of the curation and analysis of data that are born digital and the digitization and archiving projects that sought to render analogue texts and material objects into digital forms that could be organized and searched and be subjected to basic forms of overarching, automated or guided analysis such as summary visualizations of content (
Schnapp and Presner, 2009). Subsequently, its advocates have been divided into two camps. The first group believes that new digital humanities techniques – counting, graphing, mapping and distant reading – bring methodological rigour and objectivity to disciplines that heretofore have been unsystematic and random in their focus and approach (
Moretti, 2005;
Ramsay, 2010). In contrast, the second group argues that, rather than replacing traditional methods or providing an empiricist or positivistic approach to humanities scholarship, new techniques complement and augment existing humanities methods and facilitate traditional forms of interpretation and theory-building, enabling studies of much wider scope to answer questions that would be all but unanswerable without computation (
Berry, 2011;
Manovich, 2011).
The digital humanities has not been universally welcomed, with detractors contending that using computers as ‘reading machines’ (
Ramsay, 2010) to undertake ‘distant reading’ (
Moretti, 2005) runs counter to and undermines traditional methods of close reading.
Culler (2010: 22) notes that close reading involves paying ‘attention to how meaning is produced or conveyed, to what sorts of literary and rhetorical strategies and techniques are deployed to achieve what the reader takes to be the effects of the work or passage’ – something that a distant reading is unable to perform. His worry is that a digital humanities approach promotes literary scholarship that involves no actual reading. Similarly,
Trumpener (2009: 164) argues that a ‘statistically driven model of literary history … seems to necessitate an impersonal invisible hand’, continuing: ‘any attempt to see the big picture needs to be informed by broad knowledge, an astute, historicized sense of how genres and literary institutions work, and incisive interpretive tools’ (pp. 170–171). Likewise,
Marche (2012) contends that cultural artefacts, such as literature, cannot be treated as mere data. A piece of writing is not simply an order of letters and words; it is contextual and conveys meaning and has qualities that are ineffable. Algorithms are very poor at capturing and deciphering meaning or context and, Marche argues, treat ‘all literature as if it were the same’. He continues:
[t]he algorithmic analysis of novels and of newspaper articles is necessarily at the limit of reductivism. The process of turning literature into data removes distinction itself. It removes taste. It removes all the refinement from criticism. It removes the history of the reception of works.
Jenkins (2013) thus concludes:
the value of the arts, the quality of a play or a painting, is not measurable. You could put all sorts of data into a machine: dates, colours, images, box office receipts, and none of it could explain what the artwork is, what it means, and why it is powerful. That requires man [sic], not machine.
For many, then, the digital humanities is fostering weak, surface analysis, rather than deep, penetrating insight. It is overly reductionist and crude in its techniques, sacrificing complexity, specificity, context, depth and critique for scale, breadth, automation, descriptive patterns and the impression that interpretation does not require deep contextual knowledge.
The same kinds of argument can be levelled at computational social science. For example, a map of the language of tweets in a city might reveal patterns of geographic concentration of different ethnic communities (
Rogers, 2013), but the important questions are who constitutes such concentrations, why do they exist, what were the processes of formation and reproduction, and what are their social and economic consequences? It is one thing to identify patterns; it is another to explain them. This requires social theory and deep contextual knowledge. As such, the pattern is not the end-point but rather a starting point for additional analysis, which almost certainly is going to require other data sets.
As with earlier critiques of quantitative and positivist social sciences, computational social sciences are taken to task by post-positivists as being mechanistic, atomizing, and parochial, reducing diverse individuals and complex, multidimensional social structures to mere data points (Wyly, in press). Moreover, the analysis is riddled with assumptions of social determinism, as exemplified by
Pentland (2012): ‘the sort of person you are is largely determined by your social context, so if I can see some of your behaviors, I can infer the rest, just by comparing you to the people in your crowd’. In contrast, human societies, it is argued, are too complex, contingent and messy to be reduced to formulae and laws, with quantitative models providing little insight into phenomena such as wars, genocide, domestic violence and racism, and only circumscribed insight into other human systems such as the economy, inadequately accounting for the role of politics, ideology, social structures, and culture (
Harvey, 1972). People do not act in rational, pre-determined ways, but rather live lives full of contradictions, paradoxes, and unpredictable occurrences. How societies are organized and operate varies across time and space and there is no optimal or ideal form, or universal traits. Indeed, there is an incredible diversity of individuals, cultures and modes of living across the planet. Reducing this complexity to the abstract subjects that populate universal models does symbolic violence to how we create knowledge. Further, positivistic approaches wilfully ignore the metaphysical aspects of human life (concerned with meanings, beliefs, experiences) and normative questions (ethical and moral dilemmas about how things should be as opposed to how they are) (
Kitchin, 2006). In other words, positivistic approaches only focus on certain kinds of questions, which they seek to answer in a reductionist way that seemingly ignores what it means to be human and to live in richly diverse societies and places. This is not to say that quantitative approaches are not useful – they quite patently are – but that their limitations in understanding human life should be recognized and complemented with other approaches.
Brooks (2013) thus contends that Big Data analytics struggles with the social (people are not rationale and do not behave in predictable ways; human systems are incredibly complex, having contradictory and paradoxical relation); struggles with context (data are largely shorn of the social, political and economic and historical context); creates bigger haystacks (consisting of many more spurious correlations, making it difficult to identify needles); has trouble addressing big problems (especially social and economic ones); favours memes over masterpieces (identifies trends but not necessarily significant features that may become a trend); and obscures values (of the data producers and those that analyse them and their objectives). In other words, whilst Big Data analytics might provide some insights, it needs to be recognized that they are limited in scope, produce particular kinds of knowledge, and still need contextualization with respect to other information, whether that be existing theory, policy documents, small data studies, or historical records, that can help to make sense of the patterns evident (Crampton et al., 2012).
Beyond the epistemological and methodological approach, part of the issue is that much Big Data and analysis seem to be generated with no specific questions in mind, or the focus is driven by the application of a method or the content of the data set rather than a particular question, or the data set is being used to seek an answer to a question that it was never designed to answer in the first place. With respect to the latter, geotagged Twitter data has not been produced to provide answers with respect to the geographical concentration of language groups in a city and the processes driving such spatial autocorrelation. We should perhaps not be surprised then that it only provides a surface snapshot, albeit an interesting snapshot, rather than deep penetrating insights into the geographies of race, language, agglomeration and segregation in particular locales.
Whereas most digital humanists recognize the value of close readings, and stress how distant readings complement them by providing depth and contextualization, positivistic forms of social science are oppositional to post-positivist approaches. The difference between the humanities and social sciences in this respect is because the statistics used in the digital humanities are largely descriptive – identifying and plotting patterns. In contrast, the computational social sciences employ the scientific method, complementing descriptive statistics with inferential statistics that seek to identify associations and causality. In other words, they are underpinned by an epistemology wherein the aim is to produce sophisticated statistical models that explain, simulate and predict human life. This is much more difficult to reconcile with post-positivist approaches. Advocacy then rests on the utility and value of the method and models, not on providing complementary analysis of a more expansive set of data.
There is a potentially fruitful alternative to this position that adopts and extends the epistemologies employed in critical GIS and radical statistics. These approaches employ quantitative techniques, inferential statistics, modelling and simulation whilst being mindful and open with respect to their epistemological shortcomings, drawing on critical social theory to frame how the research is conducted, how sense is made of the findings, and the knowledge employed. Here, there is recognition that research is not a neutral, objective activity that produces a view from nowhere, and that there is an inherent politics pervading the datasets analysed, the research conducted, and the interpretations made (
Haraway, 1991;
Rose, 1997). As such, the researcher is acknowledged to possess a certain positionality (with respect to their knowledge, experience, beliefs, aspirations, etc.), that the research is situated (within disciplinary debates, the funding landscape, wider societal politics, etc.), the data are reflective of the technique used to generate them and hold certain characteristics (relating to sampling and ontological frames, data cleanliness, completeness, consistency, veracity and fidelity), and the methods of analysis utilized produce particular effects with respect to the results produced and interpretations made. Moreover, it is recognized that how the research is employed is not ideologically-neutral but is framed in subtle and explicit ways by the aspirations and intentions of the researchers and funders/sponsors, and those that translate such research into various forms of policy, instruments, and action. In other words, within such an epistemology the research conducted is reflexive and open with respect to the research process, acknowledging the contingencies and relationalities of the approach employed, thus producing nuanced and contextualized accounts and conclusions. Such an epistemology also does not foreclose complementing situated computational social science with small data studies that provide additional and amplifying insights (Crampton et al., 2012). In other words, it is possible to think of new epistemologies that do not dismiss or reject Big Data analytics, but rather employ the methodological approach of data-driven science within a different epistemological framing that enables social scientists to draw valuable insights from Big Data that are situated and reflexive.
Conclusion
There is little doubt that the development of Big Data and new data analytics offers the possibility of reframing the epistemology of science, social science and humanities, and such a reframing is already actively taking place across disciplines. Big Data and new data analytics enable new approaches to data generation and analyses to be implemented that make it possible to ask and answer questions in new ways. Rather than seeking to extract insights from datasets limited by scope, temporality and size, Big Data provides the counter problem of handling and analysing enormous, dynamic, and varied datasets. The solution has been the development of new forms of data management and analytical techniques that rely on machine learning and new modes of visualization.
With respect to the sciences, access to Big Data and new research praxes has led some to proclaim the emergence of a new fourth paradigm, one rooted in data-intensive exploration that challenges the established scientific deductive approach. At present, whilst it is clear that Big Data is a disruptive innovation, presenting the possibility of a new approach to science, the form of this approach is not set, with two potential paths proposed that have divergent epistemologies – empiricism, wherein the data can speak for themselves free of theory, and data-driven science that radically modifies the existing scientific method by blending aspects of abduction, induction and deduction. Given the weaknesses in the empiricist arguments it seems likely that the data-driven approach will eventually win out and over time, as Big Data becomes more common and new data analytics are advanced, will present a strong challenge to the established knowledge-driven scientific method. To accompany such a transformation the philosophical underpinnings of data-driven science, with respect to its epistemological tenets, principles and methodology, need to be worked through and debated to provide a robust theoretical framework for the new paradigm.
The situation in the humanities and social sciences is somewhat more complex given the diversity of their philosophical underpinnings, with Big Data and new analytics being unlikely to lead to the establishment of new disciplinary paradigms. Instead, Big Data will enhance the suite of data available for analysis and enable new approaches and techniques, but will not fully replace traditional small data studies. This is partly due to philosophical positions, but also because it is unlikely that suitable Big Data will be produced that can be utilized to answer particular questions, thus necessitating more targeted studies. Nonetheless, as
Kitchin (2013) and
Ruppert (2013) argue, Big Data presents a number of opportunities for social scientists and humanities scholars, not least of which are massive quantities of very rich social, cultural, economic, political and historical data. It also poses a number of challenges, including a skills deficit for analysing and making sense of such data, and the creation of an epistemological approach that enables post-positivist forms of computational social science. One potential path forward is an epistemology that draws inspiration from critical GIS and radical statistics in which quantitative methods and models are employed within a framework that is reflexive and acknowledges the situatedness, positionality and politics of the social science being conducted, rather than rejecting such an approach out of hand. Such an epistemology also has potential utility in the sciences for recognizing and accounting for the use of abduction and creating a more reflexive data-driven science. As this tentative discussion illustrates, there is an urgent need for wider critical reflection on the epistemological implications of Big Data and data analytics, a task that has barely begun despite the speed of change in the data landscape.