Automating Data Extraction from Chinese Texts
(Principal Investigators: Peter K. Bol, Harvard University, US; Hilde De Weerdt, King's College London, UK)
Abstract: The Automating Data Extraction from Chinese Texts Project aims to provide humanists and social scientists with a means of transforming 2200 years of Chinese texts into structured data. The project will fully develop an open-source platform that allows its users to apply sophisticated text-mining techniques, hitherto the domain of information scientists, to a wide variety of historical and literary texts. Users interested in biographical data, for example, will be able to tag and extract personal names, dates, place names, official titles and postings, kinship ties, and other social relationships. The platform will be tested against 2000 local histories spanning an 800-year period and 19,000 letters and 500 notebooks dating from the seventh through the thirteenth century. Data extracted from the sample repositories will be used to enrich text-mining applications and will also be made available in English and Chinese for research through open-access online databases and data archives.
Cleaning, Organizing, and Uniting Linguistic Databases (the COULD project)
(Principal Investigators: Maria Polinsky, Harvard University, US; Alan Bale, Concordia University, CAN)
Abstract: The COULD project has 5 goals. (1) It seeks to transfer existing linguistic data from a variety of different formats into a universal format that will allow linguists to combine and share information, not only with other linguists but also with the public at large. (2) The project will build applications that automatically correct errors, draw attention to inconsistencies, and fill gaps in the data. (3) These automated mechanisms will provide new tools to detect patterns that are not obvious when looking at smaller databases. (4) The project seeks to make the vast amounts of linguistic data, currently only being used by researchers, available to second language learners by developing search algorithms that facilitate lesson creation. (5) The project will make data collection easier and thus make language preservation and documentation less dependent on experts. Communities trying to revive endangered languages will benefit directly from this project.
Commonplace Cultures: Mining Shared Passages in the 18th Century using Sequence Alignment and Visual Analytics
(Principal Investigators: Robert Morrissey, University of Chicago, US; Min Chen, University of Oxford; UK)
Abstract: Recent scholarship has demonstrated that the various practices associated with Early Modern “commonplacing” -- the extraction and organization of quotations and other passages for later recall and reuse--were highly effective strategies for dealing with the perceived "information overload" of the period. But, the 18th century was also a crucial moment in the modern construction of a new sense of self-identity. Our goal is to examine this paradigm shift in 18th-century culture from the perspective of commonplaces and their textual and historical deployment in the contexts of collecting, reading, writing, classifying, and learning. These practices allowed individuals to master a collective literary culture through the art of commonplacing, a nexus of intertextual activities that we aim to explore through the concerted application of sequence alignment algorithms for shared passage detection and large-scale visual analytics on the largest collection of 18th-century works ever assembled.
Digging Archaeology Data: Image Search and Markup (DADAISM)
(Principal Investigators: Maarten de Rijke, University of Amsterdam, NL; Helen Petrie, University of York, UK; Mark Eramian, University of Saskatchewan, CAN)
Abstract: Teams from the UK, Canada and the Netherlands will investigate how we can use interactive systems design in conjunction with image processing and text mining techniques to help archaeologists find, organise and analyse the thousands of image and document resources available to them for answering archaeology research questions.
Digging into Linked Parliamentary Data
(Principal Investigators: Maarten Marx, University of Amsterdam, NL; Jane Winters, University of London, UK; Christopher Cochrane, University of Toronto Scarborough, CAN)
Abstract: This project brings together political scientists, historians and computational linguists, from Canada, The Netherlands and the UK, to enable large-scale analysis of the proceedings of three parliaments, from c.1800 to the present day. This data reflects any event of significance over the past 200 years, and will be enhanced during the course of the project to shed light on developments across different nations, cultures and systems of political representation. The project will deliver a common, and extensible, format for encoding parliamentary proceedings; a joint, linked dataset covering all three jurisdictions; a range of tools to facilitate the longitudinal study of parliamentary data; and a series of case studies to test and inform the chosen methodology.
Digging into signs: Developing standard annotation practices for cross-linguistic quantitative analysis of sign language data
(Principal Investigators: Onno Crasborn, Radboud University Nijmegen, NL; Kearsy Cormier, University College London, UK)
Abstract: This project will develop cross-linguistic annotation protocols for exploring the content of sign language video datasets. The key progress lies in a) standardised lemmatisation protocols for lexicalised signs, and b) protocols for annotating partly-lexical and non-lexical (including gestural) elements. The project will demonstrate its approach using corpora of British Sign Language (BSL) and Sign Language of the Netherlands (NGT). Linguistic corpora – i.e. large, representative samples of naturalistic language use – are one of the richest type of resources for studying language structure and use. The new annotation protocols and resulting corpora will enable users to really dig into the content of the existing video data and to enable cross-linguistic research with sign language corpora. The project thus goes far beyond the current state of the art with online sign language corpus data which restricts searches to a few key background details about participants via metadata.
Field Mapping: An Archival Protocol for Social Science Research Findings
(Principal Investigators: Frank Bosco, Virginia Commonwealth University, US; Piers Steel, University of Calgary, CAN)
Abstract: In this project, psychology and management scholars from the United States and Canada will collaborate with an expert in online research and classification methods to devise a web application that will (i) enable the encoding of millions of individual findings in a multidisciplinary social science research domain, (ii) facilitate complex analyses, and (iii) provide open access to members of the scholar community and the general public. Our project provides protocols for the extraction and classification of research findings into a semantic taxonomy. The foundation of this taxonomy will change how researchers search for and analyze findings from big data. We will develop efficient algorithms to access and analyze research findings. This will lead us to our eventual goal -- a comprehensive repository of findings from social science research that is updated continuously and responds to dynamic queries.
Global Currents: Cultures of Literary Networks, 1050-1900
(Principal Investigators: Elaine Treharne, Stanford University, US; Lambert Schomaker, Groningen University, NL; Andrew Piper, McGill University, CAN)
Abstract: This project undertakes the cross-cultural study of literary networks in a global context, ranging from post-classical Islamic philosophy to the European Enlightenment. Integrating new image-processing techniques with social network analysis, we examine how different cultural epochs are characterized by unique networks of intellectual exchange. Research on "world literature" has become a central area of inquiry today within the humanities, and yet so far data-driven approaches have largely been absent from the field. Our combined approach of visual language processing and network modeling allows us to study the non-western and pre-print textual heritages so far resistant to large-scale data analysis as well as develop a new model of global comparative literature that preserves a sense of the world’s cultural differences.
(Principal Investigators: Adam Badawi, Washington University School of Law, US; Rens Bod, University of Amsterdam)
Abstract: This project takes a radically novel approach to the problem of measuring and visualizing differences among legal systems: it focuses on machine coding of internal references in codes and laws. Internal referencing is an inherent characteristic of codes. Already the Code of Hammurabi, almost 3800 years ago, was structured as a numbered list of laws with at least one cross-reference. The intuition behind this approach is that fundamental differences among legal systems manifest themselves in the structure of the texts and can be detected, parameterized, and visualized using computerized algorithms. For instance, the French Civil Code—based on a deductive ideal of legal thought—has fewer internal references than the hundred-year younger German Civil Code—influenced by the idea that law finds its legitimacy in the history of a country rather than on natural principles and hence is less organically structured. We will use this procedure to analyze the world’s codes.
(Principal Investigators: William Ulate Rodriguez, Missouri Botanical Garden, US; Sophia Ananiadou, University of Manchester, UK; Anatoliy Gruzd, Dalhousie University, CAN)
Abstract: The Mining Biodiversity project aims to transform the Biodiversity Heritage Library into a next-generation social digital library resource to facilitate the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community and to raise awareness of the changes in biodiversity over time in the general public. The project will integrate novel text mining methods, visualisation, crowdsourcing and social media into the BHL to provide a semantic search system.
MIning Relationships Among variables in large datasets from CompLEx systems (MIRACLE)
(Principal Investigators: C. Michael Barton, Arizona State University, US; Tatiana Filatova, University of Twente, NL; Terence P. Dawson, University of Dundee, UK; Dawn Cassandra Parker, University of Waterloo, CAN)
Abstract: Social scientists have used agent-based models (ABMs) to explore the interaction and feedbacks among social agents and their environments. The bottom-up structure of ABMs enables simulation and investigation of complex systems and their emergent behavior with a high level of detail; however the stochastic nature and potential combinations of parameters of such models create large non-linear multidimensional “big data,” which are difficult to analyze using traditional statistical methods. Our proposed project seeks to address this challenge by developing algorithms and web-based analysis and visualization tools that provide automated means of discovering complex relationships among variables. The tools will enable modelers to easily manage, analyze, visualize, and compare their output data, and will provide stakeholders, policy makers and the general public with intuitive web interfaces to explore, interact with and provide feedback on otherwise difficult-to-understand models.
Project Arclight: Analytics for the Study of 20th Century Media
(Principal Investigators: Eric Hoyt, University of Wisconsin-Madison, US; Charles Acland, Concordia University, CAN)
Resurrecting Early Christian Lives: Digging in Papyri in a Digital Age
(Principal Investigators: Philip Sellew, University of Minnesota, US; Dirk Obbink, Oxford University, UK)
Abstract: Our team proposes to study papyrus documents from Egypt found in trash heaps: scraps giving us rich evidence of human activity in the ancient Mediterranean. They allow us to retrieve lost poetry, new gospels, and everyday writings: letters, contracts, census returns, homilies, recipes. Half a million fragments await study in the Oxyrhynchus collection alone. Building on data from our crowd-sourcing transcriptions of this material in Greek, we will study a range of papyri relevant to early Christianity. We will develop a transcription tool for Coptic, the late version of Egyptian used by Christians. We will complete a web-based interface to allow scholars to edit the results of the transcriptions; these tools allow us to look in detail at complex networks of identity and authority and examine how Christians saw their new religion as part of their other identities (Greek, Egyptian, Roman, merchant, monk). Our tools and our results will be made available to other developers and scholars.
Trees and Tweets: Mining Billions to Understand Human Migration and Regional Linguistic Variation
(Principal Investigators: Diansheng Guo, University of South Carolina, US; Jack Grieve, Aston University, UK)
Abstract: The proposed research aims to analyze contemporary twitter data for the UK and USA for regional variation in linguistic forms and link the patterns of variation with migration in both countries. Our goal is to understand how linguistic variation is shaped by migration in both the past and present. Two sorts of “big data” will be collected, cleaned, and analyzed for spatial patterns: tweets will be used to document regional linguistic variation and family trees to describe the large-scale migration patterns that might explain this variation. By analyzing successive tweets by the same individuals, we will also have a record of their mobility which we will relate to linguistic variation in the tweets.