Search
Monday, September 22, 2014 ..:: Home » Repositories ::..   
Site Navigation

 List of Data Repositories Minimize

Last Updated:  13 April, 2014

Below is a list of digital libraries, data archives, and data repositories that are inviting Digging into Data researchers to use their collections.  For each repository, you'll find a description of their contents, contact information, and other details.

This list is being frequently updated, so check back often!  If you are a digital repository and would like to be included on this list, please get in touch with us


The Archaeology Data Service (ADS)
archaeologydataservice.ac.uk

About: The ADS catalogue holds the digital archives of a huge number of archaeological interventions from the UK and beyond in around 400 collections, these range from the outputs of single excavations to large scale developer funded projects encompassing hundreds of individual archaeological interventions. As well as digital archives and fieldwork outputs the catalogue contains a number of scholarly resources intended specifically as reference sources for further research on topics such as lithics, ceramics and animal bone. The catalogue also contains digitised (or born digital) versions of various significant journals and series running to many thousands of individual articles. These include scholarly journals such as the Proceedings of the Society of Antiquaries of Scotland, Research Reports from the CBA and around 8000 'grey literature' fieldwork reports. In addition to the catalogue holdings the ADS provides access to over 1,000,000 aggregated resource discovery metadata records for monument inventories from around the UK (including data from the RCAHMS, English Heritage and numerous local authorities). The ADS also hosts both the Archaeological Records of Europe Networked Access (ARENA2) service which aggregates monument inventory data from a number of European partners and the Transatlantic Archaeology Gateway portal allowing cross-searching with archives held in tDAR at Digital Antiquity, Arizona State University, ARENA is currently being both geographically expanded and technically enhanced.

Contact: If your research team is interested in using the ADS catalogue please contact the ADS User Services Manager at help@archaeologydataservice.ac.uk or on +44(0)1904323954

Links: The ADS has an OAI-PMH service and a Z39.50 Target Specification (http://archaeologydataservice.ac.uk/advice/toolsAndServices ). 

Terms of service: Everything hosted by the ADS is freely available for teaching learning and research purposes subject to our Access Agreement.

This agreement and our Copyright and Liability statement are available at the following URL: http://archaeologydataservice.ac.uk/advice/termsOfUseAndAccess

 


ARTstor
www.artstor.org

About:  ARTstor is a digital library of nearly one million images in the areas of art, architecture, the humanities, and social sciences with a set of tools to view, present, and manage images for research and pedagogical purposes.

Contact:  Bill Ying at WWY@artstor.org.

1) ARTstor can provide researchers with access to the ARTstor Library (provided they signed individual user agreements) through our XML gateway, allowing federated or cross-database searches (for more information on the XML gateway, please see http://www.artstor.org/what-is-artstor/w-html/features-and-tools-metasearch.shtml).  

2) ARTstor can also provide special access to a set of large collections to researchers.  For more information, please contact Bill Ying at WWY@artstor.org.


Biodiversity Heritage Library
biodiversitylibrary.org

About: The Biodiversity Heritage Library (BHL) is a consortium of natural history and botanical libraries that cooperate to digitize and make accessible the legacy literature of biodiversity held in their collections and to make that literature available for open access and responsible use as a part of a global “biodiversity commons.” The BHL consortium works with the international taxonomic community, rights holders, and other interested parties to ensure that this biodiversity heritage is made available to a global audience through open access principles. In partnership with the Internet Archive and through local digitization efforts, the BHL has digitized more than 40 million pages of taxonomic literature, making nearly 111,000 volumes available for unrestricted scholarly use.

Contact: William Ulate, BHL Technical Director william.ulate@mobot.org

APIs:  http://biodivlib.wikispaces.com/Developer+Tools+and+API

The BHL Application Programming Interface (API) is a set of web services that can be invoked via HTTP queries (GET/POST requests) or SOAP. Responses can be received in one of three formats: JSON, XML, or XML wrapped in a SOAP envelope. The documentation for the latest version of the API, v2, can be found at http://www.biodiversitylibrary.org/api2/docs/docs.html. The first version of the API was limited to data related to scientific names found in the BHL collection; version 2 adds access to title, author, volume, and page information. Please note that users are required to obtain an API Key from http://www.biodiversitylibrary.org/getapikey.aspx in order to use version 2 of the API. This is the preferred version of the API.

OpenURL

BHL’s OpenURL resolver provides an API for finding articles, chapters, and other pages within BHL.  This API can be used to match bibliographic citations in scientific databases to the open access literature in BHL.http://www.biodiversitylibrary.org/openurlhelp.aspx

OAI-PMH

Metadata about the books in the BHL collection is published via OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting). Descriptive metadata is provided in the form of either Dublin Core or MODS. OAI-PMH is a protocol used for publishing and harvesting metadata descriptions of records in an archive. The OAI-PMH endpoint for BHL is http://www.biodiversitylibrary.org/oai

Data Exports: A series of files is now available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files also include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.  Files are available in EndNote XML, BibTex, and custom tab-delimited text files at: http://biodivlib.wikispaces.com/Data+Exports

Privacy Policy & Terms of Service: http://biodivlib.wikispaces.com/Terms+of+Service 


The CCCA Canadian Art Database  

www.ccca.ca

Centre for Contemporary Canadian Art, Winnipeg

Gail and Stephen A. Jarislowsky Institute for Studies in Canadian Art, Montreal

About: The Canadian Art Database Project, is now permanently housed at Concordia University in Montreal, under the auspices of the Gail and Stephen A. Jarislowsky Institute for Studies in Canadian Art, which is a Research Centre within the Faculty of Fine Arts. Created by the Centre for Contemporary Canadian Art in 1996, the project continues to be a work in progress. It is the prime web resource that focuses on contemporary Canadian art production and the recent history of Canadian art, assembling a growing collection of previously inaccessible or hard-to-find, information on Canadian art in all media [images, texts, media works, and related ephemera] from a variety of sources across Canada into a searchable, bilingual database. The ongoing project is documenting some important Canadian art institutions and organizations that have helped shape the Canadian art scene since the 1960s, along with the careers of some of Canada's leading professional artists, designers, art writers and curators. The Canadian Art Database Project currently holds: 

  • more than 62,000 images, 800 media clips and 3,000 texts by over 850 prominent Canadian visual, media and performance artists, graphic designers, writers and curators;
  • a searchable Canadian Art Bibliography [currently holding 9,000+ references]; 
  • a searchable Canadian Art Chronology [currently holding 6,000+ references]; 
  • a developing series of Video Portraits profiling artists, graphic designers and art personalities; 
  • several related projects that complement the core archive. 

Contact: Bill Kirby
Director, Centre for Contemporary Canadian Art 
Research Affiliate, Gail and Stephen A. Jarislowsky Institute for Studies in Canadian Art, Concordia University. Telephone: 204.421.7100; E-mail: kirby@ccca.ca

The CCCA Canadian Art Database has become an essential interactive teaching resource about Canadian visual culture in secondary and post-secondary classrooms across Canada and abroad. It is attracting a large and varied international audience – receiving daily averages of some 2,300 visits [60,000+ per month], and 100,000 hits [3 million+ per month]. There are more than 30,000 unique visitors per month from visitors in more than 100 countries. The CCCA Database employs a unique ‘artist-empowered’ copyright model in which the copyright on all materials included in the project is retained by the individual creators and authors. Additional materials have all been cleared by the respective copyright holders. The content is housed in the MIMSY Information Management System, which has been specially customized for the CCCA Database. Users are able to freely view and use [but not alter] the material presented on the CCCA website solely for educational and research purposes. We welcome enquiries from any other researchers who might wish to work with the CCCA Canadian Art Database Project. 

 

Chronicling America
Library of Congress
National Digital Newspaper Program

www.loc.gov/chroniclingamerica

About:  As of March 2011, Chronicling America provides free and searchable access to more than 3.3 million pages of historic newspapers, published between 1860 and 1922. These newspapers are selected and digitized by NEH awardees through the National Digital Newspaper Program (http://www.neh.gov/projects/ndnp.html), per Library of Congress technical guidelines (see http://www.loc.gov/ndnp/guidelines/ ). Page-level data presented through Chronicling America include JPEG2000, PDF images, and searchable page text. To date, twenty-two state awardees and the Library of Congress have contributed content to the site from newspapers published in Arizona, California, the District of Columbia, Florida, Hawaii, Illinois, Kansas, Kentucky, Louisiana, Minnesota, Missouri, Montana, Nebraska, New York, Ohio, Oklahoma, Oregon, Pennsylvania, South Carolina, Texas, Utah, Virginia and Washington. Three additional states (New Mexico, Tennessee and Vermont) will be adding content in Spring 2011. The site will continue to expand over time (potentially, twenty years) to eventually include all 54 states and territories with newspapers published between 1836 and 1922. 

Contact: Tech support contact: David Brunton (dbrun@loc.gov), and Nathan Yarasavage, (nyarasavage@loc.gov), National Digital Newspaper Program, Library of Congress.

Links to APIs and documentation:  The Library makes available the digitized text (created through Optical Character Recognition) of more than three million newspaper pages in the METS/ALTO XML format (see http://www.loc.gov/standards/alto/).  For each page of OCR text, the library includes a permanent link to an image of the page, from which additional metadata can be derived.

The Library provides an OpenSearch API [1], with results returned in HTML, JSON, or Atom, at the researcher's discretion.  From the search results, the Library provides pointers to additional information for each result based upon a URI Template. [2]

[1.] http://www.opensearch.org/Home  

[2.] http://bitworking.org/projects/URI-Templates/spec/draft-gregorio-uritemplate-03.txt

Special terms of service:

Data provided by the Library is for the sole use of the awardee in support of research as described in the Digging for Data proposal and should not be re-used or re-distributed for any other purpose without permission. 

The Library reserves the right to block IP addresses that fail to honor the Library's robots.txt files or submit requests at a rate that negatively impacts service delivery to all Library patrons. Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete. (See http://www.loc.gov/homepage/legal.html for more information).


Connecting Repositories (CORE)
 
About. As of January 2013, the COnnecting REpositories (CORE) system, hosted at The Open University, provides access to millions of open access research articles aggregated from over 250+ repositories worldwide. The collection contains about 10 million metadata records and 1+ million full-text open access articles. The CORE system continuously updates the aggregated information, synchronising the metadata and content held in the provider repositories and including new repositories. The content of the collection can be accessed via the CORE API.
 
API access. The CORE API (http://core.kmi.open.ac.uk/api/doc) allows searching the collection, downloading full-text content (in plain text or pdf), finding related documents, accessing citation information or retrieving content statistics. The API is free to use and requires only an API key which is provided by the CORE team. There are two version of the API: a RESTful API (typically the choice for most applications), providing data in XML or JSON, and a Linked Open Data SPARQL endpoint (can be practical for systems that need to connect information from multiple sources).
 
Contact. Please check the CORE website (http://core.kmi.open.ac.uk), check information about the CORE family of projects (CORE, ServiceCORE and DiggiCORE) or contact the CORE team (Petr Knoth -
petr.knoth@open.ac.uk) for more information.

 


Data-PASS
http://www.icpsr.umich.edu/DATAPASS/

About: The Data Preservation Alliance for the Social Sciences (Data-PASS) is a partnership of social science data archives devoted to identifying, acquiring and preserving primary source data at-risk of being lost to researchers. Examples of at-risk data include opinion polls, voting records, large-scale surveys on family growth and income, and many other social science studies.

Contact: If your research team is interested in using the Data-PASS shared catalog, please see http://dvn.iq.harvard.edu/dvn/dv/datapass. For general questions, please contact data-pass@icpsr.umich.edu. For technical support, please contact Dr. Micah Altman (Micah_Altman@harvard.edu).

Data collections preserved by the Data-PASS partnership are described on the Data-PASS shared catalog (http://dvn.iq.harvard.edu/dvn/dv/datapass) in descriptive metadata records. Some of the descriptions link directly to data files and technical documentation. Many Data-PASS data holdings are freely available to anyone, while some content is restricted. 

Data-PASS is led by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan, the Roper Center for Public Opinion Research at the University of Connecticut, the Howard W. Odum Institute at the University of North Carolina-Chapel Hill, the Henry A. Murray Research Archive, a member of the Institute for Quantitative Social Science at Harvard University, the custodial electronic records division of the National Archives and Records Administration, and the Harvard-MIT Data Center, also a member of the Institute for Quantitative Social Science at Harvard University.
 


The Digital Archaeological Record (tDAR)
www.tdar.org

About:  The Digital Archaeological Record (tDAR) is an international digital repository for the digital records of archaeological investigations. The tDAR repository encompasses digital data, documents, and images derived from ongoing archaeological research, as well as legacy data derived from more than a century of archaeological research. The repository provides for long-term preservation and easy discovery and access to databases, spreadsheets, documents, and images in a variety of common formats. tDAR operates under the organizational umbrella of Digital Antiquity, a multi-institutional organization that has been explicitly designed to ensure the long-term financial, technical, and social sustainability of the tDAR cyberinfrastructure.  

Contact: comments@tdar.org

Links: http://www.tdar.org/  http://www.digitalantiquity.org

Accession Policy:  Digital Antiquity will accept a wide range of digital archaeological data for the tDAR repository. Data and documents contributed to the tDAR repository may include digital files related to a wide range of archaeological investigations and topics, e.g., archives and collections, field studies of various scales and intensities, and historical, methodological, synthetic, or theoretical studies. For detailed information on what can be contributed to tDAR and how to contribute, click here.Terms of Use:  Knowledge gained through the efforts of many researchers is shared through tDAR in order to encourage and facilitate archaeological and related research.  Read the tDAR Terms of Use Agreement here.  

 


Digital Library for Earth System Education (DLESE)
www.dlese.org

About: Since 1999, the Digital Library for Earth System Education (DLESE) has provided searchable access to high-quality, online educational resources for K-12 and undergraduate Earth system science education.   These resources include maps, lesson plans, lab exercises, data sets, virtual field trips, and interactive demonstrations. The holdings of DLESE are created by a wide variety of individual faculty members, federal and state agencies, and cultural institutions.  These resources are held (stored) on local servers and are accessed through the library via searchable metadata records in the ADN format, which extends the IEEE-LOM format to include rich educational metadata, such as educational standards, as well as geospatial and temporal descriptions. Additionally, the library contains user-contributed content in the form of teaching tips and resource reviews. 

DLESE resources and collections can be accessed by the Digital Discovery System (DDS) services and APIs, which can be used to search over DLESE content or flexibly configured to search over any XML schema structure, including user-contributed content. The APIs come in two flavors: a RESTful web service and a JavaScript API. A range of information retrieval features are available including textual and field-based searches such as audience, subject, resource type or educational standard. DDS also supports geospatial search and can be integrated with Web 2.0 applications such as Google Maps. The DLESE repository metadata can also be harvested using OAI-PMH. DLESE collections have been used for a rich variety of computational linguistics, natural language processing, and machine learning research and we welcome the opportunity to work with Digging Into Data researchers to further extend DLESE’s utility as a research test bed. 

Contact: John Weatherley (jweather@ucar.edu) for information on the DDS API. Tamara Sumner (sumner@colorado.edu) for recent publications using DLESE as a research testbed. You can also reach the team by emailing support@dlese.org

Links / Services:  

Digital Discovery System Web (DDS) Services and APIs are described at:

http://www.dlese.org/dds/services/index.jsp

OAI Data Provider:

http://www.dlese.org/dds/services/oaiDataProvider/index.jsp

Open Source Software:

http://www.dlese.org/dds/dds_overview.jsp

Terms of Use/Service:  http://www.dlese.org/documents/policy/terms_use_full.php 


Early Canadiana Online
www.canadiana.ca

About: Early Canadiana Online (ECO) is a digital library providing access to over 4 million pages of Canada's printed heritage. It features works published from the time of the first European settlers up to the early 20th Century.

Contact: William Wueppelmann, Manager, Information Systems, william.wueppelmann@canadiana.ca

613-235-2628 ext. 226

1) Pilot Project: Approximately 500,000 pages from a variety of subjects, including Canadian women’s history, English Canadian literature, the history of French Canada, and native studies.

2) Early Governors General of Canada: 60 publications focusing early Governors General of Canada.

3) Hudson’s Bay Company publications: 160 titles

4) Jesuit Relations: 73 volume set of books in the original Italian, Latin and French with commentary and translation into English.

5) Official publications: 1.5 million pages of pre-1900 Canadian official publications, including journals of legislative assemblies of the various colonies of pre-Confederation Canada, Journal of the House of Commons, Debates of the House of Commons, Debates of the Senate, Reconstituted Debates of Canada, Sessional papers, Journals of the Senate, tatutes of Canada, Statutes of the pre-Confederation colonial legislative councils, and debates of the pre-Confederation colonial house of assemblies.

6) Periodicals: Over 2 million pages of pre-1920 Canadian periodicals. 


FLOSSmole

http://flossmole.org 

Data about free, libre and open source software (FLOSS) development

About: Since 2004, FLOSSmole collects and redistributes public data about free, libre, and open source software (FLOSS) projects in multiple formats for anyone to download. We also integrate donated data from other research teams and provide a community for researchers to discuss public data about FLOSS development. FLOSSmole contains multiple terabytes of data collected from dozens of online code forges between 2004-now. This includes data about more than a million different open source projects and their developers. The web site includes some of the sample graphical visualizations that have been made with the data. The web site also includes database schema models, a bug database, a wiki of changes to the database and bugs as they have been fixed.

Data Exports: All data is available on our Google Code downloads page (over 2000 data files) located at http://code.google.com/p/flossmole/downloads/list. We also offer direct, live database access (details here: http://flossmole.org/content/direct-db-access-flossmole-collection-available)

Contact:Megan Squire (msquire@elon.edu), Elon University. We also have a user and developer mailing list (located here: http://sourceforge.net/mail/?group_id=119453) and a blog (located at http://flossmole.org). To see how FLOSS data has been used in the literature from a variety of disciplines, we host over 1200 papers on FLOSShub in our bibliography database (http://flosshub.org/biblio).

Citation:Howison, J., Conklin, M., & Crowston, K. (2006). FLOSSmole: A collaborative repository for FLOSS research data and analyses. International Journal of Information Technology and Web Engineering, 1(3), 17–26.  


English Broadside Ballad Archive (EBBA) 
ebba.english.ucsb.edu

About: EBBA mounts online surviving but difficult-to-access early ballads printed in English, with priority given to black-letter broadsides of the seventeenth century--the heyday of the printed broadside ballad. The database currently holds over 1,800 ballads from the Samuel Pepys collection at Magdalene College, Cambridge, and over 1,500 ballads from the Roxburghe collection at the British Library, London, and we are in the process of adding the early ballads from the University Glasgow and the Huntington libraries. EBBA makes these ballads fully accessible as texts, art, music, and cultural records of the period. We provide online images of each ballad in high-quality facsimiles as well as "facsimile transcriptions" (which preserve the original ballad ornament while transcribing the black-letter font into easily readable white-letter or roman print). In addition, we provide sung versions of the ballads, background essays that culturally place the ballads, TEI/XML encoding of the ballads, and search functions that allow readers easily to find ballads as well as their constituent parts/makers..

Contact: Patricia Fumerton, Director of the English Broadside Ballad Archive, Department of English, University of California, Santa Barbara, CA. 93106.  FAX: 805-893-4622.  Email: pfumer@english.ucsb.edu  


Great War Primary Documents Archive
www.gwpda.org

About: The Great War Primary Documents Archive is dedicated to the collection, preservation, and development in electronic form of materials relating to the First World War. It is a resource for scholars and students, and is a perpetual memorial to the heroism and sacrifice of those who participated in the war. Since 1995, first at the University of Kansas, then at BYU and now on private servers it is the first and largest online full text international collection of documents and images related to the Great War period, 1880-1926. The Archive provides free and universal public access to the full text records of the history of World War I and the twentieth century's attempts to deal with the conflict, the collapse of national and international political agreement into active warfare and the post-war effort to create a world without war. Thus far, the site has been hit more than 15,500,000 times, and innumerable students, scholars and researchers have examined, analyzed and incorporated these documents into their work. The Great War Primary Documents Archive now holds some 15,000 fully searchable pages of these significant official and public documents. We are the Web's non-partisan source of this material, a primary resource for isolated and under-funded repositories worldwide, and provide these documents and historical information without charge.

Contact: AJ Plotke, Executive Director, GWPDA.  Telephone: 602.297.1914; E-mail: cd078@gwpda.org

GWPDA is entirely accessible online, using any browser or system and does not require any special access permissions. All documents and images are either ex-copyright or their copyright is held by GWPDA. It may be easier to download material from the site through the administration’s back channel - please contact us before starting a complete document harvest so that we can determine the most efficient method of transferring data. 

 


Harvard Time Series Center (TSC)
timemachine.iic.harvard.edu/search/

About: Harvard Time Series Center (TSC) is an interdisciplinary effort dedicated to creating the world's largest data center for time series and to developing algorithms to understand and analyze various aspects of these time series. The partnership of the data center and the analysis effort makes both discoveries of new and rare phenomena, and large scale studies of known phenomena possible.

The TSC hosts closed to 1 billion time series, mainly from the field of astronomy but expanding to economics, health data, real estate data, etc.

Each time series typically consists of 100-100,000 measurements, making the total number of measurements greater than a trillion! We have both time series that go back 100 years with measurements every few days and time series that were taken at 200Hz for short period of time.

Our collection represents one of the largest and most interesting datasets in the world for time series and gives a unique opportunity for to analysts  to test their algorithms at large scales.

This is an unprecedented opportunity to be part of the development of computational algorithms  and the making scientific discoveries across multiple fields and answer some of the most fundamental questions.

For each time series  in the database, the TSC maintains a list of links to relevant resources, including:

  • Links to the original images via URLs.
  • A list of metadata about the object (position on the sky, time of observation, wavelength of observation etc)
  • A list of provenance information regarding the process between the original images and time series.

The TSC also maintains a full set of web services that can be accessed using any programming language such as Python, Perl etc.  Using those web services one can query the database and retrieve a subset of the dataset. These capabilities have been used to create a web interface http://timemachine.iic.harvard.edu/search/ for the astronomy data sets.

Contact: If your research team is interested in using the content maintained by our project, please contact the TSC lead investigator, Pavlos Protopapas at pprotopapas@cfa.harvard.edu

Links: The TSC supports export of its metadata records via a RESTful interface and a highly structured JSON format.

Terms of service: access to the TSC is freely available to the general public for personal use.  The relevant terms of use are detailed in the document at http://timemachine.iic.harvard.edu/tos/ 


HathiTrust
www.hathitrust.org

About: HathiTrust is an international partnership of more than fifty research institutions and libraries that are working together to ensure the long-term preservation and accessibility of the cultural record. The partnership launched a digital repository in 2008 that currently contains more than 8 million volumes, digitized from the partnering library collections. More than 2 million of these volumes are in the public domain and freely viewable on the Web.

Texts of approximately 120,000 public domain volumes in HathiTrust are available immediately to interested researchers. Up to 2 million more may be available through an agreement with Google that must be signed by an institutional sponsor. More information about obtaining the texts, including the agreement with Google, is available at http://www.hathitrust.org/datasets.

Contact: Please contact hathitrust-datasets@umich.edu with any questions.

Terms of Service: In most cases, because of constraints placed on materials that have been deposited, texts from HathiTrust may not be redistributed, rehosted, or used commercially. While HathiTrust has verified all texts it makes available to be in the public domain, researchers must agree to delete data from their researcher sets if it is determined at a later time to be in copyright.

 

The History Data Service (HDS)
hds.essex.ac.uk

About: The HDS collection, which is part of the UK Data Archive (UKDA), brings together over 650 separate data collections transcribed, scanned or compiled from historical sources. The studies cover a wide range of topics from the seventh century to the twentieth century. Although the primary focus of the collection is on the United Kingdom, it also includes a significant body of cross-national and international data collections. In addition, the HDS also enriches and enhances selected data collections by developing thematic special collections where there is a critical mass of related data collections. Current special collections include: Census enumerators’ books, including the entire 1881 Census for England, Wales and Scotland; Poll Books, including the Westminster Historical Database, 1749-1820; British and Irish nineteenth and twentieth century statistics, including histpop – the online historical populations reports website; wage and price time series including the European State Finance Database; local history including the Digital Library of Historical Directories, 1750-1919; and prosopography including the COEL Database: Continental Origins of English Landowners, 1066-1166.

Contact: If your research team is interested in using the HDS collection please contact Richard Deswarte, the head of HDS, at richardd@essex.ac.uk or +44(0)1206873226.

Links: The HDS does not have any specific APIs, although general advice to data creators is available at: http://hds.essex.ac.uk/history/create/create-advice.asp

Terms of service: Most of the collection is available to any user free of charge, upon registration with the UKDA, for the purposes of not-for-profit research. However, some datasets may have restrictions on access. For example, commercial usage may be restricted or permission for usage may be required from the depositor. Details are available in the ‘Access’ section of each HDS catalogue record.  General information on accessing data is available at: http://www.data-archive.ac.uk/aandp/access/login.asp#terms

 


Infochimps.org
infochimps.org


About: Infochimps.org is a website where anyone can upload a dataset and make it a rich addition to the large library with good metadata tagging and descriptions. Much of the data will be free and come with an open license.
 
Infochimps.org has thousands of datasets, including the Freebase data dump, Wikipedia extractions, stock data, and all sorts of text corpora.
 
Contact: help@infochimps.org 


Internet Archive
www.archive.org

About:  The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format. Founded in 1996 and located in the Presidio of San Francisco, the Archive has been receiving data donations from Alexa Internet and others. In late 1999, the organization started to grow to include more well-rounded collections. Now the Internet Archive includes texts, audio, moving images, and software as well as archived web pages in our collections.

Contact: For technical questions (e.g. bulk downloads, etc), please contact Hank Bromley or Alexis Rossi. For questions about usage rights, please contact info@archive.org.

IA Resources of Particular interest to the DID Challenge:

1) The Prelinger Film Archives

Public Collection of 2,139 films
 

This free public archive is a subset of the approximately 60,000 item collection of ephemeral films assembled by Rick Prelinger, a filmmaker and historian. 

Ephemeral films are defined as advertising, educational, industrial, and amateur films including the famous “Duck and Cover” nuclear safety education film and other extraordinary items of social, artistic and historical significance.  As a whole, the Prelinger collection currently contains over 10% of the total production of ephemeral films between 1927 and 1987, and it may be the most complete and varied collection in existence of films from these poorly preserved genres.
 

A tag cloud of the public collection to which the Archive will provide researcher access is viewable at the following URL: http://www.archive.org/browse.php?field=/metadata/subject&collection=prelinger&view=cloud
 

The Archive believes that this collection may be of interest to sociologists, filmmakers, scholars of marketing and advertising and of course, historians for study of themes in public opinion, politics and popular culture.  Moreover, it offers a substantial set of digital video imagery that can serve as a testbed for tools development in video search, face recognition and other image-based technologies.
 

Rights:  This collection is being made available for study and reuse under a Creative Commons License.  Details of the rights granted for this collection are available at the following URL:  http://www.archive.org/details/prelinger
 

2) Canadian Libraries Collection

161,732 Digitized Books

This collection contains digitazed books, the vast majority of which were contributed by the University of Toronto Libraries.
 

Major themes include Canadian history/regional history, medical history, religion, military history, Greek Classics (Tufts University/Perseus Digital Library), government documents and certain special collections (e.g., the Cardinal Newman collection.)  The collection can be viewed and searched at the following URL: http://www.archive.org/details/toronto.
 

The Archive believes this collection may be of interest to scholars of history and sociology among other disciplines, as well as tool development for OCR, multi-lingual translation, name and place recognition and natural language processing.
 

Both collections are fully searchable using text boxes on the home pages. 


Inter-university Consortium for Political and Social Research
www.icpsr.umich.edu

About: The Inter-university Consortium for Political and Social Research (ICPSR), the world's largest repository of digital social science data, provides leadership and training in data access, curation, and methods of analysis for a diverse and expanding social science research community.

Contact: If your research team is interested in using the ICPSR collection and/or its metadata records, please see this Web page for more information: www.icpsr.umich.edu/DID.  For technical support, please contact ICPSR at this e-mail address: netmail@icpsr.umich.edu.

The ICPSR repository spans the behavioral and social sciences and includes data on sociology, political science, demography, economics, history, criminal justice, gerontology, public health, education, criminal justice, gerontology, substance abuse, international relations, and much more. ICPSR's Summer Program in Quantitative Methods is internationally recognized as the premier program for training in the methodology of social science research.

ICPSR's membership includes over 650 educational and research institutions around the world. ICPSR member institutions pay annual dues that entitle faculty, staff, and students to the full range of data resources and services provided by ICPSR, but many of the ICPSR data holdings are freely available to anyone.

ICPSR content takes the form of numeric data files and associated PDF technical documentation (over 500,000 discrete files comprising over 7000 studies) that may be analyzed using statistical software packages or, in some cases, online analysis software. Each study in the holdings has a corresponding descriptive metadata record.

Data in the repository are linked to related publications in the research literature via a Bibliography of over 45,000 entries. Similarly, citations in the Bibliography are linked to the data that generated the research findings. Full text is available for many publications in the Bibliography. ICPSR content provides a rich resource for data mining.

 


JISC MediaHub

http://jiscmediahub.ac.uk

About:

JISC MediaHub, part of the JISC eCollections service, provides a single point of access to three major multimedia archives purchased on behalf of members. It enables cross searching and exploration of over 3,500 hours of film and 50,000 images from the following archives:

NewsFilm Online

Over 3,000 hours of digitised news stories from the ITN/Reuters archives, comprising some 60,000 stories. The sources include the complete Gaumont and Paramount newsreels, from 1910 and 1934 respectively. Many ITN broadcasts also include scripts and rushes, enabling comparison of the raw material and the edited footage that was broadcast.

Film & Sound Online

Over 2,000 items, comprising 17 separate collections of digitised film and sound, including Imperial War Museum film footage, the Royal Mail Film collection, and scientific content from the Wellcome Library and the Biochemical Society. The archive also includes over 50 hours of classical music recordings from the Culverhouse collection.

Digital Images for Education

The result of a £2.75 million procurement during 2009-10, this archive comprises over 56,000 images and 600 hours of film selected by the education community to capture local, UK and world events during the last 25 years. Sources include the AP Archive, Getty Images, ITN, Design CouncilArchives, Imperial War Museum, Royal Geographical Society, Fitzwilliam Museum and PYMCA.

The service is in continuous development until July 2011 and users can search a growing number of third-party collections such as the British Library Archival Sound Recordings, VADS, ARKive and Scran.

JISC MediaHub is developed by and hosted by EDINA. The service can be freely searched by anyone but some of the collections are available only by subscription by UK Further  and Higher Education institutions.

JISC eCollections also comprises:

JISC Historic Books

The full text or page image of over 300,000 books published in England before 1800 and uniquely, and never before available online, over 65,000 19th Century books from the British Library

JISC Journal Archives

More than 3.75 million articles from the archives of over 450 journals of major publishers and societies – Brill, Institution of Civil Engineers, Institute of Physics, ProQuest, Oxford University Press, The Royal Society of Chemistry

Contact: Technical support information: Research teams can contact edina@ed.ac.uk for questions regarding access to the data

Terms of Service: http://jiscmediahub.ac.uk/terms


JSTOR

www.jstor.org
 

About: With participation and support from the international scholarly community, JSTOR has created a high-quality, interdisciplinary archive of scholarship, is actively preserving over one thousand academic journals in both digital and print formats, and continues to greatly expand access to scholarly works and other materials needed for research and teaching globally. We are investing in new initiatives to increase the productivity of researchers and to facilitate new forms of scholarship.

Contact: http://www.jstor.org/action/showContactSupportForm

We are pleased to confirm JSTORs readiness to participate in the 'Digging into Data Initiative'. JSTOR is a scholarly archive of the full runs of approximately 1000 leading academic journals and covers approximately fifty disciplines, with a strong presence in the humanities, social and field sciences, business and economics.  A full list of the included journals is available at http://www.jstor.org/action/showJournals?browseType=titleInfoPage

JSTOR is prepared to provide access at two levels.

1.1 Potential participants to Digging into Data can apply to JSTOR for an account to our “Data for Research” service.  This allows users to createdatasets of word frequency against article for any subset of the articles in the JSTOR archive.  An open, but size-limited service will be generally available from mid January, 2009 at http://dfr.jstor.org.  The Digging into Data accounts will remove the limits on the size of the datasets.

1.2 Any accepted participant in the Digging into Data Program can gain access to the full text of the of JSTOR collections as XML data in a (slightly extended) NLM format. The dataset will include OAI-ORE resource maps.  The full text includes OCR’d text of the articles, and bibliographic metadata.

Notes:

1. The data referred to in (1.2) above will be a standard corpus and will be distributed “as-is”.  JSTOR will not filter, sort or in any way process it to individual participants requirements.  Participants are expected to be competent in the processing of XML data and no technical support is offered by JSTOR in the filtering or processing of the XML.  The agreement will be for a limited time, after which the data should be destroyed or returned.

2. Samples of the data in (1.2) and the XML schema will be available on request to potential participants for the purpose of assessing the suitability of the full collection for their proposal.  Final participants should expect to provide media such as USB disk drives for the delivery of the data.

3. All participants will be required to sign a standard license and non-disclosure agreement for the use of the data referred in section (1,2) above.

4. Any potential participant must accept a “click-through” limited use agreement for the service and data mentioned in (1.1) above, and must provide contact details and a bon-fide email address at the participating institution.

5. Participants should allow sufficient time for the processing of the agreement mentioned in 1.2.  We have found that 6-8 weeks is typical for legal departments of academic institutions to process such agreements.


 Koninklijke Bibliotheek

About: The Koninklijke Bibliotheek (KB) is the national library of the Netherlands: we bring people and information together. We offer access to everything published in and about the Netherlands, play a central role in the (scientific) information infrastructure of the Netherlands and promote permanent access to digital information nationally and internationally. We digitise all the books, periodicals and newspapers that have been published in the Netherlands (http://www.delpher.nl/). Most of the resulting data is available for researchers via an API.

 
 
Terms of Service: Most datasets are available for academic research purposes. Please contact us for more detail. 

The Legacy Tobacco Documents Library
University of California, San Francisco
legacy.library.ucsf.edu

About:  The Legacy Tobacco Documents Library (LTDL) contains more than 14 million previously-secret documents (80+ million pages) created by major tobacco companies related to their advertising, manufacturing, marketing, sales, and scientific research activities.  This free and publically accessible site contains:

• Documents from major U.S. tobacco companies and their public relations organizations (Philip Morris, RJ Reynolds, Lorillard, the Tobacco Institute…)
• Documents from international tobacco companies such as British American Tobacco (BAT) and Gallaher
• Complete transcripts and depositions from US tobacco litigation as well as Canadian tobacco trials
• Thousands of videos and advertising images in our Multimedia and Pollay Collections
(See http://legacy.library.ucsf.edu/about/about_collections.jsp for a complete list)

Each document has a detailed index record and is full-text searchable. 

The documents in LTDL range in date from the late 19th century up through the present, with the bulk of the collections dated 1950 through 2003.  Similar to any large corporation’s archives, the tobacco company collections include emails, memos, letters, scientific reports, meeting minutes and administrative documents.  LTDL has been used extensively by researchers and journalists to investigate the activities of the tobacco industry; to date over 800 papers, reports, book chapters and articles have been written using tobacco documents a primary source.  See http://www.library.ucsf.edu/tobacco/docsbiblio for a full listing of all publications.

Contact: Kim Klausner, Manager, Industry Documents Digital Libraries (kim.klausner@ucsf.edu) or Rachel Taketa, Library Specialist, Industry Documents Digital Libraries (rachel.taketa@ucsf.edu).

Links to APIs and documentation:  The Library makes available the metadata records for every document through an XML API.  Please contact the Library for more information and instructions.

Terms of service:  http://legacy.library.ucsf.edu/legal.jsp


Marriott Library
University of Utah
www.lib.utah.edu/portal/site/marriottlibrary/

About: The J. Willard Marriott Library at the University of Utah hosts more than 160 outstanding digital collections, containing 420,000 digital photographs, maps, books, audio recordings, and other items.  They can all be viewed by clicking on the “Digital Collections” link on the Library’s main web page: http://www.lib.utah.edu/portal/site/marriottlibrary/

Contact: John Herbert, Head - Digital Technologies, john.herbert@utah.edu

Among the many fine collections we have are:

  • The Harmonia Macrocosmica, by Andreas Cellarius, printed in 1661, is an atlas of the heavens as seen by the astronomers of the time: Copernicus, Ptolemy, Brahe, and Aratus. Our collection has 30 hand-painted color plates plus an accompanying text in Latin.  
  • Western Soundscape Archive – thousands of animal and other natural sounds from the western U.S.  
  • Karl Bodmer created these 165 aquatints during the 1832-1834 expedition by Prince Maximilian zu Wied through the American west. Since then, these watercolors have been a major historical resource for Plains Indian culture. They were instrumental in creating the romantic perceptions of these peoples, which endure to this day in art, film, and literature.
  • Dard Hunter Books – his writings on papermaking, presented in elegant hand-made books.
  • Sanborn Fire Insurance Maps - large-scale, detailed maps from 1867 -1969 depicting the commercial, industrial, and residential sections of Utah cities.
  • Arabic Papyrus – the world’s third largest, a collection of over 1,500 Arabic documents on papyrus and paper.
  • Aztec Codices – 4 Mesoamerican manuscripts describing wars, famine, pestilence, religious events, and other elements of ancient Mesoamerican culture.

We host nationally renowned collaborations such as the Utah Digital Newspapers (http://digitalnewspapers.org), the Mountain West Digital Library (http://mwdl.org), and the Western Waters Digital Library (http://westernwaters.org).  

We partner with several other UofU and State institutions, including the Spencer S. Eccles Health Sciences Library, the S. J. Quinney College of Law Library, the Utah State Historical Society, and the Utah State Library. We also work closely with many public libraries across Utah, such as the Uintah County Library, the Park City Library and Historical Society, the Delta City Library, the Topaz Library, to name only a few.

UTAH DIGITAL NEWSPAPERS

The J. Willard Marriott Library at the University of Utah has launched a pioneering program that is changing the face of newspaper research. Our program, the Utah Digital Newspapers (UDN), makes historic Utah newspapers available to the general public over the Internet. We create a database of digital images and searchable text from old newspapers and make it accessible from our website. The result is that these newspapers can be searched by keyword, title and date from the comfort of a PC. For anyone interested in their family history, Utah or national history, it's a marvelous, easy-to-use improvement over reading microfilm.   

Since its inception in 2002, UDN quickly became a leader in newspaper digitization within the public sector and a model for other academic libraries across the country. Our success has led the National Endowment for the Humanities and the Library of Congress to launch a national digital newspapers program, in which the Marriott Library and 20+ other institutions participate.

As of early 2011, we have digitized over 1 million pages from more than sixty newspapers, covering 27 of the 29 Utah counties. All this is available on our website: http://digitalnewspapers.org.

Our collection includes the first issue of the Deseret News in 1850, which is the first newspaper issue of any kind published in the Utah Territory. We have the first issue of the Salt Lake Tribune in 1871 and the early years of the Salt Lake Herald. Among other papers are the Topaz Times, the newsletter from the Japanese internment camp during World War II, and the Broad Ax, an early African American paper from Salt Lake City.

 


NASA ADS
Smithsonian/NASA Astrophysics Data System (ADS)
ads.harvard.edu

About: The Smithsonian/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, currently being developed by the Smithsonian Astrophysical Observatory under a NASA grant. The ADS maintains three bibliographic databases containing more than 7.5 million bibliographic records covering the scholarly literature in Astronomy and Physics.

For each bibliographic record in its database, the ADS maintains a list of links to relevant resources, including:


    * Location of the fulltext article via DOI and/or OpenURL links
    * List of works referenced in the original article (references), given in the form of either a list of ADS records or as a link to the publisher’s reference list
    * List of works that cite the record in question (citations or “forward links”), given in the form of a list of records that ADS was able successfully identify
    * Readership statistics and co-readership-based user recommendations

The ADS also maintains an archive of the historical content of all the astronomical publications.  The contents of the archive consist of 500,000 articles, corresponding to over 3.5 million scanned pages, which have been OCRed and which are searchable through the ADS fulltext query search form.

Contact: If your research team is interested in using the content maintained by our project, please contact the ADS project manager, Alberto Accomazzi at aaccomazzi@cfa.harvard.edu

Links: The ADS supports export of its metadata records via a RESTful interface and a highly structured XML format (see http://doc.adsabs.harvard.edu/abs_doc/help_pages/linking.html). Fulltext content can be provided to collaborators via a data dump.

Terms of service: access to the ADS is freely available to the general public for personal use.  The relevant terms of use are detailed in the document http://doc.adsabs.harvard.edu/abs_doc/help_pages/overview.html#use

 


National Archives, London
nationalarchives.gov.uk

About: The National Archives has a range of different data sources and are keen to support initiatives that can broaden access to our data. We would be happy to discuss any project further.

Contact: Dr David Thomas, Director of Technology and Chief Information Officer, david.thomas@nationalarchives.gov.uk

Description: Much of the National Archives digitised material is available via its catalogue, but discussion will be required to allow for in depth analysis. It may be preferable for scholars to work with specific collections, that have full catalogue entries or text that is searchable via OCR. This could include


National Library of Medicine (NLM)

http://www.nlm.nih.gov/   

About: A component of the US National Institutes of Health (NIH), the National Library of Medicine is the world's largest biomedical library which traces its roots to 1836 and the commitment of the second US Army Surgeon General to purchase books and journals for active-duty medical officers. Today, the NLM maintains and makes available a vast collection of over twelve million books, journals, manuscripts, audiovisuals, and other forms of medical information. It also produces electronic information resources on a wide range of topics that are searched billions of times each year by millions of people around the globe. As a public institution with over one-hundred and-seventy-five years of experience in collecting materials and providing information and research services in all areas of biomedicine and health care, the NLM is committed to introducing more audiences to its unique holdings and rich sets of data. The NLM is also committed to developing new and innovative collaborations that engages its data and, in doing so, advance research, teaching, and public understanding of the past, present, and future of medical science and public health. 

A complete listing of all NLM databases and related resources & APIs, is available at:http://wwwcf2.nlm.nih.gov/nlm_eresources/eresources/search_database.cfm   

 

Of particular interest to digital humanists will be XML datasets and associated DTDs from:

  • The NLM’s IndexCat™ database, which encompasses more than 3.7 million history of medicine bibliographic items spanning five centuries, covering a wide range of subjects such as the basic sciences, scientific research, civilian and military medicine, public health, and hospital administration
  • Two unique collections, encompassing over 42,000 records of incipits, or the beginning words of a medieval manuscript or early printed book, covering various medical and scientific writings on topics as diverse as astronomy, astrology, geometry, agriculture, household skills, book production, occult science, natural science, and mathematics, as these disciplines and others were largely intermingled in the medieval period of European history.
  • MEDLINE®/PubMed® data, which includes over 22 million references to biomedical and life sciences journal articles back to 1946, and, for some journals, much earlier.

 

Contact/Technical Support:  See: http://apps2.nlm.nih.gov/mainweb/siebel/nlm/index.cfm   

Copyright and Related Terms and Conditions: See: http://www.nlm.nih.gov/copyright.html  


The National Library of Wales
www.llgc.org.uk

About: The National Library of Wales has been a pioneer in developing large-scale digital collections based around their holdings of texts (books, manuscripts and maps), images (phtographs, artworks) and audio visual material related to Welsh history, language and culture. The National Library houses the de facto national archive of Wales, its national photographic collection, its screen and sound archive, and its second largest art collection. Digital content is available in English and Welsh. 

Contact: lorna.hughes@llgc.org.uk Lorna Hughes, University of Wales Chair in Digital Collections, responsible for establishing a research programme based around the Library’s digital collections.

Resources of particular interest to the Digging in to Data Challenge: 

1) Historic Newspapers and Journals aims to digitise 2 million pages of historical newspapers and journals published in Wales and to provide new opportunities for existing and new audiences to research and exploit this record of everyday knowledge online – completely free of charge. This 3-year project began in 2009 and the National Library of Wales aims to launch the new digital service on its website from 2012. This will be the largest body of searchable text relating to Wales.  The project aims to digitise all of the National Library’s paper holdings of out-of-copyright newspapers and journals - generally those published in Wales up to 1911 and comprising more than 700 different titles touching all corners of Wales. Researchers worldwide will be able to search for words, phrases and dates across 2 million pages.  

2) Welsh Journals Online provides free access to scholarship from Wales. The back-numbers of up to 50 titles will be available, ranging from academic and scientific publications to literary and popular magazines. Complete runs of each title have been included, and occasional papers, Index volumes and Monographs are also available. 

3) Welsh Wills Online has made freely available digital images of over 190,000 Welsh wills (some 800,000 pages). Wills which were proved in the Welsh ecclesiastical courts before the introduction of Civil Probate on 11 January 1858 have long been deposited at The National Library of Wales. 

4. Geoff Charles Collection. Charles (1909 – 2002) was a newspaper photographer who over 50 years produced a vivid and distinctive portrait of Welsh life, and his archive of 120,000 photographs is now one of the National Library’s treasures. In the late 1990s a programme to digitise the negatives was established so that the original negatives could be frozen to stabilise their condition. Over 14,000 images from the Geoff Charles archive were digitised, producing 2,294 images and over 16,000 Fedora objects.

5. National Screen and Sound Archive of Wales. The National Screen and Sound Archive of Wales is home to a comprehensive and unequalled collection of films, television programmes, videos, sound recordings and music relating to Wales and the Welsh. The Archive is developing digital content in collaboration with major English and Welsh language broadcasters. 

 


National Science Digital Library (NSDL)

www.nsdl.org

About: In 2000, the National Science Foundation created the National Science Digital Library (NSDL) to provide organized access to high quality resources and tools that support innovations in teaching and learning at all levels of science, technology, engineering, and mathematics (STEM) education. In addition to providing an organized point of access to high-quality STEM content, NSDL also provides open-access, non-proprietary tools to stimulate new ways to access and use science education information in an easily accessible online environment. NSDL currently catalogs over 60,000 resources from 57 digital collection providers and thousands of web sites. Individual resources in the library are characterized using qualified Dublin core metadata. 

NSDL makes an ideal testbed for researchers interested in exploring innovative approaches into “digging into data” over rich collections of web-based educational resources. NSDL uses a Fedora-based open-source digital library platform of technology and standards (NCore), creating a dynamic information layer on top of library resources. Collections, resources, and this information layer are easily accessible to researchers via a number of application programming interfaces. The repository contents can be accessed directly using the Digital Repository API. Additionally, a Search API is available for searching directly over NSDL collections. Finally, there is also the Strand Map Service API for searching and visualizing NSDL collections according to K-12 learning goals. This web service protocol supports the construction of interactive knowledge map interfaces based on the learning goals articulated in the American Association for the Advancement of Science (AAAS) Benchmarks for Science Literacy and the learning progressions and strand maps published in the AAAS Atlas of Science Literacy. The library’s Dublin core metadata descriptions can also be harvested using OAI-PMH. 

Documentation:

Digital Repository API documentation is available here: http://wiki.nsdl.org/index.php/Community:NDR

Search API documentation is available here: http://wiki.nsdl.org/index.php/Community:Search 

Strand Map Service API documentation is available here: http://wiki.nsdl.org/index.php/Community:StrandMaps

Contact: For more information on these APIs, contact the NSDL Technical Network Services team by sending a request via: http://nsdl.org/about/contactus/

For information on upcoming API training opportunities, please contact Karon Kelly at 303-497-2652 or kkelly@ucar.edu.

Terms of Use/Service:  http://nsdl.org/help/?pager=termsofuse


National Technical Information Service (NTIS)
 
About: The National Technical Information Service (NTIS) is the nation's largest and most comprehensive source of government-funded scientific, technical, engineering, and business information produced or sponsored by U.S. and international government sources. NTIS is a federal agency within the U.S. Department of Commerce.
 
Since 1945 the NTIS mission has been to operate a central U.S. government access point for scientific and technical information useful to American industry and government. NTIS maintains a permanent archive of this declassified information for researchers, businesses, and the public to access quickly and easily. Release of the information is intended to promote U.S. economic growth and development and to increase U.S. competitiveness in the world market.
 
The NTIS collection of more than 2 million titles contains products available in various formats. Such information includes reports describing research conducted or sponsored by federal agencies and their contractors; statistical and business information; U.S. military publications; multimedia training programs; databases developed by federal agencies; and technical reports prepared by research organizations worldwide. NTIS maintains a permanent repository of its information products.
 
More than 200 U.S. government agencies contribute to the NTIS collection, including the National Aeronautics and Space Administration; Environmental Protection Agency; the departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland Security, Interior, Labor, Treasury, Veterans Affairs, Housing and Urban Development, Education, and Transportation; and numerous other agencies. International contributors include Canada, Japan, Britain, and several European countries.
 
NTIS offers Web-based access to the latest government scientific and technical research information products. Visitors to http://www.ntis.gov can search the entire collection dating back to 1964 free of charge. NTIS also provides downloading capability for many technical reports, and purchase of the publications on CD as well as paper copies.
 
National Technical Reports Library:  New at NTIS, the National Technical Reports Library (NTRL) provides a more comprehensive offering that delivers high-quality government technical content in all subject areas directly and seamlessly to the user’s desktop. The NTRL service gives access to more than 2 million NTIS bibliographic records and more than 500,000 full-text documents in PDF format. For more information, see http://www.ntis.gov/products/ntrl.aspx.
 
Contact/Technical Support:  See: http://www.ntis.gov/help/overview.aspx
 
Terms and Conditions:  See: http://www.ntis.gov/pdf/ntrl-terms-cond.pdf
 
 
 

Nebraska Digital Newspaper Project
nebnewspapers.unl.edu

About: The Nebraska Digital Newspaper Project and its contractor, iArchives, have created 300,000 full-text digitized pages of 19th and early 20th Century newspapers from selected communities in Nebraska that can be used for text mining by DID research teams.  The number of pages will grow over time.  Files have been created in three forms:  TIFF images, JPEG2000, and PDFs with hidden text.  Optical character recognition has been performed on the scanned images, resulting in dirty OCR.   Metadata associated with the project is TEI, XML, and METS/ALTO, following the guidelines provided by the Library of Congress for the National Digital Newspaper Program. Newspaper languages include English and Czech.

We also have 118,000 full text pages of Nebraska Public Documents, http://cdrh.unl.edu/nebpubdocs, that may be useful for DID.  These are XML files, TEI2 headers, with METS/ALTO and dirty OCR.

Contact: Technical support information.  The research teams can contact Jason Bougger, Systems Administrator, UNL Libraries, jbougger1@unl.edu, (402) 472-0856, for questions regarding access to the data.

Documentation.   Detailed descriptions of the NDNP requirements are found in the Library of Congress website at http://www.loc.gov/ndnp.  No APIs have been developed for the Nebraska Digital Newspaper Project.

Terms of service.  The Nebraska Digital Newspaper Project should be cited in any acknowledgements associated with the DID Challenge.   Any uses of the data that are outside of the DID Challenge should be cleared through Katherine Walter, Project Director of the Nebraska Digital Newspaper Project, kwalter1@unl.edu, (402) 472-3939.

 


New York Public Library
NYPL.org

About: Libraries are the memory of humankind, irreplaceable repositories of documents of human thought and action. The New York Public Library is such a memory bank par excellence, one of the great knowledge institutions of the world, its myriad collections ranking with those of the British Library, the Library of Congress, and the Bibliothèque nationale de France.  

Contact:  Joe Dalton jdalton@nypl.org

NYPL has several collections that may be of interest to Digging into Data participants.  For example:

The NYPL Digital Gallery contains roughly 700,000 images that could be used for data analysis.  The images are available via our RESTful API.  For more information about this API, please see: http://digitalgallery.nypl.org/feeds/dev/atom/docs/.  For an ATOM feed, please see: http://digitalgallery.nypl.org/feeds/dev/atom/.  

NYPL has been a lead participant in the National Digital Newspaper Program, and has additionally contracted for article-level data/coordinates for those newspapers it has provided. Available runs include the New York World (1890-1910) and the New York Daily Sun (1890-1910). There is currently no online access, but to obtain more information about using this newspaper data for analysis, please contact Barbara Taranto (btaranto@nypl.org). 


The New York Times Article Search API
http://developer.nytimes.com/

About: The NYT Article Search API allows you to search more than 2.8 million New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs and links to associated multimedia.

The API supports the following type of searching: 
    * Standard keyword searching
    * Date range: all articles from X date to Y date
    * Field search: search within any number of given fields, e.g., title:obama byline:dowd
    * Conjunction and disjunction (AND and NOT) operations, e.g., baseball yankees -"red sox"
    * Ordering by closest (variable ranking algorithms), newest and oldest

The Article Search API also offers faceted searching. The available facets include Times-specific fields such as sections, taxonomic classifiers and controlled vocabulary terms (names of people, organizations and geographic locations).

Contact: If your research team is interested in using the Times' Article Search API, please visit http://developer.nytimes.com to register for an API key.

Links:
Times Developer Network: http://developer.nytimes.com
Open blog (A blog about open-source technology at The New York Times, written by and primarily for developers. This includes our own projects, our work with open-source technologies at NYTimes.com, and other interesting topics in the open-source and Web 2.0 worlds.): http://open.nytimes.com

Terms of Use:  The New York TImes Article Search API is for noncommercial use only. Please see the FAQ for more detail about commercial and noncommercial use: http://developer.nytimes.com/docs/faq

This Terms of Use is available at the following URL:
http://developer.nytimes.com/Api_terms_of_use, and our Attribution Requirements are also available at the following URL: http://developer.nytimes.com/attribution
 


 Open Images

 
About:
Open Images is an open media platform that offers online access to audiovisual archive material to stimulate creative reuse. Footage from audiovisual collections can be downloaded and remixed into new works. Users of Open Images also have the opportunity to add their own material to the platform and thus expand the collection. Open Images also provides an API, making it easy to develop mashups.
 
Access to the material on Open Images is provided under the Creative Commons licensing model. Creative Commons gives authors, artists, scientists and teachers the freedom to approach their copyright in a more flexible manner and make their work available in a way they can choose themselves.
 
The ‘open’ nature of the platform is underscored by the use of open video formats (Ogg Theora), open standards (HTML5, OAI-PMH) and open source software components. Furthermore, all software that is developed within the scope of Open Images will also be released under the GNU General Public License.
 
Open Images is an initiative of the Netherlands Institute for Sound and Vision in collaboration with Knowledgeland. By the end of 2012 Open Images offered access to over 1800 Polygoon items from the Sound and Vision archives. The collection will grow substantially over the coming years; as new items will be uploaded continuously.
 
Everybody is more than welcome to add material to the platform – not only collection institutes and producers, but all netizens creating new materials based on Open Images fragments and items from other open repositories.
 
Open Images has been developed as part of Images for the Future.

Contact: Maarten Brinkerink, Netherlands Institute for Sound and Vision mbrinkerink@beeldengeluid.nl

Links:
Terms of Service:  www.openimages.eu/terms 
 

PhilPapers
http://philpapers.org

About:  PhilPapers' purpose is to facilitate the exchange and development of philosophical research through the Internet. It gathers and organizes philosophical research on the Internet, and provides tools for philosophers to access, organize, and discuss this research.PhilPapers aggregates article-level metadata from professional journals, open access archives, personal web sites, library catalogues and other sources of professional publications in philosophy. It is one of the most comprehensive literature indexes for philosophical research in English.

Contact: http://philpapers.org  or David Bourget, Research Fellow, Institute of Philosophy, University of London, and General Editor, PhilPapers.org, root@dbourget.com, phone +44 20 7862 8678  


Project MUSE
http://muse.jhu.edu/

About: Project MUSE is an online collection of over 400 journals from approximately 100 not-for-profit publishers.  MUSE sells 6 specific collections to institutional libraries worldwide.  More information is available at http://muse.jhu.edu/about/muse/index.html

Contact: Wendy Queen, Manager of Electronic Publishing Technologies.  Phone: 410-516-3845.  Email: wendy@muse.jhu.edu  Please also copy Mary Rose Muccie (mrm@press.jhu.edu) on any emails.

A representative from each participating team using Project MUSE is required to sign a Memorandum of Understanding that outlines the terms of use for Project MUSE content. The MOU is available at http://muse.jhu.edu/about/docs/did_mou.pdf.

MUSE does not have an API available online. Please call Wendy Queen (info above) and she will explain our database structure and any other information you need.  We have several access methods available to MUSE subscribers and need to know how you plan to access the content (basic IP authentication, via Athens, via Shibboleth, etc).  Please get in touch with Wendy before any downloading begins so we are prepared and do not shut you down for violating our license and also do not include the hits in usage statistics, which are a component of both our pricing and our publisher royalty payments.   


PSLC DataShop
pslcdatashop.web.cmu.edu/

About: The PSLC (Pittsburgh Science of Learning Center) DataShop is a data repository and web application for learning science researchers. It provides secure data storage as well as an array of analysis and visualization tools available through a web-based interface. PSLC DataShop (https://pslcdatashop.web.cmu.edu/) currently houses over 230,000 hours of fine grained student course data from a variety of math, science, and language courses. PSLC DataShop is able to serve as a learning science project’s data repository In order to satisfy NSF Data Management requirements. This allows the data be available for sharing, re-use, re-distribution as well as the production of derivatives. At all times the availability of the data is at the discretion of the data owner.For more information about the Pittsburgh Science of Learning Center, please see http://LearnLab.org.

Contact: If you need more information, please contact datashop-help@lists.andrew.cmu.edu


Research Data Australia (RDA)
http://researchdata.ands.org.au

About: Research Data Australia is a flagship service of the Australian National Data Services (ANDS) and provides a national and comprehensive window into the Australian Research Data Commons.

This discovery service provides rich connections between data, projects, researchers and institutions, and promotes visibility of Australian research data collections. 70 Australian universities, research institutions and government data producing agencies have contributed over 55,000 data collections to Research Data Australia (mid 2013). 

Users can search by subject area, geographic location, topics or browse to find: research data collections; researchers or organisations; data created by projects; and services that support the creation of use of research datasets or collections.  Searching by subject area uses the Australian and New Zealand Standard Research Classification (ANZSRC) which was developed by the Australian Government to meet the dual needs for a comprehensive description of today's research environment, as well as the ability to compare R&D statistics internationally. A specific strength of RDA is the Tropical Research Data topic page which brings together collections, contributors, and related services, activities and additional links.

ANDS also has reciprocal publishing arrangements with other data collection registries and discovery portals in Australia and internationally.

Contact:  Please contact us for information about RDA - contact@ands.org.au

Links:   http://researchdata.ands.org.au (RDA) and http://ands.org.au (ANDS)

ANDS is funded by the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS) and the Education Investment Fund (EIF) Super Science Initiative. 


Scholarly Database at the Cyberinfrastructure for Network Science Center, Indiana University
sdb.slis.indiana.edu
 
About:  The Scholarly Database (SDB) at Indiana University aims to serve researchers and practitioners interested in the analysis, modeling, and visualization of large-scale scholarly datasets.  The online interface provides access to four datasets: Medline papers, U.S. Patent and Trademark Office patents (USPTO), National Science Foundation (NSF) funding, and National Institutes of Health (NIH) funding – over 22 million records in total.  Users can register for free at http://sdb.slis.indiana.edu to cross-search these datasets and to download result sets as dumps for scientometrics research and science policy practice.

Contact:  Nianli Ma, SDB Team Lead, nianma@indiana.edu, (812) 856-3465. 


ScholarSpace at the University of Hawai'i at Manoa
scholarspace.manoa.hawaii.edu/community-list

About: Most of the material is previously published scientific journals, pamphlets and such like, or open access journal issues, along with what will soon be over 2,000 dissertations published at the University of Hawaii.

Contact: If you are interested in that sort of material - the IR project manager is Beth Tillinghast (betht@hawaii.edu) and the technical contacts would be Daniel Ishimitsu (daniel20@hawaii.edu) and Wing Leung (leungwin@hawaii.edu).


Statistical Accounts of Scotland
http://edina.ac.uk/stat-acc-scot/

About:  The two Statistical Accounts of Scotland, covering the 1790s and the 1830s, are among the best contemporary reports of life during the agricultural and industrial revolutions in Europe. Learn more about the area in which you or your ancestors have lived, or use this key source to study the emergence of the modern British State and the economic and social impact of the world's first industrial nation.

Based largely on information supplied by each parish [church] minister and other compilers, the Old (First) Statistical Account (1791-99) and the New (Second) Statistical Account (1834-1845) provide a rich record of a wide variety of topics: wealth, social structures and poverty; climate, agriculture, fishing and wildlife; religion; population, schools, and the social habits of the people.

The online service features include: 

  • scanned images for browsing
  • transcribed text enabling copy and paste
  • page and volume search
  • the accounts presented in published order
  • key word searching and display of results in order
  • bookmarking of parishes or page citations
  • PDF download for parish reports
  • selected original manuscript parish reports 
  • resources related to the preparation and publication of the Accounts
  • parish links to the Gazetteer for Scotland
  • index to compilers of parish reports
  • index of map within the Accounts
  • contemporaries and successors

 
Contact: All visitors can freely browse, view and print the scanned original pages from the two Accounts by clicking on the "Browse scanned pages" link on the Statistical Accounts of Scotland login page, and subscriptions to the Statistical Accounts of Scotland service may be taken out by individuals, educational institutions and other organisations in the UK and overseas.  http://edina.ac.uk/stat-acc-scot/access/prices.htmlLinks/APIs: There are links to Parishes or specific pages (via Persistent URLs, PURLS), from which additional metadata can be derived, but there is no public API

Terms of Service: The Statistical Accounts of Scotland Online Service is provided by EDINA. The design of the interface and database are property of EDINA. Copyright in page images belongs jointly to the Universities of Glasgow and Edinburgh.  Subscribers to the service may print or download, in full, the page images from up to 10 parishes for use in teaching, research or personal educational development. These page images may not be republished in full in any format including, inter alia, on the world-wide-web, in print or on CD-ROM. A single page image from a parish may be republished on condition that copyright is properly acknowledged (in the form 'Images from Statistical Accounts Online Service © University of Glasgow and University of Edinburgh') and a link is provided alongside to the front page of the Statistical Accounts Online Service (in the form 'The Statistical Accounts of Scotland is available online at http://edina.ac.uk/stat-acc-scot/').Use of images for consultancy or for services leading to their commercial exploitation is prohibited without the explicit permission of the Copyright holders. A request for such permission should be addressed to EDINA in the first instance: Email edina@ed.ac.uk or tel: 0131-650 3302.


University of Florida Digital Library Center
www.uflib.ufl.edu/ufdc

 About:  The University of Florida Digital Collections (UFDC) hosts local and international collections, housing over 8 million pages of all material types (books, archival documents, newspapers, photographs, audio, video, museum objects, data sets, maps, etc.) in many languages. A full list of collections that are clickable to descriptions, with statistics for each, is available here: http://ufdc.ufl.edu/stats/usage/history 

Contact

Laurie Taylor, UF Digital Humanities Librarian, Laurien@ufl.edu and 352.273.2902

Mark Sullivan, UF Head of Digital Development & Web Services, marsull@uflib.ufl.edu

Selected, large collections are:

•         Digital Library of the Caribbean: 79,313 items and 1,793,735 pages

•         Florida Digital Newspaper Library: 88,614 issues and 1,386,668 pages

•         Baldwin Library of Historical Children's Literature: 6,316 items and 941,350 pages

The Digital Library of the Caribbean contains historic through current materials in multiple languages (primarily in English, Spanish, and French). The Florida Digital Newspaper Library includes historic through current newspapers. The Baldwin collection contains 19th century children's literature. All collections and items are openly accessible for use and for datamining.

The SobekCM system powering the UF Digital Collections supports OAI-PMH, searches and browses as XML, and a JSON interface to images and raw text (in use for several iPhone Apps). Extensive documentation is available here: http://ufdc.ufl.edu/sobekcm/harvesting and on the main SobekCM pages: http://ufdc.ufl.edu/sobekcm/   


University of North Texas
digital.library.unt.edu/browse/?browseby=collection

About:  Brief Description of the contents of our collections – Collections may be viewed at http://digital.library.unt.edu/browse/?browseby=collection.  The collections are heterogeneous in nature including books, posters, photographs, born digital reports, musical scores, newspapers, letters, maps, etc.  The Portal to Texas History is the largest collection listed.

Contact: Mark Phillips, mark.phillips@unt.edu

While we currently are not able to supply active API interfaces to our system, here are some of the services that will be supported in our initial release of our new infrastructure. 

The digital library infrastructure in development at UNT consists of a METS based delivery system. We use a locally developed METS profile for storing file and structural metadata, which is used in delivering content to our users.  Descriptive metadata is stored in a locally qualified Dublin Core metadata scheme which has been developed over the past five years and is documented on our libraries website <http://www.library.unt.edu/digitalprojects/metadata>. 

Services built on our digital objects revolve around existing protocols for sharing and reusing data in libraries.  These include support for OAI-PMH to provide access to both Dublin Core and MODS representations of each metadata record in the system.  The system supports queries in both SRU and OpenSearch. 

Each digital object has descriptive metadata available in a variety of  simple formats for reuse.  Examples include XML, JSON and TXT.  Automatically generated citations are also available in a variety of formats.  We are using the ARK identifier scheme and providing ERC records containing simple metadata for each object in an easy to use format. Additionally it is planned to support COinS and unAPI as well as other Microformats where appropriate. 

Additional planned services include the use of Open Text Mining Interfaces for supplying other organizations the full-text of our digital objects without compromising the terms of our agreements with content providers.   In addition to the full text, this will provide word count vectors for use in visualization tools such as word clouds and other data graphs. 

Special terms of services – The data may be used freely for research purposes. 

 

 

 

About: The Koninklijke Bibliotheek (KB) is the national library of the Netherlands: we bring people and information together. We offer access to everything published in and about the Netherlands, play a central role in the (scientific) information infrastructure of the Netherlands and promote permanent access to digital information nationally and internationally. We digitise all the books, periodicals and newspapers that have been published in the Netherlands (http://www.delpher.nl/). Most of the resulting data is available for researchers via an API.
 
 
 
Terms of Service: Most datasets are available for academic research purposes. Please contact us for more detail.  

 Print   
Privacy/Terms of Use