Search
Wednesday, March 10, 2010 ..:: Home » Repositories ::..   
Site Navigation
  Home

 List of Data Repositories Minimize

Last Updated:  16 June, 2009

Below is a list of digital libraries, data archives, and data repositories that are inviting Digging into Data researchers to use their collections.  For each repository, you'll find a description of their contents, contact information, and other details.

This list is being frequently updated, so check back often!  If you are a digital repository and would like to be included on this list, please get in touch with us.

 


The Archaeology Data Service (ADS)
ads.ahds.ac.uk

About: The ADS catalogue holds the digital archives of around 400 archaeological projects from the UK and beyond, these range from the outputs of single excavations to large scale developer funded projects encompassing hundreds of individual archaeological interventions. As well as digital archives and fieldwork outputs the catalogue contains a number of scholarly resources intended specifically as reference sources for further research on topics such as lithics, ceramics and animal bone. The catalogue also contains digitised (or born digital) versions of various significant journals and series running to many thousands of individual articles. These include scholarly journals such as the Proceedings of the Society of Antiquaries of Scotland, Research Reports from the CBA and around 3000 'grey literature' fieldwork reports. In addition to the catalogue holdings the ADS provides access to over 1,000,000 aggregated resource discovery metadata records for monument inventories from around the UK (including data from the RCAHMS, Northern Ireland Department of the Environment, English Heritage and numerous local authorities). The ADS also hosts the Archaeological Records of Europe Networked Access (ARENA) service which aggregates monument inventory data from six European partners, this service is currently being both geographically expanded and technically enhanced.

Contact: If your research team is interested in using the ADS catalogue please contact the ADS User Services Manager, Dr Stuart Jeffrey, at sj523@york.ac.uk or on +44(0)1904433954

Links: The ADS has an OAI-PMH service and a Z39.50 Target Specification (http://ads.ahds.ac.uk/project/target_spec.html). A WSDL specification for the ADS resource discovery metadata service will be published mid 2009.

Terms of service: Everything hosted by the ADS is freely available for teaching learning and research purposes subject to our Access Agreement.

This agreement is available at the following URL:

http://ads.ahds.ac.uk/cap.html, our Copyright and Liability statement is also available at the following URL: http://ads.ahds.ac.uk/copy.html.

 


ARTstor
www.artstor.org

About:  ARTstor is a digital library of nearly one million images in the areas of art, architecture, the humanities, and social sciences with a set of tools to view, present, and manage images for research and pedagogical purposes.

Contact:  Bill Ying at WWY@artstor.org.

1) ARTstor can provide researchers with access to the ARTstor Library (provided they signed individual user agreements) through our XML gateway, allowing federated or cross-database searches (for more information on the XML gateway, please see http://www.artstor.org/what-is-artstor/w-html/features-and-tools-metasearch.shtml).  

2) ARTstor can also provide special access to a set of large collections to researchers.  For more information, please contact Bill Ying at WWY@artstor.org.

 


The Centre for Contemporary Canadian Art
Canadian Art Database Project
www.ccca.ca

About: The Canadian Art Database Project, housed in the Faculty of Fine Arts at York University in Toronto, is a work in progress. It is the prime web resource that focuses on contemporary Canadian art production and the recent history of Canadian art. For the past several years, The Centre for Contemporary Canadian Art has been assembling a growing collection of previously inaccessible or hard-to-find, information on Canadian art in all media [images, texts, media works, and related ephemera] from a variety of sources across Canada into a fully searchable, bilingual, database. The ongoing project is documenting some important Canadian art institutions and organizations that have helped shape the Canadian art scene since the 1960s, along with the careers of some of Canada's leading professional artists, designers, art writers and curators. 

The Canadian Art Database Project currently holds: 

  • more than 54,000 images and 650 media clips by over 600 prominent Canadian visual, media and performance artists; 
  • more than 1,200 images by 40 leading Canadian graphic designers;
  • more than 3,000 texts on art by 190 Canadian writers;
  • a searchable Canadian Art Bibliography [currently holding  9,000+ references];
  • a searchable Canadian Art Chronology [currently holding  6,000+ references];
  • a developing series of Video Portraits profiling artists, graphic designers and art personalities;
  • several related projects that complement the core archive.

Contact: Bill Kirby, Director, Centre for Contemporary Canadian Art; Visiting Assistant Professor, Department of Visual Arts, York University. Telephone: 416.533.4810; E-mail: kirby@ccca.ca

The Canadian Art Database has become an essential interactive teaching resource about Canadian visual culture in secondary and post-secondary classrooms across Canada and abroad. It is attracting a large and varied international audience – receiving daily averages of some 2,300 visits [60,000+ per month], and 100,000 hits [3 million+ per month]. There are more than 30,000 unique visitors per month from visitors in more than 100 countries. 

The CCCA employs a unique ‘artist-empowered’ copyright model in which the copyright on all materials included in the Canadian Art Database project is retained by the individual creators and authors. Additional materials have all been cleared by the respective copyright holders. The Database is housed in the MIMSY Information Management System, which has been specially customized for the CCCA.

Users of the Canadian Art Database are able to freely view and use [but not alter] the material housed in the Canadian Art Database solely for educational and research purposes.

The Centre for Contemporary Canadian Art would welcome enquiries from any other researchers who might wish to work with The Canadian Art Database.

   


Chronicling America
Library of Congress
National Digital Newspaper Program
www.loc.gov/chroniclingamerica

About:  As of December 2008, Chronicling America provides free and searchable access to more than 850,000 pages of historic newspapers, published between 1880 and 1910. These newspapers are selected and digitized by NEH awardees through the National Digital Newspaper Program (http://www.neh.gov/projects/ndnp.html), per Library of Congress technical guidelines (see http://www.loc.gov/ndnp/techspecs.html ). Page-level data presented through Chronicling America include JPEG2000, PDF, and searchable page text. To date, nine state awardees and the Library of Congress have contributed content to the site from newspapers published in California, the District of Columbia, Florida, Kentucky, Minnesota, Nebraska, New York, Texas, Utah, and Virginia. Six additional states (Arizona, Hawaii, Missouri, Ohio, Pennsylvania, and Washington) will be adding content in Spring 2009 published between 1880 and 1922. The site will continue to expand over time (potentially, twenty years) to eventually include all 54 states and territories with newspapers published between 1836 and 1922.

Contact: Tech support contact: David Brunton (dbrun@loc.gov), Technical Coordinator, National Digital Newspaper Program, Repository Development, Office of Strategic Initiatives, Library of Congress.

Links to APIs and documentation:  The Library will make available the digitized text (created through Optical Character Recognition) of approximately one million newspapers in the METS/ALTO XML format (see http://www.ccs-gmbh.com/alto/ ).  For each page of OCR text, the library will include a permanent link to an image of the page, from which additional metadata can be derived.

In Spring 2009, the Library will provide an OpenSearch API [1], with results returned in HTML, JSON, or Atom, at the researcher's discretion.  From the search results, the Library will provide pointers to additional information for each result based upon a URI Template. [2]

[1.] http://www.opensearch.org/Home 

[2.] http://bitworking.org/projects/URI-Templates/spec/draft-gregorio-uritemplate-03.txt

Special terms of service:

Data provided by the Library is for the sole use of the awardee in support of research as described in the Digging for Data proposal and should not be re-used or re-distributed for any other purpose without permission.

The Library reserves the right to block IP addresses that fail to honor the Library's robots.txt files or submit requests at a rate that negatively impacts service delivery to all Library patrons. Current guidelines recommend that software programs submit a total of no more than 10 requests per minute to Library applications, regardless of the number of machines used to submit requests. The Library also reserves the right to terminate programs that require more than 24 hours to complete. (See http://www.loc.gov/homepage/legal.html for more information).

 


Data-PASS
http://www.icpsr.umich.edu/DATAPASS/

About: The Data Preservation Alliance for the Social Sciences (Data-PASS) is a partnership of social science data archives devoted to identifying, acquiring and preserving primary source data at-risk of being lost to researchers. Examples of at-risk data include opinion polls, voting records, large-scale surveys on family growth and income, and many other social science studies.

Contact: If your research team is interested in using the Data-PASS shared catalog, please see http://dvn.iq.harvard.edu/dvn/dv/datapass. For general questions, please contact data-pass@icpsr.umich.edu. For technical support, please contact Dr. Micah Altman (Micah_Altman@harvard.edu).

Data collections preserved by the Data-PASS partnership are described on the Data-PASS shared catalog (http://dvn.iq.harvard.edu/dvn/dv/datapass) in descriptive metadata records. Some of the descriptions link directly to data files and technical documentation. Many Data-PASS data holdings are freely available to anyone, while some content is restricted. 

Data-PASS is led by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan, the Roper Center for Public Opinion Research at the University of Connecticut, the Howard W. Odum Institute at the University of North Carolina-Chapel Hill, the Henry A. Murray Research Archive, a member of the Institute for Quantitative Social Science at Harvard University, the custodial electronic records division of the National Archives and Records Administration, and the Harvard-MIT Data Center, also a member of the Institute for Quantitative Social Science at Harvard University.
 


Digital Library for Earth System Education (DLESE)

www.dlese.org

About: Since 1999, the Digital Library for Earth System Education (DLESE) has provided searchable access to high-quality, online educational resources for K-12 and undergraduate Earth system science education.   These resources include maps, lesson plans, lab exercises, data sets, virtual field trips, and interactive demonstrations. The holdings of DLESE are created by a wide variety of individual faculty members, federal and state agencies, and cultural institutions.  These resources are held (stored) on local servers and are accessed through the library via searchable metadata records in the ADN format, which extends the IEEE-LOM format to include rich educational metadata, such as educational standards, as well as geospatial and temporal descriptions. Additionally, the library contains user-contributed content in the form of teaching tips and resource reviews. 

DLESE resources and collections can be accessed by the Digital Discovery System (DDS) services and APIs, which can be used to search over DLESE content or flexibly configured to search over any XML schema structure, including user-contributed content. The APIs come in two flavors: a RESTful web service and a JavaScript API. A range of information retrieval features are available including textual and field-based searches such as audience, subject, resource type or educational standard. DDS also supports geospatial search and can be integrated with Web 2.0 applications such as Google Maps. The DLESE repository metadata can also be harvested using OAI-PMH. DLESE collections have been used for a rich variety of computational linguistics, natural language processing, and machine learning research and we welcome the opportunity to work with Digging Into Data researchers to further extend DLESE’s utility as a research test bed. 

Contact: John Weatherley (jweather@ucar.edu) for information on the DDS API. Tamara Sumner (sumner@colorado.edu) for recent publications using DLESE as a research testbed. You can also reach the team by emailing support@dlese.org

Links / Services:  

Digital Discovery System Web (DDS) Services and APIs are described at:

http://www.dlese.org/dds/services/index.jsp

OAI Data Provider:

http://www.dlese.org/dds/services/oaiDataProvider/index.jsp

Open Source Software:

http://www.dlese.org/dds/dds_overview.jsp

Terms of Use/Service:  http://www.dlese.org/documents/policy/terms_use_full.php

 


Early Canadiana Online
www.canadiana.org

About: Early Canadiana Online (ECO) is a digital library providing access to 2,838,778 pages of Canada's printed heritage. It features works published from the time of the first European settlers up to the early 20th Century.

Contact: William Wueppelmann, Systems Librarian, william.wueppelmann@canadiana.org

613-235-2628 ext. 226

1) Pilot Project: Approximately 500,000 pages from a variety of subjects, including Canadian women’s history, English Canadian literature, the history of French Canada, and native studies.

2) Early Governors General of Canada: 60 publications focusing early Governors General of Canada.

3) Hudson’s Bay Company publications: 160 titles

4) Jesuit Relations: 73 volume set of books in the original Italian, Latin and French with commentary and translation into English.

5) Official publications: 1.5 million pages of pre-1900 Canadian official publications, including journals of legislative assemblies of the various colonies of pre-Confederation Canada, Journal of the House of Commons, Debates of the House of Commons, Debates of the Senate, Reconstituted Debates of Canada, Sessional papers, Journals of the Senate, tatutes of Canada, Statutes of the pre-Confederation colonial legislative councils, and debates of the pre-Confederation colonial house of assemblies.

6) Periodicals: Approximately 700,000 pages of pre-1920 Canadian periodicals.

 


English Broadside Ballad Archive (EBBA) 
ebba.english.ucsb.edu

About: EBBA mounts online surviving but difficult-to-access early ballads printed in English, with priority given to black-letter broadsides of the seventeenth century--the heyday of the printed broadside ballad.  The database currently holds over 1,800 ballads from the Samuel Pepys collection at Magdalene College, Cambridge, and is in the process of adding the approximately 1,300 ballads in the Roxburghe collection at the British Library.  EBBA makes these ballads fully accessible as texts, art, music, and cultural records of the period.  We provide online images of each ballad in high-quality facsimiles as well as "facsimile transcriptions" (which preserve the original ballad ornament while transcribing the black-letter font into easily readable white-letter or roman print). In addition, we provide sung versions of the ballads, background essays that culturally place the ballads, TEI/XML encoding of the ballads, and search functions that allow readers easily to find ballads as well as their constituent parts/makers.

Contact: Patricia Fumerton, Director of the English Broadside Ballad Archive, Department of English, University of California, Santa Barbara, CA. 93106.  FAX: 805-893-4622.  Email: pfumer@english.ucsb.edu 

 


Great War Primary Documents Archive
www.gwpda.org

About: The Great War Primary Documents Archive is dedicated to the collection, preservation, and development in electronic form of materials relating to the First World War. It is a resource for scholars and students, and is a perpetual memorial to the heroism and sacrifice of those who participated in the war. Since 1995, first at the University of Kansas, then at BYU and now on private servers it is the first and largest online full text international collection of documents and images related to the Great War period, 1880-1926. The Archive provides free and universal public access to the full text records of the history of World War I and the twentieth century's attempts to deal with the conflict, the collapse of national and international political agreement into active warfare and the post-war effort to create a world without war. Thus far, the site has been hit more than 15,500,000 times, and innumerable students, scholars and researchers have examined, analyzed and incorporated these documents into their work. The Great War Primary Documents Archive now holds some 15,000 fully searchable pages of these significant official and public documents. We are the Web's non-partisan source of this material, a primary resource for isolated and under-funded repositories worldwide, and provide these documents and historical information without charge.

Contact: AJ Plotke, Executive Director, GWPDA.  Telephone: 602.297.1914; E-mail: GWPDA-DID@gwpda.org or cd078@faradic.net.

GWPDA is entirely accessible online, using any browser or system and does not require any special access permissions. All documents and images are either ex-copyright or their copyright is held by GWPDA. It may be easier to download material from the site through the administration’s back channel - please contact us before starting a complete document harvest so that we can determine the most efficient method of transferring data. 

 


Harvard Time Series Center (TSC)
timemachine.iic.harvard.edu/search/

About: Harvard Time Series Center (TSC) is an interdisciplinary effort dedicated to creating the world's largest data center for time series and to developing algorithms to understand and analyze various aspects of these time series. The partnership of the data center and the analysis effort makes both discoveries of new and rare phenomena, and large scale studies of known phenomena possible.

The TSC hosts closed to 1 billion time series, mainly from the field of astronomy but expanding to economics, health data, real estate data, etc.

Each time series typically consists of 100-100,000 measurements, making the total number of measurements greater than a trillion! We have both time series that go back 100 years with measurements every few days and time series that were taken at 200Hz for short period of time.

Our collection represents one of the largest and most interesting datasets in the world for time series and gives a unique opportunity for to analysts  to test their algorithms at large scales.

This is an unprecedented opportunity to be part of the development of computational algorithms  and the making scientific discoveries across multiple fields and answer some of the most fundamental questions.

For each time series  in the database, the TSC maintains a list of links to relevant resources, including:

  • Links to the original images via URLs.
  • A list of metadata about the object (position on the sky, time of observation, wavelength of observation etc)
  • A list of provenance information regarding the process between the original images and time series.

The TSC also maintains a full set of web services that can be accessed using any programming language such as Python, Perl etc.  Using those web services one can query the database and retrieve a subset of the dataset. These capabilities have been used to create a web interface http://timemachine.iic.harvard.edu/search/ for the astronomy data sets.

Contact: If your research team is interested in using the content maintained by our project, please contact the TSC lead investigator, Pavlos Protopapas at pprotopapas@cfa.harvard.edu

Links: The TSC supports export of its metadata records via a RESTful interface and a highly structured JSON format.

Terms of service: access to the TSC is freely available to the general public for personal use.  The relevant terms of use are detailed in the document at http://timemachine.iic.harvard.edu/tos/


 


Hathi Trust
www.hathitrust.org

About: HathiTrust was conceived as a collaboration of the thirteen universities of the Committee on Institutional Cooperation and the University of California system to establish a repository for these universities to archive and share their digitized collections. Partnership is open to all who share this grand vision.

Contact: The mechanisms HathiTrust uses to transfer the data remain to be determined.  DID-approved researchers interested in acquiring the 50,000 document sample should contact hathitrust-datasets@umich.edu and should specify the Digging into Data (DID) dataset.

HathiTrust will make available a corpus of 50,000 volumes representing a mix of dates (all pre-1923), countries of origin, language, character sets, and formats (i.e., some serial literature in a body of mostly monographic literature). The vast majority of these texts were digitized by Google and are made available with Google’s permission.  Though delivered in a single package, each volume will be in a separate directory associated with a METS file, which will in turn be documented on the HathiTrust website.  Descriptive metadata such as authors and titles will not be provided along with the data, but the unique identifier for each volume can be used to gather bibliographic data from an API, described below. 

The HathiTrust website will include a DID webpage documenting the relevant bibliographic API (which can be used to gather bibliographic information about the texts) and documentation for the METS file for each object.  No other relevant APIs currently exist.  However, a data-oriented API may be deployed before the competition, and researchers will be able to use that API to display any of these texts from the HathiTrust repository.  If that API has been completed prior to the competition, it too will be documented on the website.

Although individual texts in HathiTrust have different conditions of use, in general, HathiTrust texts may not be redistributed or used in commercial applications.  Researchers will be asked to submit a brief form statement (to be provided by HathiTrust) confirming their intention to use the dataset for research purposes and their commitment to not further distributing the texts in whole or in part.
 


The History Data Service (HDS)
hds.essex.ac.uk

About: The HDS collection, which is part of the UK Data Archive (UKDA), brings together over 650 separate data collections transcribed, scanned or compiled from historical sources. The studies cover a wide range of topics from the seventh century to the twentieth century. Although the primary focus of the collection is on the United Kingdom, it also includes a significant body of cross-national and international data collections. In addition, the HDS also enriches and enhances selected data collections by developing thematic special collections where there is a critical mass of related data collections. Current special collections include: Census enumerators’ books, including the entire 1881 Census for England, Wales and Scotland; Poll Books, including the Westminster Historical Database, 1749-1820; British and Irish nineteenth and twentieth century statistics, including histpop – the online historical populations reports website; wage and price time series including the European State Finance Database; local history including the Digital Library of Historical Directories, 1750-1919; and prosopography including the COEL Database: Continental Origins of English Landowners, 1066-1166.

Contact: If your research team is interested in using the HDS collection please contact Richard Deswarte, the head of HDS, at richardd@essex.ac.uk or +44(0)1206873226.

Links: The HDS does not have any specific APIs, although general advice to data creators is available at: http://hds.essex.ac.uk/history/create/create-advice.asp

Terms of service: Most of the collection is available to any user free of charge, upon registration with the UKDA, for the purposes of not-for-profit research. However, some datasets may have restrictions on access. For example, commercial usage may be restricted or permission for usage may be required from the depositor. Details are available in the ‘Access’ section of each HDS catalogue record.  General information on accessing data is available at: http://www.data-archive.ac.uk/aandp/access/login.asp#terms

 


Infochimps.org
 infochimps.org


About: Infochimps.org is a website where anyone can upload a dataset and make it a rich addition to the large library with good metadata tagging and descriptions. Much of the data will be free and come with an open license.
 
Infochimps.org has thousands of datasets, including the Freebase data dump, Wikipedia extractions, stock data, and all sorts of text corpora.
 
Contact: help@infochimps.org

 


Internet Archive
www.archive.org

About:  The Internet Archive is a 501(c)(3) non-profit that was founded to build an Internet library, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format. Founded in 1996 and located in the Presidio of San Francisco, the Archive has been receiving data donations from Alexa Internet and others. In late 1999, the organization started to grow to include more well-rounded collections. Now the Internet Archive includes texts, audio, moving images, and software as well as archived web pages in our collections.

Contact: For assistance in downloading the collections or selected subsets in bulk, please contact Julie@archive.org during Pacific Time business hours.

IA Resources of Particular interest to the DID Challenge:

1) The Prelinger Film Archives

Public Collection of 2,139 films

This free public archive is a subset of the approximately 60,000 item collection of ephemeral films assembled by Rick Prelinger, a filmmaker and historian. 

Ephemeral films are defined as advertising, educational, industrial, and amateur films including the famous “Duck and Cover” nuclear safety education film and other extraordinary items of social, artistic and historical significance.  As a whole, the Prelinger collection currently contains over 10% of the total production of ephemeral films between 1927 and 1987, and it may be the most complete and varied collection in existence of films from these poorly preserved genres.

A tag cloud of the public collection to which the Archive will provide researcher access is viewable at the following URL:

http://www.archive.org/browse.php?field=/metadata/subject&collection=prelinger&view=cloud

The Archive believes that this collection may be of interest to sociologists, filmmakers, scholars of marketing and advertising and of course, historians for study of themes in public opinion, politics and popular culture.  Moreover, it offers a substantial set of digital video imagery that can serve as a testbed for tools development in video search, face recognition and other image-based technologies.

Rights:  This collection is being made available for study and reuse under a Creative Commons License.  Details of the rights granted for this collection are available at the following URL:  http://www.archive.org/details/prelinger

2) Canadian Libraries Collection

161,732 Digitized Books

This collection contains digitazed books, the vast majority of which were contributed by the University of Toronto Libraries.

Major themes include Canadian history/regional history, medical history, religion, military history, Greek Classics (Tufts University/Perseus Digital Library), government documents and certain special collections (e.g., the Cardinal Newman collection.)  The collection can be viewed and searched at the following URL: http://www.archive.org/details/toronto.

The Archive believes this collection may be of interest to scholars of history and sociology among other disciplines, as well as tool development for OCR, multi-lingual translation, name and place recognition and natural language processing.

Both collections are fully searchable using text boxes on the home pages.

 


Inter-university Consortium for Political and Social Research
www.icpsr.umich.edu

About: The Inter-university Consortium for Political and Social Research (ICPSR), the world's largest repository of digital social science data, provides leadership and training in data access, curation, and methods of analysis for a diverse and expanding social science research community.

Contact: If your research team is interested in using the ICPSR collection and/or its metadata records, please see this Web page for more information: www.icpsr.umich.edu/DID.  For technical support, please contact ICPSR at this e-mail address: netmail@icpsr.umich.edu.

The ICPSR repository spans the behavioral and social sciences and includes data on sociology, political science, demography, economics, history, criminal justice, gerontology, public health, education, criminal justice, gerontology, substance abuse, international relations, and much more. ICPSR's Summer Program in Quantitative Methods is internationally recognized as the premier program for training in the methodology of social science research.

ICPSR's membership includes over 650 educational and research institutions around the world. ICPSR member institutions pay annual dues that entitle faculty, staff, and students to the full range of data resources and services provided by ICPSR, but many of the ICPSR data holdings are freely available to anyone.

ICPSR content takes the form of numeric data files and associated PDF technical documentation (over 500,000 discrete files comprising over 7000 studies) that may be analyzed using statistical software packages or, in some cases, online analysis software. Each study in the holdings has a corresponding descriptive metadata record.

Data in the repository are linked to related publications in the research literature via a Bibliography of over 45,000 entries. Similarly, citations in the Bibliography are linked to the data that generated the research findings. Full text is available for many publications in the Bibliography. ICPSR content provides a rich resource for data mining.

 


JSTOR
www.jstor.org

About: With participation and support from the international scholarly community, JSTOR has created a high-quality, interdisciplinary archive of scholarship, is actively preserving over one thousand academic journals in both digital and print formats, and continues to greatly expand access to scholarly works and other materials needed for research and teaching globally. We are investing in new initiatives to increase the productivity of researchers and to facilitate new forms of scholarship.

Contact: http://www.jstor.org/action/showContactSupportForm

We are pleased to confirm JSTORs readiness to participate in the 'Digging into Data Initiative'. JSTOR is a scholarly archive of the full runs of approximately 1000 leading academic journals and covers approximately fifty disciplines, with a strong presence in the humanities, social and field sciences, business and economics.  A full list of the included journals is available at http://www.jstor.org/action/showJournals?browseType=titleInfoPage

JSTOR is prepared to provide access at two levels.

1.1 Potential participants to Digging into Data can apply to JSTOR for an account to our “Data for Research” service.  This allows users to createdatasets of word frequency against article for any subset of the articles in the JSTOR archive.  An open, but size-limited service will be generally available from mid January, 2009 at http://dfr.jstor.org.  The Digging into Data accounts will remove the limits on the size of the datasets.

1.2 Any accepted participant in the Digging into Data Program can gain access to the full text of the of JSTOR collections as XML data in a (slightly extended) NLM format. The dataset will include OAI-ORE resource maps.  The full text includes OCR’d text of the articles, and bibliographic metadata.

Notes:

1. The data referred to in (1.2) above will be a standard corpus and will be distributed “as-is”.  JSTOR will not filter, sort or in any way process it to individual participants requirements.  Participants are expected to be competent in the processing of XML data and no technical support is offered by JSTOR in the filtering or processing of the XML.  The agreement will be for a limited time, after which the data should be destroyed or returned.

2. Samples of the data in (1.2) and the XML schema will be available on request to potential participants for the purpose of assessing the suitability of the full collection for their proposal.  Final participants should expect to provide media such as USB disk drives for the delivery of the data.

3. All participants will be required to sign a standard license and non-disclosure agreement for the use of the data referred in section (1,2) above.

4. Any potential participant must accept a “click-through” limited use agreement for the service and data mentioned in (1.1) above, and must provide contact details and a bon-fide email address at the participating institution.

5. Participants should allow sufficient time for the processing of the agreement mentioned in 1.2.  We have found that 6-8 weeks is typical for legal departments of academic institutions to process such agreements.

 


Marriott Library
University of Utah
www.lib.utah.edu/portal/site/marriottlibrary/


About: The J. Willard Marriott Library at the University of Utah hosts more than 100 outstanding digital collections, containing 315,000 digital photographs, maps, books, audio recordings, and other items.  They can all be viewed by clicking on the “Digital Collections” link on the Library’s main web page: http://www.lib.utah.edu/portal/site/marriottlibrary/

Contact: John Herbert, Head - Digital Technologies, john.herbert@utah.edu

Among the many fine collections we have are:


o    The Harmonia Macrocosmica, by Andreas Cellarius, printed in 1661, is an atlas of the heavens as seen by the astronomers of the time: Copernicus, Ptolemy, Brahe, and Aratus. Our collection has 30 hand-painted color plates plus an accompanying text in Latin.  
o    Western Soundscape Archive – hundreds of animal and other natural sounds from the western U.S.  
o    Karl Bodmer created these 165 aquatints during the 1832-1834 expedition by Prince Maximilian zu Wied through the American west. Since then, these watercolors have been a major historical resource for Plains Indian culture. They were instrumental in creating the romantic perceptions of these peoples, which endure to this day in art, film, and literature.
o    Dard Hunter Books – his writings on papermaking, presented in elegant hand-made books.
o    Sanborn Fire Insurance Maps - large-scale, detailed maps from 1867 -1969 depicting the commercial, industrial, and residential sections of Utah cities.
o    Arabic Papyrus – the world’s third largest, a collection of 700 Arabic documents on papyrus and 1300 on paper.
o    Aztec Codices – 4 Mesoamerican manuscripts describing wars, famine, pestilence, religious events, and other elements of ancient Mesoamerican culture.

We host nationally renowned collaborations such as the Utah Digital Newspapers (http://digitalnewspapers.org), the Mountain West Digital Library (http://mwdl.org), and the Western Waters Digital Library (http://westernwaters.org).  

We partner with several other UofU and State institutions, including the Spencer S. Eccles Health Sciences Library, the S. J. Quinney College of Law Library, the Utah State Historical Society, and the Utah State Library. We also work closely with many public libraries across Utah, such as the Uintah County Library, the Park City Library and Historical Society, the Delta City Library, the Topaz Library, to name only a few.

UTAH DIGITAL NEWSPAPERS
The J. Willard Marriott Library at the University of Utah has launched a pioneering program that is changing the face of newspaper research. Our program, the Utah Digital Newspapers (UDN), makes historic Utah newspapers available to the general public over the Internet. We create a database of digital images and searchable text from old newspapers and make it accessible from our website. The result is that these newspapers can be searched by keyword, title and date from the comfort of a PC. For anyone interested in their family history, Utah or national history, it's a marvelous, easy-to-use improvement over reading microfilm.   

Since its inception in 2002, UDN quickly became a leader in newspaper digitization within the public sector and a model for other academic libraries across the country. Our success has led the National Endowment for the Humanities and the Library of Congress to launch a national digital newspapers program, in which the Marriott Library and seven other institutions participate.

As of early 2009, we have digitized nearly 650,000 pages from more than fifty newspapers, covering 27 of the 29 Utah counties. All this is available on our website: http://digitalnewspapers.org.

Our collection includes the first issue of the Deseret News in 1850, which is the first newspaper issue of any kind published in the Utah Territory. We have the first issue of the Salt Lake Tribune in 1871 and the early years of the Salt Lake Herald. Among other papers are the Topaz Times, the newsletter from the Japanese interment camp during World War II, and the Broad Ax, an early African American paper from Salt Lake City.

 


NASA ADS
Smithsonian/NASA Astrophysics Data System (ADS)
ads.harvard.edu

About: The Smithsonian/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, currently being developed by the Smithsonian Astrophysical Observatory under a NASA grant. The ADS maintains three bibliographic databases containing more than 7.5 million bibliographic records covering the scholarly literature in Astronomy and Physics.

For each bibliographic record in its database, the ADS maintains a list of links to relevant resources, including:


    * Location of the fulltext article via DOI and/or OpenURL links
    * List of works referenced in the original article (references), given in the form of either a list of ADS records or as a link to the publisher’s reference list
    * List of works that cite the record in question (citations or “forward links”), given in the form of a list of records that ADS was able successfully identify
    * Readership statistics and co-readership-based user recommendations

The ADS also maintains an archive of the historical content of all the astronomical publications.  The contents of the archive consist of 500,000 articles, corresponding to over 3.5 million scanned pages, which have been OCRed and which are searchable through the ADS fulltext query search form.

Contact: If your research team is interested in using the content maintained by our project, please contact the ADS project manager, Alberto Accomazzi at aaccomazzi@cfa.harvard.edu

Links: The ADS supports export of its metadata records via a RESTful interface and a highly structured XML format (see http://doc.adsabs.harvard.edu/abs_doc/help_pages/linking.html). Fulltext content can be provided to collaborators via a data dump.

Terms of service: access to the ADS is freely available to the general public for personal use.  The relevant terms of use are detailed in the document http://doc.adsabs.harvard.edu/abs_doc/help_pages/overview.html#use

 


National Archives, London
nationalarchives.gov.uk

About: The National Archives has a range of different data sources and are keen to support initiatives that can broaden access to our data. We would be happy to discuss any project further.

Contact: Dr David Thomas, Director of Technology and Chief Information Officer, david.thomas@nationalarchives.gov.uk

Description: Much of the National Archives digitised material is available via its catalogue, but discussion will be required to allow for in depth analysis. It may be preferable for scholars to work with specific collections, that have full catalogue entries or text that is searchable via OCR. This could include

 


National Science Digital Library (NSDL)

www.nsdl.org

About: In 2000, the National Science Foundation created the National Science Digital Library (NSDL) to provide organized access to high quality resources and tools that support innovations in teaching and learning at all levels of science, technology, engineering, and mathematics (STEM) education. In addition to providing an organized point of access to high-quality STEM content, NSDL also provides open-access, non-proprietary tools to stimulate new ways to access and use science education information in an easily accessible online environment. NSDL currently catalogs over 60,000 resources from 57 digital collection providers and thousands of web sites. Individual resources in the library are characterized using qualified Dublin core metadata. 

NSDL makes an ideal testbed for researchers interested in exploring innovative approaches into “digging into data” over rich collections of web-based educational resources. NSDL uses a Fedora-based open-source digital library platform of technology and standards (NCore), creating a dynamic information layer on top of library resources. Collections, resources, and this information layer are easily accessible to researchers via a number of application programming interfaces. The repository contents can be accessed directly using the Digital Repository API. Additionally, a Search API is available for searching directly over NSDL collections. Finally, there is also the Strand Map Service API for searching and visualizing NSDL collections according to K-12 learning goals. This web service protocol supports the construction of interactive knowledge map interfaces based on the learning goals articulated in the American Association for the Advancement of Science (AAAS) Benchmarks for Science Literacy and the learning progressions and strand maps published in the AAAS Atlas of Science Literacy. The library’s Dublin core metadata descriptions can also be harvested using OAI-PMH. 

Documentation:

Digital Repository API documentation is available here: http://wiki.nsdl.org/index.php/Community:NDR

Search API documentation is available here: http://wiki.nsdl.org/index.php/Community:Search 

Strand Map Service API documentation is available here: http://wiki.nsdl.org/index.php/Community:StrandMaps

Contact: For more information on these APIs, contact the NSDL Technical Network Services team by sending a request via: http://nsdl.org/about/contactus/

For information on upcoming API training opportunities, please contact Karon Kelly at 303-497-2652 or kkelly@ucar.edu.

Terms of Use/Service:  http://nsdl.org/help/?pager=termsofuse


National Technical Information Service (NTIS)
 
About: The National Technical Information Service (NTIS) is the nation's largest and most comprehensive source of government-funded scientific, technical, engineering, and business information produced or sponsored by U.S. and international government sources. NTIS is a federal agency within the U.S. Department of Commerce.
 
Since 1945 the NTIS mission has been to operate a central U.S. government access point for scientific and technical information useful to American industry and government. NTIS maintains a permanent archive of this declassified information for researchers, businesses, and the public to access quickly and easily. Release of the information is intended to promote U.S. economic growth and development and to increase U.S. competitiveness in the world market.
 
The NTIS collection of more than 2 million titles contains products available in various formats. Such information includes reports describing research conducted or sponsored by federal agencies and their contractors; statistical and business information; U.S. military publications; multimedia training programs; databases developed by federal agencies; and technical reports prepared by research organizations worldwide. NTIS maintains a permanent repository of its information products.
 
More than 200 U.S. government agencies contribute to the NTIS collection, including the National Aeronautics and Space Administration; Environmental Protection Agency; the departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland Security, Interior, Labor, Treasury, Veterans Affairs, Housing and Urban Development, Education, and Transportation; and numerous other agencies. International contributors include Canada, Japan, Britain, and several European countries.
 
NTIS offers Web-based access to the latest government scientific and technical research information products. Visitors to http://www.ntis.gov can search the entire collection dating back to 1964 free of charge. NTIS also provides downloading capability for many technical reports, and purchase of the publications on CD as well as paper copies.
 
National Technical Reports Library:  New at NTIS, the National Technical Reports Library (NTRL) provides a more comprehensive offering that delivers high-quality government technical content in all subject areas directly and seamlessly to the user’s desktop. The NTRL service gives access to more than 2 million NTIS bibliographic records and more than 500,000 full-text documents in PDF format. For more information, see http://www.ntis.gov/products/ntrl.aspx.
 
Contact/Technical Support:  See: http://www.ntis.gov/help/overview.aspx
 
Terms and Conditions:  See: http://www.ntis.gov/pdf/ntrl-terms-cond.pdf
 
 
 

Nebraska Digital Newspaper Project
cdrh.unl.edu/nebnewspapers/

About: The Nebraska Digital Newspaper Project and its contractor, iArchives, have created 100,000 full-text digitized pages of 19th and early 20th Century newspapers from selected communities in Nebraska that can be used for text mining by DID research teams.  The number of pages will grow over time.  Files have been created in three forms:  TIFF images, JPEG2000, and PDFs with hidden text.  Optical character recognition has been performed on the scanned images, resulting in dirty OCR.   Metadata associated with the project is TEI, XML, and METS/ALTO, following the guidelines provided by the Library of Congress for the National Digital Newspaper Program.

We also have 118,000 full text pages of Nebraska Public Documents, http://cdrh.unl.edu/nebpubdocs, that may be useful for DID.  These are XML files, TEI2 headers, with METS/ALTO and dirty OCR.

Contact: Technical support information.  The research teams can contact Jason Bougger, Systems Administrator, UNL Libraries, jbougger1@unl.edu, (402) 472-0856, for questions regarding access to the data.

Documentation.   Detailed descriptions of the NDNP requirements are found in the Library of Congress website at http://www.loc.gov/ndnp.  No APIs have been developed for the Nebraska Digital Newspaper Project.

Terms of service.  The Nebraska Digital Newspaper Project should be cited in any acknowledgements associated with the DID Challenge.   Any uses of the data that are outside of the DID Challenge should be cleared through Katherine Walter, Project Director of the Nebraska Digital Newspaper Project, kwalter1@unl.edu, (402) 472-3939.

 


New York Public Library
NYPL.org

About: Libraries are the memory of humankind, irreplaceable repositories of documents of human thought and action. The New York Public Library is such a memory bank par excellence, one of the great knowledge institutions of the world, its myriad collections ranking with those of the British Library, the Library of Congress, and the Bibliothèque nationale de France.  

Contact:  Joe Dalton jdalton@nypl.org

NYPL has several collections that may be of interest to Digging into Data participants.  For example:

The NYPL Digital Gallery contains roughly 700,000 images that could be used for data analysis.  The images are available via our RESTful API.  For more information about this API, please see: http://digitalgallery.nypl.org/feeds/dev/atom/docs/.  For an ATOM feed, please see: http://digitalgallery.nypl.org/feeds/dev/atom/.  

NYPL has been a lead participant in the National Digital Newspaper Program, and has additionally contracted for article-level data/coordinates for those newspapers it has provided. Available runs include the New York World (1890-1910) and the New York Daily Sun (1890-1910). There is currently no online access, but to obtain more information about using this newspaper data for analysis, please contact Barbara Taranto (btaranto@nypl.org).

 


The New York Times Article Search API
http://developer.nytimes.com/

About: The NYT Article Search API allows you to search more than 2.8 million New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs and links to associated multimedia.

The API supports the following type of searching: 
    * Standard keyword searching
    * Date range: all articles from X date to Y date
    * Field search: search within any number of given fields, e.g., title:obama byline:dowd
    * Conjunction and disjunction (AND and NOT) operations, e.g., baseball yankees -"red sox"
    * Ordering by closest (variable ranking algorithms), newest and oldest

The Article Search API also offers faceted searching. The available facets include Times-specific fields such as sections, taxonomic classifiers and controlled vocabulary terms (names of people, organizations and geographic locations).

Contact: If your research team is interested in using the Times' Article Search API, please visit http://developer.nytimes.com to register for an API key.

Links:
Times Developer Network: http://developer.nytimes.com
Open blog (A blog about open-source technology at The New York Times, written by and primarily for developers. This includes our own projects, our work with open-source technologies at NYTimes.com, and other interesting topics in the open-source and Web 2.0 worlds.): http://open.nytimes.com

Terms of Use:  The New York TImes Article Search API is for noncommercial use only. Please see the FAQ for more detail about commercial and noncommercial use: http://developer.nytimes.com/docs/faq

This Terms of Use is available at the following URL:
http://developer.nytimes.com/Api_terms_of_use, and our Attribution Requirements are also available at the following URL: http://developer.nytimes.com/attribution
 


Opening History
imlsdcc.grainger.uiuc.edu/history/

About:  Opening History (OH) (http://imlsdcc.grainger.uiuc.edu/history/) contains metadata describing digital collections specializing in United States history and culture and item-level metadata describing resources contained within these collections. As of May, 2009, the portal contains 553 collection records and over 918,000 item records. Opening History’s mission is to provide organized access to digital resources of value for research on United States history and culture. OH aggregates a range of distributed and complementary cultural heritage collections from libraries, museums, and archives to increase their visibility and to enhance the value and usefulness of individual collections by integrating them with related collections, providing search and browse functions across collections, and linking to related resources outside of OH. The OH initiative encourages sharing of digital resources in open access formats and promotes coordinated access to regional, state, and local collections to support the creation of a digital cultural heritage aggregation of national scope.

Contact:  Amy Jackson (amyjacks@illinois.edu), Project Coordinator, IMLS Digital Collections and Content

Links:
•    Public interface: http://imlsdcc.grainger.uiuc.edu/history
•     Individual item-level metadata records are available through OAI-PMH at http://imlsdcc.grainger.uiuc.edu/history/oai/oai.aspx
•    Collection-level metadata records are also available through OAI-PMH at http://imlsdcc.grainger.uiuc.edu/registry/oai/oai.aspx  (setSpec=ISHIST)

Terms of service: 
Data may be freely used for research purposes. IMLS Digital Collections and Content/Opening History should be mentioned in any publication of findings.

 


Project MUSE
http://muse.jhu.edu/

About: Project MUSE is an online collection of over 400 journals from approximately 100 not-for-profit publishers.  MUSE sells 6 specific collections to institutional libraries worldwide.  More information is available at http://muse.jhu.edu/about/muse/index.html

Contact: Wendy Queen, Manager of Electronic Publishing Technologies.  Phone: 410-516-3845.  Email: wendy@muse.jhu.edu  Please also copy Mary Rose Muccie (mrm@press.jhu.edu) on any emails.

A representative from each participating team using Project MUSE is required to sign a Memorandum of Understanding that outlines the terms of use for Project MUSE content. The MOU is available at http://muse.jhu.edu/about/docs/did_mou.pdf.

MUSE does not have an API available online. Please call Wendy Queen (info above) and she will explain our database structure and any other information you need.  We have several access methods available to MUSE subscribers and need to know how you plan to access the content (basic IP authentication, via Athens, via Shibboleth, etc).  Please get in touch with Wendy before any downloading begins so we are prepared and do not shut you down for violating our license and also do not include the hits in usage statistics, which are a component of both our pricing and our publisher royalty payments.

 


Scholarly Database at the Cyberinfrastructure for Network Science Center, Indiana University
sdb.slis.indiana.edu
 
About:  The Scholarly Database (SDB) at Indiana University aims to serve researchers and practitioners interested in the analysis, modeling, and visualization of large-scale scholarly datasets.  The online interface provides access to four datasets: Medline papers, U.S. Patent and Trademark Office patents (USPTO), National Science Foundation (NSF) funding, and National Institutes of Health (NIH) funding – over 22 million records in total.  Users can register for free at http://sdb.slis.indiana.edu to cross-search these datasets and to download result sets as dumps for scientometrics research and science policy practice.

Contact:  Nianli Ma, SDB Team Lead, nianma@indiana.edu, (812) 856-3465.

 


ScholarSpace at the University of Hawai'i at Manoa
scholarspace.manoa.hawaii.edu/community-list

About: Most of the material is previously published scientific journals, pamphlets and such like, or open access journal issues, along with what will soon be over 2,000 dissertations published at the University of Hawaii.

Contact: If you are interested in that sort of material - the IR project manager is Beth Tillinghast (betht@hawaii.edu) and the technical contacts would be Daniel Ishimitsu (daniel20@hawaii.edu) and Wing Leung (leungwin@hawaii.edu).

 


University of Florida Digital Library Center
www.uflib.ufl.edu/ufdc

About:  The University of Florida Digital Collections (UFDC) provides the overall infrastructure for many collections which can be accessed and cross-searched through the main UFDC interface (www.uflib.ufl.edu/ufdc) or individually. A full list of collections, with statistics for each, is available here: http://www.uflib.ufl.edu/ufdc/?m=hai.

Contact:  ufdc@uflib.ufl.edu and 352.273.2900

Laurie Taylor, UF Digital Library Center Interim Director, Laurien@ufl.edu and 352.273.2902

Mark Sullivan, UF Digital Library Center Programmer, marsull@uflib.ufl.edu

The largest collections are the Florida Digital Newspaper Library (45,418 items with 504,773 pages as of December 30, 2008), the Baldwin Library of Historical Children's Literature Digital Collection (5,292 items with 768,119 pages); and the Digital Library of the Caribbean (22,613 items with 577,315 pages). The Florida Digital Newspaper Library includes historic through current newspapers, the Baldwin collection contains 19th century children's literature, and the Digital Library of the Caribbean contains historic through current materials in multiple languages (primarily in English, Spanish, and French). The Digital Library of the Caribbean also includes the Caribbean Newspaper subcollection, which currently has 4,167 items with 35,447 pages. With the newspapers and the children's literature collections, the collections are strong in caricature and illustration and this is further supported by items like the 51 volumes of Fun (magazine contemporary to Punch) and the collections as a whole are strong in legal documents for and related to Florida and the Caribbean. All of the collections and all items in UFDC are openly accessible for regular use and for datamining.

We do not have an API for datamining, but we do have extensive documentation here: http://www.uflib.ufl.edu/ufdc2/technical/. Also, UFDC includes static pages for all collection items and www.uflib.ufl.edu/ufdc2 lists all of the collections codes, each of which links to its own page with a list of all of the static pages for all items in the collection. The static pages are built during the regular load cycle and new items loaded are added to the static pages and added to the new item RSS feed, accessible here:  http://www.uflib.ufl.edu/ufdc2/rss/

Special terms of service:


Many of the materials contributed to UFDC are under copyright and UFDC has been granted permissions for Internet distribution and use within UFDC. If a researcher wants to re-publish entire source documents, please contact us through the technical support email so that we can request any additional permissions from the copyright owners if needed.

 


University of North Texas
digital.library.unt.edu/browse/?browseby=collection

About:  Brief Description of the contents of our collections – Collections may be viewed at http://digital.library.unt.edu/browse/?browseby=collection.  The collections are heterogeneous in nature including books, posters, photographs, born digital reports, musical scores, newspapers, letters, maps, etc.  The Portal to Texas History is the largest collection listed.

Contact: Mark Phillips, mark.phillips@unt.edu

While we currently are not able to supply active API interfaces to our system, here are some of the services that will be supported in our initial release of our new infrastructure. 

The digital library infrastructure in development at UNT consists of a METS based delivery system. We use a locally developed METS profile for storing file and structural metadata, which is used in delivering content to our users.  Descriptive metadata is stored in a locally qualified Dublin Core metadata scheme which has been developed over the past five years and is documented on our libraries website <http://www.library.unt.edu/digitalprojects/metadata>. 

Services built on our digital objects revolve around existing protocols for sharing and reusing data in libraries.  These include support for OAI-PMH to provide access to both Dublin Core and MODS representations of each metadata record in the system.  The system supports queries in both SRU and OpenSearch. 

Each digital object has descriptive metadata available in a variety of  simple formats for reuse.  Examples include XML, JSON and TXT.  Automatically generated citations are also available in a variety of formats.  We are using the ARK identifier scheme and providing ERC records containing simple metadata for each object in an easy to use format. Additionally it is planned to support COinS and unAPI as well as other Microformats where appropriate. 

Additional planned services include the use of Open Text Mining Interfaces for supplying other organizations the full-text of our digital objects without compromising the terms of our agreements with content providers.   In addition to the full text, this will provide word count vectors for use in visualization tools such as word clouds and other data graphs. 

Special terms of services – The data may be used freely for research purposes.


 Print   
Privacy/Terms of Use