Harvesting Speech Datasets for Linguistic Research on the Web


This project will harvest audio and transcribed data from podcasts, news broadcasts, public and educational lectures and other sources to create a massive corpus of speech. Tools will then be developed to analyze the different uses of prosody (rhythm, stress and intonation) within spoken communication.

Principal Investigators

Mats Rooth, Cornell University, US, NSF
Michael Wagner, McGill University, CAN, SSHRC