Spoken language corpora: Approaches for facilitating linguistic research

The use of spoken corpora in linguistic research and natural language processing has increased in recent years, but compared to text corpora, spoken corpora are still scarce due to the difficulty of gathering high-quality spoken material. The main reason for this lack of resources is the fact that gathering spoken material and creating high-quality spoken corpora is very demanding. Since conventions and practices used for spoken resources are highly variable, corpus creators must determine which linguistic and paralinguistic units to represent, which metadata to store, and how to segment the corpus. Additionally, creating adequate presentation of spoken corpus resources through web-platforms is challenging due to varying technological affinity and research interests of spoken corpus users. My cumulative dissertation aims to address these challenges in spoken language corpora from two perspectives: corpus creation and presentation of corpus data through spoken corpus platforms. Each article will describe a prototype for creating a corpus or corpus tool as an example for similar corpus projects, addressing both language-independent and language-specific challenges in constructing German and Bosnian/Croatian/Montenegrin/Serbian corpora.

Doktorandin: Dolores Lemmenmeier-Batinić
Erstbetreuerin: Prof. Dr. Barbara Sonnenhauser
Zweitbetreuer: Prof. Dr. Noah Bubenhofer

Quicklinks

Hauptnavigation

Spoken language corpora: Approaches for facilitating linguistic research