Overview

The Multilingual Hispanic Speech in California (MuHSiC) corpus is a collection of audio-recordings of Spanish-English bilinguals living in California. It is a robust and linguistically rich oral corpus of bilingual Spanish-English speech samples culled from naturalistic conversations among diverse social profiles and regional origins. The conversations cover a variety of topics, from (im)migration stories and traditional folktales to culinary recipes and instructions for speaking Spanglish. Speech samples were audio-recorded in high-quality, uncompressed digital formats that allow for the corpus to be used for analysis of a wide range of linguistic features. Data collection was carried out throughout California and was coordinated in three different hubs —Berkeley, Los Angeles, and Santa Cruz—from speakers of diverse ages and social profiles.

The MuHSiC corpus was created with the following objectives in mind:

The Corpus documents Spanish and English spoken by Spanish-English bilinguals throughout California, thus giving a clearer picture of the state’s multilingual reality. The main method of data collection is the sociolinguistic interview, consisting of roughly 90-minute-long voice-recorded sessions with adult bilingual speakers of Spanish and English (over 18 years of age) in California. Notably, the interviews elicit speech in both Spanish and English, permitting a truly comprehensive analysis of Spanish speech patterns in relation to those of English. This crucially obviates the need for monolingual benchmark comparisons (e.g. the typical and problematic comparison between a bilingual’s Spanish and that of a monolingual Spanish speaker from a different country), and instead more appropriately grounds the empirical study of bilingualism within a bilingual context. After the interviews, the participants completed socio-demographic and language background questionnaires in written form. The speech samples for each speaker were recorded using a high-quality head-mounted microphone and portable audio recorder. The audio files were edited, coded for speaker information, and annotated with suppressible linguistic transcriptions that allow users to quickly search the corpus based on discourse topics, grammatical features, or speaker demographics.

The Corpus will be expanded as more data is collected and additional supplemental features are developed. This is an ongoing project. We invite collaborations with other institutions in California, thus allowing for the collection of a broader, system-wide sampling of bilingual speech in California.