What the corpus looks like | English Teacher Corpus

In creating the corpus, we were initially motivated by the need to find meaningful research avenues so that our teacher-trainees could write their BA theses in line with the linguistic requirements of their predominantly philologically-based accreditation whilst dealing with meaningful content related to the pedagogical aspects of their future profession. The English Teacher Corpus was thus created with the intention of enriching our English teacher training programme by involving our teacher-trainees in all stages of its design and construction. The aim was for students to learn the key principles of corpus linguistics, design and management of research projects, data collection and data processing. Students also had to actively engage with practitioners, whether in approaching them, recording them, processing the recordings or analysing the data. Working with actual examples of teachers’ language also allowed them to better understand the concept of teacher language proficiency and to realize the importance of taking proactive steps against second language attrition. Furthermore, the topics chosen for the interviews allowed them to learn about the thoughts of experienced teachers on different aspects of the profession, which might contribute to the development of their own beliefs.

In this, we were guided by our extensive experience with compiling and analysing spoken learner corpora. The principal author of the project had previously completed the Czech subcorpus of the international LINDSEI corpus (organized by the Centre for English Corpus Linguistics at Université catholique de Louvain) and the Czerasmus corpus (a spoken study-abroad corpus recording spoken English of Czech English-studies students before and after an Erasmus study-abroad period).

The pilot version of the corpus, comprising interviews with 25 Czech non-native-speaker teachers of English and 15 native-speaker teachers of English, was recorded and transcribed in 2023.

Structure of the corpus

The corpus contains 5 tasks. Tasks 1, 2 and 5 are designed in such a way that as much authentic spontaneous language is elicited whilst the teachers express their views on a range of themes directly related to the profession of EFL teaching.

Task 1 – monologue on a given topic

The first task is a monologue intended to last for 3–5 minutes during which the teachers speak on the following topics:

Topic 1:

Why did you decide to become an English teacher? Are you still happy with the choice? Have you ever thought about doing something else?

Topic 2:

Have you changed as a teacher in the course of your career. How? And why?

Topic 3:

Do you remember any critical incidents in your pedagogical career which had an impact on you as a teacher? What happened and how did it affect you? What did you learn?

Task 2 – dialogue developing ELT-related content

The second task is a dialogue in which a range of ELT-related topics are explored. The interviewers may ask questions in reaction to ideas mentioned in Task 1 but may also introduce new ideas related to the same topic. The intended duration of the dialogue is 7–10 minutes.

Task 3 – narrative

The third task is a narrative based on a sequence of 6 pictures. The interviewees’ task was to reconstruct the story.

The goal of the task is to elicit unplanned speech whilst having to process fairly complex visual input.

Task 4 – reading out loud

The fourth task requires the interviewees to read out loud in a natural style a short (215 words) text. Previous research (e.g. Gráf et al., 2023) shows that performance in reading out loud correlates with L2 proficiency. The text was intentionally selected to present a dual challenge. While featuring a clear and simple sentence structure, the passage incorporated numerous less common words, proper names, and numerals, posing potential pronunciation difficulties. This deliberate combination of complex and easy elements was chosen to provide suitable material for a complex analysis of the participants’ reading proficiency.

Task 5 – L1 monologue

The fifth task is a monologue in the interviewees’ mother tongue (Czech). It focuses once more on topics pertinent to the teaching profession. To guide the teachers in their monologue, the interlocutors could use targeted prompts related to the topic. The task was introduced with its possible research potential in mind, allowing for the analysis of the relationship between teachers’ first and second language.

Transcription style and alignment

The initial transcription was carried out by an AI speech recognition model, Whisper. Subsequently, these transcriptions were checked and adapted to align with the simple orthographic transcription system based on LINDSEI transcription guidelines, taking into account non-linguistic features such as filled pauses and backchannelling. The transcription includes standard forms, contracted forms, and nonstandard forms like gonna, dunno and cos presented in a dictionary-accepted manner. Each file underwent a thorough review process to maintain consistency and accuracy in the final transcriptions. All personal data (personal names etc.) was anonymized.

Whisper generated JSON files, which include timestamps for each orthographic word and larger text chunks. These files facilitated the alignment process, as they could be directly imported into EXMARaLDA Partitur Editor, the chosen time-alignment software.

References

Gráf, T., Huang, L., & Cilibrasi, L. (2023). Oral reading tasks as proficiency indicators: Insights from a learner corpus study. International Journal of Learner Corpus Research, 9(2), 155–179.