Corpora for language learning: part 1
As I mentioned before the workshop and presentations on corpora were the highlight of the EuroCALL 2005 conference for me. Other than a few sessions using a concordancer, I went into the corpora activities as a beginner, and, thanks to some great teachers, I came away with a fascination for the potential that corpora and data-driven learning have for fostering learner autonomy and improving materials design. Over the next few months I’ll be testing some ideas to see how they work for our learners.
Corpora: the short and sweet explanation (for beginners like me)
Basically to make a corpus you collect a number of real, authentic texts (written or spoken) that are representative of a language as a whole, or of a specific genre of that language (say, business email). A smaller, specialized corpus might have 1,000,000 words (equivalent to about ten 300-page books) and the BNC corpus has over 100,000,000 words (depending on your purpose, smaller corpora can be fine). By aggregating many, many examples of real language (from newspapers, novels, magazines, interviews, debates, talk shows, gossip, small talk, email, etc. et. al.) you have a nice sample of what really happens when people use the language everyday.
Then, you label each word according to the part of speech it represents (called POS tagging). This process can be automated with software.
Now you can use other software to do all sorts of cool statistical stuff, such as analyzing the most significant words in Business English, and concordancing. Here’s part of a KWIC (key words in context) concordance I made with the phrase “moving on” using the web-based concordancer created by Mark Davies that uses the BNC. I limited the register to “spoken” to try and extract “moving on” used as a discourse marker showing a transition to a new topic (typical BE target language). Each line is a separate “hit” from a different text:
we ought to be thinking about er Yeah. er moving on. Do you think that is about right?
say, an hour and three quarters. Erm moving on a l a little bit, what was your
in the surrounding streets. Mhm. Moving on, erm in te you know obviously you must
That’s how they were dealt with. Erm moving on a bit now, er er I mean,
crimes to prostitution? Well I think you’re moving on now to a sphere where perhaps
does that. That’s right. Er anyway moving on, just a couple more things were on this
when they empty it now? Yeah. Yes Moving on from the, the dredger back to when you
shift again. Pump that one in and kept moving on. Mm. They got the roof secure,
four one is the number to call. Erm moving on and talking about the subject of the
anyway, I’d like to consult you about moving on and getting in the next four motions
sensible way forward. Moving on, another suggestion is that we should ballot our
One implication for teachers: teaching authentic language
So you can see with this abbreviated example that, with the register limitations I used, “moving on” is used quite frequently in authentic speech to signal a change of topic. This isn’t surprising…most BE texts teach this phrase. But, let’s look at some of the other phrases that BE texts teach as discourse markers used to change a topic, and note down the frequency that each is used within the BNC spoken sub-corpus. (Example phrases from a popular BE coursebook on presentation skills from a major publisher, released in 1995.)
“That brings me to…” or “That brings us to…”
Total examples: 7
Used as discourse marker: 7
“Now we come to…”
Total examples: 7
Used as discourse marker: 6
“Let’s go on to…”
Total examples: 2
Used as discourse marker: 2
“…move on to…”
Total examples: 2
Used as discourse marker: 0
“Moving on…”
Total examples: 59
Used as discourse marker: 38
So if we are teaching BE learners how to produce (or understand) a transition in a presentation, which one of these phrases would be more useful? That the answer to this question is now obvious points to the promise of corpora and data-driven learning, and how it can impact materials design. (Disclaimer: corpus analysis can be tricky, and I’m far from being qualified to do it. So this little example above may be riddled with mistakes and the conclusion therefore wrong. My first question would be the appropriateness of the BNC corpus for identifying authentic business language…it might be interesting to use a more genre-specific corpus like Mike Nelson’s. Any readers who can point out other problems get a post of their own.)

Cleve,
Thanks for a good summary of the usefulness of corpora in the learning and teaching context. I’m glad we were able to share some of our experience in this field at the workshop. I think the examples you have given are spot on.
But, yes, integrating corpora successfully into a learning and teaching scenario can be tricky, and a lot of methodological work is still lying ahead here. So far, the use of corpora has more often been approached with linguistic research questions in mind than with pedagogical goals.
The wealth (and sometimes ‘messiness’) of examples from a large corpus, which is desirable for descriptive linguistics, can be overwhelming for a learner (and teacher!). What is necessary is careful planning on where corpus materials can provide added-value in relation to your learning goals, careful selection of the corpus (or subsection) which is appropriate in your context, a precise formulation of the ‘research’ question, devising of a corpus query, and last but not least some practice in observation and interpretation of the query results. Whether, as a teacher, you decide to hand out pre-selected to your learners or whether you decide to let them get their hands on the corpus, the use of corpus data always involves a training phase and much guidance of the learners on all of the above aspects.
A related crucial point is that learners are not necessarily able to create a ‘relationship’ to the texts or concordance lines they find in a corpus (especially in a large corpus, covering different genres etc). In other words, learners may have difficulty to contextualise the data … with many implications attached. To get a – not too serious – illustration of the importance of contextualisation and of knowing where your data come from, just go to the online concordancer at the Edict website (http://www.edict.com.hk/concordance/WWWConcappE.htm) and get a concordance for the word ‘hell’ in the King James Bible and in the Hitchhiker’s guide to the Galaxy (both texts available there) …
One – more serious – approach to this problem is to use smaller corpora in the learning context, and corpora which contain texts that are interesting and relevant for the learners. Such corpora give the learners a better chance to actually read at least some of the corpus texts in their entirety, in addition to analysing these texts with the help of corpus techniques (such as frequency lists or concordances). Familiarity with the texts in the corpus will help the learners to get a much better grasp of what’s going on in the area of language they are interested in.
In addition, I believe we need to think about appropriate types of learning activities, tasks and exercises which we can develop around (appropriate) corpora – to exploit corpora more fully than by just using concordances and concordance-based exercises.
Given all these prerequisites, I’m convinced that corpora and corpus analysis techniques can be a fascinating, complementary learning and teaching resource.
Best regards
Sabine
Comment by Sabine — September 7, 2005 @ 4:54 pm
Thanks for these really valuable observations Sabine. Given this context it seems that for us the next step is to look into the smaller, more specific BE corpora. Your point promoting corpora that are “interesting and relevant” for learners - that learners can create a relationship with - is especially useful. I’ll keep you updated in this.
Comment by Cleve — September 7, 2005 @ 7:09 pm
Cleve,
I think you really have managed to capture some very central aspects of using corpora for language learning. I also think that Sabine raises some interesting and important points. It is very important to know what it is you are looking at (cf. Sabine’s example of ‘hell’ in different kinds of texts). At the same time - if your corpus is too small or too specialised, you may not find the kind of information you need. What is the ultimate balance? So far, I don’t think anyone has been able to give one and only one answer to that question.
On a slightly different note: I like your example of ‘moving on’. In your concordances I noticed a pattern which I checked in some spoken corpora (BNC and MICASE). It does seem to be the case that ‘moving on’ very often is preceded by other discourse markers (fillers or whatever you want to call them): erm, eh, uh and also ‘anyway’, ‘well’. Although this may not necessarily be something you want to teach your students to do, it can be interesting to note and something that can give a further understanding of how and when ‘moving on’ is used. It also illustrates how using corpora and concordances can help you discover features of language that you did not think of looking for. And that goes for both learner and teacher.
Comment by Ylva — September 9, 2005 @ 12:41 pm
Nice catch on the preceding discourse markers Ylva. As you say certainly it’d be good to show Ss the tendency so as to support comprehension.
Comment by Cleve — September 9, 2005 @ 8:40 pm
A different kind of note this time. I’d like to share a link to a site I and my students find rewarding when teaching/learning business correspondence. It’s the Business Letter Corpus at http://ysomeya.hp.infoseek.co.jp/. It’s a great resource also when you design your materials. Cheers.
Comment by Sarolta — September 11, 2005 @ 11:45 am
Hi Sarolta - thanks for the resource. It looks quite useful - I tried it out a few times and it’s easy and fast. Do you happen to know the origen of the corpus? I searched around briefly on the site for some information on that, but couldn’t find anything…it may have been in Japanese. I was interested in the corpus origen in view of Sabine’s comments above on Ss relationship with the text and/or concordance lines, and the possibility of viewing entire letters to further that relationship.
Also, please keep us updated on Scripta Manent.
Comment by Cleve — September 11, 2005 @ 12:42 pm