As I mentioned before the workshop and presentations on corpora were the highlight of the EuroCALL 2005 conference for me. Other than a few sessions using a concordancer, I went into the corpora activities as a beginner, and, thanks to some great teachers, I came away with a fascination for the potential that corpora and data-driven learning have for fostering learner autonomy and improving materials design. Over the next few months I’ll be testing some ideas to see how they work for our learners.
Corpora: the short and sweet explanation (for beginners like me)
Basically to make a corpus you collect a number of real, authentic texts (written or spoken) that are representative of a language as a whole, or of a specific genre of that language (say, business email). A smaller, specialized corpus might have 1,000,000 words (equivalent to about ten 300-page books) and the BNC corpus has over 100,000,000 words (depending on your purpose, smaller corpora can be fine). By aggregating many, many examples of real language (from newspapers, novels, magazines, interviews, debates, talk shows, gossip, small talk, email, etc. et. al.) you have a nice sample of what really happens when people use the language everyday.
Then, you label each word according to the part of speech it represents (called POS tagging). This process can be automated with software.
Now you can use other software to do all sorts of cool statistical stuff, such as analyzing the most significant words in Business English, and concordancing. Here’s part of a KWIC (key words in context) concordance I made with the phrase “moving on” using the web-based concordancer created by Mark Davies that uses the BNC. I limited the register to “spoken” to try and extract “moving on” used as a discourse marker showing a transition to a new topic (typical BE target language). Each line is a separate “hit” from a different text:
we ought to be thinking about er Yeah. er moving on. Do you think that is about right?
say, an hour and three quarters. Erm moving on a l a little bit, what was your
in the surrounding streets. Mhm. Moving on, erm in te you know obviously you must
That’s how they were dealt with. Erm moving on a bit now, er er I mean,
crimes to prostitution? Well I think you’re moving on now to a sphere where perhaps
does that. That’s right. Er anyway moving on, just a couple more things were on this
when they empty it now? Yeah. Yes Moving on from the, the dredger back to when you
shift again. Pump that one in and kept moving on. Mm. They got the roof secure,
four one is the number to call. Erm moving on and talking about the subject of the
anyway, I’d like to consult you about moving on and getting in the next four motions
sensible way forward. Moving on, another suggestion is that we should ballot our
One implication for teachers: teaching authentic language
So you can see with this abbreviated example that, with the register limitations I used, “moving on” is used quite frequently in authentic speech to signal a change of topic. This isn’t surprising…most BE texts teach this phrase. But, let’s look at some of the other phrases that BE texts teach as discourse markers used to change a topic, and note down the frequency that each is used within the BNC spoken sub-corpus. (Example phrases from a popular BE coursebook on presentation skills from a major publisher, released in 1995.)
“That brings me to…” or “That brings us to…”
Total examples: 7
Used as discourse marker: 7
“Now we come to…”
Total examples: 7
Used as discourse marker: 6
“Let’s go on to…”
Total examples: 2
Used as discourse marker: 2
“…move on to…”
Total examples: 2
Used as discourse marker: 0
“Moving on…”
Total examples: 59
Used as discourse marker: 38
So if we are teaching BE learners how to produce (or understand) a transition in a presentation, which one of these phrases would be more useful? That the answer to this question is now obvious points to the promise of corpora and data-driven learning, and how it can impact materials design. (Disclaimer: corpus analysis can be tricky, and I’m far from being qualified to do it. So this little example above may be riddled with mistakes and the conclusion therefore wrong. My first question would be the appropriateness of the BNC corpus for identifying authentic business language…it might be interesting to use a more genre-specific corpus like Mike Nelson’s. Any readers who can point out other problems get a post of their own.)