Applying Corpus Tools to Academic English Instruction

Note: this is a copy of the post I wrote for CELFS at the University of Bristol, updated and edited to make the information available for my students in China.

My research involved a very useful tool for teaching and learning academic vocabulary. In this blog post I will provide a ‘walkthrough’ example of how this tool can be used by both English for Academic Purposes (EAP) teachers and, more importantly, students (this version assumes quite advanced students).

Click here for a brief explanation of the Corpus Tools

What is a corpus?
Corpus, which literally means body, is a collection of texts – a body of texts. There are big corpora (with Google being the biggest) and smaller corpora that comprise collections of specific kinds of text (e.g. a collection of the works of one author). A reference corpus is an existing corpus against which a researcher (or EAP teacher or student) can check how particular words or sequences of words are used.

Why use a corpus?
Before demonstrating how a reference corpus can be used, it would be useful to consider why an EAP teacher or student might want to use one. Essentially, it allows us to test our own intuition about how lexical items are used and provides a tool for students to be able to check their own vocabulary use. For example, as one of my former EAP colleagues in China explained:

When [students] have written something that just is not an English expression, I say to them, ‘have you tried putting this through Google…to see what comes up?’ And either it just doesn’t come up or it comes up all Chinese language websites and they realise they’re not using it appropriately…very quick – just using the web as your very quick reference corpus…You have to be careful with that obviously because there are a lot of people using wrong, horrible grammar and vocabulary.

(see also Robb, 2003)

Wouldn’t it be useful if there were specific corpora against which we could reference academic English vocabulary? Well, the remainder of this blog post will look at one such corpus – the British Academic Written English (BAWE) corpus which is accessible via the following link: https://the.sketchengine.eu/#open

Why BAWE?

BAWE is particularly useful for EAP teachers and students because it represents target written language, especially for those studying or aiming to study in the UK. The corpus available on Sketch 文 Engine consists of 6,968,089 words distributed across 2,761 texts. These are all student assignments that were positively assessed. They range from first year undergraduate to Masters level across four broad disciplinary groupings (Nesi & Gardner, 2012:8). Thus, the BAWE corpus gives students the opportunity to test their vocabulary choices against an authentic target corpus – successful student papers submitted to UK universities.

This also give students the ability to check vocabulary where they have received mixed messages from different tutors; they can check it using BAWE as the reference corpus. For the purpose of the demonstrative ‘walkthrough’ below, I have chosen to offer an answer to a question raised in training sessions on the pre-sessional course at the University of Bristol in 2016. This is a question that often gets raised in discussions between EAP tutors – whether we can use first person pronouns in academic writing.

 

Demonstration

For the purpose of this demonstrative ‘walkthrough’, I have chosen to offer an answer to a question raised in training sessions on the pre-sessional course at the University of Bristol in 2016. This is a question that often gets raised in discussions between EAP tutors – whether we can use first person pronouns in academic writing.

For clarity I have used italics for lexical items explored through the corpus tools and bold to signify words with operational functions on the Sketch 文 Engine website.

Conducting a Simple query

First navigate to the Sketch 文 Engine and select BAWE from the list of options (Figure 1).

Figure 1 (click on image to enlarge)

Then select concordance from the list of options (Figure 2)

Figure 2 (click image to enlarge)

Next, simply enter the word or words that you want to investigate and click Search (Figure 3).

Figure 3 (click image to enlarge)

After entering the word we and running this search, students who have been told that academic writing does not use we are in for a bit of a shock. The simple query search includes instances of us as well as we and combined they occur 15,718 times (or 1,885.50 per million words) in the corpus (Figure 4). These words are in fact used at a much greater frequency than some words EAP teachers often actively encourage students to use (try searching furthermore and you’ll find it occurs 1,319 times or 158.22 per million words, and in conclusion occurs 428 times or 51.34 per million words). This indicates quite clearly that the instruction not to use first person pronouns is overly simplistic.

Figure 4 (click on image to enlarge)

KWIC and Concordancing

Having discovered that we is frequently used in successful academic assignments,  students could usefully explore how we and us are used in context. To do this, look at the key word in context (KWIC) which appears in red embedded in the listed concordance lines. These concordance lines can be studied and patterns of usage observed (Figure 5).

Figure 5 (click on image to enlarge)

Note that by clicking on the blue words to the left of the concordance lines list,  a pop up box appears that gives you more information about the text (Figure 6), which is how I discovered that the only instances of pay attention on in the corpus were all written by Chinese students so can be safely interpreted as a Chinglish grammar error. You can also see a bigger sample of the writing in the pop up box by clicking on the red key word in the middle of the concordance lines.

Figure 6 (click on image to enlarge)

 

To start a new search, click on this icon on the menu at the left of the screen:

The material below needs updating but please do go ahead and explore the functions the software has to offer 🙂

Returning to the case of we, for some direction on interpreting the functions of we, I recommend Tang and John (1999) who categorise ‘the writer identity in student academic writing through the first person pronoun’. They identify 6 different functions, including positioning the writer as: representative, guide through the essay, architect of the essay, recounter of the research process, opinion-holder and originator (with representative and guide being the most frequent). If you are a student who is now puzzled why your respected English teacher told you not to use we in an academic essay, try a new search using this string of words that is frequently used in Chinese student writing: as we all know. The result will show you that there are ways of using we that are not acceptable in academic writing. So, if you cannot understand the difference between as we all know and the examples of we in the concordance lines generated by the first search, it would be advisable to follow your teacher’s advice and avoid writing we.

To search for a specific word form (e.g. exclude us) click on query type – word and enter your search item in the word form field (Figure 5). This returns 13,222 instances of we (1,586.08 per million). So we is clearly a lot more commonly used than us.

Figure 5 (click image to enlarge)

It may also be useful to know that you can control the word form of a search item to some extent just by being aware of what you type into the simple query. For example, a search of maintains will give you only instances of maintains, whereas searching maintain will return instances of multiple word forms (maintains, maintained, maintaining). You can also use * to indicate missing letters which allows you to enter the basic stem of a given word and broaden your search for different word forms (try it by running a search for analyse and another for analys*). The * can also be used to indicate a missing word so, for example, you could check what adverbs might be appropriate to put between is and argued by running a search for is * argued (compare the results with a search for is argued).

Collocations Tool

To further analyse the use of the word we, after running the simple query search, use the options down the left hand side of the Sketch 文 Engine screen. A useful tool is Collocations (near the bottom of the options) and if the default settings are used (just click Make candidate list when some complex looking options appear), it is very clear that we strongly collocates with can and see (Figure 6), which suggests the collocation we can see is frequently used (this could be tested by entering we can see into the simple query and running another search as in Figure 7).

Figure 6 (click image to enlarge)
Figure 7 (click image to enlarge)

Filters

The texts that comprise the corpus can be filtered in various ways. For example, listed under Frequency (in the list of options on the left-hand side of the screen), the Text types option will give you graphical data of the breakdown of usage across different types of discipline, text genre and author so you can, for example, discover that:

  • we is used more often by 1st and 2nd year undergraduates than 3rd years
  • we is used most often in Philosophy and Mathematics, but very rarely in Planning
  • we is used most often in the ‘Methodology recount + Narrative recount’ genres
  • we is used in greater relative frequency by L1 Welsh and Mongolian speakers

This provides plenty of data about different specific contexts in which we is used and by whom in academic writing but is probably of limited interest to most students.

Any of the filters can be used to narrow the range of the corpus in simple query searches (scroll down and below the simple query input field you will see lists of check box options). Again, this is probably of limited interest to most students but it does allow you to discover information such as we occurs 623 times (or 74.73 per million words) in Social Sciences – Economics, and occurs 98 times (11.76 per million words) in Physical Sciences – Chemistry. When comparing the statistics between sub-corpora like this, it is important to use the per million words figure rather than the number of instances so as to take account for the fact that employing different filters could create sub-corpora of significantly different sizes (convert per million into per thousand or a percentage if it makes it easier to conceptualise).

A Caveat
Whilst BAWE provides a very useful reference corpus for exploring the use of lexis in successful academic student writing, it is important to remember that it is still non-expert writing and may contain grammar, spelling and other language errors. It cannot be assumed that the assignments are models of excellent writing, only that they are of sufficiently good quality to have been deemed successful. Furthermore, BAWE is clearly not exhaustive so if a particular search item does not return many hits, this does not necessarily mean that it is not academic. There could be other explanations such as a lack of papers on a particular topic area or insufficient instances of a particular word to reveal a comprehensive list of collocates. Nevertheless, if a search item turns up few or no instances and no other reason is identifiable, it certainly does suggest that it might be sensible to avoid using it.

Taking it further
There ends my demonstration of how the BAWE reference corpus can be used by EAP teachers and students. Given the high numbers of Chinese students who currently make up the student body on EAP pre-sessionals, teachers and students might find it helpful to run searches on common clichés like every coin has two sides, what’s more and with the development of the society/technology/economy. I have developed an editable web page for this at: http://mushroom-scholars.org/learning-community/docs/exploring-academic-vocabulary-2/ or for those who can access Google Docs.

For teachers and students who are really interested in exploring corpora further, whole texts can be compared to each other using tools available on Compleat Lexical Tutor. This website also allows you to conduct Key Word Analyses against a selection of reference corpora. There is also, of course, the mighty Google Book Corpus but this isn’t readily accessible in China.

 

Additional Useful Resources
Sinclair (2004) Corpus and Text — Basic Principles, in Wynn (Ed.) Developing Linguistic Corpora: a Guide to Good Practice

Tom Cobb’s Compleat Lexical Tutor

Mark Davies’ (formally) Word and Phrase (requires registration)

 

References

Nesi, H. & Gardner, S. (2012) Genres across the disciplines student writing in higher education, Cambridge University Press.

Robb, T. (2003) ‘Google as a Quick ‘n Dirty Corpus Tool’, The Electronic Journal for English as a Second Language, 7:2, http://www.tesl-ej.org/wordpress/issues/volume7/ej26/ej26int/

Tang, R. and John, S. (1999) ‘The ‘I’ in identity: Exploring writer identity in student academic writing through the first person pronoun’, English for Specific Purposes 18, pp.S23-S39.

Comments (1)

Leave a Comment