Notes on a Language Database
|Fred Curtis - November 2004|
|These notes are some thoughts on a database to capture information on words from many languages, with an aim to provide resources for learning languages.|
[Introduction] [Limitations of dictionaries] [What is a word?] [Written vs. Spoken Language] [Modelling Variations] [References]
Ultimately, one hopes, this information will reside in one's head. Until that happens it is very convenient to be able to track down notes by keyword searches etc. For centuries people far cleverer than I have been using carefully crafted paper systems to organise information. For better or worse, I am the product of a period and place where computers are in common use, and lean toward computerised solutions for organising and obtaining near-instant access to information.
I'm not a linguist (someone with a deep knowledge of the structure of languages and how they change) nor someone with a knowledge of how languages should be taught. I'm a student trying to learn several languages and I'd like reference materials that let me merely dip into the meanings of a word or wade in amongst detail as I need.
The word appears to be a widespread concept. Even in primitive cultures, informants are often able to identify words. This is somewhat surprising, because nobody has yet proposed a satisfactory definition of the notion "word", or provided a foolproof means of identification. [...]
[...] People sometimes wrongly assume that a word is recognizable because it represents a "single piece of meaning". But it can be easily shown that this view is wrong by looking at the lack of correspondence between words from different languages. In English, the three words "cycle repair outfit" correspond to one in German, "Fahrradreparaturwerkzeuge". Or the six words "He used to live in Rome" are translated by two in Latin, "Romae habitabat". And even in English, a word such as "walked" includes at least two pieces of meaning, "walk" and "past tense".
[...] A word that shows just how widespread these changes can be is nice, which is first recorded in 1290 with the meaning of stupid and foolish. Seventy-five years later Chaucer was using it to mean lascivious and wanton. Then at various times over the next 400 years it came to mean extravagant, elegant, strange, slothful, unmanly, luxurious, modest, slight, precise, thin, shy, discriminating, dainty, and -- by 1769 -- pleasant and agreeable. The meaning shifted so frequently and radically that it is now often impossible to tell in what sense it was intended, as when Jane Austen wrote to a friend, "You scold me so much in a nice long letter ... which I have received from you".
Silly people grumble about English being corrupted by Americanisms. [...] before gumbling about a new American vogue word, it is worth inspecting it carefully to see how new it really is. Many such words were exported across the Atlantic by out common ancestors. They have died or become archaic here; flourished over there; and are now coming home like boomerangs.
[...] The suffix -wise, as, rebarbatively in situationwise, is not an odious American neologism but a respectable revenant from antiquity. Chaucer had doublewise; Bunyan dialoguewise; and Coleridge maidenwise.
[...] The converse of a boomerang word is a word inherited from our common vocabulary of the eighteenth century that has survived in BritEnglish, but become obsolete in AmerEnglish. For example [...] "I am too mean to go to the seaside for a fortnight, so I reckon I will fetch my bathing costume and paddle in the bath". The American translation of that is: "I am too cheap to go to the ocean for two weeks, so I guess I will get my swimsuit and wade in the tub". [...]
A third group of words are those that have survived in common usage in both BritEnglish and AmerEnglish, but have developed divergent meanings on either side of the Atlantic. Homely is the classic example of this group. It means "cosy" in Britain, but "plain" in the United States, and can be a cause of misunderstanding and offence. [...]
[Boomerang Words, p. 1-3]
And yet for all its grammatical complexity Old English is not quite as remote from modern English as it sometimes appears. Scip, bæð, bricg, and þæt might look wholly foreign but their pronunciations -- respectively "ship", "bath", "bridge", and "that" -- have not altered in a thousand years. [...] You also find that in terms of sound values Old English is a much simpler and more reliable language with every letter distinctly and invariably related to a single sound. There were none of the silent letters or phonetic inconsistencies that bedevil modern English spelling.
Umpire, incidentally, is one of those many words in which an initial n became attached, like a charged particle, to the preceding indefinite article. In Middle English, one was "a noumpere", just as an apron was at first "a napron".
To be fair, English is full of booby traps for the unwary foreigner. Any language where the unassuming word fly signifies an annoying insect, a means of travel, and a critical part of a gentleman's apparel is clearly asking to be mangled. [p. 1]And from Aitchison1999, p. 51:
All Chinese dialects are monosyllabic -- which can itself be almost absurdly limiting -- but the Pekingese dialect goes a step further and demands that all words end in an "n" or "ng" sound. As a result, there are so few phonetic possibilities in Pekingese that each sound must represent on average seventy words. Just one sound, "yi", can stand for 215 separate words. Partly the Chinese get around this by using rising and falling pitches to vary the sounds fractionally, but even so in some dialects a falling "i" sound can still represent almost forty unrelated words. [p. 79]
Should "fly" (noun) and "fly" (verb) be counted as the same, since they sound the same, or as different, because they have different meanings? Should "fly" and "flew" be regard as one word because they belong to the same verb, or as different because they have differnt forms? [...] Most dictionaries have separate entries for "fly" (noun) and "fly" (verb) [...] However, both these lexical items have various syntactic forms associated with them. The insect could occur as "fly" (singular) or "flies" (plural), and the verb could occur as "fly", "flying", "flies", "flew", "flown".
In Welsh, the initial consonant of each word varies systematically, depending mainly on the preceding sound: the word for "father" could be "tad", "dad", "thad", or "nhad".
[...] Most importantly, spoken language is primary, not written language. Indeed, only spoken language can be truly considered "language." Writing is a collection of symbols meant to represent spoken language. It is not language in and of itself. Many written languages (Spanish, Dutch, etc.), will regularly undergo orthographic reforms to reflect changes in the spoken language. This has never been done for English (the spelling of which has never been regularized in the first place), so what we use for written language is actually largely based on the spoken language of several centuries ago.
[Merriam-Webster web page on pronunciation, sampled 2005-12-11]
In this view (ignoring, e.g. conjugated forms), a word is a grouping (fluid over time and geography) of meaning, orthographic and phonetic forms.
By tying a written (spoken) example to a spelling/word (sound/word) pair, we can record particular instances of a word (through the time/location details of the example text/speech), and effectively record shifts in meanings, spelling and pronunciation.
Refinements: sounds and spellings may be more finely divided by adding context, e.g. a particular archaic saying may provide the only extant example of a particular meaning, spelling or pronunciation; a word may take on a particular meaning only in certain contexts; c.f. slang vs. formal usages.
This model doesn't capture the strong associations between related verb and noun forms (as occur, e.g. in English and Japanese); there is no explicit representation of etymological relationships, word transfer between languages, forms unrelated by sound/orthography (e.g. humble/honorific verb forms in Japanese) - these would seem to require additional relationships and possibly anchors to particular contexts.
... to chase up: Defining Polysemous Words, by Peter Norvig