This article was originally posted on WomenLearnThai.com.
Creating a Thai dictionary…
Like most students of the Thai language, I keep adding to a list of Thai words I must know. It started out on a simple spread sheet with just the Thai word and the meaning(s). Then I added whether the words were noun, verb, whatever. When I discovered classifiers, they were added too. Oh, and polite particles, ending particles, colloquial particles, auxiliary verbs… there seemed to be no end to what I needed to clarify in order to understand Thai. My growing spreadsheet gave me an appreciation for creators of real dictionaries.
When it comes to producing Thai dictionaries, Benjawan Poomsan Becker of Paiboon Publishing and Chris Pirazzi of Word in the Hand have had a successful working relationship. Back in 2003, they created a Palm OS version of Benjawan’s first Thai-English English-Thai dictionary. In 2009, they worked together on the paper version of the improved Three-way Thai-english English-thai Dictionary, with the software version just now out. Coming next will be the same for the iPhone.
Early this year, Chris Pirazzi asked if I could please help beta test their software dictionary. Time didn’t allow for me to participate properly, but I was able to poke around each pre-released version sent out. Doing so raised my curiosity over the makings of a real dictionary. When I approached Chris with the idea of an interview, he was happy to oblige.
Chris, what possessed you to write a dictionary?
Boy, is that the right question! In his pioneering 1755 Dictionary of the English Language, Samuel Johnson famously defined “lexicographer” as “A writer of dictionaries; a harmless drudge that busies himself in tracing the original, and detailing the signification of words.” Kun Benjawan began her first dictionary in 2001, and both of us began our expanded dictionary project in 2007, with a strong passion to create the first Thai–English–Thai dictionaries that are really useful for non-Thai-natives who are learning the Thai language. During this process, we learned how incredibly difficult and labor-intensive it is to produce a good dictionary, and we gained great respect for pioneers like Johnson and the late Mary Haas, but thanks to our strong desire to advance the field, we were able to complete both paper and software versions, with more to come!
What makes it so hard?
Creating a dictionary is such a daunting task, then and now, simply because it defies almost any kind of automation. For our new dictionary project that began with the 2009 paper dictionary, we used databases, home-grown software, and other technologies to streamline as many potentially repetitive tasks as possible, but at the heart of it is something that even the most powerful supercomputers of today can’t touch: meaning.
To see what I mean, write down any five common English words, and then try to think of all the meanings of those words that you know. Then, look those words up in a big dictionary such as dictionary.com, and you’ll be surprised how many extra meanings there are—simple, everyday meanings that you know and use often—that you forgot to list. As you read through the different meanings from dictionary.com, at first you’re likely to say “Hmm, those two are the same meaning,” but when you read them again you realize the meanings are entirely different, and you just lumped those meanings together in your head because they happen to map to the same English word.
As humans, we’re used to having a thought and then looking up the word for that thought in our brain so we can speak or write it, but not the other way around. Rarely in our everyday activities do we need to find all meanings for a given word. And I can tell you from experience that if you exercise this mental skill for more than an hour or so, your brain starts to overheat. If you do it for days, weeks, and months on end, plowing your way through the seemingly endless list of words that makes up even the most basic dictionary, you start to get an inkling of why Johnson’s English–English dictionary took nine years to complete, why the first OED took almost 50 years to complete, and why many lexicographers become increasingly disconnected with reality 🙂
How is creating a bilingual dictionary different?
The second language adds a whole new dimension of complexity. Every English word (e.g. “glass”) has certain a set of meanings (e.g. “glass (drinking),” “glass (pane)”), each of which may (or may not) translate to a set of different Thai words, and each of those Thai words, in turn, has a certain set of meanings, each of which could translate to a set of different English words! In this sense, a bilingual dictionary is like a tangled web of links back and forth, and our job is to reveal that web for each and every word that the reader might look up.
Languages like Thai add additional complexity because there are often multiple different words one must choose based on the social context (similar to “eat” vs. “chow down” vs. “dine” vs. “consume sustenance” in English, but this phenomenon occurs much more commonly in Thai than in English); our dictionaries tell the reader when a Thai word is loaded in such a way. Many Thai–English dictionaries ignore this critical reality, and so their users end up saying things like “Hey buddy, how’s it goin’? Let’s go consume sustenance at the burger joint!”
The only luxury we have that Johnson did not is that we can assume the reader is already an expert in one of the two languages. But the result we produce is therefore only useful for readers who are skilled in that language: it’s a fallacy that one bilingual dictionary can be equally useful for both English-native and Thai-native readers.
Classifiers are another biggie. In Thai you can’t say “two cars,” “this car,” or “that car” without knowing the special Thai classifier for “car,” and each noun that you might want to use in this way has one or more different classifiers that you have to learn. Dictionaries meant for Thai people typically skip the classifier for most words, because they are “obvious” to the Thai reader. But we Thai learners need to know the classifier for every Thai noun that has them, and so that’s what we provided in our dictionaries. We probably have the largest list of Thai classifiers ever assembled!
Finally, pronunciation guides and sound recordings are the final factor that rounds out a bilingual dictionary. There are so many “talking dicts” on sale at malls in Thailand, but nearly all of them only talk in English. Often the salesman will try to trick you by highlighting the pronunciation guide for a Thai word (e.g. “sanuk”) and pushing the “talk” button. But this just makes the little unit try to use its gravelly robot English voice to pronounce the Thai word as if it were a real English word, and the resulting tone-less mumble you get is typically unrecognizable for any Thai listener.
A real bilingual software dictionary has to have sound recordings from a native speaker of the target language to be learned. And a real bilingual dictionary of any kind (software or otherwise) has to have a system of written pronunciation guides that is complete enough that we, the Thai learner, even have a chance of being understood. That means the pronunciation guide system must include the Thai tones, and it must have a unique way of writing each vowel and consonant sound of Thai that can differentiate words. Most pronunciation guide systems (such as those seen on Thai road signs and in karaoke videos, but even some found in Thai learning materials) immediately fail this test because they drop the tones, drop vowel lengths, and map many common vowels to the same written symbol.
How do others deal with these challenges of creating bilingual datasets for Thai learners?
Mostly they don’t. The vast majority of bilingual printed and software dictionaries, particularly in Thailand, are direct copies of other works (in most cases, scanned and pirated outright, without licensing or giving credit, and rarely with any editing). Nearly all of the web and software dictionaries currently out there use the same circa-1995 LEXiTRON data freely released to the public by the Thai-Government-funded agency known as NECTEC. The LEXiTRON data, while it is an amazing resource, is worth every baht: it has a very large word count, but it contains an enormous number of errors in both languages, and, unfortunately for us Thai learners, it was designed with the needs of a Thai person learning English in mind. So all the explanatory texts (e.g. the words “drinking” and “pane” in “glass (drinking)” vs. “glass (pane)”) are in Thai, not in English. When you look up “glass,” or almost any other word, you can never be sure which meaning(s) you’re getting. “Can I have another pane of beer please?”
That explains why there are so many software dictionaries, for, say, iPhone, but why the available dictionaries are so uniformly awful. The authors try to take a shortcut to avoid literally years of hard editing work, but the usefulness of their work is ultimately limited by the error-ridden, native-Thai-focused nature of the underlying data.
The currently available free (or pirated) datasets also do not include pronunciation guides that are useful to Thai learners like us (they often provide pronunciation guides for the English words only) and so as a result, many of the “software repackagers” use a computer program to generate their Thai pronunciation guides directly from the Thai script. Unfortunately, written Thai is sufficiently irregular that an automated approach is extremely inaccurate, and so as many as 30–40% of the resulting pronunciation guides are wrong (often so far off that you have no chance of being understood). There is no substitute for having a Thai native expert manually edit all the pronunciation guides.
So we decided to set out on the nearly insane task of creating a completely new Thai–English dictionary dataset from scratch. The last time this was attempted, aside from the incredible work in the 1960s by Mary Haas, was probably in the 1930s when political prisoner Sor Settabut completed his dataset while trapped on Ko Tarutao and in various other Thai jails, and this is probably the only reason he was able to finish it! Since our target audience is people who are learning Thai, we set out to include classifiers, formality levels, and complete, Thai-native-edited pronunciation guides with every Thai word.
This new effort requires us to make an enormous, ongoing investment of time, labor, and money, but we think the result is so much better than anything else out there that it will pay off. Like all dictionary makers, our editors have giant stacks of existing reference dictionaries spread across our tables, and we even found that Google Search makes a fantastic corpus tool for finding monolingual usages of any English or Thai word “in the wild” (as Rikker Dockum of Thai 101 has often pointed out) but the key labor-intensive element that produces so much value is the human touch: critically evaluating and synthesizing the available research data to create a set of useful dictionary entries.
Didn’t Paiboon Publishing already have a dictionary before 2009?
Yes. Kun Benjawan released her first paper Thai–English–Thai dictionary in 2001, complete with the innovative “Thai Sound” section where you can look up a word by its pronunciation guide without having to know Thai script. This was the first opportunity to go through the whole process. I used this same dataset to produce the 2003 Word in the Hand Thai–English–Thai dictionary software for Palm OS PDAs. Around 2007, we began a new, much wider-ranging dictionary project, the first results of which are the new 2009 Thai–English–Thai compact paper dictionary and the recently released Thai–English–Thai Talking dictionary for Windows PCs.
What did you learn after the first process?
Quite a lot. The first time around, Kun Benjawan did a lot of the data storage and editing in a manual fashion. The second time around, we learned to use databases to store all the words in a form that could easily be repurposed to both a paper and software result, and I wrote quite a bit of custom software that our editing team uses to check each entry in detail at the moment it is written, which helps us to avoid all kinds of formatting problems and omissions (such as pronunciation guides which do not match the corresponding Thai word, missing classifiers, etc.). Thanks to the database, we are also able to have large numbers of people work on the dataset at the same time without stepping on each others’ changes, and even more usefully, we are able to spread our team geographically across the globe. At the moment, we have editors working in California and in Thailand. The database also lets us easily keep track of the editing status of each word, since our policy is to run each word by both native Thai- and native English-speaking editors.
On the second pass, we also learned a lot about what information to include in each entry so that it would be useful to the Thai learner. We added the formality indicators, syllable stress and classifiers, we refined the list of parts of speech and added placeholders in certain verb expressions so you know where to put the object (as in “ao ___ maa” for “bring ___”), and we now have a much better system for assuring that we provide the meanings of each English word that correspond to the Thai translations given.
So is the dataset done?
Far from it. The dataset contained in the 2010 software dictionary is about 40% larger than that contained in the 2009 paper dictionary, and it contains a healthy set of useful words, but even before the software dictionary came out, we had already begun work on a much larger dataset. We expect to at least double the size of our dataset by the end of 2010, and we will roll this expanded data out to those who purchase the dictionary now as a free upgrade. We plan to keep working on the data for several years, until we have a large dataset of the type appropriate for library reference volumes.
Was it fun?
Yes. A key difference between your typical stodgy, corporate linguistic production and our dictionary effort is the Thai element of fun, as evidenced by this video that Kun Benjawan and some of our editors put together:
How do you choose what vocabulary to include and what to leave out?
This was super-difficult for the compact paper dictionary, especially given that having large, readable Thai text was a very high priority for us. We had to make some hard compromises when deciding which words to leave out so as to keep the dictionary “compact.”
For the software, of course, printed space is not an issue. Disk space is somewhat of an issue, because every Thai word in our dictionary includes its own high-quality sound recording of a native Thai speaker, but the constraint is not as great as in the printed case. The main constraint becomes development time, and the trade-off we must decide on in order to ship the project in this millennium is: “do we focus on quantity or quality?”
The answer is clear. There are already plenty of dictionaries out there whose marketing materials quote enormous Thai and English word counts, but which contain huge numbers of errors and/or unusable pronunciation guides. We decided to spend a lot more time on each entry, running each entry, including its pronunciation guides, through both Thai-native and English-native editors. At each editing stage, we focus on defining the most useful words well, rather than cranking out huge lists of rare words without human intervention. We have a very useful set of words now, and we believe by the end of 2010 we will have covered 95% of the words that people are searching for.
Strangely, market forces tend to push all dictionaries away from quality. When people are shopping around for a dictionary, they tend to give disproportionate weight to the published word count (easy to do, since it’s printed on the outside) and they don’t find out until after purchase that the dictionary is useless for them, because it is full of errors, because its entries are not designed for their needs, or because the words themselves turn out not to be very useful.
It’s actually pretty shocking what some vendors have done to reach the astronomical word counts they quote. The typical trick is to find huge, freely-available lists of (usually rare) words on the internet, and import those lists mechanically without any human editing whatsoever. The all-electronic importing process may take the author only a few minutes to complete, and it results in a big spike in word count that grabs the attention of potential buyers, but it does not add to the usefulness of the dictionary in any appreciable way. For example, many dictionaries have mechanically imported lists of tens of thousands of plant species, animal species, chemicals, etc., for which the mechanically-generated definition is simply an italicized Latin name or a chemical equation (which most readers will not find useful), but not including the common English name, if there even is one. As another example, one dictionary I looked at included several hundred names of historical Roman Catholic Popes and Cardinals, transliterated from English into Thai, without any further definition! In some cases, the “padding words” can actually make the dictionary less usable, because the sound or spelling of the useless, definition-less words is sometimes very similar to common, useful words that people are trying to look up.
Are these valid “words?” Yes.
Are they useful, and should they be given the same “credit” towards useful word count as the core words? You be the judge.
Does the word count tell you whether or not a dictionary covers the useful core words? Definitely not.
Thanks to Thai language learning forums like this, though, people are becoming smarter buyers who will demand good coverage of useful core words, definitions that always clarify which sense of an English word is being translated into Thai, complete and accurate pronunciation guides for every Thai word, classifiers, and high-quality sound recordings of every word with native speakers.
Chris and Benjawan on WLT…
Chris and Benjawan are not strangers on WLT. So if you have the time, please read more: