FLTR: The Foreign Language Text Reader

Foreign Language Text Reade

This article was originally posted on WomenLearnThai.com.

  • Get your FREE Thailand Cheat Sheet ​by entering your email below. The ​Sheet, based on ​our experience with living and working in ​Thailand for 10+ years, shows you how to ​save time and money and ​gives you the tools the thrive in Thailand.

Intensive Reading vs. Extensive Reading…

Extensive reading is a language learning technique characterized by reading a lot, at or slightly below your current level of proficiency, without looking up unknown words. If the level of books or texts chosen is appropriate, unknown words or grammatical structures can be inferred from context. Extensive reading is basically reading for pleasure, but it is very beneficial in terms of solidifying existing knowledge, acquiring new vocabulary, increasing reading speed, and (depending on what you read) expanding cultural understanding. The nice thing about extensive reading is that it is fun (if you like reading, of course), with language learning being just a by-product. Focus is on meaning, not on language. Extensive reading is often neglected in language schools because it has to be done alone and can’t be assessed or tested.

Intensive reading, on the other hand, is slow, careful reading of a short text. Here, the focus is on understanding (almost) every word, every sentence. Often the text is beyond your current reading ability, but because you go slowly, you can tackle it. Intensive reading can be used to familiarize yourself with new vocabulary, to study vocabulary related to a specific topic, or to find information. It is certainly less fun than extensive reading, but it can have an important role in language learning. As a matter of fact, intensive reading is often the only reading activity used in classroom settings, and it is heavily used by self-learners as well.

I have seen recommendations to balance the amount of time spent on extensive reading vs. intensive reading at a ratio of about 4:1, which seems quite reasonable to me. In this blog post, however, instead of championing the extensive reading cause, I want to talk about intensive reading assisted by a freely available open-source software.

Intensive reading is quite time-consuming, most of which is spent looking up vocabulary, taking notes, searching for notes, and looking up the same words again. Unless you’re extremely well organized, you will find that you look up many words more than once when you encounter them again in a new text. Some time is also often spent on highlighting new words and expressions, or otherwise visually structuring the text. This has inspired some people to write software dealing with those more tedious tasks in order to make intensive reading easier. One of those software projects is the Foreign Language Text Reader (FLTR) which is open-source and can be installed and configured quite easily.

Foreign Language Text Reader…

FLTR basically works as follows: You load a text. The text is then displayed for reading, but words come color-coded. Words never seen before are blue, unknown words take shades between red and yellow/green, and known words are a pale green. While going through the text, you will mark new words as either known or unknown. If they are unknown, you can look them up in up to three online dictionaries with a single mouse click. Then you annotate the words (translations, explanations, pronunciation etc.), and this information is stored. When you encounter the word again, it will show up in its color code (there are five or six of them, from unknown to well known), and hovering over the word will reveal the notes you typed (or rather copied) in earlier. As time progresses, FLTR will learn which words you know and which you don’t, and will help you to focus on new and unknown words.

Advertisement

FLTR: The Foreign Language Text Reader

In this picture, the mouse is hovering over เครื่องกล.

What’s cool about this? Firstly, you look up words only once, and then you can review them by just hovering over those words. Secondly, instead of leafing through paper dictionaries, or typing words into an online search mask, a single mouse click will look them up. Thirdly, the color coding helps you to identify what’s new, what you’ve seen before but is still unfamiliar etc. Instead of reading over those words, they stand out a bit and remind you of their existence. The color coding is also a good visualization of how difficult the text is going to be. Lots of blue and red words means work ahead.

There are also testing options as well as the possibility to export terms to Anki, but I haven’t used those features and can’t comment on them.

xxx Setting up FLTR is pretty straight-forward, with simple and clear instructions. Language configuration is also simple, options include setting font and font size and specifying up to three dictionaries for automated look-up (if the website allows that). Below you’ll find a screen-shot of my settings. I link to the monolingual Royal Institute Dictionary (doesn’t support automated look-up), Google image search and a longdo dictionary containing many Thai-Thai definitions. (I don’t use translations, but if you do, you’ve got many more choices).

The only problem with Thai is the following: Thai doesn’t uses spaces to separate words. FLTR, however, relies on spaces to identify words. So, unlike with languages like French or Indonesian that use spaces to indicate word boundaries, we need to prepare (‘parse’) the text before uploading it to FLTR.

A Thai Parser…

I haven’t been able to find a Thai parser on the web. It wouldn’t even have occurred to me to write my own parser, but a visitor to my website Thai Recordings told me that he wrote one, and that gave me the idea (thanks! :)). Coming up with a basic parser is actually quite simple – if you have some programming skills, you can do it yourself within a few hours. The parser requires a list of words (I use the FLTR vocabulary file for that), and inserts zero-width spaces into the Thai text. Zero-width spaces are invisible, but are recognized by FLTR. It was very important to me to find a space character that is invisible, because I’m so used to reading Thai without spaces that I get confused when I have to read spaced out Thai.

I use Python, which comes with my Mac, and have a terminal open to process texts:

terminal

Here’s what the parser does:

  1. Read in dictionary D (uses the FLTR vocabulary file, which is a tab separated text file)
  2. Read in the text
  3. For every ‘sentence’ S (set of Thai characters between two spaces) of the text, set i = j = 1 and do until i reaches the end of S:
  4. Define the snippet X = S(i, j), i.e., the characters in S between positions i and j
  5. If X is a word in D, note down this particular snippet
  6. If j has reached the end of S, go to 7, otherwise set j = j+1 and go to 4
  7. If snippets have been identified as words: choose the longest of those, insert zero-width spaces accordingly, set i to the index of the character right after that word, and start over at 4
  8. If no snippets have been identified as words, set i = j = i+1 and start over at 4

The parser finds the longest word, and then restarts on the remainder. If no words have been found, it starts with the second, then third, etc., character, and finds the first word in the middle of the ‘sentence’. The more words the parser has in its dictionary, the more likely it is that new words are isolated between known words. Those words then will show up in blue in FLTR and can be marked according to whether they are already known or still unknown. Once they have been marked, they’re in the database and increase parsing accuracy.

This parser is not perfect. It doesn’t work very well in the beginning: If new words come in chunks, a manual update of the database might be required to resolve that. It also can’t distinguish between มา-กลับ and มาก-ลับ. The first issue disappears over time, but the second stays (and would require semantic parsing to be resolved). If you have ideas on how to deal with those issues, please let me know in the comments!

Wrap up…

FLTR is a great little piece of software. It supports intensive reading and facilitates vocabulary work (whether monolingual or using translation). Look-ups are one click away, notes (or translations) are stored and show up when hovering over the word, and the color coding can be a useful visual aid. The only inconvenience is the necessity to have a parser, but a basic parser is not too difficult to write yourself.

Andrej,
Thai Recordings

24 thoughts on “FLTR: The Foreign Language Text Reader”

  1. It looks like the Thai Text Reader project is dead then? What a shame I really like this method and have used it for other languages. If it would work for Thai that would be perfect, oh well, there doesn’t seem to be any other options either.

    Reply
  2. We are now looking for volunteers to help fine-tune a Thai parser to use with FLTR. You don’t have to be a programmer. You just need to get Thai script, run it through the parser, then send the results to Rick.

    The details are here:

    Reply
  3. “you remember words better when you have figured out their meaning yourself from context rather than from a dictionary”

    That’s a really great point. Dictionary checking whilst keeping my fingers jammed in several different books is not fun. Maybe passive learning is stronger in the long term, but fighting the desire to immediately know the definition of a word is difficult!

    Reply
  4. Oh. I thought we were talking about Learning With Texts. Works fine there. Apologies – I should read more carefully.

    Reply
  5. Chris, thanks for posting. Unfortunately, this doesn’t help with using the Foreign Language Text Reader…

    Python is a programming language. It’s open-source, I believe, and comes with my Mac, and probably with other operating systems as well. But, as others have pointed out as well, many programming languages will be suitable for writing a parser. Lexitron is a dictionary, just type the phrase into Google and you will see what it is. Loops etc. are basic programming concepts. Any programming tutorial for beginners will cover that, and there’s not much more needed for writing a parser. There’s also massive amounts of online help and example code for Python on the web.

    Reply
  6. Since I can’t be your only reader who doesn’t know what Python, Lexitron, and i,j loops are, let me tell you how I manage the parsing:

    Go to the ever-awesome thai-language.com dictionary. Go to bulk look up. Paste your text. Select simple. Uncheck gather phrases. Increase columns from 8 to 20. Click.

    It is not perfect and can only handle 2500 characters at a time, so if you want to read books, this would be a very tedious approach. Since I can’t read more than a paragraph at a time, it works for me.

    Reply
  7. I’ve been using Lexitron with my Thai parser for a number of months now, but I’d like begin moving away from relying so heavily on translation from Thai to English. One of things that I’m looking for is a good, simple Thai-Thai dictionary that’s available in electronic form.

    Reply
  8. Rick,

    that’s great! :))
    Same here, it never occurred to me that writing a parser would be so easy, until somebody told me he had written one and gave me a brief outline…

    Have fun using FLTR!

    Reply
  9. Andrej,

    It never occurred to me that writing a parser would be so easy, but given your lead, I tackled it today, and was pleased to knock it over in about 3 hours of on-and-off work.

    I use a language called Haskell, which is ideal for this kind of task, and even though I’m not an expert, the parser is only 10 lines of code all up (Haskell uses recursion instead of those pesky loops).

    Thanks for giving me the idea, now I’ll see about using FLTR itself.

    Reply
  10. Catherine and Andrej – Extensive or intensive reading? All intensive and no extensive reading will make Johnny a dull boy one day. I like the 4:1 reading ratio in favour of extensive, that makes sense, a little like a keep fit fanatic doing four days of quick burn routine exercise and one day of hard heavy weights. The ratio’s reads right.

    The Foreign Language Text Reader is a great idea, what ever will they come up with next. The FLTR is a simple, but a most useful tool. Unfortunately my computer screen would be a sea of colour.

    Reply
  11. Peter, your comment must have been stuck in moderation, I just saw it a minute ago. There is no link. I’m not a programming guy, and the parser is not packaged nicely to work on other people’s machines. I’m willing to share my parser on a private basis but I can’t give any support. You would certainly need to adapt it before using it, in particular setting paths etc. The parser currently operates on the FLTR word file; if you want to make it work without installing FLTR, you need to make more changes. Please contact me through my website if you’re still interested.

    Reply
  12. Rick, that’s a good suggestion. I haven’t implemented this because the parsing is still extremely fast. See the numbers given in my comment preceding this one: 0.01s for the actual parsing once the dictionary is read in.

    What I’ve however implemented is the following: in step 5 of the pseudo-code given above, I don’t allow snippets for which the next character is one no Thai word can start with (some vowels and tone marks). This avoids parsing “มากับ” into “มาก” and ” ับ”.

    There are more speed-ups possible, but I wanted to keep the pseudo-code simple.

    Reply
  13. Over the past two days, I’ve had a private conversation on this post that I would like to summarize here.

    There is a free Thai-English dictionary available from . You have to open an account with them, but then you can download the Lexitron 2.0 data .zip file; it is encoded in TIS-620 (ISO-8859-11). This gives you around 32’000 unique Thai headwords. Reading in this dictionary resolves the first problem with the parser I mentioned in my post: The parser parses now basically any text without stumbling over clusters of previously unseen words; no manual intervention necessary anymore.

    I’ve downloaded the Lexitron dictionary, but I’ve thrown out the English translations and kept only the Thai headwords plus the synonyms, if given. Then I’ve merged that with my FLTR database, setting the new words to unknown and keeping the synonyms in the translation field. I’m now at around 32’600 headwords in my parser (and FLTR). Processing times for the parser on a 901-word text are as follows: 0.21s for reading in the word list and doing clean-up work, 0.73s for writing the cleaned-up word list back to file, 0.12s for converting the list into a format that is suitable for fast inclusion checks (remember that the parser has to check whether a given character sequence is in the word list or not), and 0.01s for actually parsing the text. After parsing, the text can be uploaded to FLTR. FLTR takes another second or two of processing time before it displays the text, but then you’re ready to go. (Times are given for a MacBook Air 2012.)

    Once you have a parser, it is possible to build up a corpus of Thai texts. Such a corpus can be used to find collocations, to find examples of how certain words are used etc., in a well-controlled set of texts. This can be particularly useful for beginner and intermediate learners. The real challenge here is to find texts that use only basic vocabulary; most texts, even primary school books, use already quite a large amount of vocabulary. Advanced learners can, of course, use the ultimate corpus that’s out there, Google.

    Reply
  14. You could probably speed up the parser by making it 2-pass, well, by extending the conditions for creating a sentence S from the whole text.

    The first pass, you would also stick a flag before any occurrence of ‘sara ai’ (both variants ไ and ใ), ‘sara oh’ (โ) and after ‘sara am’ (ทำ) and ‘sara a’ (ะ) plus around any punctuation marks such as quotes of all types, digits and other extraneous stuff.

    These could then be regarded as several separate smaller texts to put through your main parser.

    This would cut down the time spent in your ‘i,j’ loop quite a bit (and you could throw in some parallel processing if needed).

    Reply
  15. Sure, extensive reading is vastly underestimated and should be the primary focus of our reading activity. It is also my experience that words I’ve figured out myself stick better, and, actually, I haven’t used bilingual dictionaries in my Thai learning at all. Not using a dictionary doesn’t hold you back learning the language, rather the opposite.

    Still there is sometimes the need to do intensive reading. I primarily do it as a preparation for tutoring sessions, in which we talk about a certain topic based on an article (usually a newspaper article).

    Reply
  16. Very interesting article, I would like to give it a try. Andrej, are you willing to share your parser (unless I overlooked the link..)?

    Reply
  17. “you remember words better when you have figured out their meaning yourself from context rather than from a dictionary”

    I’ve heard that before but haven’t put it to test in a large sample (I’m too impatient to know the answer).

    Have you tried Andrej’s suggestions with FLTR yet? The Thai parser? I’m rushing around this week so it’ll have to be later. And… I’m thinking I just might need a coding person to walk me through the parser (I already have FLTR set up).

    Reply
  18. Another fan of extensive reading was Kato Lomb, a Hungarian lady who spoke 16 languages, 10 of them well enough to be a translator in.

    Her view was that intensive reading is so slow for a newcomer that it quickly becomes boring. Instead, choose some text which is likely to be of interest to you, abandon looking up words in a dictionary, she said, and keep on going through the text as quickly as possible so you can get involved in the story.

    Besides, she said, you remember words better when you have figured out their meaning yourself from context rather than from a dictionary.

    This software looks like an interesting tool to help with reading.

    BTW: Lomb wrote a 216-page book on how she learnt languages, which can be downloaded as a free PDF from http://www.cc.kyoto-su.ac.jp/information/tesl-ej/ej45/tesl-ej.ej45.fr1.pdf

    Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.