This reminds me of "Thousand Character Text"(千字文), which is a Chinese poem that has been used as a primer for teaching Chinese characters to children from the sixth century onward. It contains exactly one thousand characters, each used only once, arranged into 250 lines of four characters apiece and grouped into four line rhyming stanzas to facilitate easy memorization.
The language learning premise in this post is a bit ridiculous - if I started with the goal of learning a language and ended up worrying about the asymptotic complexity of my automated k-book recommendation algorithm for arbitrary values of k, then I think I should worry about a serious case of procrastination.
But the algorithms are interesting, so I think a better title would have been "why submodular NP hard problems are cool" or something similar.
My language learning problem is slightly different and quite under-served because my motivation is not the most common one.
I want to learn so that I can read/understand publications in mathematics in a foreign language, mostly Swedish, French, German. (*) For this exercise, the typical apps do not help much.
(*) I would have liked to add Latin and Greek too but that's mostly a pipe dream.
Reading old mathematicians and scholars I have realized something that runs quite counter to the common perception we have, especially in my country.
That common perception is school kids, especially in mid and high school are overwhelmingly burdened by sheer volume of subject matter to learn at a very young age. But then I look at educated teenagers from 17th - 18th centuries, who went on to become mathematicians or scholars, they were so immensely well read at a very young age. I understand this is a biased sample, but many of these people, Newton, for example, were ordinary folks (socio-economically speaking)
Hamilton (I concede that one cannot compare Hamilton with a typical modern teenager) was already quite fluent in thirteen languages in his pre-teens. Apart from the usual suspects, he knew Arabic, Hebrew, Farsi, Sanskrit, Hindi, Marathi.
This might sound atypical but this was not unheard of. One of the poets in my language was fluent in Hebrew, Greek, Italian, French, Latin, Sanskrit, Telugu, Tamil, Bengali, English.
That's quite different than what most folks are looking for when learning a new language, but I guess some techniques can apply just fine. You would have to take the lead find and prepare your own study material.
Something like collecting phrases from these books, loading them into SRS, collecting youtube videos of natives discussing the material you are into, extracting the sound and listening several hours of it for immersion... That is basically the way I learn but focusing on different material.
With LLMs, it is much easier to create your own study material nowadays, as you can ask to translate, break down and explain things as you go.
Good luck! This kind of reminds me of how Bobby Fischer at a relatively young age learned Russian for the explicit purpose of being able to keep up with the best Chess manuals and periodicals - a great deal of which were coming out of the Soviet Union.
Interesting, you are learning languages for math publications but you didn't include Russian ? unless of course you are native (or you are from ex soviet)
I had just resigned myself to the fact that I will probably never be able learn Russian. At an optimistic best, perhaps French and Swedish only, if at all.
I understand both your historical question and the more concrete practical one. Separating reading comprehension as a skill from all the other discrete functions of a language is very straight-forward [1] and, in fact, there are some good analog resources for this.
For French: Dandberg and Tatham, French for Reading
For German: Jannach, German for Reading Knowledge
I've used both and swear they're magic, especially if you're trying to learn to read in a scientific domain that you're already a specialist in (versus literature).
Once you've sort of "learned the game" it isn't very hard to do a similar process for other languages on your own. Then, my main recommendation is to take a text you're deeply familiar with in your native language or English that exists in X other language and just go ahead and start reading it with a dictionary. It starts slow, but progress is very very fast if you stick with it, especially compared to learning to speak or even just listen to a language.
For life reasons, I've found myself having to learn Danish, so I'll let you know if I figure out any good resources for Scandinavian languages.
[1] The only downside I've encountered is trying to later learn to speak a language I had been reading for a while where overcoming the sort of "fictitious phonetics" that existed in my head proved problematic.
My last year of German, I brought it up a grade by reading Faust in parallel in German and a century old Danish translation... I'm Norwegian, and the old Danish translation was a decent mid-point between Norwegian and German to let me get through the German without having to resort to the dictionary very often.
I think, for Danish, if your German is decent, look to older, more formal Danish books you can also find in German, or maybe try to find work in both Danish and Low German / Plattdeutsch and see if it forms a good midpoint for you.
Dutch might possibly also form a decent parallel - the combination of my Norwegian, German and English means I can slog my way through more formal Dutch reasonably well without ever having tried to learn it.
That's very encouraging thanks. I hear that Norwegian, Swedish isn't very hard for an English language speaker. All the best for your next language.
Apparently I was good at picking up languages other than my mother tongue, as a child (4yrs). But now those same languages that I apparently was fluent in appear quite incomprehensible, like first contact incomprehensible.
Are there other books in other languages with the same idea (reading comprehension) ? do you think they are worth reading even for readers not interested in those specific languages but in learning techniques to apply ?
what makes Latin difficult in your context? My focus isn't Math and fwiw found many very good, free, entry-level[1] self-study[2] books (Hans Orberg and others), and even Latin podcasts. There is even a fun Latin track on some of the popular language learning apps.
As for difficulty, well, even English is not my first language. So Latin would be quite a stretch for me.
What makes things more difficult (this is not specific to Latin) is that Maths, Physics has its own language. Domain specific words, such as curvature, torsion, divergence, curl, force, power, action, moment, momentum do not translate in a way that is linguistically obvious.
I fully agree with this approach! 5 years ago I built a prototype to execute that same concept of language covering. But instead of just using words, I used n-grams. It ii trained on subtitles to model spoken language. Combined with sqlite in the browser to get the next sentence with the most impact.
I am fiddling with some language learning utilities myself. Can anyone recommend some relatively simple ways of tracking users' knowledge of a given language? Something like having a sorted frequency list of words/phrases/concepts, and tracking how many times each word has been seen vs used correctly vs used incorrectly, etc.?
I believe this does not have to be perfect, simplicity is preferred. But it should be just enough for an LLM to take a glimpse and estimate users' level in given language.
I developed a few utilities to help me track the words and expressions I know in a language (and also see which words in other languages are missing). Tried to port it to an app [0], but it's not perfect yet.
> People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.
I think the article is just using this as a hook to introduce the submodularity of the maximum weighted cover problem. But I'll talk about a different way of using the same collection of books to learn a language that I think is better.
First of all, you'll probably want to take into account which words you already know, instead of just removing stopwords. If a book uses lots of common words, but you already know them, you're not learning much.
Secondly, no matter how much or how little you already know, you're unlikely to find a book that fits your level well. If you're just beginning to learn the language, no matter which book you pick, the very first sentence will be full of new words, but most of those will be rare ones that you won't encounter again until much later. If on the other hand you already have a very good command of the language, you might be able to breeze through entire chapters and only pick up a handful of new words. (If your primary goal is to enjoy books rather than achieving mastery of the language, this is of course perfectly fine.)
So what I do is split the entire collection into sentences, and for each word from most common to least, pick a small number of sentences using this word, ideally without also having much rarer words, try to read and understand them all, and then use the most suitable sentence to make an Anki flashcard. It's much easier to find a sentence at the right level than an entire book.
It can be a bit weird to learn about the plot of a book piecemeal out of order, especially if multiple books are mixed together, but I think it's an interesting experience.
The same principle can also be applied to recordings from Mozilla Common Voice: https://commonvoice.mozilla.org/en/datasets I like to use them for dictation exercises in Anki, where the card plays a recording and I type in what I thought I heard to check whether I got it right.
word_count = Counter(w for s in sentences for w in words(s))
sentences_by_word = defaultdict(list)
for s in sentences:
for w in words(s):
sentences_by_word[w].append(s)
sentence_sort_key = lambda s: sorted(word_count[w] for w in set(words(s)))
for w, _ in word_count.most_common():
candidates = sorted(sentences_by_word[w], key=sentence_sort_key, reverse=True)[:5]
for c in candidates:
print(w, ':', c)
input()
(Add epicycles for defining what a word is, what a sentence is, ensure the candidate sentences have varying lengths, keep track of which words and sentences were already seen...)
The final step of choosing one sentence and turning it into an Anki flashcard is manual.
This is a bad way to go about it. You want to consume more material, and you want each piece of it to have the least impact on your vocabulary.
So maybe looking for high frequency words is good, but only high frequency words that you know. So the most coverage of the most high frequency words would be very bad. To get the most coverage of the most high frequency words, they'd have to be used in a lower frequency than they are normally, with less repetition in natural contexts, which enable the learner to build meaning. Unless the books were longer which makes degenerate the concept of concentrating common vocabulary in very few books (just read two 2000 page books!)
Reading a bunch of stuff with a concentrated dose of tons of words you don't know will leave you with absolutely no retention. If you know every word but one in e.g. a chapter, you'll probably remember that word forever. The concept is called comprehensible input - you set unfamiliar things in a background of familiar things.
If you want a book with the most unfamiliar vocabulary, it's called a dictionary. It contains all of the most commonly used words, and the least commonly used, too.
In fact, maybe this makes sense if you're going to be locked in a cell for 10 years, you want to learn a language starting from zero*, and only get to have a pocket dictionary and two other books (with a size limit.) You might want to have sample natural sentences for as many of the best words to know as you could.
The real algorithmic language learning trick is to write books that are interesting that use the fewest words (which would inevitably be the most important words to use to communicate but not the necessarily the most common words that natives use to communicate), and introduce new, useful words at a steady rate. That seems like how Capretz put together French in Action. It's also graded readers: I still remember the moment I realized that I could not only understand what was happening in the basic graded reader I'd accidentally picked up on a whim, but also I was interested in finding out what was going to happen next. It's been downhill from there.
-----
[*] or maybe from one? You would have to have some familiarity with the script, and it had better be a phonetic one. Otherwise, this would be just learning how to read a language. No English, no French, no Portuguese, no Chinese... although having poetry books might help, because you can be surer of vowel similarities and syllable breaks. Poetry books are not dense, however, and might bump against any size limit. And the vocabulary would be weird and not representative.
There are many apps that have utilized formal methods in an attempt to teach languages as optimally as possible. But Duolingo is still the leader in language learning. Why? Language learning is an emotional process. Every word you can bring to mind likely has some specific memories tied to them, from another time and place. So even though Duolingo is far from optimal in terms of how and when to present new items to learn, it is close to optimal in vibes, and apparently in the market of language learning this is what consumers prioritize over all else. I believe it is for good reason. Whoever displaces Duolingo will do so not because they teach more efficiently, but because they improve on embedding particular emotions and sentiments into the lessons.
Duolingo isn't even language learning. It's closer to tiktok, it produces dopamine without actually teaching very much at all.
Turns out that most consumers just want to feel like they're learning a language instead of doing the actual work, or in extreme cases, literally only care about maintaining their streak or leaderboard score.
Agreeing with you that Duolingo seems more like a nudging/psychological manipulation testbed with a thin veneer of language learning on top to provide legitimacy.
But what makes you think that this is because "most consumers just want" it that way? The whole effect of dopamine hits is to manipulate what users believe they "want". But you cannot claim to be working in the interests of your users after you manipulated them.
I.e. if a user installed Duolingo because they genuinely wanted to learn the language and than got sidetracked by all the gamification stuff, I don't think you can say they "really" just wanted to play games the whole time.
(Duolingo is walking a fine line here, which was probably the reason they picked language learning in the first place: Because in that field, users really do want a certain degree of nudging and manipulation, to help them keep up with the tedious process of frequent repetition.
That was sort of the official value proposition if Duolingo and I think the reason why many users installed it. It's also why many of the nudging strategies work at all, because they can assume a cooperating user.
But if you use the app, you can see that it frequently tries to push beyond that mutually agreed purpose: Trying to upsell you to the paid version, invite friends, take part in global leaderboard challenges, etc - all of which has very little to do with language learning)
You are making the mistake of assuming that the largest market / largest user base app is also doing the most language instruction.
Duolingo is one of the worst apps out there for language learning, and its users are not practicing useful language skills. It’s a gamified system that feels like language learning, without actually having any substance.
Nonsense. By what metrics do you consider it “the leader”? Popularity forced by marketing? I don’t know a single serious language learner that swears by Duolingo. My gf, who spent at least 100 days on Duolingo, migrated to Babbel.
This reminds me of "Thousand Character Text"(千字文), which is a Chinese poem that has been used as a primer for teaching Chinese characters to children from the sixth century onward. It contains exactly one thousand characters, each used only once, arranged into 250 lines of four characters apiece and grouped into four line rhyming stanzas to facilitate easy memorization.
See Also: https://en.wikipedia.org/wiki/Thousand_Character_Classic
The language learning premise in this post is a bit ridiculous - if I started with the goal of learning a language and ended up worrying about the asymptotic complexity of my automated k-book recommendation algorithm for arbitrary values of k, then I think I should worry about a serious case of procrastination.
But the algorithms are interesting, so I think a better title would have been "why submodular NP hard problems are cool" or something similar.
Agreed - it's a bit of a ridiculous premise. Honestly you'd be better served picking up some proper Graded Readers [1] in the foreign language.
[1] https://tadoku.org/japanese/en/graded-readers-en
How would one go about dealing with that kind of procrastination? Or is it not handling distraction?
My language learning problem is slightly different and quite under-served because my motivation is not the most common one.
I want to learn so that I can read/understand publications in mathematics in a foreign language, mostly Swedish, French, German. (*) For this exercise, the typical apps do not help much.
(*) I would have liked to add Latin and Greek too but that's mostly a pipe dream.
Reading old mathematicians and scholars I have realized something that runs quite counter to the common perception we have, especially in my country.
That common perception is school kids, especially in mid and high school are overwhelmingly burdened by sheer volume of subject matter to learn at a very young age. But then I look at educated teenagers from 17th - 18th centuries, who went on to become mathematicians or scholars, they were so immensely well read at a very young age. I understand this is a biased sample, but many of these people, Newton, for example, were ordinary folks (socio-economically speaking)
Hamilton (I concede that one cannot compare Hamilton with a typical modern teenager) was already quite fluent in thirteen languages in his pre-teens. Apart from the usual suspects, he knew Arabic, Hebrew, Farsi, Sanskrit, Hindi, Marathi.
This might sound atypical but this was not unheard of. One of the poets in my language was fluent in Hebrew, Greek, Italian, French, Latin, Sanskrit, Telugu, Tamil, Bengali, English.
That's quite different than what most folks are looking for when learning a new language, but I guess some techniques can apply just fine. You would have to take the lead find and prepare your own study material.
Something like collecting phrases from these books, loading them into SRS, collecting youtube videos of natives discussing the material you are into, extracting the sound and listening several hours of it for immersion... That is basically the way I learn but focusing on different material.
With LLMs, it is much easier to create your own study material nowadays, as you can ask to translate, break down and explain things as you go.
Good luck! This kind of reminds me of how Bobby Fischer at a relatively young age learned Russian for the explicit purpose of being able to keep up with the best Chess manuals and periodicals - a great deal of which were coming out of the Soviet Union.
Interesting, you are learning languages for math publications but you didn't include Russian ? unless of course you are native (or you are from ex soviet)
That's a great point because a lot of the maths literature I am interested in is actually in Russian (optimization, probability).
Thankfully there is the "Translations of Mathematical Monographs" book series
https://bookstore.ams.org/mmono
I had just resigned myself to the fact that I will probably never be able learn Russian. At an optimistic best, perhaps French and Swedish only, if at all.
I understand both your historical question and the more concrete practical one. Separating reading comprehension as a skill from all the other discrete functions of a language is very straight-forward [1] and, in fact, there are some good analog resources for this.
For French: Dandberg and Tatham, French for Reading
For German: Jannach, German for Reading Knowledge
I've used both and swear they're magic, especially if you're trying to learn to read in a scientific domain that you're already a specialist in (versus literature).
Once you've sort of "learned the game" it isn't very hard to do a similar process for other languages on your own. Then, my main recommendation is to take a text you're deeply familiar with in your native language or English that exists in X other language and just go ahead and start reading it with a dictionary. It starts slow, but progress is very very fast if you stick with it, especially compared to learning to speak or even just listen to a language.
For life reasons, I've found myself having to learn Danish, so I'll let you know if I figure out any good resources for Scandinavian languages.
[1] The only downside I've encountered is trying to later learn to speak a language I had been reading for a while where overcoming the sort of "fictitious phonetics" that existed in my head proved problematic.
My last year of German, I brought it up a grade by reading Faust in parallel in German and a century old Danish translation... I'm Norwegian, and the old Danish translation was a decent mid-point between Norwegian and German to let me get through the German without having to resort to the dictionary very often.
I think, for Danish, if your German is decent, look to older, more formal Danish books you can also find in German, or maybe try to find work in both Danish and Low German / Plattdeutsch and see if it forms a good midpoint for you.
Dutch might possibly also form a decent parallel - the combination of my Norwegian, German and English means I can slog my way through more formal Dutch reasonably well without ever having tried to learn it.
That's very encouraging thanks. I hear that Norwegian, Swedish isn't very hard for an English language speaker. All the best for your next language.
Apparently I was good at picking up languages other than my mother tongue, as a child (4yrs). But now those same languages that I apparently was fluent in appear quite incomprehensible, like first contact incomprehensible.
Are there other books in other languages with the same idea (reading comprehension) ? do you think they are worth reading even for readers not interested in those specific languages but in learning techniques to apply ?
what makes Latin difficult in your context? My focus isn't Math and fwiw found many very good, free, entry-level[1] self-study[2] books (Hans Orberg and others), and even Latin podcasts. There is even a fun Latin track on some of the popular language learning apps.
[1] https://archive.org/details/conspectus-grammaticus-familia-r...
[2] https://latinitium.com/best-books-for-learning-latin/
Thanks for the links.
As for difficulty, well, even English is not my first language. So Latin would be quite a stretch for me.
What makes things more difficult (this is not specific to Latin) is that Maths, Physics has its own language. Domain specific words, such as curvature, torsion, divergence, curl, force, power, action, moment, momentum do not translate in a way that is linguistically obvious.
I fully agree with this approach! 5 years ago I built a prototype to execute that same concept of language covering. But instead of just using words, I used n-grams. It ii trained on subtitles to model spoken language. Combined with sqlite in the browser to get the next sentence with the most impact.
github here: https://github.com/fdietze/ravioli
prototype deployed here: https://raviolio.web.app/
I am fiddling with some language learning utilities myself. Can anyone recommend some relatively simple ways of tracking users' knowledge of a given language? Something like having a sorted frequency list of words/phrases/concepts, and tracking how many times each word has been seen vs used correctly vs used incorrectly, etc.?
I believe this does not have to be perfect, simplicity is preferred. But it should be just enough for an LLM to take a glimpse and estimate users' level in given language.
I developed a few utilities to help me track the words and expressions I know in a language (and also see which words in other languages are missing). Tried to port it to an app [0], but it's not perfect yet.
[0] https://apps.apple.com/us/app/ai-anki-learning-fluentread/id...
You can give them a series of questions from hardest to easiest and based on where they fail according to your metric you place them.
> People have many ways to learn a language, different for each person. Suppose you wanted to improve your vocabulary by reading books in that language. To get the most impact, you’d like to pick books that cover as many common words in the language as possible.
I think the article is just using this as a hook to introduce the submodularity of the maximum weighted cover problem. But I'll talk about a different way of using the same collection of books to learn a language that I think is better.
First of all, you'll probably want to take into account which words you already know, instead of just removing stopwords. If a book uses lots of common words, but you already know them, you're not learning much.
Secondly, no matter how much or how little you already know, you're unlikely to find a book that fits your level well. If you're just beginning to learn the language, no matter which book you pick, the very first sentence will be full of new words, but most of those will be rare ones that you won't encounter again until much later. If on the other hand you already have a very good command of the language, you might be able to breeze through entire chapters and only pick up a handful of new words. (If your primary goal is to enjoy books rather than achieving mastery of the language, this is of course perfectly fine.)
So what I do is split the entire collection into sentences, and for each word from most common to least, pick a small number of sentences using this word, ideally without also having much rarer words, try to read and understand them all, and then use the most suitable sentence to make an Anki flashcard. It's much easier to find a sentence at the right level than an entire book.
It can be a bit weird to learn about the plot of a book piecemeal out of order, especially if multiple books are mixed together, but I think it's an interesting experience.
The same principle can also be applied to recordings from Mozilla Common Voice: https://commonvoice.mozilla.org/en/datasets I like to use them for dictation exercises in Anki, where the card plays a recording and I type in what I thought I heard to check whether I got it right.
do you have an automated method of doing the filtering or is this all manual
The sorting is automated.
(Add epicycles for defining what a word is, what a sentence is, ensure the candidate sentences have varying lengths, keep track of which words and sentences were already seen...)The final step of choosing one sentence and turning it into an Anki flashcard is manual.
This is a bad way to go about it. You want to consume more material, and you want each piece of it to have the least impact on your vocabulary.
So maybe looking for high frequency words is good, but only high frequency words that you know. So the most coverage of the most high frequency words would be very bad. To get the most coverage of the most high frequency words, they'd have to be used in a lower frequency than they are normally, with less repetition in natural contexts, which enable the learner to build meaning. Unless the books were longer which makes degenerate the concept of concentrating common vocabulary in very few books (just read two 2000 page books!)
Reading a bunch of stuff with a concentrated dose of tons of words you don't know will leave you with absolutely no retention. If you know every word but one in e.g. a chapter, you'll probably remember that word forever. The concept is called comprehensible input - you set unfamiliar things in a background of familiar things.
If you want a book with the most unfamiliar vocabulary, it's called a dictionary. It contains all of the most commonly used words, and the least commonly used, too.
In fact, maybe this makes sense if you're going to be locked in a cell for 10 years, you want to learn a language starting from zero*, and only get to have a pocket dictionary and two other books (with a size limit.) You might want to have sample natural sentences for as many of the best words to know as you could.
The real algorithmic language learning trick is to write books that are interesting that use the fewest words (which would inevitably be the most important words to use to communicate but not the necessarily the most common words that natives use to communicate), and introduce new, useful words at a steady rate. That seems like how Capretz put together French in Action. It's also graded readers: I still remember the moment I realized that I could not only understand what was happening in the basic graded reader I'd accidentally picked up on a whim, but also I was interested in finding out what was going to happen next. It's been downhill from there.
-----
[*] or maybe from one? You would have to have some familiarity with the script, and it had better be a phonetic one. Otherwise, this would be just learning how to read a language. No English, no French, no Portuguese, no Chinese... although having poetry books might help, because you can be surer of vowel similarities and syllable breaks. Poetry books are not dense, however, and might bump against any size limit. And the vocabulary would be weird and not representative.
There are many apps that have utilized formal methods in an attempt to teach languages as optimally as possible. But Duolingo is still the leader in language learning. Why? Language learning is an emotional process. Every word you can bring to mind likely has some specific memories tied to them, from another time and place. So even though Duolingo is far from optimal in terms of how and when to present new items to learn, it is close to optimal in vibes, and apparently in the market of language learning this is what consumers prioritize over all else. I believe it is for good reason. Whoever displaces Duolingo will do so not because they teach more efficiently, but because they improve on embedding particular emotions and sentiments into the lessons.
Duolingo isn't even language learning. It's closer to tiktok, it produces dopamine without actually teaching very much at all.
Turns out that most consumers just want to feel like they're learning a language instead of doing the actual work, or in extreme cases, literally only care about maintaining their streak or leaderboard score.
Agreeing with you that Duolingo seems more like a nudging/psychological manipulation testbed with a thin veneer of language learning on top to provide legitimacy.
But what makes you think that this is because "most consumers just want" it that way? The whole effect of dopamine hits is to manipulate what users believe they "want". But you cannot claim to be working in the interests of your users after you manipulated them.
I.e. if a user installed Duolingo because they genuinely wanted to learn the language and than got sidetracked by all the gamification stuff, I don't think you can say they "really" just wanted to play games the whole time.
(Duolingo is walking a fine line here, which was probably the reason they picked language learning in the first place: Because in that field, users really do want a certain degree of nudging and manipulation, to help them keep up with the tedious process of frequent repetition.
That was sort of the official value proposition if Duolingo and I think the reason why many users installed it. It's also why many of the nudging strategies work at all, because they can assume a cooperating user.
But if you use the app, you can see that it frequently tries to push beyond that mutually agreed purpose: Trying to upsell you to the paid version, invite friends, take part in global leaderboard challenges, etc - all of which has very little to do with language learning)
You are making the mistake of assuming that the largest market / largest user base app is also doing the most language instruction.
Duolingo is one of the worst apps out there for language learning, and its users are not practicing useful language skills. It’s a gamified system that feels like language learning, without actually having any substance.
Or it just requires the lowest effor. Or it is the most gamified language learning app.
Or ...
Nonsense. By what metrics do you consider it “the leader”? Popularity forced by marketing? I don’t know a single serious language learner that swears by Duolingo. My gf, who spent at least 100 days on Duolingo, migrated to Babbel.