Building a French verb conjugator
Many Indo-European languages conjugate verbs for tense, aspect, person, and so on. Students of these languages need help remembering how each verb is conjugated. There are many places online where French conjugation tables can be looked up, but they all have problems! Either they're unergonomic, or slow, or don't support reverse conjugation (searching via a conjugated form rather than the infinitive), or they simply don't cover many verbs.
This stab at a French verb conjugator hopefully doesn't make those mistakes. It's designed to be fast and ergonomic and work offline (everything runs in the client) and verb coverage is strong. Building something which does these things requires solving a few challenges; none are difficult but solving them all provided a decent amount of variety, so I document how it was built here.
Obtaining the conjugations
Around 90% of French verbs are regular (follow one conjugation pattern), and there are only a few dozen other conjugation patterns in total, so in theory the conjugation tables should be easy to come by. One approach to generating the tables would be to describe the underlying rules programatically. For example to conjugate a regular, group 1 verb like aimer (to love), the algorithm would look something like
- Chop off the ‘er’ to find the root (aimer → aim)
- Append characters to the root to form the conjugation:
- For the indicative present first person: add an e to the root (j'aime)
- For the ... second person singular: add an es to the root (tu aimes)
- For the ... third person singular: add an e to the root (il/elle aime)
- …
In practice it's not so simple. Even the group 1 verbs are less regular than they initially appear; for example certain root endings require respelling to maintain pronunciation regularity, and some words just have multiple valid conjugations (e.g. je paie vs je paye for the verb payer, both of which are common). Other group 1 verbs are degenerate, meaning that they're lacking certain conjugations (like neiger), something which the conjugator should make clear to the user.
Nonetheless the programmatic approach clearly should work given a hearty amount of effort. Fortunately the contributors to Wiktionary have expended that effort: the conjugation tables there are powered by some ropey Lua code which is invoked with customisations where necessary by each verb page.
I'm not interested in trying to write the conjugation code – it's not the part of the problem that interests me and anyway I'd like to extend the conjugator to support other languages – so I stand on their shoulders by taking a dump of Wiktionary, fetching the list of all French verbs, and then scraping each verb page to extract the conjugation tables. There are around 6000 verbs in total.
Compression
The downside of having the conjugation tables and not the underlying rules used to generate them is the space taken up. Concatenating all verb conjugations results in a 3.2mb string – not dreadful but a bit bulky for what should be a snappy web app. Compressing that string with gzip takes it to 600kb, but we'll have to work with the uncompressed data in memory, and surely we can do better than gzip by knowing a bit about how conjugations are generally formed.
The idea therefore is to associate each verb with an inferred conjugation table template, which is made up of pattern entries of the form ‘chop off the last n characters from the infinitive and then append the string x’. Using the example of aimer again:
conjugation | pattern |
---|---|
j'aime | n = 1, x = '' |
tu aimes | n = 1, x = 's' |
… | |
ils aimeraient | n = 0, x = 'aient' |
j'aimai | n = 2, x = 'ai' |
… |
Given the regularity of the language there are vastly fewer of these templates than there are verbs – around 100, it transpires. The patterns, full list of verbs, and some other metadata discussed below can therefore be stored in a javascript file of less than 300kb, which can be gzipped down to ~50kb.
Default Conjugation
Which conjugation should be displayed by default for verbs that have multiple valid ones? For example, when the user asks to conjugate essayer, should she see j'essaie or j'essaye by default? I set the default conjugation to be the one which is more commonly used in recent years in France, which I infer from Google search popularity using Google Trends via pytrends. Back to the essayer example: with the –ayer verbs it's the –ie– forms which are more often used, so those are what you'll see by default.
Autocompletion
Fast and comprehensive search is an essential feature. I don't use a full search trie built from all conjugations in the interests of saving memory. Instead I build a map which is conceptually just a one-level trie; it maps from two characters of the alphabet to all verbs whose infinitive or conjugated form starts with those characters. For example the mapping contains an entry from ‘su’ to verbs like suivre, supposer, and suffire, but also être (because one of its conjugations is suis). When a word is typed in the verb search box I iterate through the list of matches found from the first two characters and then conjugate on the fly to complete the match with the rest of the search string. This is fast (< 2ms on my computer) because there's always a manageable number of candidates which match the first two characters.
Once all matches have been found there's the question of ordering them. I sort based on which verbs are most common in spoken French as inferred from film subtitles. That data comes from Boris New and Christophe Pallier's database Lexique.
Fin.
I think those are the interesting parts. Get in touch if you're interested in extending it to other languages or want to know more about anything I've skipped over.