This is the second in a series of posts about Isosceles, a corpus of 19th-century French and English literature. The first post described the corpus itself. This post covers the annotation methodology: how the French texts are converted to CoNLL-U format, what goes wrong, and how cross-pipeline comparison and targeted LLM review bring the annotations to training-data quality. Readers are assumed to be familiar with Universal Dependencies and CoNLL-U format.
Toward a better model for linguistic annotation of 19th-century French literature
Overview
It's well known that NLP tools trained on modern corpora degrade in practice when used on historical literary text (Gladstone, Fang, & Stewart, 2025). In my experience attempting to annotate this corpus of ~7 million tokens of French literary text from 1840 to 1920, I found that Stanza, spaCy, and CoreNLP all exhibit systematic errors on passé simple constructions, rare subjunctive forms, and irregular verb morphology, among other things. Manual correction would be impractical for a corpus of this size, and an expensive round of LLM annotation would tend to introduce error patterns of its own1 Gladstone et al. describe a related approach using full-sentence LLM annotation, but acknowledge that their synthetic ground truth likely still contains systematic errors. .
This post describes an alternative: a layered correction pipeline that combines rule-based checks, cross-pipeline comparison, and targeted local LLM classifiers to flag likely errors, then escalates only the flagged tokens to a stronger model for final adjudication. This approach is cheaper and more accurate than sending raw parser output through a single, more powerful LLM, and can be run with on-device models on consumer hardware2 A MacBook M4 Pro with 24 GB RAM, in my case. .
I've applied this pipeline to a subset of the ELTeC-fra corpus — about 600,000 tokens from 80 novels — to produce high-quality CoNLL-U that will eventually serve as training data for a domain-adapted Stanza model for 19th-century literary French. If the resulting model generalizes well, I'll use it to annotate the remaining 6 million of tokens in the Isosceles corpus. I plan to release the model publicly when training completes, around mid-March 2026.
Choosing an annotation stack
My goal is to get high-quality, UD v2-compliant CoNLL-U with reliable tokenization, lemmas, UPOS tags, and morphological features. I evaluated three tools for the purpose: spaCy (fr_dep_news_trf), Stanza (GSD treebank model), and Stanford CoreNLP.
For sentence segmentation, I chose CoreNLP for its consistency across French and English texts. Stanza had two problems that made it less than ideal in this task for my use case: it splits sentences at semicolons in French but not in English, and, in French only, occasionally stumbles on paragraph-final quotation marks. spaCy had pervasive segmentation problems in French that made it entirely unusable for this purpose.
For annotation, Stanza's GSD model produces the most consistent UD v2 output. I initially tried the default "combined" model, which was trained on multiple treebanks, but the mixture of annotation conventions there tended to produce inconsistent results, as Stanza's own documentation warns.
spaCy was initially attractive for its stronger dependency parsing. But, in the end, I found that none of the available parsers could produce satisfactory structural parses for my purposes, and I decided to leave dependency parsing out of scope. With dependency relations off the table, then, Stanza became the better choice given its richer set of morphological features and native UD v2 compliance.
CoreNLP's French lemmatizer turned out not to be competitive with these more recent tools; for most tokens, it simply returns the surface form or its lowercased variant.
What Stanza gets wrong
Passé simple tagged as present tense
This is the largest single error class. In much first-person narration in French, passé simple is the primary narrative tense. But Stanza frequently cannot distinguish this from present tense.
Fabricated infinitives
Stanza's lemmatizer sometimes invents infinitives that don't exist in French:
trouvâmes→trouvâmer(should betrouver)parurent→pararer(should beparaître)haïssez→haïsser(should behaïr)buvaient→buver(should beboire)entrâmes→entrâmer(should beentrer).
Cross-verb confusion
Sometimes Stanza assigns forms to the wrong real verb, attracted by character-level prefix overlap. A particularly systematic case is prier/prendre: I found that every short form of prier (prie, pria, priez) is lemmatized as prendre. The pri- prefix is shared with high-frequency prendre forms (pris, prit, prend), and presumably the character-level lemmatizer model can't distinguish them. Longer forms where the -ier conjugation pattern is visible (priait, prié, priant) are correctly assigned.
A few other confirmed cross-verb confusions:
durerait→devoir(should bedurer)dirent→devoir(should bedire)pourrions→pourrir(should bepouvoir)
Conditional and subjunctive misanalysis
Some verb forms are being labeled as the wrong tense. For example, conditional forms are marked as present tense, and literary past subjunctive forms are marked as simple indicative forms. But these forms have distinct endings in French. A word like retournerait, for example, cannot be present tense — its ending marks it as conditional (Grevisse & Goosse, 2016, §809).
The layered correction pipeline
The pipeline combines three layers of detection. Tokens flagged by any layer are escalated to a stronger model (Claude Sonnet) for final adjudication, described in a later section.
1. Rule-based filters
The first layer handles cases where errors can be detected deterministically with high likelihood:
-
Each lemma is checked against a period-appropriate dictionary, Littré's Dictionnaire de la langue française (1872–1877).
-
Some verb forms are unambiguous on their own: the passé simple endings
-âmes,-âtes,-èrentcan only belong to one tense (Grevisse & Goosse, 2016, §803). A small set of heuristics detects violations of these rules. -
A simple subject-verb agreement checker flags mismatches between unambiguous subject pronouns and the verb's
Personfeature.
2. Cross-pipeline comparison
Although I found Stanza more reliable overall, it proved useful to also have outputs from spaCy on hand for comparison. In particular, cases where Stanza and spaCy disagree have turned out to provide a valuable error-detection signal.
To leverage this signal, I built a comparison tool that aligns Stanza and spaCy outputs by token (handling MWT expansion differences) and checks whether the two parsers converge on the same annotation or not. Across the 600k tokens of my training set (i.e. chunks from 80 ELTeC-Fra novels):
| Category | Count |
|---|---|
| Substantive lemma disagreements | 41,092 |
| Form-as-lemma (Stanza uses surface form) | 7,202 |
| Case-only differences | 3,471 |
| Total Stanza lemma outliers | 51,765 |
Not all of these are errors. Many are convention differences: Stanza follows GSD conventions for pronoun lemmas (lui for 3sg subject, moi for 1sg), while spaCy normalizes differently (il, je). Gendered noun lemmas, ligature normalization (sœur/soeur), and the participle-adjective boundary account for much of the rest.
After applying some heuristics to filter out these merely conventional disagreements, roughly 20,000 substantive disagreements remain.
Tense and mood
The comparison extends to morphological features. Across the 600k tokens of the test set, almost 17k tokens show tense or mood disagreements between the two pipelines. The most frequent patterns:
| Count | Stanza | spaCy |
|---|---|---|
| 976 | Ind Pres | Ind Past |
| 728 | Ind Pres | Ind Fut |
| 649 | Ind Pres | Ind Imp |
| 481 | Ind Pres | Cnd Pres |
The top three patterns alone flag 2,353 tokens — the same passé simple, future, and imparfait errors found during manual review. SpaCy is not always right (it over-corrects toward passé simple on stative verbs like s'agir, and mishandles imperfect subjunctive), but it often catches tense errors that were missed entirely by the local LLM models used later in the pipeline.
3. Targeted LLM classifiers
For situations requiring contextual judgment that neither above approach can easily provide, I used a set of local LLM classifiers, each tuned to a single error type.
I evaluated four models — Mistral-Nemo (12B), Qwen 3 (8B), Gemma 3 (12B), and Mistral-Small 3.2 (24B, Q4 quantized to fit 24 GB RAM) — across five classifier tasks on test chunks from several ELTeC authors (Audoux, Allais, Balzac, Flaubert, Montagne, Rolland). In general, the Mistral models performed best at the lemma task, while Qwen and Gemma had strengths in grammatical judgments. The task and the selected model are described below:
- Lemma correctness (Mistral-Small 3.2). Where both the current lemma and a candidate replacement are real words (e.g., prier vs prendre), a classifier asks whether the current lemma is correct in context given GSD treebank conventions. This complements the Littré dictionary check, which only catches non-words.
- Tense/mood tagging (Qwen3 + Gemma3 (union)). A classifier flags verb forms whose tense or mood tag may be wrong, targeting the passé simple/present and conditional/indicative confusions described above.
- AUX/VERB distinction (Qwen3). French auxiliary verbs (avoir, être) are tagged AUX in some syntactic contexts and VERB in others. A classifier checks whether the current tag is correct.
quedisambiguation (Gemma3). The word que can be a relative pronoun (PRON), a complementizer (SCONJ), or a restrictive adverb (ADV). Stanza sometimes assigns the wrong category, and the correct choice depends on syntactic context.- ADJ↔ADV misclassification (Qwen3). A closed list of 28 lemmas that can function as either ADJ or ADV depending on context (23 from the GSD treebank, 5 common 19th-century additions). A classifier checks each occurrence against per-word usage guidance.
As a post-processing step, I discarded any cases of model self-contradiction: any flag where the model's proposed correction is the same as the current value. This addresses a frequent deficiency across all LLM models tested.
Summary
The resulting pipeline architecture, after three rounds of evaluation and refinement:
| Classifier | Best model | Notes |
|---|---|---|
| Lemma | Littré + Mistral-Small 3.2 24B | Dictionary for non-words, LLM for real-word confusions |
| Tense/mood | Qwen 3 8B + Gemma 3 12B (union) | Qwen for precision, Gemma for recall backup |
| AUX/VERB | Qwen 3 8B | Only model to resist the "eut" trap |
que | Gemma 3 12B | Only model with zero false positives |
| ADJ↔ADV | Qwen 3 8B | Closed word list + syntactic judgment |
Final review with Sonnet
The flags from the local classifiers are submitted to Claude Sonnet (claude-sonnet-4-5). Each prompt contains the sentence, the flagged token with its current annotation, a targeted set of instructions about GSD treebank conventions, and the proposed correction (which Sonnet is free to change if necessary). Sonnet then returns a structured JSON response: either "action": "correct" with updated fields, or "action": "no_change" with a reason.
Sonnet's review is consistently very good but not perfect. It tends to fail, for example, on forms that are genuinely ambiguous between two tenses. The 3sg -it ending of second-group verbs is identical in present and passé simple (envahit, grandit, finit). When a classifier or spaCy flags one of these, Sonnet sometimes accepts the change even when context clearly favors present tense. This failure is consistent across models (Opus produces the same error) and across temperature settings. For these ambiguous forms, a contextual heuristic upstream may be needed: if all other finite verbs in the sentence are present tense.
Still, the resulting patched CoNLL-U is considerably more accurate than the raw outputs from Stanza. Soon I'll follow up with some specific numbers.
What's next
Once processing completes in early March, 2026, my training set of ELTeC French (~600,000 words across 80 novels) will be used to train a domain-adapted Stanza model for literary French. The question is how much the corrected training data improves tagging accuracy on held-out literary text compared to the stock GSD model. I'll report on that when the time comes.
The corpus, annotation scripts, and comparison tools are available at github.com/myersm0/isosceles.
References
- Gladstone, C., Fang, Z., & Stewart, S. D. (2025). Ground truth generation for multilingual historical NLP using LLMs. Proceedings of the Conference of the Alliance of Digital Humanities Organizations.
- Grevisse, M., & Goosse, A. (2016). Le Bon Usage (16th ed., §§803, 809). De Boeck Supérieur.