running piccl to correct words in a simple wordlist #58

Irishx · 2020-08-18T15:18:11Z

This is mostly a remark on how you can use ticcl.nf to correct a lexicon-list of words. Piccl is intented for spelling correction at document level. However, it can be applied to a wordlist too and get reasonable results.

So input is a list of words, one word per line. --> you run ticcl.nf --inputtype text
The official output of ticcl.nf is an XML file *.ticcl.folia.xml but the output of ticcl.nf includes several intermediate files and for a quick look at the corrections without the superfluous XML, you can also use the information in the *tsv.clean.ldcalc.ranked file which is list of orig-word,edit-distance-corrected-word,technicalcode-word1,technicalcode-word-2, certainty of algorithm

martinreynaert · 2020-08-19T19:50:46Z

Thank you Irishx for the above elucidations!

The actual column contents of the output as explained above are not in fact correct. I explain below.

PICCL is a workflow system (based on NextFlow components such as ticcl.nf). TICCL is a (digitized) text correction and normalization system consisting of several modules.

Let us take a look at some actual TICCL system output. What I present next is not TICCL-rank output, but output from the subsequent module, TICCL-chain.

This output is very similar to TICCL-rank's except for the last column. TICCL-chain as input takes TICCL-rank's output.

Both outputs are in fact '#' or hash-separated columns, 7 in all.

I present an extract from a *chained (the file extension for TICCL-chain output) file. This was based on a corpus of Dutch National Archives' Notarial Deeds from the 'Golden Century', Haarlem region. Handwritten Text Recognition courtesy of Transkribus.

We present six HTR (or, possibly, regional diachronic) variants corrected by TICCL to 'schilderijtjes', i.e. small paintings:

schildenijtjes#1#schilderijtjes#100000057#596286601#1#C
schildereitjes#1#schilderijtjes#100000057#15434340889#2#C
schildereytjes#2#schilderijtjes#100000057#1630347719#2#C
schildergties#1#schilderijtjes#100000057#35629811471#3#C
schildergtjes#8#schilderijtjes#100000057#23296010607#2#C
schilderij_tjes#1#schilderijtjes#100000057#11040808032#1#C

Column 1: word variant
Column 2: observed corpus frequency (corpus here was 100K pages of HTR)
Column 3: best-first ranked TICCL correction candidate (CC)
Column 4: TICCL 'artificial' frequency (here: 100,000,000) augmented with the observed corpus frequency (57)
Column 5: Anagram value (AV) difference between the variant and its CC. Denotes a particular character confusion between variant and CC.
Column 6: Levenshtein Distance (LD)
Column 7: C for 'chained'

Note: underscores in either Column 1 or 3 denote spaces: in the HTR of these Notarial Deeds the last example given above is a bigram, i.e. a split word. We effected word bigram (and trigram) correction on this corpus.

Main differences with TICCL-rank output:
a/ TICCL is usually set to work with an LD limit of two edits. TICCL-rank cannot have higher values in Column 6 than the actual limit that was set. TICCL-chain collects variants and can go way higher, here: 3.
b/ TICCL-rank in Column 7 gives a kind of confidence measure derived from TICCL's ranking features used in TICCL-LDcalc and TICCL-rank. This is lost during chaining since the original word pairs are often discarded.

Purpose of some columns:
Column 4: The artificial frequency is (or can be) assigned by TICCL to word forms of which one is certain or confident they are (or were at some point in time) 'correct' or 'canonical' (whatever your definition of both). In this work we assigned it to all word forms and names we had gathered for Dutch for which we were confident they had at some time, by some instance, been 'humanly-attested'. For more about this, see our work on 'TICCLAT'.
Column 5: TICCL is based on what we call 'numerical anagram hashing', more prosaically: counting with words. Each combination of a bag of characters ultimately is expressed as a single large numerical value. The numerical difference between the anagram values of two words denotes a specific difference in particular characters between them, what we call a 'character confusion'. The one before last example above has as character confusion: the HTR recognized the character bigram 'ij' as a single 'g'. The observed corpus frequency for the word variant is quite high: 8. So, one might wonder whether this happened a lot in this particular digitization batch. To find out one might 'grep', i.e. search for, all occurrences of the AV '23296010607' in the full *chained list in order to obtain the stats on this.
(The answer is that this substitution occurred quite a lot in this corpus and that the top three (at least) have elevated corpus frequencies: extract:
bladzgde#67#bladzijde#100004393#23296010607#2#C (CC = page)
vrgwaring#39#vrijwaring#100002135#23296010607#2#C (CC = exemption)
kwgting#15#kwijting#100001360#23296010607#2#C (CC = acquittance)
)

More info on TICCL's modules is to be found on https://github.com/LanguageMachines/ticcltools, as well as a diagrammatic overview of their interactions.

proycon added a commit that referenced this issue Aug 18, 2020

publish more intermediate output #58

26b46e5

proycon closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running piccl to correct words in a simple wordlist #58

running piccl to correct words in a simple wordlist #58

Irishx commented Aug 18, 2020

martinreynaert commented Aug 19, 2020 •

edited

Loading

running piccl to correct words in a simple wordlist #58

running piccl to correct words in a simple wordlist #58

Comments

Irishx commented Aug 18, 2020

martinreynaert commented Aug 19, 2020 • edited Loading

martinreynaert commented Aug 19, 2020 •

edited

Loading