Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running piccl to correct words in a simple wordlist #58

Closed
Irishx opened this issue Aug 18, 2020 · 1 comment
Closed

running piccl to correct words in a simple wordlist #58

Irishx opened this issue Aug 18, 2020 · 1 comment

Comments

@Irishx
Copy link

Irishx commented Aug 18, 2020

This is mostly a remark on how you can use ticcl.nf to correct a lexicon-list of words. Piccl is intented for spelling correction at document level. However, it can be applied to a wordlist too and get reasonable results.

So input is a list of words, one word per line. --> you run ticcl.nf --inputtype text
The official output of ticcl.nf is an XML file *.ticcl.folia.xml but the output of ticcl.nf includes several intermediate files and for a quick look at the corrections without the superfluous XML, you can also use the information in the *tsv.clean.ldcalc.ranked file which is list of orig-word,edit-distance-corrected-word,technicalcode-word1,technicalcode-word-2, certainty of algorithm

proycon added a commit that referenced this issue Aug 18, 2020
@martinreynaert
Copy link
Collaborator

martinreynaert commented Aug 19, 2020

Thank you Irishx for the above elucidations!

The actual column contents of the output as explained above are not in fact correct. I explain below.

PICCL is a workflow system (based on NextFlow components such as ticcl.nf). TICCL is a (digitized) text correction and normalization system consisting of several modules.

Let us take a look at some actual TICCL system output. What I present next is not TICCL-rank output, but output from the subsequent module, TICCL-chain.

This output is very similar to TICCL-rank's except for the last column. TICCL-chain as input takes TICCL-rank's output.

Both outputs are in fact '#' or hash-separated columns, 7 in all.

I present an extract from a *chained (the file extension for TICCL-chain output) file. This was based on a corpus of Dutch National Archives' Notarial Deeds from the 'Golden Century', Haarlem region. Handwritten Text Recognition courtesy of Transkribus.

We present six HTR (or, possibly, regional diachronic) variants corrected by TICCL to 'schilderijtjes', i.e. small paintings:

schildenijtjes#1#schilderijtjes#100000057#596286601#1#C
schildereitjes#1#schilderijtjes#100000057#15434340889#2#C
schildereytjes#2#schilderijtjes#100000057#1630347719#2#C
schildergties#1#schilderijtjes#100000057#35629811471#3#C
schildergtjes#8#schilderijtjes#100000057#23296010607#2#C
schilderij_tjes#1#schilderijtjes#100000057#11040808032#1#C

Column 1: word variant
Column 2: observed corpus frequency (corpus here was 100K pages of HTR)
Column 3: best-first ranked TICCL correction candidate (CC)
Column 4: TICCL 'artificial' frequency (here: 100,000,000) augmented with the observed corpus frequency (57)
Column 5: Anagram value (AV) difference between the variant and its CC. Denotes a particular character confusion between variant and CC.
Column 6: Levenshtein Distance (LD)
Column 7: C for 'chained'

Note: underscores in either Column 1 or 3 denote spaces: in the HTR of these Notarial Deeds the last example given above is a bigram, i.e. a split word. We effected word bigram (and trigram) correction on this corpus.

Main differences with TICCL-rank output:
a/ TICCL is usually set to work with an LD limit of two edits. TICCL-rank cannot have higher values in Column 6 than the actual limit that was set. TICCL-chain collects variants and can go way higher, here: 3.
b/ TICCL-rank in Column 7 gives a kind of confidence measure derived from TICCL's ranking features used in TICCL-LDcalc and TICCL-rank. This is lost during chaining since the original word pairs are often discarded.

Purpose of some columns:
Column 4: The artificial frequency is (or can be) assigned by TICCL to word forms of which one is certain or confident they are (or were at some point in time) 'correct' or 'canonical' (whatever your definition of both). In this work we assigned it to all word forms and names we had gathered for Dutch for which we were confident they had at some time, by some instance, been 'humanly-attested'. For more about this, see our work on 'TICCLAT'.
Column 5: TICCL is based on what we call 'numerical anagram hashing', more prosaically: counting with words. Each combination of a bag of characters ultimately is expressed as a single large numerical value. The numerical difference between the anagram values of two words denotes a specific difference in particular characters between them, what we call a 'character confusion'. The one before last example above has as character confusion: the HTR recognized the character bigram 'ij' as a single 'g'. The observed corpus frequency for the word variant is quite high: 8. So, one might wonder whether this happened a lot in this particular digitization batch. To find out one might 'grep', i.e. search for, all occurrences of the AV '23296010607' in the full *chained list in order to obtain the stats on this.
(The answer is that this substitution occurred quite a lot in this corpus and that the top three (at least) have elevated corpus frequencies: extract:
bladzgde#67#bladzijde#100004393#23296010607#2#C (CC = page)
vrgwaring#39#vrijwaring#100002135#23296010607#2#C (CC = exemption)
kwgting#15#kwijting#100001360#23296010607#2#C (CC = acquittance)
)

More info on TICCL's modules is to be found on https://github.com/LanguageMachines/ticcltools, as well as a diagrammatic overview of their interactions.

@proycon proycon closed this as completed Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants