-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
running piccl to correct words in a simple wordlist #58
Comments
Thank you Irishx for the above elucidations! The actual column contents of the output as explained above are not in fact correct. I explain below. PICCL is a workflow system (based on NextFlow components such as ticcl.nf). TICCL is a (digitized) text correction and normalization system consisting of several modules. Let us take a look at some actual TICCL system output. What I present next is not TICCL-rank output, but output from the subsequent module, TICCL-chain. This output is very similar to TICCL-rank's except for the last column. TICCL-chain as input takes TICCL-rank's output. Both outputs are in fact '#' or hash-separated columns, 7 in all. I present an extract from a *chained (the file extension for TICCL-chain output) file. This was based on a corpus of Dutch National Archives' Notarial Deeds from the 'Golden Century', Haarlem region. Handwritten Text Recognition courtesy of Transkribus. We present six HTR (or, possibly, regional diachronic) variants corrected by TICCL to 'schilderijtjes', i.e. small paintings: schildenijtjes#1#schilderijtjes#100000057#596286601#1#C Column 1: word variant Note: underscores in either Column 1 or 3 denote spaces: in the HTR of these Notarial Deeds the last example given above is a bigram, i.e. a split word. We effected word bigram (and trigram) correction on this corpus. Main differences with TICCL-rank output: Purpose of some columns: More info on TICCL's modules is to be found on https://github.com/LanguageMachines/ticcltools, as well as a diagrammatic overview of their interactions. |
This is mostly a remark on how you can use ticcl.nf to correct a lexicon-list of words. Piccl is intented for spelling correction at document level. However, it can be applied to a wordlist too and get reasonable results.
So input is a list of words, one word per line. --> you run ticcl.nf --inputtype text
The official output of ticcl.nf is an XML file *.ticcl.folia.xml but the output of ticcl.nf includes several intermediate files and for a quick look at the corrections without the superfluous XML, you can also use the information in the *tsv.clean.ldcalc.ranked file which is list of orig-word,edit-distance-corrected-word,technicalcode-word1,technicalcode-word-2, certainty of algorithm
The text was updated successfully, but these errors were encountered: