LibreMonk@linkage.ds8.zone

LibreMonk@linkage.ds8.zone

When I receive a non-English document, I scan it and run OCR (Tesseract). Then use pdftotext to dump the text to a text file and run Argos Translate (a locally installed translation app). That gives me the text in English without a cloud dependency. What next?

Up until now, I save the file as (original basename)_en.txt. Then when I want to read the doc in the future I open that text file in emacs. But that’s not enough. I still want to see the original letter, so I open the PDF (or DjVu) file anyway.

That workflow is a bit cumbersome. So another option: use pdfjam --no-tidy to import the PDF into the skeleton of LaTeX code, then modify the LaTeX to add a \pdfcomment which then puts the English text in an annotation. Then the PDF merely needs to be opened and mousing over the annotation icon shows the English. This is labor intensive up front but it can be scripted.

Works great until pdf2djvu runs on it. Both evince and djview render the document with annotation icons showing, but there is no way to open the annotation to read the text.

Okular supports adding new annotations to DjVu files, but Okular is also apparently incapable of opening the text associated to pre-existing annotations. This command seems to prove the annotation icons are fake props:

djvused annotatedpdf.djvu -e 'select 1; print-ant'

No output.

When Okular creates a new annotation, it is not part of the DjVu file (according to a comment 10 years ago). WTF? #DjVu’s man page says the format includes “annotation chunks”, so why would Okular not use that construct?

It’s theoretically possible to add an annotation to a DjVu file using this command:

djvused -e set-ant annotation-file.txt book.djvu

But the format of the annotations input file is undocumented. Anyone have the secret recipe?

Expats: when you receive documents that are not in your language, how do you handle them? (PDF vs DjVu, Tesseract, Argos, etc)

Expats: when you receive documents that are not in your language, how do you handle them? (PDF vs DjVu, Tesseract, Argos, etc)