02-09-2022, 08:40 AM | #1 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
[SOLVED] (Open-source) application to extract text layer?
Hello,
If a PDF consists of a scanned (picture) layer + OCRed text layer to allow the user to copy text… I assume it's possible to either extract the text layer for use elsewhere, or leave it in the PDF and remove the picture layer, leave just the text layer so as to get a much smaller file. Here's an example. I can't find an application to do this, preferably open-source. Thank you. --- Edit: Done. Code:
gs -sDEVICE=txtwrite -o output.txt input.pdf Last edited by Shohreh; 02-09-2022 at 09:57 AM. |
02-09-2022, 10:40 AM | #2 | |
the rook, bossing Never.
Posts: 11,171
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Mint (Debian/Ubuntu based) and Linux generally comes with Ghostscript (gs). I remember installing it and Ghostview (a GUI that uses gs) on Windows XP.
Quote:
|
|
Advert | |
|
02-09-2022, 02:31 PM | #3 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Yes, it does the job. There's probably other tools as well.
The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good. |
02-09-2022, 05:04 PM | #4 | |
the rook, bossing Never.
Posts: 11,171
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Quote:
Or do a Perl script (a write only language to automate file I/O and regex). I've converted old style CP/M & DOS Wordstar to proper plain text (only returns at paragraphs) with regex in Perl. Then LO Writer (Mac, Windows, Linux). Edit headings and paragraph styles, using native odt format. An EXTRA docx Save As. Then Calibre Docx-> epub (no editing needed) or else Sigil |
|
02-10-2022, 07:01 PM | #5 |
Fuzzball, the purple cat
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
You can extract text with k2pdfopt as well:
k2pdfopt -ocrout %s_textout.txt *.pdf -mode copy -dpi 50 This will extract text from all .pdf files in the folder. E.g. myfile.pdf will extract text to myfile_textout.txt. Might want to view this page as well. Last edited by willus; 02-10-2022 at 07:06 PM. |
Advert | |
|
02-11-2022, 08:00 AM | #6 |
Zealot
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
|
Thanks much, I'll experiment.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can Koreader display text layer from PDF as if it was an epub? | dtc87 | KOReader | 2 | 11-20-2021 10:21 AM |
Is the PDF experience better with a text layer? | El Duderino | KOReader | 16 | 08-04-2017 08:25 PM |
Scanned text pdf with OCR but graphical layer instead vectorial | whopper | 2 | 09-10-2011 06:32 PM | |
Announcing Janus - A Calibre Web Application [Open Source] | cruffalo | Related Tools | 5 | 09-07-2011 05:36 PM |
Open Source Text Books for California Schools | shrimphead | News | 6 | 05-09-2009 03:37 AM |