![]() |
#1 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 102
Karma: 137284
Join Date: Jan 2016
Device: none
|
![]()
Hello,
If a PDF consists of a scanned (picture) layer + OCRed text layer to allow the user to copy text… I assume it's possible to either extract the text layer for use elsewhere, or leave it in the PDF and remove the picture layer, leave just the text layer so as to get a much smaller file. Here's an example. I can't find an application to do this, preferably open-source. Thank you. --- Edit: Done. Code:
gs -sDEVICE=txtwrite -o output.txt input.pdf Last edited by Shohreh; 02-09-2022 at 09:57 AM. |
![]() |
![]() |
![]() |
#2 | |
the rook, bossing Never.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,172
Karma: 62500001
Join Date: Jun 2017
Location: Ireland
Device: Both Kinds: epub based makes and Kindle
|
Mint (Debian/Ubuntu based) and Linux generally comes with Ghostscript (gs). I remember installing it and Ghostview (a GUI that uses gs) on Windows XP.
Quote:
|
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 102
Karma: 137284
Join Date: Jan 2016
Device: none
|
Yes, it does the job. There's probably other tools as well.
The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good. |
![]() |
![]() |
![]() |
#4 | |
the rook, bossing Never.
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,172
Karma: 62500001
Join Date: Jun 2017
Location: Ireland
Device: Both Kinds: epub based makes and Kindle
|
Quote:
Or do a Perl script (a write only language to automate file I/O and regex). I've converted old style CP/M & DOS Wordstar to proper plain text (only returns at paragraphs) with regex in Perl. Then LO Writer (Mac, Windows, Linux). Edit headings and paragraph styles, using native odt format. An EXTRA docx Save As. Then Calibre Docx-> epub (no editing needed) or else Sigil |
|
![]() |
![]() |
![]() |
#5 |
Fuzzball, the purple cat
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,256
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
|
You can extract text with k2pdfopt as well:
k2pdfopt -ocrout %s_textout.txt *.pdf -mode copy -dpi 50 This will extract text from all .pdf files in the folder. E.g. myfile.pdf will extract text to myfile_textout.txt. Might want to view this page as well. Last edited by willus; 02-10-2022 at 07:06 PM. |
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Zealot
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 102
Karma: 137284
Join Date: Jan 2016
Device: none
|
Thanks much, I'll experiment.
|
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can Koreader display text layer from PDF as if it was an epub? | dtc87 | KOReader | 2 | 11-20-2021 10:21 AM |
Is the PDF experience better with a text layer? | El Duderino | KOReader | 16 | 08-04-2017 08:25 PM |
Scanned text pdf with OCR but graphical layer instead vectorial | whopper | 2 | 09-10-2011 06:32 PM | |
Announcing Janus - A Calibre Web Application [Open Source] | cruffalo | Related Tools | 5 | 09-07-2011 05:36 PM |
Open Source Text Books for California Schools | shrimphead | News | 6 | 05-09-2009 03:37 AM |