Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 02-09-2022, 08:40 AM   #1
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Question [SOLVED] (Open-source) application to extract text layer?

Hello,

If a PDF consists of a scanned (picture) layer + OCRed text layer to allow the user to copy text… I assume it's possible to either extract the text layer for use elsewhere, or leave it in the PDF and remove the picture layer, leave just the text layer so as to get a much smaller file.

Here's an example.

I can't find an application to do this, preferably open-source.

Thank you.

---
Edit: Done.

Code:
gs -sDEVICE=txtwrite -o output.txt input.pdf

Last edited by Shohreh; 02-09-2022 at 09:57 AM.
Shohreh is offline   Reply With Quote
Old 02-09-2022, 10:40 AM   #2
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,171
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Mint (Debian/Ubuntu based) and Linux generally comes with Ghostscript (gs). I remember installing it and Ghostview (a GUI that uses gs) on Windows XP.

Quote:
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
On debian system you may need to install ghostscript-doc package.
Quoth is offline   Reply With Quote
Advert
Old 02-09-2022, 02:31 PM   #3
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Yes, it does the job. There's probably other tools as well.

The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good.
Shohreh is offline   Reply With Quote
Old 02-09-2022, 05:04 PM   #4
Quoth
the rook, bossing Never.
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 11,171
Karma: 85874891
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
Quote:
Originally Posted by Shohreh View Post

The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good.
On Linux, KATE and on Windows Notepad++. Use powerful regex and other programming tools. Also both project based & multitab. (Or for real old school emacs or vi).
Or do a Perl script (a write only language to automate file I/O and regex). I've converted old style CP/M & DOS Wordstar to proper plain text (only returns at paragraphs) with regex in Perl.
Then LO Writer (Mac, Windows, Linux). Edit headings and paragraph styles, using native odt format. An EXTRA docx Save As. Then Calibre Docx-> epub (no editing needed) or else Sigil
Quoth is offline   Reply With Quote
Old 02-10-2022, 07:01 PM   #5
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
You can extract text with k2pdfopt as well:

k2pdfopt -ocrout %s_textout.txt *.pdf -mode copy -dpi 50

This will extract text from all .pdf files in the folder. E.g. myfile.pdf will extract text to myfile_textout.txt.

Might want to view this page as well.

Last edited by willus; 02-10-2022 at 07:06 PM.
willus is offline   Reply With Quote
Advert
Old 02-11-2022, 08:00 AM   #6
Shohreh
Zealot
Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.Shohreh can program the VCR without an owner's manual.
 
Posts: 148
Karma: 192898
Join Date: Jan 2016
Device: none
Thanks much, I'll experiment.
Shohreh is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can Koreader display text layer from PDF as if it was an epub? dtc87 KOReader 2 11-20-2021 10:21 AM
Is the PDF experience better with a text layer? El Duderino KOReader 16 08-04-2017 08:25 PM
Scanned text pdf with OCR but graphical layer instead vectorial whopper PDF 2 09-10-2011 06:32 PM
Announcing Janus - A Calibre Web Application [Open Source] cruffalo Related Tools 5 09-07-2011 05:36 PM
Open Source Text Books for California Schools shrimphead News 6 05-09-2009 03:37 AM


All times are GMT -4. The time now is 09:49 PM.


MobileRead.com is a privately owned, operated and funded community.