(Open-source) application to extract text layer?

Shohreh · 02-09-2022, 08:40 AM

Hello,

If a PDF consists of a scanned (picture) layer + OCRed text layer to allow the user to copy text… I assume it's possible to either extract the text layer for use elsewhere, or leave it in the PDF and remove the picture layer, leave just the text layer so as to get a much smaller file.

Here's an example.

I can't find an application to do this, preferably open-source.

Thank you.

---
Edit: Done.

Code:

gs -sDEVICE=txtwrite -o output.txt input.pdf

Quoth · 02-09-2022, 10:40 AM

Mint (Debian/Ubuntu based) and Linux generally comes with Ghostscript (gs). I remember installing it and Ghostview (a GUI that uses gs) on Windows XP.

Quote:

GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.

On debian system you may need to install ghostscript-doc package.

Shohreh · 02-09-2022, 02:31 PM

Yes, it does the job. There's probably other tools as well.

The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good.

Quoth · 02-09-2022, 05:04 PM

Quote:

Originally Posted by Shohreh

The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good.

On Linux, KATE and on Windows Notepad++. Use powerful regex and other programming tools. Also both project based & multitab. (Or for real old school emacs or vi).
Or do a Perl script (a write only language to automate file I/O and regex). I've converted old style CP/M & DOS Wordstar to proper plain text (only returns at paragraphs) with regex in Perl.
Then LO Writer (Mac, Windows, Linux). Edit headings and paragraph styles, using native odt format. An EXTRA docx Save As. Then Calibre Docx-> epub (no editing needed) or else Sigil

willus · 02-10-2022, 07:01 PM

You can extract text with k2pdfopt as well:

k2pdfopt -ocrout %s_textout.txt *.pdf -mode copy -dpi 50

This will extract text from all .pdf files in the folder. E.g. myfile.pdf will extract text to myfile_textout.txt.

Might want to view this page as well.

Shohreh · 02-11-2022, 08:00 AM

Thanks much, I'll experiment.

02-09-2022, 08:40 AM	#1
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	[SOLVED] (Open-source) application to extract text layer? Hello, If a PDF consists of a scanned (picture) layer + OCRed text layer to allow the user to copy text… I assume it's possible to either extract the text layer for use elsewhere, or leave it in the PDF and remove the picture layer, leave just the text layer so as to get a much smaller file. Here's an example. I can't find an application to do this, preferably open-source. Thank you. --- Edit: Done. Code: gs -sDEVICE=txtwrite -o output.txt input.pdf Last edited by Shohreh; 02-09-2022 at 09:57 AM.

02-10-2022, 07:01 PM	#5
willus Fuzzball, the purple cat Posts: 1,318 Karma: 11087510 Join Date: Jun 2011 Location: California Device: iPad	You can extract text with k2pdfopt as well: k2pdfopt -ocrout %s_textout.txt .pdf -mode copy -dpi 50 This will extract text from all .pdf files in the folder. E.g. myfile.pdf will extract text to myfile_textout.txt. Might want to view this page as well. Last edited by willus; 02-10-2022 at 07:06 PM.*

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Can Koreader display text layer from PDF as if it was an epub?	dtc87	KOReader	2	11-20-2021 10:21 AM
Is the PDF experience better with a text layer?	El Duderino	KOReader	16	08-04-2017 08:25 PM
Scanned text pdf with OCR but graphical layer instead vectorial	whopper	PDF	2	09-10-2011 06:32 PM
Announcing Janus - A Calibre Web Application [Open Source]	cruffalo	Related Tools	5	09-07-2011 05:36 PM
Open Source Text Books for California Schools	shrimphead	News	6	05-09-2009 03:37 AM

02-09-2022, 02:31 PM	#3
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Yes, it does the job. There's probably other tools as well. The next task is finding a way to massage that raw text with carriage-returns into an Epub that looks good.

02-11-2022, 08:00 AM	#6
Shohreh Addict Posts: 236 Karma: 304158 Join Date: Jan 2016 Location: France Device: none	Thanks much, I'll experiment.

Advert

Advert