![]() |
#1 |
Connoisseur
![]() Posts: 65
Karma: 10
Join Date: May 2011
Device: Samsung Tablet
|
Best tools for editing PDFs prior to conversion?
I'm ridding myself of physical books. This has involved having a few books scanned that I'm unable to find in electronic form.
My scanned books are supplied as PDFs. The visible pages are images, but they have backing text. When I view the PDFs in Moon+ Reader Pro the Text to Speech works, though with page numbers and titles included. Visually, though, they're a mess, as PDFs always are. If I use Calibre to convert to EPUBs the result is visually much cleaner, but Text to Speech doesn't work at all. If I unzip the EPUB file too look inside it's clear why - there is no text, only a collection of images. Most EPUBs seem to contain a collection of HTML files, and I've used ordinary text editors to clean them up, on occasion. I'm wondering if there are any tools that would allow me to extract the text from a PDF file in a usable format. If I just had a text file containing the text that I could clean up prior to conversion that would be ideal. Thoughts? |
![]() |
![]() |
![]() |
#2 | |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 14,010
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
PDFs are really meant to be an end product. Some can't be edited at all, then you have to treat them as images and use OCR.
Quote:
The only foolproof method is to edit the source used to build a PDF! |
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Book E d i t o r
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 432
Karma: 288184
Join Date: May 2015
Device: Laptop
|
Before converting, check the Enable Heuristic Processing box and enter .2 in the Line Un-Wrap Factor box. This should help connect more of the sentences and result in less editing.
Last edited by deback; 07-29-2023 at 10:23 PM. |
![]() |
![]() |
![]() |
#4 |
Bibliophagist
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 46,155
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
One item to double check is that all too often, the OCR text layer used for search has not been proofed and the quality can be absolutely atrocious. What I tend to start with now is extracting the images from the PDF, cleaning them up and then OCRring them. OTOH, this is often a case of the game not being worth the candle. Too much effort for too little return.
On a brighter note, if you look for messages by Tex2002ans, you will find much help. See this recent thread for instance: From print to ePub - how I did it. |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
.MOBI file not readying properly in Calibre (prior to conversion) | Saylan | Library Management | 2 | 02-01-2021 08:25 PM |
Tools for Editing Kindle .mobi Files? | GJN | Kindle Formats | 33 | 12-26-2013 02:05 PM |
Tools for reading adacemic paper PDFs? | saigafreak | Sony Reader Dev Corner | 2 | 04-23-2011 02:37 PM |
Editing PDFs in library | Ryan_Phx | Calibre | 3 | 10-07-2010 06:03 PM |
Looking for Linux PDF editing tools for DX format | tobor | Kindle Developer's Corner | 1 | 06-19-2009 07:37 PM |