Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 07-29-2023, 12:37 PM   #1
jdege
Connoisseur
jdege began at the beginning.
 
Posts: 65
Karma: 10
Join Date: May 2011
Device: Samsung Tablet
Best tools for editing PDFs prior to conversion?

I'm ridding myself of physical books. This has involved having a few books scanned that I'm unable to find in electronic form.

My scanned books are supplied as PDFs. The visible pages are images, but they have backing text. When I view the PDFs in Moon+ Reader Pro the Text to Speech works, though with page numbers and titles included. Visually, though, they're a mess, as PDFs always are.

If I use Calibre to convert to EPUBs the result is visually much cleaner, but Text to Speech doesn't work at all. If I unzip the EPUB file too look inside it's clear why - there is no text, only a collection of images.

Most EPUBs seem to contain a collection of HTML files, and I've used ordinary text editors to clean them up, on occasion.

I'm wondering if there are any tools that would allow me to extract the text from a PDF file in a usable format. If I just had a text file containing the text that I could clean up prior to conversion that would be ideal.

Thoughts?
jdege is offline   Reply With Quote
Old 07-29-2023, 01:02 PM   #2
Quoth
Still reading
Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.Quoth ought to be getting tired of karma fortunes by now.
 
Quoth's Avatar
 
Posts: 14,010
Karma: 105092227
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
PDFs are really meant to be an end product. Some can't be edited at all, then you have to treat them as images and use OCR.
Quote:
My scanned books are supplied as PDFs. The visible pages are images, but they have backing text.
That's unproofed OCR. It can be extracted (tools depend on OS) and then edited in LO Writer.

The only foolproof method is to edit the source used to build a PDF!
Quoth is offline   Reply With Quote
Advert
Old 07-29-2023, 10:20 PM   #3
deback
Book E d i t o r
deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.
 
Posts: 432
Karma: 288184
Join Date: May 2015
Device: Laptop
Before converting, check the Enable Heuristic Processing box and enter .2 in the Line Un-Wrap Factor box. This should help connect more of the sentences and result in less editing.

Last edited by deback; 07-29-2023 at 10:23 PM.
deback is offline   Reply With Quote
Old 07-29-2023, 11:10 PM   #4
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 46,155
Karma: 168983734
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
One item to double check is that all too often, the OCR text layer used for search has not been proofed and the quality can be absolutely atrocious. What I tend to start with now is extracting the images from the PDF, cleaning them up and then OCRring them. OTOH, this is often a case of the game not being worth the candle. Too much effort for too little return.

On a brighter note, if you look for messages by Tex2002ans, you will find much help.

See this recent thread for instance: From print to ePub - how I did it.
DNSB is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
.MOBI file not readying properly in Calibre (prior to conversion) Saylan Library Management 2 02-01-2021 08:25 PM
Tools for Editing Kindle .mobi Files? GJN Kindle Formats 33 12-26-2013 02:05 PM
Tools for reading adacemic paper PDFs? saigafreak Sony Reader Dev Corner 2 04-23-2011 02:37 PM
Editing PDFs in library Ryan_Phx Calibre 3 10-07-2010 06:03 PM
Looking for Linux PDF editing tools for DX format tobor Kindle Developer's Corner 1 06-19-2009 07:37 PM


All times are GMT -4. The time now is 05:53 AM.


MobileRead.com is a privately owned, operated and funded community.