Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > PDF

Notices

Reply
 
Thread Tools Search this Thread
Old 09-20-2018, 04:07 AM   #1
icefusion
Junior Member
icefusion began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Feb 2014
Device: Kindle Paperwhite 3
extra spaces in Kindle (e.g. Ganzhe i t swor t e) but not DC (e.g. Ganzheitsworte)

I scanned a German document at 600dpi. Then I used Briss to split each scanned page into two PDF pages. Then I ran Acrobat DC's OCR for 600dpi output. It worked, as can be verified by copying and pasting the text.

When I send the PDF to Kindle, however, virtually every word has spaces within it. What in DC, e.g., was properly "Ganzheitsworte," when selected within Kindle is "Ganzhe i t swor t e". This renders Kindle's integrated dictionary useless. Ideas?
icefusion is offline   Reply With Quote
Old 09-20-2018, 04:32 AM   #2
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Use the text from Acrobat DC's OCR to create a kindle book instead. You shouldn't expect the same results from two different OCR systems.
pdurrant is offline   Reply With Quote
Old 09-21-2018, 12:25 AM   #3
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
I don't understand. Did you somehow use Adobe to create a new PDF with an OCR layer in it, and send that PDF to the kindle? Or did you send the scanned pdf (after cropping with Briss) to the kindle without having performed any OCR beforehand? I don't know enough about Adobe DC to know if it will create a PDF with an OCR layer.

Last edited by willus; 09-21-2018 at 12:25 AM. Reason: Fixed typo
willus is offline   Reply With Quote
Old 09-21-2018, 01:40 AM   #4
icefusion
Junior Member
icefusion began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Feb 2014
Device: Kindle Paperwhite 3
willus: I scanned the book as a PDF, ran it through Briss, then used Acrobat DC to add an OCR layer.

pdurrant: Exporting the text from the PDF is not an option. The document has too many quotes in foreign languages, including Greek, using the Greek alphabet. Also, the OCR made quite a few mistakes on the footnotes. I don't think that Kindle runs its own OCR but rather processes the OCR layer in the PDF, adding spaces.
icefusion is offline   Reply With Quote
Old 09-21-2018, 03:03 AM   #5
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Quote:
Originally Posted by icefusion View Post
I don't think that Kindle runs its own OCR but rather processes the OCR layer in the PDF, adding spaces.
That sounds unlikely. What happens if you send the original PDF?
pdurrant is offline   Reply With Quote
Old 09-24-2018, 04:40 AM   #6
icefusion
Junior Member
icefusion began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Feb 2014
Device: Kindle Paperwhite 3
Whenever I've sent non-OCR'ed PDFs to my Kindle they lack a text layer. The same goes for this document when I use a version without the text layer.
icefusion is offline   Reply With Quote
Old 09-24-2018, 06:07 AM   #7
pdurrant
The Grand Mouse 高貴的老鼠
pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.pdurrant ought to be getting tired of karma fortunes by now.
 
pdurrant's Avatar
 
Posts: 71,504
Karma: 306214458
Join Date: Jul 2007
Location: Norfolk, England
Device: Kindle Voyage
Oh, how interesting. Could it be that the spaces are there in the text layer already?

What happens if you try to convert the PDF with text layer in calibre?
pdurrant is offline   Reply With Quote
Old 09-24-2018, 10:26 AM   #8
icefusion
Junior Member
icefusion began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Feb 2014
Device: Kindle Paperwhite 3
When I copy text within Acrobat the spaces are absent.
I just used Calibre to export to TXT and RTF. The former only produces the document outline (but none of the document proper), which lacks the extra spaces. The latter produces the image layer, not the text.
I have posted my quandary on the Kindle forum (https://www.mobileread.com/forums/sh...d.php?t=310958), hoping that someone over there has had the same issue.
icefusion is offline   Reply With Quote
Old 10-19-2018, 04:15 PM   #9
willus
Fuzzball, the purple cat
willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.willus ought to be getting tired of karma fortunes by now.
 
willus's Avatar
 
Posts: 1,273
Karma: 11087488
Join Date: Jun 2011
Location: California
Device: iPad
Typically double-posting is frowned upon at MR, though they definitely need a way to cross-post questions like this to multiple forums. I downloaded the PDF sample you posted in the other thread and looked at it. There are definitely no spaces in the OCR layer (see excerpt from decompressed PDF stream below), so it's a mystery as to why they are put in by Amazon's conversion.


Code:
...
0.05 Tc 9.4807 0 0 9.1 63.27 418.57 Tm
(der )Tj
9.2469 0 0 9.1 79.35 418.57 Tm
(Ganzheitsworte )Tj
9.65 0 0 9.1 146.38 418.57 Tm
(mag )Tj
/Suspect <</Conf 0 >>BDC 
9.1849 0 0 9.1 167.15 418.57 Tm
(salom )Tj
...
willus is offline   Reply With Quote
Reply

Tags
german language ebook, ocr


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Extra spaces between words Drybonz Conversion 4 12-14-2015 08:15 PM
Extra spaces in AZW3 format on Kindle ozshots Calibre 5 09-17-2013 05:04 AM
Extra spaces in Sigil noteon Sigil 2 04-08-2011 02:42 PM
PDF->Mobi extra spaces inserted? tapar Conversion 8 01-29-2011 08:33 PM
I'm having a problem with extra paragraph spaces akosimike Calibre 10 05-27-2010 06:53 PM


All times are GMT -4. The time now is 10:25 AM.


MobileRead.com is a privately owned, operated and funded community.