How to force TOC generation out of scanned PDF

magphil · 09-30-2009, 07:51 AM

Hi,

I am trying to convert a scanned pdf document to mobi.

I cannot get any TOC, even though there are 133 chapter labled

CHAPTER NNN

The default regexp used for chapter detection is :

//*[((name()='h1' or name()='h2') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']

I guess it expects tags named h1 or h2 or those defined in class chapter.

How can we get a TOC when there is no tags, but chapter keyword is part of the text ?

Thanks for any hint

magphil

gwynevans · 09-30-2009, 08:54 AM

Try something like
//*[re:test(.,'CHAPTER','')]
and see if that gets them. (It might get too much, of course...)

magphil · 10-01-2009, 08:33 AM

Thanks for the hint, but this failed with :

ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (The Lost Symbol)

Convert book 1 of 1 (The Lost Symbol)
InputFormatPlugin: PDF Input running on C:\Philippe\Books Calibre\Dan Brown\The Lost Symbol (15509)\The Lost Symbol - Dan Brown.pdf
Converting file to html...
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Parsing all content...
Parsing index.html ...
Parsing file 'index.html' as HTML
Generating default TOC from spine...
Merging user specified metadata...
Detecting structure...
Detected chapter: Brown Dan - The Lost Symbol FOR BLYTHE
Traceback (most recent call last):
File "worker.py", line 103, in <module>
File "worker.py", line 90, in main
File "calibre\gui2\convert\gui_conversion.pyo", line 19, in gui_convert
File "calibre\ebooks\conversion\plumber.pyo", line 751, in run
File "calibre\ebooks\oeb\transforms\structure.pyo", line 32, in __call__
File "calibre\ebooks\oeb\transforms\structure.pyo", line 93, in detect_chapters
File "lxml.etree.pyx", line 685, in lxml.etree._Element.addprevious (src/lxml/lxml.etree.c:9834)
TypeError: Only processing instructions and comments can be siblings of the root element

I had the same failure at all my previous attempts at changing the regexp

JSWolf · 10-01-2009, 08:36 AM

Why not purchase a copy that's in the format you want and not PDF? or is this one of those darknet downloads?

banjomike · 10-01-2009, 03:29 PM

Obviously the Masons have rigged the PDF.

veysey · 10-01-2009, 04:53 PM

Very darknet-y. But an interesting question.

I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ...

If the chapter were bold-faced, you could get partial success with a line like:

//*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']

But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...")

I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags?

Dunno. I may figure it out eventually, and will post back.

Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest.

From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...

09-30-2009, 07:51 AM	#1
magphil Connoisseur Posts: 60 Karma: 5090 Join Date: Jun 2009 Device: Gen3, Kobo glow	How to force TOC generation out of scanned PDF Hi, I am trying to convert a scanned pdf document to mobi. I cannot get any TOC, even though there are 133 chapter labled CHAPTER NNN The default regexp used for chapter detection is : //*[((name()='h1' or name()='h2') and re:test(.,'chapter\|book\|section\|part\s+', 'i')) or @class = 'chapter'] I guess it expects tags named h1 or h2 or those defined in class chapter. How can we get a TOC when there is no tags, but chapter keyword is part of the text ? Thanks for any hint magphil

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Scanned PDF onto Kindle 2. Help!	Tac420oma	PDF	6	07-20-2012 08:42 AM
ToC generation issue	tecgeo	Calibre	6	09-21-2010 08:44 PM
Advise for scanned pdf	Mike_73	Sony Reader	7	05-28-2010 05:43 AM
PRS-600 Dictionary on scanned PDF?	antistar	Sony Reader	8	11-29-2009 03:05 PM
Ok I have scanned pdf books....but	DeathtoToasters	Sony Reader	38	11-04-2008 07:51 PM

09-30-2009, 08:54 AM	#2
gwynevans Wizzard Posts: 1,402 Karma: 2000000 Join Date: Nov 2007 Location: UK Device: iPad 2, iPhone 6s, Kindle Voyage & Kindle PaperWhite	Try something like //*[re:test(.,'CHAPTER','')] and see if that gets them. (It might get too much, of course...)

10-01-2009, 08:33 AM	#3
magphil Connoisseur Posts: 60 Karma: 5090 Join Date: Jun 2009 Device: Gen3, Kobo glow	Thanks for the hint, but this failed with : ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (The Lost Symbol) Convert book 1 of 1 (The Lost Symbol) InputFormatPlugin: PDF Input running on C:\Philippe\Books Calibre\Dan Brown\The Lost Symbol (15509)\The Lost Symbol - Dan Brown.pdf Converting file to html... Retrieving document metadata... Generating manifest... Rendering manifest... Parsing all content... Parsing index.html ... Parsing file 'index.html' as HTML Generating default TOC from spine... Merging user specified metadata... Detecting structure... Detected chapter: Brown Dan - The Lost Symbol FOR BLYTHE Traceback (most recent call last): File "worker.py", line 103, in <module> File "worker.py", line 90, in main File "calibre\gui2\convert\gui_conversion.pyo", line 19, in gui_convert File "calibre\ebooks\conversion\plumber.pyo", line 751, in run File "calibre\ebooks\oeb\transforms\structure.pyo", line 32, in __call__ File "calibre\ebooks\oeb\transforms\structure.pyo", line 93, in detect_chapters File "lxml.etree.pyx", line 685, in lxml.etree._Element.addprevious (src/lxml/lxml.etree.c:9834) TypeError: Only processing instructions and comments can be siblings of the root element I had the same failure at all my previous attempts at changing the regexp

10-01-2009, 08:36 AM	#4
JSWolf Resident Curmudgeon Posts: 79,718 Karma: 145864619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	Why not purchase a copy that's in the format you want and not PDF? or is this one of those darknet downloads?

10-01-2009, 03:29 PM	#5
banjomike Addict Posts: 319 Karma: 397404 Join Date: Aug 2009 Location: UK Device: PRS-505,DSlibris,nook Glow & HD+,Tab S2,Moon+,Clara,Clara Colour	Obviously the Masons have rigged the PDF.

10-01-2009, 04:53 PM	#6
veysey Enthusiast Posts: 31 Karma: 144 Join Date: Aug 2009 Location: Washington DC Device: Sony ?	Very darknet-y. But an interesting question. I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ... If the chapter were bold-faced, you could get partial success with a line like: //*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter\|book\|section\|part\s+', 'i')) or @class = 'chapter'] But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...") I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags? Dunno. I may figure it out eventually, and will post back. Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest. From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...

Advert

Advert