|09-30-2009, 08:51 AM||#1|
Join Date: Jun 2009
How to force TOC generation out of scanned PDF
I am trying to convert a scanned pdf document to mobi.
I cannot get any TOC, even though there are 133 chapter labled
The default regexp used for chapter detection is :
//*[((name()='h1' or name()='h2') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
I guess it expects tags named h1 or h2 or those defined in class chapter.
How can we get a TOC when there is no tags, but chapter keyword is part of the text ?
Thanks for any hint
|10-01-2009, 09:33 AM||#3|
Join Date: Jun 2009
Thanks for the hint, but this failed with :
ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (The Lost Symbol)
Convert book 1 of 1 (The Lost Symbol)
InputFormatPlugin: PDF Input running on C:\Philippe\Books Calibre\Dan Brown\The Lost Symbol (15509)\The Lost Symbol - Dan Brown.pdf
Converting file to html...
Retrieving document metadata...
Parsing all content...
Parsing index.html ...
Parsing file 'index.html' as HTML
Generating default TOC from spine...
Merging user specified metadata...
Detected chapter: Brown Dan - The Lost Symbol FOR BLYTHE
Traceback (most recent call last):
File "worker.py", line 103, in <module>
File "worker.py", line 90, in main
File "calibre\gui2\convert\gui_conversion.pyo", line 19, in gui_convert
File "calibre\ebooks\conversion\plumber.pyo", line 751, in run
File "calibre\ebooks\oeb\transforms\structure.pyo", line 32, in __call__
File "calibre\ebooks\oeb\transforms\structure.pyo", line 93, in detect_chapters
File "lxml.etree.pyx", line 685, in lxml.etree._Element.addprevious (src/lxml/lxml.etree.c:9834)
TypeError: Only processing instructions and comments can be siblings of the root element
I had the same failure at all my previous attempts at changing the regexp
|10-01-2009, 09:36 AM||#4|
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Why not purchase a copy that's in the format you want and not PDF? or is this one of those darknet downloads?
|10-01-2009, 04:29 PM||#5|
Join Date: Aug 2009
Device: PRS-505 Red, DS Lite+DSlibris, nook Glow, nook Simple rooted, nook HD+
Obviously the Masons have rigged the PDF.
|10-01-2009, 05:53 PM||#6|
Join Date: Aug 2009
Location: Washington DC
Device: Sony ?
Very darknet-y. But an interesting question.
I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ...
If the chapter were bold-faced, you could get partial success with a line like:
//*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']
But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...")
I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags?
Dunno. I may figure it out eventually, and will post back.
Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest.
From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Scanned PDF onto Kindle 2. Help!||Tac420oma||6||07-20-2012 09:42 AM|
|ToC generation issue||tecgeo||Calibre||6||09-21-2010 09:44 PM|
|Advise for scanned pdf||Mike_73||Sony Reader||7||05-28-2010 06:43 AM|
|PRS-600 Dictionary on scanned PDF?||antistar||Sony Reader||8||11-29-2009 04:05 PM|
|Ok I have scanned pdf books....but||DeathtoToasters||Sony Reader||38||11-04-2008 08:51 PM|