09-30-2009, 07:51 AM | #1 |
Connoisseur
Posts: 60
Karma: 5090
Join Date: Jun 2009
Device: Gen3, Kobo glow
|
How to force TOC generation out of scanned PDF
Hi,
I am trying to convert a scanned pdf document to mobi. I cannot get any TOC, even though there are 133 chapter labled CHAPTER NNN The default regexp used for chapter detection is : //*[((name()='h1' or name()='h2') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter'] I guess it expects tags named h1 or h2 or those defined in class chapter. How can we get a TOC when there is no tags, but chapter keyword is part of the text ? Thanks for any hint magphil |
09-30-2009, 08:54 AM | #2 |
Wizzard
Posts: 1,402
Karma: 2000000
Join Date: Nov 2007
Location: UK
Device: iPad 2, iPhone 6s, Kindle Voyage & Kindle PaperWhite
|
Try something like
//*[re:test(.,'CHAPTER','')] and see if that gets them. (It might get too much, of course...) |
Advert | |
|
10-01-2009, 08:33 AM | #3 |
Connoisseur
Posts: 60
Karma: 5090
Join Date: Jun 2009
Device: Gen3, Kobo glow
|
Thanks for the hint, but this failed with :
ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (The Lost Symbol) Convert book 1 of 1 (The Lost Symbol) InputFormatPlugin: PDF Input running on C:\Philippe\Books Calibre\Dan Brown\The Lost Symbol (15509)\The Lost Symbol - Dan Brown.pdf Converting file to html... Retrieving document metadata... Generating manifest... Rendering manifest... Parsing all content... Parsing index.html ... Parsing file 'index.html' as HTML Generating default TOC from spine... Merging user specified metadata... Detecting structure... Detected chapter: Brown Dan - The Lost Symbol FOR BLYTHE Traceback (most recent call last): File "worker.py", line 103, in <module> File "worker.py", line 90, in main File "calibre\gui2\convert\gui_conversion.pyo", line 19, in gui_convert File "calibre\ebooks\conversion\plumber.pyo", line 751, in run File "calibre\ebooks\oeb\transforms\structure.pyo", line 32, in __call__ File "calibre\ebooks\oeb\transforms\structure.pyo", line 93, in detect_chapters File "lxml.etree.pyx", line 685, in lxml.etree._Element.addprevious (src/lxml/lxml.etree.c:9834) TypeError: Only processing instructions and comments can be siblings of the root element I had the same failure at all my previous attempts at changing the regexp |
10-01-2009, 08:36 AM | #4 |
Resident Curmudgeon
Posts: 75,860
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Why not purchase a copy that's in the format you want and not PDF? or is this one of those darknet downloads?
|
10-01-2009, 03:29 PM | #5 |
Addict
Posts: 319
Karma: 397404
Join Date: Aug 2009
Location: UK
Device: PRS-505,DSlibris,nook Glow & HD+,Tab S2,Moon+,Clara,Clara Colour
|
Obviously the Masons have rigged the PDF.
|
Advert | |
|
10-01-2009, 04:53 PM | #6 |
Enthusiast
Posts: 31
Karma: 144
Join Date: Aug 2009
Location: Washington DC
Device: Sony ?
|
Very darknet-y. But an interesting question.
I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ... If the chapter were bold-faced, you could get partial success with a line like: //*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter'] But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...") I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags? Dunno. I may figure it out eventually, and will post back. Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest. From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ... |
Tags |
toc |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Scanned PDF onto Kindle 2. Help! | Tac420oma | 6 | 07-20-2012 08:42 AM | |
ToC generation issue | tecgeo | Calibre | 6 | 09-21-2010 08:44 PM |
Advise for scanned pdf | Mike_73 | Sony Reader | 7 | 05-28-2010 05:43 AM |
PRS-600 Dictionary on scanned PDF? | antistar | Sony Reader | 8 | 11-29-2009 03:05 PM |
Ok I have scanned pdf books....but | DeathtoToasters | Sony Reader | 38 | 11-04-2008 07:51 PM |