Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-30-2009, 07:51 AM   #1
magphil
Enthusiast
magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!
 
Posts: 44
Karma: 5090
Join Date: Jun 2009
Device: Gen3
How to force TOC generation out of scanned PDF

Hi,

I am trying to convert a scanned pdf document to mobi.

I cannot get any TOC, even though there are 133 chapter labled

CHAPTER NNN

The default regexp used for chapter detection is :

//*[((name()='h1' or name()='h2') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']

I guess it expects tags named h1 or h2 or those defined in class chapter.

How can we get a TOC when there is no tags, but chapter keyword is part of the text ?

Thanks for any hint

magphil
magphil is offline   Reply With Quote
Old 09-30-2009, 08:54 AM   #2
gwynevans
Wizard
gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.gwynevans ought to be getting tired of karma fortunes by now.
 
gwynevans's Avatar
 
Posts: 1,343
Karma: 1065246
Join Date: Nov 2007
Location: UK
Device: Sony 505 (retired), iPad2, iPhone 3GS & Nexus 7 3G
Try something like
//*[re:test(.,'CHAPTER','')]
and see if that gets them. (It might get too much, of course...)
gwynevans is offline   Reply With Quote
 
Enthusiast
Old 10-01-2009, 08:33 AM   #3
magphil
Enthusiast
magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!magphil , Klaatu Barada Niktu!
 
Posts: 44
Karma: 5090
Join Date: Jun 2009
Device: Gen3
Thanks for the hint, but this failed with :

ERROR: Conversion Error: <b>Failed</b>: Convert book 1 of 1 (The Lost Symbol)

Convert book 1 of 1 (The Lost Symbol)
InputFormatPlugin: PDF Input running on C:\Philippe\Books Calibre\Dan Brown\The Lost Symbol (15509)\The Lost Symbol - Dan Brown.pdf
Converting file to html...
Retrieving document metadata...
Generating manifest...
Rendering manifest...
Parsing all content...
Parsing index.html ...
Parsing file 'index.html' as HTML
Generating default TOC from spine...
Merging user specified metadata...
Detecting structure...
Detected chapter: Brown Dan - The Lost Symbol FOR BLYTHE
Traceback (most recent call last):
File "worker.py", line 103, in <module>
File "worker.py", line 90, in main
File "calibre\gui2\convert\gui_conversion.pyo", line 19, in gui_convert
File "calibre\ebooks\conversion\plumber.pyo", line 751, in run
File "calibre\ebooks\oeb\transforms\structure.pyo", line 32, in __call__
File "calibre\ebooks\oeb\transforms\structure.pyo", line 93, in detect_chapters
File "lxml.etree.pyx", line 685, in lxml.etree._Element.addprevious (src/lxml/lxml.etree.c:9834)
TypeError: Only processing instructions and comments can be siblings of the root element


I had the same failure at all my previous attempts at changing the regexp
magphil is offline   Reply With Quote
Old 10-01-2009, 08:36 AM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 36,174
Karma: 17169472
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Sony Reader PRS-650, iPad, nook STR
Why not purchase a copy that's in the format you want and not PDF? or is this one of those darknet downloads?
JSWolf is offline   Reply With Quote
Old 10-01-2009, 03:29 PM   #5
banjomike
Addict
banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.banjomike ought to be getting tired of karma fortunes by now.
 
banjomike's Avatar
 
Posts: 300
Karma: 392216
Join Date: Aug 2009
Location: UK
Device: PRS-505 Red,DSlibris,nook Glow,nook Simple rooted,nook HD+,Moon+,Zinio
Obviously the Masons have rigged the PDF.
banjomike is offline   Reply With Quote
Old 10-01-2009, 04:53 PM   #6
veysey
Enthusiast
veysey doesn't litterveysey doesn't litter
 
Posts: 24
Karma: 144
Join Date: Aug 2009
Location: Washington DC
Device: Sony ?
Very darknet-y. But an interesting question.

I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ...

If the chapter were bold-faced, you could get partial success with a line like:

//*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']

But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...")

I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags?

Dunno. I may figure it out eventually, and will post back.

Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest.

From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...
veysey is offline   Reply With Quote
Reply

Tags
toc

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Scanned PDF onto Kindle 2. Help! Tac420oma PDF 6 07-20-2012 08:42 AM
ToC generation issue tecgeo Calibre 6 09-21-2010 08:44 PM
Advise for scanned pdf Mike_73 Sony Reader 7 05-28-2010 05:43 AM
PRS-600 Dictionary on scanned PDF? antistar Sony Reader 8 11-29-2009 03:05 PM
Ok I have scanned pdf books....but DeathtoToasters Sony Reader 38 11-04-2008 07:51 PM


All times are GMT -4. The time now is 01:42 AM.


MobileRead.com is a privately owned, operated and funded community.