MobileRead Forums - View Single Post - How to force TOC generation out of scanned PDF

veysey · 10-01-2009, 04:53 PM

Very darknet-y. But an interesting question.

I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ...

If the chapter were bold-faced, you could get partial success with a line like:

//*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter|book|section|part\s+', 'i')) or @class = 'chapter']

But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...")

I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags?

Dunno. I may figure it out eventually, and will post back.

Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest.

From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...

10-01-2009, 04:53 PM	#6
veysey Enthusiast Posts: 31 Karma: 144 Join Date: Aug 2009 Location: Washington DC Device: Sony ?	Very darknet-y. But an interesting question. I'm trying to learn python and xpath so that I can answer it. But right now I know very very little ... If the chapter were bold-faced, you could get partial success with a line like: //*[((name()='h1' or name()='h2' or name()='b') and re:test(.,'chapter\|book\|section\|part\s+', 'i')) or @class = 'chapter'] But if the word CHAPTER were bolded and the number were not, this would get you a TOC without numbers, but with repeated (and properly linked) entries "CHAPTER, CHAPTER, CHAPTER ...") I think the underlying reason for the previous problem is that the XPATH representation builds a tree of tags, and that it isn't searchable for raw text at the root level. One can only search for tags, and then within the structure of tags? Dunno. I may figure it out eventually, and will post back. Here's one hint: In the page structure tab in the conversion window of calibre, you can click on the "header" or "footer" "gui" buttons, and it will show you what the xhtml representation (I think it's xhtml ... maybe another markup lang) that calibre is seeing is. Then you can see which tags are being used with the text of interest. From what I've read here, BookDesigner might be a good way to do manual corrections. You might also have luck with intermediate conversion to html and then working from that ...