Structure Detection Problems - Page 2

ldolse · 05-12-2011, 06:14 AM

Quote:

Originally Posted by Jonnster

Would it be worth me taking the HTML from the input folder of the Calibre debug, converting it to CHM and then converting to mobi? If so how do I create the CHM?

This is what I was trying to tell you before, but you don't need to convert it to CHM. Take the input.html file from the input directory, open it in text/html editor, and massage the html as you see fit. Then import the html file back to Calibre - no need to convert it to CHM - and then convert from HTML to Mobi.

Note that fixing the html input document up will require a major amount of effort though. You'll need to unwrap lines yourself, and it would be a good idea to manually put the code block in <pre> tags as itimpi just described.

Jonnster · 05-12-2011, 06:15 AM

The document is 1700 pages long. "massageing" it by hand is just not an option.

user_none · 05-12-2011, 11:31 AM

There really isn't any any other option. The CHM author has gone through and marked all code sections in pre tags. That's what you need to do with the PDF output to make it look right on your Kindle.

Due to how PDF's are made there is no good / easy way to detect and add pre tags when converting. PDF files don't even differentiate paragraphs. It's all fixed with lines. There has been many hours of work put into calibre's PDF conversion to determine which lines belong in the previous or a new paragraph.

kiwidude · 05-12-2011, 12:06 PM

@user_none - if I could ask a question here. I've never looked at what a PDF structure looks like internally so have no full appreciation of the difficulties it causes. However one thing I have noted that the conversion *always* gets wrong is when a sentence in an indented paragraph starts at the leftmost column.

Code:

    Some first line.
My second line.

Will always become two paragaphs when converted.

Out of technical curiosity and ignorance what is the issue with detecting this? And does the new PDF engine (which I know is on hold) address this?

user_none · 05-12-2011, 01:00 PM

The issue is paragraphs in a novel typically start with an indent. The massaging is around 100 heuristics that re applied to the text.

I don't know much about the new engine. Kovid started and is pretty much the only on working on it. I gave up on PDF a long time ago.

ldolse · 05-12-2011, 02:07 PM

I've looked at the new engine - it's got a lot of potential. Vertical and horizontal positional information is retained so paragraphs can be detected through indents and other tests (though none of those tests are done now). Header and footer removal will also become trivial as it can be done based on position on the page. Last time I looked at it though I couldn't quite figure out the logic as the reflow function covers single column and two column unwrapping in the same function.

@kiwidude, the specific problem in your example is that punctuation at the end of a line is a full stop - since the current engine loses all positional information including indents punctuation is all we've got. If a line in the middle of a paragraph ends in with a full stop punctuation element then the paragraphs will be split there.

kiwidude · 05-12-2011, 02:12 PM

@Idolse - ahhh, thanks for the info, now I understand. It is the full stop at the end of the previous line that is "significant" in this case.

Having spent many hours resurrecting some PDF conversions in Sigil on a page by page basis, this is one particular limitation I am looking forward to the new engine solving one day...

05-12-2011, 12:06 PM	#19
kiwidude Calibre Plugins Developer Posts: 4,720 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@user_none - if I could ask a question here. I've never looked at what a PDF structure looks like internally so have no full appreciation of the difficulties it causes. However one thing I have noted that the conversion always gets wrong is when a sentence in an indented paragraph starts at the leftmost column. Code: Some first line. My second line. Will always become two paragaphs when converted. Out of technical curiosity and ignorance what is the issue with detecting this? And does the new PDF engine (which I know is on hold) address this?

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
structure detection - documentation ?	cybmole	Calibre	27	01-12-2011 02:14 AM
Trouble w structure detection	jeff47	Calibre	1	10-13-2010 12:51 AM
Structure Detection Ceased To Exist?	radiofred	Calibre	3	10-01-2010 12:33 AM
Structure detection v5.5 and v6.2	AlexBell	Calibre	2	07-29-2009 10:11 PM

05-12-2011, 06:15 AM	#17
Jonnster Member Posts: 16 Karma: 10 Join Date: May 2011 Device: Kindle 3	The document is 1700 pages long. "massageing" it by hand is just not an option.

05-12-2011, 11:31 AM	#18
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	There really isn't any any other option. The CHM author has gone through and marked all code sections in pre tags. That's what you need to do with the PDF output to make it look right on your Kindle. Due to how PDF's are made there is no good / easy way to detect and add pre tags when converting. PDF files don't even differentiate paragraphs. It's all fixed with lines. There has been many hours of work put into calibre's PDF conversion to determine which lines belong in the previous or a new paragraph.

05-12-2011, 01:00 PM	#20
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	The issue is paragraphs in a novel typically start with an indent. The massaging is around 100 heuristics that re applied to the text. I don't know much about the new engine. Kovid started and is pretty much the only on working on it. I gave up on PDF a long time ago.

05-12-2011, 02:07 PM	#21
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I've looked at the new engine - it's got a lot of potential. Vertical and horizontal positional information is retained so paragraphs can be detected through indents and other tests (though none of those tests are done now). Header and footer removal will also become trivial as it can be done based on position on the page. Last time I looked at it though I couldn't quite figure out the logic as the reflow function covers single column and two column unwrapping in the same function. @kiwidude, the specific problem in your example is that punctuation at the end of a line is a full stop - since the current engine loses all positional information including indents punctuation is all we've got. If a line in the middle of a paragraph ends in with a full stop punctuation element then the paragraphs will be split there.

05-12-2011, 02:12 PM	#22
kiwidude Calibre Plugins Developer Posts: 4,720 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Idolse - ahhh, thanks for the info, now I understand. It is the full stop at the end of the previous line that is "significant" in this case. Having spent many hours resurrecting some PDF conversions in Sigil on a page by page basis, this is one particular limitation I am looking forward to the new engine solving one day...