![]() |
#16 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
Note that fixing the html input document up will require a major amount of effort though. You'll need to unwrap lines yourself, and it would be a good idea to manually put the code block in <pre> tags as itimpi just described. |
|
![]() |
![]() |
![]() |
#17 |
Member
![]() Posts: 16
Karma: 10
Join Date: May 2011
Device: Kindle 3
|
The document is 1700 pages long. "massageing" it by hand is just not an option.
|
![]() |
![]() |
![]() |
#18 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
There really isn't any any other option. The CHM author has gone through and marked all code sections in pre tags. That's what you need to do with the PDF output to make it look right on your Kindle.
Due to how PDF's are made there is no good / easy way to detect and add pre tags when converting. PDF files don't even differentiate paragraphs. It's all fixed with lines. There has been many hours of work put into calibre's PDF conversion to determine which lines belong in the previous or a new paragraph. |
![]() |
![]() |
![]() |
#19 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,720
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@user_none - if I could ask a question here. I've never looked at what a PDF structure looks like internally so have no full appreciation of the difficulties it causes. However one thing I have noted that the conversion *always* gets wrong is when a sentence in an indented paragraph starts at the leftmost column.
Code:
Some first line. My second line. Out of technical curiosity and ignorance what is the issue with detecting this? And does the new PDF engine (which I know is on hold) address this? |
![]() |
![]() |
![]() |
#20 |
Sigil & calibre developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
The issue is paragraphs in a novel typically start with an indent. The massaging is around 100 heuristics that re applied to the text.
I don't know much about the new engine. Kovid started and is pretty much the only on working on it. I gave up on PDF a long time ago. |
![]() |
![]() |
![]() |
#21 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I've looked at the new engine - it's got a lot of potential. Vertical and horizontal positional information is retained so paragraphs can be detected through indents and other tests (though none of those tests are done now). Header and footer removal will also become trivial as it can be done based on position on the page. Last time I looked at it though I couldn't quite figure out the logic as the reflow function covers single column and two column unwrapping in the same function.
@kiwidude, the specific problem in your example is that punctuation at the end of a line is a full stop - since the current engine loses all positional information including indents punctuation is all we've got. If a line in the middle of a paragraph ends in with a full stop punctuation element then the paragraphs will be split there. |
![]() |
![]() |
![]() |
#22 |
Calibre Plugins Developer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 4,720
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Idolse - ahhh, thanks for the info, now I understand. It is the full stop at the end of the previous line that is "significant" in this case.
Having spent many hours resurrecting some PDF conversions in Sigil on a page by page basis, this is one particular limitation I am looking forward to the new engine solving one day... ![]() |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
structure detection - documentation ? | cybmole | Calibre | 27 | 01-12-2011 02:14 AM |
Trouble w structure detection | jeff47 | Calibre | 1 | 10-13-2010 12:51 AM |
Structure Detection Ceased To Exist? | radiofred | Calibre | 3 | 10-01-2010 12:33 AM |
Structure detection v5.5 and v6.2 | AlexBell | Calibre | 2 | 07-29-2009 10:11 PM |