Unwanted UnWrapping

MacEvansCB · 09-18-2011, 08:07 PM

I do a lot of conversion from PDF to editable text and there is one thing that drives me up the wall. Anytime there is punctuation (or a number or a capitalized letter) at the right margin, Calibre ALWAYS inserts a hard line break. It doesn't matter what I'm converting to... I've tried EPub, HTMLZ, RTF, TXT and others. The result is always the same.

Now I've gone thru piles of posts on this forum...
I've read the sticky for paragraphs being broke up...
I've gone thru the manual for unwrapping text...
I've turned on Heuristic Processing, enabled only Unwrap Lines and used piles of values between 0.00 and 1.00. While other paragraph breaks come and go, those I'm testing for NEVER stay wrapped as they should. And I hate to waste so much time scrubbing thru documents cleaning up these extra hard breaks.

This has me really really lost. Obviously every PDF reader app I've used, including Acrobat and Apple Preview, knows where these hard line breaks should be and should NOT be. Yet everybody says there is no such thing as paragraphs in a PDF.

Are there secret hidden characters or what???
How the heck does a PDF reader app handle hard and soft breaks correctly???
And why can't the Calibre PDF converter do the same thing????

user_none · 09-18-2011, 08:54 PM

PDF does not define that a block of text is a paragraph. In HTML for instance you would put the text in side of <p> </p> tags to denote that that block is a paragraph.

In a PDF, it essentially says draw black lines in this shape at these points on the page. Each line is drawn independently of the next like in a print book. The tab indent (if there is one) is you're visual indicator that you have started a new paragraph. However, that tab character isn't a character in the PDF. The instructions for drawing the text just start a bit further to the right than the line above and below.

Now we get into the question of what is a paragraph? Does it always start with a tab indent? How large of an indent? Is a paragraph separated by blank lines? Is a 10 character line alone that says Chapter 10 a paragraph or something else?

Do you see the issue? With a PDF (much like a TXT file) you don't have information (you do but it's limited at best) that tells you what you're looking at other than at this point on the page draw this.

DoctorOhh · 09-18-2011, 09:00 PM

Quote:

Originally Posted by MacEvansCB

How the heck does a PDF reader app handle hard and soft breaks correctly???
And why can't the Calibre PDF converter do the same thing????

You read the sticky about the limitations of converting PDF files. The PDF viewer is viewing the pdf which is not a simple document with hard and soft line breaks and nice clean paragraphs it is much more compicated than that. Even when using Adobe Professional to export a PDF to html the resulting html isn't perfect and has some sentence breaks where they don't belong and requires a lot of editing to turn it into a passable epub.

Your best bet might be to cite specific examples with a PDF file to go with it and see if that example can be corrected or handled correctly.

ldolse · 09-19-2011, 05:56 AM

Not to discourage you if you come up with examples (and if they're good ones they can be acted upon), but Calibre also leans towards false negatives in questionable situations vs. false positives. i.e. if it's debatable whether a sentence should be unwrapped or not it will leave the hard break.

A common example is:
"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor."
Proper Name said.

This annoys some users that Calibre doesn't unwrap this, but it's extremely difficult to tell whether the above is one sentence or two sentences from an algorithmic standpoint.

Leaving the hard break in place if it's one sentence is annoying, but you always recognize as a human when it happens to fix it manually. However if you remove the hard break and it should have been two sentences the dialogue can be fundamentally changed, and it's not so easy for a human to detect if the author really meant both those items to be in a single paragraph - if you even notice the oddity you'll need to dig out the original file/book to check.

MacEvansCB · 09-20-2011, 08:37 AM

Ahhhh ... It was user_none that gave me the kick in the head and got me realigned ... I used to dabble in PostScript in the eighties and now I DO have at least a clue to what's going on. I've been thinking too much about text files and not about how PostScript works.

After wandering thru piles and piles of PDF files over the last couple of years, I feel that they fall into three cases:

1. The "perfect" PDF file:
These files have both paragraph indents and paragraph spacing. It ought to be simple to analyze the PostScript code for text positioning, and, given numbers for normal line spacing, paragraph spacing and indent spacing, it ought to be a piece of cake to properly format text from these files with absolutely no wrap/unwrap errors.

2. The "nice" PDF file:
These files have either paragraph indents or paragraph spacing but not both. It still ought to be simple to properly format text from these files without wrapping errors.

3. The "bad" PDF file:
These files have neither paragraph indents or paragraph spacing and one is stuck with only looking at punctuation and end-of-line position to find where paragraphs break.

I have seen a "4th" case, where the file was one complete glob of text, with no breaks whatsoever .... I just throw those files away.

Please note that I got all the way to case 3 before even mentioning punctuation or end-of-line position. Unfortunately, this seems to be the only way that Calibre's PDF converter formats text, without even considering the first two cases.

I went to a folder with almost 200 PDFs in it and tallied up the first 60 files (and then gave up!). I found that 29 files matched case 1, 25 files matched case 2, and only 6 files matched case 3. I could probably go for a larger statistical base, but this still argues for a better way to analyze PDF files.

It would be really nice if the PDF converter first looked for paragraph indents and paragraph spacing and used those for controlling wrapping when possible, falling back to the worst case of punctuation and end-of-line position only when the other two failed.

Idolse: I understand the point you're making with your example. But after scrubbing thru piles of converted files, I would have to say that in 99.99% of the cases that match your example, the lines should be wrapped without a hard line break.

kiwidude · 09-20-2011, 10:49 AM

@MacEvansCB - take a look at this thread:
https://www.mobileread.com/forums/sho...d.php?t=132726

In particular my posts and the responses towards the end. To summarise as I understand it, the problem is that the existing PDF engine does not retain the indents information, so punctuation is the *only* option it has. The new PDF engine will retain this information, but it is not yet finished (and seems perpetually on hold in lieu of other priorities).

jackie_w · 09-20-2011, 03:24 PM

@MacEvansCB,

My experience with PDF to HTML conversion may be of limited use to you, but I'll offer it anyway.

You could try using one of the utilities

pdftohtml.exe (freeware, also used by Calibre, I believe)
pdf2xml.exe (freeware used by mobipocket creator)

to convert your "nice PDFs" to XML format. Personally, I prefer the latter option as pdftohtml sometimes loses italics.

The output XML does contain positional (x, y) info for each line, namely distance from Left edge of page and distance from Top of page, so detecting paragraph indents is possible.

If you have some programming ability, with work (quite a lot of work) you can write something to parse the XML and reconstruct chapter headings, paragraphs, scene-breaks, italics, bold, smallcaps, images and hyperlinks as you convert the XML to HTML.

Even so, I have not found it to be a "single magic button" conversion process. Every PDF is different and supplying a little specific knowledge about a particular PDF can make a big difference to the quality of the resultant HTML. Also, I haven't even attempted to try and convert PDFs of technical manuals in this way, only novels.

MacEvansCB · 09-20-2011, 05:05 PM

Thanks kiwi-dude ... I had started reading that post, but didn't finish it as it appeared not to apply to what I was after.

OK Kovid ... I respectfully request that you please give the new PDF engine a higher priority. I certainly could use it!!!!!! While I do have a lot of nice HTMLs, RTFs, and ePUBs, most of the files I'm forced to work on are grungy PDFs.

And thanks for the recommendations, jackie_w ... unfortunately I'm on a Mac ... but fortunately I'm also running Windows 7 as a virtual machine on my Mac! I'll look at anything that might make my conversions simpler and less labor intensive.

I understand how much fun illustrated documents and technical manuals are. You can probably guess how much time and fun I had getting my copy of Piers Anthony's "Visual Guide to Xanth" online and properly formatted with all the illustrations. I started over from scratch three different times as I learned new ways to do things, and when I changed eReaders, which changed my eBook formats.

09-18-2011, 08:07 PM	#1
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	Unwanted UnWrapping I do a lot of conversion from PDF to editable text and there is one thing that drives me up the wall. Anytime there is punctuation (or a number or a capitalized letter) at the right margin, Calibre ALWAYS inserts a hard line break. It doesn't matter what I'm converting to... I've tried EPub, HTMLZ, RTF, TXT and others. The result is always the same. Now I've gone thru piles of posts on this forum... I've read the sticky for paragraphs being broke up... I've gone thru the manual for unwrapping text... I've turned on Heuristic Processing, enabled only Unwrap Lines and used piles of values between 0.00 and 1.00. While other paragraph breaks come and go, those I'm testing for NEVER stay wrapped as they should. And I hate to waste so much time scrubbing thru documents cleaning up these extra hard breaks. This has me really really lost. Obviously every PDF reader app I've used, including Acrobat and Apple Preview, knows where these hard line breaks should be and should NOT be. Yet everybody says there is no such thing as paragraphs in a PDF. Are there secret hidden characters or what??? How the heck does a PDF reader app handle hard and soft breaks correctly??? And why can't the Calibre PDF converter do the same thing????

09-20-2011, 03:24 PM	#7
jackie_w Grand Sorcerer Posts: 6,251 Karma: 16539642 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	@MacEvansCB, My experience with PDF to HTML conversion may be of limited use to you, but I'll offer it anyway. You could try using one of the utilities pdftohtml.exe (freeware, also used by Calibre, I believe) pdf2xml.exe (freeware used by mobipocket creator) to convert your "nice PDFs" to XML format. Personally, I prefer the latter option as pdftohtml sometimes loses italics. The output XML does contain positional (x, y) info for each line, namely distance from Left edge of page and distance from Top of page, so detecting paragraph indents is possible. If you have some programming ability, with work (quite a lot of work) you can write something to parse the XML and reconstruct chapter headings, paragraphs, scene-breaks, italics, bold, smallcaps, images and hyperlinks as you convert the XML to HTML. Even so, I have not found it to be a "single magic button" conversion process. Every PDF is different and supplying a little specific knowledge about a particular PDF can make a big difference to the quality of the resultant HTML. Also, I haven't even attempted to try and convert PDFs of technical manuals in this way, only novels.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Free Book (Kindle/nook/Sony) - Unwrapping Christmas	koland	Deals and Resources (No Self-Promotion or Affiliate Links)	2	12-15-2010 07:00 PM
Unwanted Pagebreaks	Timoleon	Calibre	3	09-19-2010 07:53 PM
Still having problems PDF to MOBI line unwrapping	jengwen	Calibre	2	04-16-2010 09:14 AM
Unwrapping hard line breaks across all input formats	ldolse	Calibre	17	05-10-2009 11:31 PM
Sell Unwanted Iliad	Amanda	Flea Market	1	08-26-2008 05:23 AM

09-18-2011, 08:54 PM	#2
user_none Sigil & calibre developer Posts: 2,487 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	PDF does not define that a block of text is a paragraph. In HTML for instance you would put the text in side of <p> </p> tags to denote that that block is a paragraph. In a PDF, it essentially says draw black lines in this shape at these points on the page. Each line is drawn independently of the next like in a print book. The tab indent (if there is one) is you're visual indicator that you have started a new paragraph. However, that tab character isn't a character in the PDF. The instructions for drawing the text just start a bit further to the right than the line above and below. Now we get into the question of what is a paragraph? Does it always start with a tab indent? How large of an indent? Is a paragraph separated by blank lines? Is a 10 character line alone that says Chapter 10 a paragraph or something else? Do you see the issue? With a PDF (much like a TXT file) you don't have information (you do but it's limited at best) that tells you what you're looking at other than at this point on the page draw this.

09-19-2011, 05:56 AM	#4
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Not to discourage you if you come up with examples (and if they're good ones they can be acted upon), but Calibre also leans towards false negatives in questionable situations vs. false positives. i.e. if it's debatable whether a sentence should be unwrapped or not it will leave the hard break. A common example is: "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor." Proper Name said. This annoys some users that Calibre doesn't unwrap this, but it's extremely difficult to tell whether the above is one sentence or two sentences from an algorithmic standpoint. Leaving the hard break in place if it's one sentence is annoying, but you always recognize as a human when it happens to fix it manually. However if you remove the hard break and it should have been two sentences the dialogue can be fundamentally changed, and it's not so easy for a human to detect if the author really meant both those items to be in a single paragraph - if you even notice the oddity you'll need to dig out the original file/book to check.

09-20-2011, 08:37 AM	#5
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	Ahhhh ... It was user_none that gave me the kick in the head and got me realigned ... I used to dabble in PostScript in the eighties and now I DO have at least a clue to what's going on. I've been thinking too much about text files and not about how PostScript works. After wandering thru piles and piles of PDF files over the last couple of years, I feel that they fall into three cases: 1. The "perfect" PDF file: These files have both paragraph indents and paragraph spacing. It ought to be simple to analyze the PostScript code for text positioning, and, given numbers for normal line spacing, paragraph spacing and indent spacing, it ought to be a piece of cake to properly format text from these files with absolutely no wrap/unwrap errors. 2. The "nice" PDF file: These files have either paragraph indents or paragraph spacing but not both. It still ought to be simple to properly format text from these files without wrapping errors. 3. The "bad" PDF file: These files have neither paragraph indents or paragraph spacing and one is stuck with only looking at punctuation and end-of-line position to find where paragraphs break. I have seen a "4th" case, where the file was one complete glob of text, with no breaks whatsoever .... I just throw those files away. Please note that I got all the way to case 3 before even mentioning punctuation or end-of-line position. Unfortunately, this seems to be the only way that Calibre's PDF converter formats text, without even considering the first two cases. I went to a folder with almost 200 PDFs in it and tallied up the first 60 files (and then gave up!). I found that 29 files matched case 1, 25 files matched case 2, and only 6 files matched case 3. I could probably go for a larger statistical base, but this still argues for a better way to analyze PDF files. It would be really nice if the PDF converter first looked for paragraph indents and paragraph spacing and used those for controlling wrapping when possible, falling back to the worst case of punctuation and end-of-line position only when the other two failed. Idolse: I understand the point you're making with your example. But after scrubbing thru piles of converted files, I would have to say that in 99.99% of the cases that match your example, the lines should be wrapped without a hard line break.

09-20-2011, 10:49 AM	#6
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@MacEvansCB - take a look at this thread: https://www.mobileread.com/forums/sho...d.php?t=132726 In particular my posts and the responses towards the end. To summarise as I understand it, the problem is that the existing PDF engine does not retain the indents information, so punctuation is the only option it has. The new PDF engine will retain this information, but it is not yet finished (and seems perpetually on hold in lieu of other priorities).

09-20-2011, 05:05 PM	#8
MacEvansCB Enthusiast Posts: 25 Karma: 10 Join Date: Nov 2010 Location: Somewhere in Iowa Device: Nook Color	Thanks kiwi-dude ... I had started reading that post, but didn't finish it as it appeared not to apply to what I was after. OK Kovid ... I respectfully request that you please give the new PDF engine a higher priority. I certainly could use it!!!!!! While I do have a lot of nice HTMLs, RTFs, and ePUBs, most of the files I'm forced to work on are grungy PDFs. And thanks for the recommendations, jackie_w ... unfortunately I'm on a Mac ... but fortunately I'm also running Windows 7 as a virtual machine on my Mac! I'll look at anything that might make my conversions simpler and less labor intensive. I understand how much fun illustrated documents and technical manuals are. You can probably guess how much time and fun I had getting my copy of Piers Anthony's "Visual Guide to Xanth" online and properly formatted with all the illustrations. I started over from scratch three different times as I learned new ways to do things, and when I changed eReaders, which changed my eBook formats.

Advert

Advert