Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 09-18-2011, 08:07 PM   #1
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Unwanted UnWrapping

I do a lot of conversion from PDF to editable text and there is one thing that drives me up the wall. Anytime there is punctuation (or a number or a capitalized letter) at the right margin, Calibre ALWAYS inserts a hard line break. It doesn't matter what I'm converting to... I've tried EPub, HTMLZ, RTF, TXT and others. The result is always the same.

Now I've gone thru piles of posts on this forum...
I've read the sticky for paragraphs being broke up...
I've gone thru the manual for unwrapping text...
I've turned on Heuristic Processing, enabled only Unwrap Lines and used piles of values between 0.00 and 1.00. While other paragraph breaks come and go, those I'm testing for NEVER stay wrapped as they should. And I hate to waste so much time scrubbing thru documents cleaning up these extra hard breaks.

This has me really really lost. Obviously every PDF reader app I've used, including Acrobat and Apple Preview, knows where these hard line breaks should be and should NOT be. Yet everybody says there is no such thing as paragraphs in a PDF.

Are there secret hidden characters or what???
How the heck does a PDF reader app handle hard and soft breaks correctly???
And why can't the Calibre PDF converter do the same thing????
MacEvansCB is offline   Reply With Quote
Old 09-18-2011, 08:54 PM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
PDF does not define that a block of text is a paragraph. In HTML for instance you would put the text in side of <p> </p> tags to denote that that block is a paragraph.

In a PDF, it essentially says draw black lines in this shape at these points on the page. Each line is drawn independently of the next like in a print book. The tab indent (if there is one) is you're visual indicator that you have started a new paragraph. However, that tab character isn't a character in the PDF. The instructions for drawing the text just start a bit further to the right than the line above and below.

Now we get into the question of what is a paragraph? Does it always start with a tab indent? How large of an indent? Is a paragraph separated by blank lines? Is a 10 character line alone that says Chapter 10 a paragraph or something else?

Do you see the issue? With a PDF (much like a TXT file) you don't have information (you do but it's limited at best) that tells you what you're looking at other than at this point on the page draw this.
user_none is offline   Reply With Quote
Advert
Old 09-18-2011, 09:00 PM   #3
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,896
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
Quote:
Originally Posted by MacEvansCB View Post
How the heck does a PDF reader app handle hard and soft breaks correctly???
And why can't the Calibre PDF converter do the same thing????
You read the sticky about the limitations of converting PDF files. The PDF viewer is viewing the pdf which is not a simple document with hard and soft line breaks and nice clean paragraphs it is much more compicated than that. Even when using Adobe Professional to export a PDF to html the resulting html isn't perfect and has some sentence breaks where they don't belong and requires a lot of editing to turn it into a passable epub.

Your best bet might be to cite specific examples with a PDF file to go with it and see if that example can be corrected or handled correctly.
DoctorOhh is offline   Reply With Quote
Old 09-19-2011, 05:56 AM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Not to discourage you if you come up with examples (and if they're good ones they can be acted upon), but Calibre also leans towards false negatives in questionable situations vs. false positives. i.e. if it's debatable whether a sentence should be unwrapped or not it will leave the hard break.

A common example is:
"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor."
Proper Name said.

This annoys some users that Calibre doesn't unwrap this, but it's extremely difficult to tell whether the above is one sentence or two sentences from an algorithmic standpoint.

Leaving the hard break in place if it's one sentence is annoying, but you always recognize as a human when it happens to fix it manually. However if you remove the hard break and it should have been two sentences the dialogue can be fundamentally changed, and it's not so easy for a human to detect if the author really meant both those items to be in a single paragraph - if you even notice the oddity you'll need to dig out the original file/book to check.
ldolse is offline   Reply With Quote
Old 09-20-2011, 08:37 AM   #5
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Ahhhh ... It was user_none that gave me the kick in the head and got me realigned ... I used to dabble in PostScript in the eighties and now I DO have at least a clue to what's going on. I've been thinking too much about text files and not about how PostScript works.

After wandering thru piles and piles of PDF files over the last couple of years, I feel that they fall into three cases:

1. The "perfect" PDF file:
These files have both paragraph indents and paragraph spacing. It ought to be simple to analyze the PostScript code for text positioning, and, given numbers for normal line spacing, paragraph spacing and indent spacing, it ought to be a piece of cake to properly format text from these files with absolutely no wrap/unwrap errors.

2. The "nice" PDF file:
These files have either paragraph indents or paragraph spacing but not both. It still ought to be simple to properly format text from these files without wrapping errors.

3. The "bad" PDF file:
These files have neither paragraph indents or paragraph spacing and one is stuck with only looking at punctuation and end-of-line position to find where paragraphs break.

I have seen a "4th" case, where the file was one complete glob of text, with no breaks whatsoever .... I just throw those files away.

Please note that I got all the way to case 3 before even mentioning punctuation or end-of-line position. Unfortunately, this seems to be the only way that Calibre's PDF converter formats text, without even considering the first two cases.

I went to a folder with almost 200 PDFs in it and tallied up the first 60 files (and then gave up!). I found that 29 files matched case 1, 25 files matched case 2, and only 6 files matched case 3. I could probably go for a larger statistical base, but this still argues for a better way to analyze PDF files.

It would be really nice if the PDF converter first looked for paragraph indents and paragraph spacing and used those for controlling wrapping when possible, falling back to the worst case of punctuation and end-of-line position only when the other two failed.

Idolse: I understand the point you're making with your example. But after scrubbing thru piles of converted files, I would have to say that in 99.99% of the cases that match your example, the lines should be wrapped without a hard line break.
MacEvansCB is offline   Reply With Quote
Advert
Old 09-20-2011, 10:49 AM   #6
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,729
Karma: 2197770
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
@MacEvansCB - take a look at this thread:
https://www.mobileread.com/forums/sho...d.php?t=132726

In particular my posts and the responses towards the end. To summarise as I understand it, the problem is that the existing PDF engine does not retain the indents information, so punctuation is the *only* option it has. The new PDF engine will retain this information, but it is not yet finished (and seems perpetually on hold in lieu of other priorities).
kiwidude is offline   Reply With Quote
Old 09-20-2011, 03:24 PM   #7
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,251
Karma: 16539642
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
@MacEvansCB,

My experience with PDF to HTML conversion may be of limited use to you, but I'll offer it anyway.

You could try using one of the utilities
  • pdftohtml.exe (freeware, also used by Calibre, I believe)
  • pdf2xml.exe (freeware used by mobipocket creator)
to convert your "nice PDFs" to XML format. Personally, I prefer the latter option as pdftohtml sometimes loses italics.

The output XML does contain positional (x, y) info for each line, namely distance from Left edge of page and distance from Top of page, so detecting paragraph indents is possible.

If you have some programming ability, with work (quite a lot of work) you can write something to parse the XML and reconstruct chapter headings, paragraphs, scene-breaks, italics, bold, smallcaps, images and hyperlinks as you convert the XML to HTML.

Even so, I have not found it to be a "single magic button" conversion process. Every PDF is different and supplying a little specific knowledge about a particular PDF can make a big difference to the quality of the resultant HTML. Also, I haven't even attempted to try and convert PDFs of technical manuals in this way, only novels.
jackie_w is offline   Reply With Quote
Old 09-20-2011, 05:05 PM   #8
MacEvansCB
Enthusiast
MacEvansCB began at the beginning.
 
Posts: 25
Karma: 10
Join Date: Nov 2010
Location: Somewhere in Iowa
Device: Nook Color
Thanks kiwi-dude ... I had started reading that post, but didn't finish it as it appeared not to apply to what I was after.

OK Kovid ... I respectfully request that you please give the new PDF engine a higher priority. I certainly could use it!!!!!! While I do have a lot of nice HTMLs, RTFs, and ePUBs, most of the files I'm forced to work on are grungy PDFs.

And thanks for the recommendations, jackie_w ... unfortunately I'm on a Mac ... but fortunately I'm also running Windows 7 as a virtual machine on my Mac! I'll look at anything that might make my conversions simpler and less labor intensive.

I understand how much fun illustrated documents and technical manuals are. You can probably guess how much time and fun I had getting my copy of Piers Anthony's "Visual Guide to Xanth" online and properly formatted with all the illustrations. I started over from scratch three different times as I learned new ways to do things, and when I changed eReaders, which changed my eBook formats.
MacEvansCB is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Free Book (Kindle/nook/Sony) - Unwrapping Christmas koland Deals and Resources (No Self-Promotion or Affiliate Links) 2 12-15-2010 07:00 PM
Unwanted Pagebreaks Timoleon Calibre 3 09-19-2010 07:53 PM
Still having problems PDF to MOBI line unwrapping jengwen Calibre 2 04-16-2010 09:14 AM
Unwrapping hard line breaks across all input formats ldolse Calibre 17 05-10-2009 11:31 PM
Sell Unwanted Iliad Amanda Flea Market 1 08-26-2008 05:23 AM


All times are GMT -4. The time now is 02:31 PM.


MobileRead.com is a privately owned, operated and funded community.