|12-17-2012, 11:23 PM||#1|
Join Date: Dec 2012
Device: Kindle Paperwhite
Extra "<p>" tags when converting to AZW3 from pdf
It's a simple question but there are hours and hours of mind numbing work on the line for me if I can't solve it. I'm getting extra "<p>" tags (HTML paragraph tags) when I convert from pdf to AZW3. I'm sure the extra p tags would be there in the other formats too, it seems to be something in how Calibre is written to handle converting pdf formatting to HTML.
Is there a way to modify Calibre to only use p tags when there is an actual paragraph? Simple line wrapping should be handled automatically by the reader, as per usual.
As a matter of interest, Adobe Acrobat pro handles this properly when you ask it to save a pdf as an HTML file. Which is to say it only uses paragraph tags when there is an actual paragraph, and lets the reader handle line wrapping in between paragraphs...
Your help is greatly appreciated!
Last edited by MrTanquery; 12-17-2012 at 11:25 PM.
|12-18-2012, 03:12 PM||#2|
Join Date: Jun 2012
You've just discovered the frustration of trying to convert from PDF's. I feel your pain.
First, read the sticky, especially the section titled "Some of my paragraphs are split into multiple paragraphs".
Short answer: PDF's don't have paragraphs; they have lines of text. The information to know where one paragraph ends and another begins gets lost in the conversion to PDF, so it's not available for Calibre or any other conversion program to make use of. Some PDF's use workarounds to maintain that information (e.g. by putting blank lines between paragraphs) and therefore Calibre is able to guess where to break paragraphs. The one you're working with apparently does not.
Possible solutions include converting and manual cleanup afterward (a lot of work), using Calibre's heuristic processing to try to guess where the line breaks are (good, but not perfect), or trying to obtain the original in a different format, like epub, mobi, or html. If this is possible, I recommend it as the best solution.
|calibre, extra <p> tags, pdf to html|
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Missing second "l" when converting from PDF||NewEreader123||Conversion||2||03-28-2011 10:55 AM|
|The option "--extra-css" doesn't work||slex||Conversion||2||02-19-2011 06:26 AM|
|Repeated "Ignoring missing TOC entry" when converting PDF to MOBI||goldenhair||Calibre||2||01-19-2011 10:30 AM|
|Converting PDF w/ "Calibre" Problem?||federalbetrayal||Calibre||4||09-28-2010 06:41 PM|
|Help needed converting PDF of "James Potter and the Hall of Elders' Crossing"||rgodby||Calibre||6||10-17-2009 12:32 AM|