06-14-2012, 09:38 PM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2012
Device: Kindle 4NT
|
PDF pagebreaks turn to blank lines in mobi
Hi, there. I am new to the forum, but a bit experienced in converting things in Calibre.
I know that PDF is a bad format and yadda yadda, but sometimes the only source one have is in pdf, so... I've noticed that when converting from PDF to mobi the PDF's page breaks become blank lines in mobi, even if the option to "delete blank lines between paragraphs" is turned on. This way, even if the text is all right and fluid, there is some blank lines that coincide with the PDF's page breaks. Is this a feature or a bug? Or am I missing something? OBS: The PDF document I am talking about is a pure text document, without fancy formatting or images. Thanks a lot! |
06-14-2012, 09:51 PM | #2 |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Here is how to do it so it comes out correctly...
1. Use Calibre to convert the PDF to ePub (do not use the delete blank lines option ever). 2. Use Sigil to edit the resulting ePub. 3. A/B compare the PDF to the ePub (every letter, every space, every punctuation, everything) 4. Edit the ePub to fix the errors. 5. Edit the ePub to fix the formatting. 6. Validate the ePub using FlightCrew. 7. Convert to AZW3 using Calibre. To do this right takes a lot of work. |
06-14-2012, 10:46 PM | #3 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2012
Device: Kindle 4NT
|
Thanks.
What I do is convert the PDF to RTF, edit it and then convert from the edited RTF to mobi. Since there is no images or such, it works well. My question is: the "insert blank lines in mobi when a page break in PDF is found" behavior is a bug or a feature? Should I file a bug report? It seems like a bug to me Thanks! |
06-14-2012, 11:24 PM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
It's not a feature or a bug - it's just a limitation/expected behavior. Did you read the sticky?
https://www.mobileread.com/forums/sho...d.php?t=118605 Most pdfs don't exhibit the specific problem you're mentioning - odds are there is some sort of header/footer in your pdf that's tripping up the normal pdf conversion - use search/replace to delete it. |
06-15-2012, 12:56 PM | #5 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2012
Device: Kindle 4NT
|
Thanks for your answer. I've read the sticky many times in search of a solution. And every PDF I've already tried to convert exhibit this problem, or I wouldn't cry for help.
I think you can reproduce the problem. Here are the steps I followed right now, just to be sure: 1. Downloaded a .pdf book from Project Gutenberg (I got this http://www.gutenberg.org/ebooks/1342 ). There is no header or footer in it. You can get the .txt version and make a PDF from it, too. The results are the same. 2. Convert it with Calibre to MOBI, with all the options from Heuristics checked, and "Removing spacing between paragraphs" from "Look and Feel" checked too. You can see in the resulting MOBI (using the internal viewer or even using Kindle itself) that there is some blank lines corresponding to every page break in the PDF file. Playing around a bit, I've just found that this blank lines are soft scene breaks inserted by Calibre (if I use the option to "replace soft scene breaks" it become obvious). However, if a paragraph is broken from one page to another in the PDF, no soft scene break is inserted, but rather a new paragraph begins in the point of the page break. I certainly can use regex to fix this paragraph breaks, but I think that Calibre could handle these "PDF page break -> MOBI soft scene break" problem. The problem becomes annoying in a text that already contains some real soft scene breaks, as you can imagine, as the resulting MOBI will have a lot of fake soft scene breaks Cheers! |
06-15-2012, 01:58 PM | #6 | |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
As for whether the page beak issue is a bug or not, it's probably not. No two PDF will convert the same. So it's just what is. You'll have to fix it averward. So if you convert the PDF to ePub and then use Sigil to correct the output and fix the formatting, you can then convert it to read on your Kindle better then RTF. |
|
06-15-2012, 11:38 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I understand what you're saying then - I misunderstood and thought you meant that broken paragraphs weren't being connected across page breaks - blank lines getting inserted and appearing to be scene breaks is a little bit different.
I've submitted a patch which fixes it, will probably be in the next release. |
06-16-2012, 10:49 AM | #8 | |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2012
Device: Kindle 4NT
|
Quote:
|
|
06-16-2012, 09:56 PM | #9 |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Regardless of this patch, my directions are still valid and still correct and this still requires a lot of work to do it right and it does not involve RTF and/or Word in any way at all.
|
08-23-2012, 04:58 PM | #10 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2012
Device: da5id403
|
A Lot Of Work – Understatement
Quote:
Bottom line – software should be able to do this. That's what PC's are for. There is a $.10 alternative. |
|
08-23-2012, 05:02 PM | #11 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2012
Device: da5id403
|
Yes, Do That
Quote:
|
|
08-23-2012, 05:14 PM | #12 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2012
Device: da5id403
|
Nothing This Simple Is Inevitable
Quote:
"How can I help make pdf conversion better? "Improving pdf conversion is on the to-do list of the Calibre developers, but any help would be greatly appreciated. There is a new pdf engine that is currently in progress, and fixes many of the issues described above, like multi-column pdfs, ligatures, line wrapping, etc. Development is presently stalled, and there is no ETA for this being released. ... " PS: Most do. Last edited by da5id403; 08-23-2012 at 05:25 PM. Reason: To edit. |
|
08-23-2012, 06:59 PM | #13 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
This particular issue was resolved back in June - refer to post number 7. Please make sure you're using the latest version of Calibre.
|
08-24-2012, 06:11 AM | #14 |
null operator (he/him)
Posts: 20,460
Karma: 26645808
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
@daniel3ub - I've found that the mobicreator tool usually does a better job of converting PDF's than Calibre does - as is suggested in the "PDF Conversion - Read This First" sticky,
|
08-28-2012, 10:31 PM | #15 | |
Resident Curmudgeon
Posts: 73,668
Karma: 127838212
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
If the line space is created in CSS, then you can easily remove them. If they are in the XML code, they may not be as easy, but maybe a search/replace (maybe with regex) will work. No software will do the conversion perfectly. It isn't possible. So my directions are all you have to get it right. |
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to EPUB adds extra blank lines ?? | Rojofo | Conversion | 3 | 06-01-2012 06:20 PM |
Blank Lines in MOBI-Conversion | ulrichbi | Conversion | 3 | 01-19-2012 04:50 AM |
Blank lines in Gutenberg mobi files | SkookumPete | Calibre | 9 | 06-12-2011 11:16 AM |
Calibre Indent Issue When Removing Blank Lines (Converting From HTML to MOBI or EPUB) | David Derrico | Calibre | 5 | 08-04-2010 12:13 AM |
Using one of the Mobi softwares to turn prc to pdf. | Ireadfreely | Kindle Formats | 22 | 01-09-2009 11:43 PM |