10-13-2009, 10:34 PM | #1 |
Member
Posts: 10
Karma: 10
Join Date: Sep 2009
Device: Kindle 2
|
PDF Conversion
For some reason I cannot seem to get a successful PDF -> mobi (or any other format for that matter). I either get a clump of text (with no spacing or formatting at all) or else what I more commonly get it a bunch of lines in the middle of paragraphs.
Ex: This is a sentence and then halfway through the sentence or the para graph it would have a space that is weird. And it seems like the majority of the formatting is wrong. If you have any directions or methods that have worked for you - even if it requires multiple conversions that would be very helpful. Thanks! |
10-14-2009, 05:59 AM | #2 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Thems the breaks
When I do a conversion from HTML to LRF these kinds of weird breaks are signs of non breaking spaces. The non breaking space binds the text before it to the text after it, so it doesn't break normally.
I don't know if this is what is happening to you...just a thought. You might try conversion to epub, then you could look at the text with sigil and see what is there. |
Advert | |
|
10-14-2009, 06:01 AM | #3 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
You need to adjust the unwrap factor. If it's clump increase if it has broken paragraphs decrease.
|
10-14-2009, 10:56 AM | #4 |
Member
Posts: 10
Karma: 10
Join Date: Sep 2009
Device: Kindle 2
|
The unwrap factor under PDF input is at 0.00 already. I tried changing it to 0.5, which actually corrected the problem. THanks. I'll try it again with some other files and see if it helps. I don't know why it was defaulting to 0 instead of .5. I'll keep you posted on what I find. Thanks.
|
10-15-2009, 01:17 PM | #5 |
Connoisseur
Posts: 61
Karma: 7104
Join Date: Jul 2009
Device: Hanlin V3, PB360
|
I have just been through this process.
Here is a procedure (linux): a) Put the pdf in a viewer and do a "select all" b) Put the text into openoffice and produce an .odt file c) Ajust sentence length to 50% so as to join short bits with new line chars d) Run replace for "" to "\n" and find for [a-z] - this gets rid of paras that begin with a small letter and dialogue that has been run together. e)Split odt file into a separate file for each chapter or section you want in the TOC f) Clean up the odt files against the original, checking sentences and paras, and put (i) and (ii) for example around italic text. g) Convert these files to encoded text utf-8 h) Start ecub and put in these files for immediate conversion to html. i) Clean these html files substituting <i> for (i) and </i> for (ii) etc and putting in images etc. h) Compile to epub file and check in azardi that all the changes are ok. i) Copy the resulting build folder to e.g. finalbuild j) Correct the cover page and place a reference to the TOC in content.opf k) Place a reference to the TOC in the title page l) Run "zip -Xr9D $1.epub mimetype * -x .DS_Store" in FinalBuild to produce a new epub. Check in azardi m) Run mobigen against content.opf with wine |
Advert | |
|
10-15-2009, 06:57 PM | #6 |
Sigil & calibre developer
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
The addition of the unwrap factor should make this unnecessary. However, I do realize it is not a perfect solution. Kovid is doing some work on PDF input right now that will make it even better.
|
10-15-2009, 10:26 PM | #7 |
Member
Posts: 10
Karma: 10
Join Date: Sep 2009
Device: Kindle 2
|
So far Calibre is doing a great job. I just needed to change that unwrap to 0.5. I went to preferences and changed my default to be 0.5 and it has been great.
|
10-16-2009, 08:13 AM | #8 |
Connoisseur
Posts: 61
Karma: 7104
Join Date: Jul 2009
Device: Hanlin V3, PB360
|
I think the problem is the style of the original and how faithfully one wishes to follow it.
I found the unwrap factor could handle about 50% of the problem and recourse to regular expressions probably got it up to 80-90%. However if you need to be really faithful to the original then unfortunately it needs to be checked by hand and this is where it takes most time. The advantage of ecub is that it is very flexible and produces good, simple xhtml files which can be edited with ease. Also the problems surrounding the TOC disappear and mobi,epub and voice can all be produced at the same time. I think that calibre does a very good job but is limited in the degree of accuracy of the output. |
Tags |
conversion, pdf |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
pdf conversion | terraskye | Calibre | 0 | 10-07-2010 09:46 PM |
Conversion de pdf ? | Cressence | Assistance | 7 | 02-11-2010 07:34 AM |
PDF conversion help | ardeegee | Other formats | 5 | 01-13-2010 02:47 PM |
Conversion PDF | EricGagne | Software | 3 | 10-29-2009 03:19 PM |
PDF Conversion Help | Exinferis | Reading and Management | 2 | 06-15-2009 09:11 AM |