PDF Conversion

wamblej · 10-13-2009, 10:34 PM

For some reason I cannot seem to get a successful PDF -> mobi (or any other format for that matter). I either get a clump of text (with no spacing or formatting at all) or else what I more commonly get it a bunch of lines in the middle of paragraphs.
Ex: This is a sentence and then halfway through the sentence or the para

graph it would have a space that is weird. And it seems like the majority of the formatting is wrong.

If you have any directions or methods that have worked for you - even if it requires multiple conversions that would be very helpful. Thanks!

mrmikel · 10-14-2009, 05:59 AM

When I do a conversion from HTML to LRF these kinds of weird breaks are signs of non breaking spaces. The non breaking space binds the text before it to the text after it, so it doesn't break normally.

I don't know if this is what is happening to you...just a thought.

You might try conversion to epub, then you could look at the text with sigil and see what is there.

user_none · 10-14-2009, 06:01 AM

You need to adjust the unwrap factor. If it's clump increase if it has broken paragraphs decrease.

wamblej · 10-14-2009, 10:56 AM

The unwrap factor under PDF input is at 0.00 already. I tried changing it to 0.5, which actually corrected the problem. THanks. I'll try it again with some other files and see if it helps. I don't know why it was defaulting to 0 instead of .5. I'll keep you posted on what I find. Thanks.

mysweety · 10-15-2009, 01:17 PM

I have just been through this process.
Here is a procedure (linux):
a) Put the pdf in a viewer and do a "select all"
b) Put the text into openoffice and produce an .odt file
c) Ajust sentence length to 50% so as to join short bits with new line chars
d) Run replace for "" to "\n" and find for [a-z] - this gets rid of paras that begin with a small letter and dialogue that has been run together.
e)Split odt file into a separate file for each chapter or section you want in the TOC
f) Clean up the odt files against the original, checking sentences and paras, and put (i) and (ii) for example around italic text.
g) Convert these files to encoded text utf-8
h) Start ecub and put in these files for immediate conversion to html.
i) Clean these html files substituting <i> for (i) and </i> for (ii) etc and putting in images etc.
h) Compile to epub file and check in azardi that all the changes are ok.
i) Copy the resulting build folder to e.g. finalbuild
j) Correct the cover page and place a reference to the TOC in content.opf
k) Place a reference to the TOC in the title page
l) Run "zip -Xr9D $1.epub mimetype * -x .DS_Store" in FinalBuild to produce a new epub. Check in azardi
m) Run mobigen against content.opf with wine

user_none · 10-15-2009, 06:57 PM

Quote:

Originally Posted by mysweety

I have just been through this process.
Here is a procedure (linux):
...

The addition of the unwrap factor should make this unnecessary. However, I do realize it is not a perfect solution. Kovid is doing some work on PDF input right now that will make it even better.

wamblej · 10-15-2009, 10:26 PM

So far Calibre is doing a great job. I just needed to change that unwrap to 0.5. I went to preferences and changed my default to be 0.5 and it has been great.

mysweety · 10-16-2009, 08:13 AM

I think the problem is the style of the original and how faithfully one wishes to follow it.
I found the unwrap factor could handle about 50% of the problem and recourse to regular expressions probably got it up to 80-90%. However if you need to be really faithful to the original then unfortunately it needs to be checked by hand and this is where it takes most time.
The advantage of ecub is that it is very flexible and produces good, simple xhtml files which can be edited with ease. Also the problems surrounding the TOC disappear and mobi,epub and voice can all be produced at the same time.
I think that calibre does a very good job but is limited in the degree of accuracy of the output.

10-13-2009, 10:34 PM	#1
wamblej Member Posts: 10 Karma: 10 Join Date: Sep 2009 Device: Kindle 2	PDF Conversion For some reason I cannot seem to get a successful PDF -> mobi (or any other format for that matter). I either get a clump of text (with no spacing or formatting at all) or else what I more commonly get it a bunch of lines in the middle of paragraphs. Ex: This is a sentence and then halfway through the sentence or the para graph it would have a space that is weird. And it seems like the majority of the formatting is wrong. If you have any directions or methods that have worked for you - even if it requires multiple conversions that would be very helpful. Thanks!

10-14-2009, 05:59 AM	#2
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Thems the breaks When I do a conversion from HTML to LRF these kinds of weird breaks are signs of non breaking spaces. The non breaking space binds the text before it to the text after it, so it doesn't break normally. I don't know if this is what is happening to you...just a thought. You might try conversion to epub, then you could look at the text with sigil and see what is there.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
pdf conversion	terraskye	Calibre	0	10-07-2010 09:46 PM
Conversion de pdf ?	Cressence	Assistance	7	02-11-2010 07:34 AM
PDF conversion help	ardeegee	Other formats	5	01-13-2010 02:47 PM
Conversion PDF	EricGagne	Software	3	10-29-2009 03:19 PM
PDF Conversion Help	Exinferis	Reading and Management	2	06-15-2009 09:11 AM

10-14-2009, 06:01 AM	#3
user_none Sigil & calibre developer Posts: 2,488 Karma: 1063785 Join Date: Jan 2009 Location: Florida, USA Device: Nook STR	You need to adjust the unwrap factor. If it's clump increase if it has broken paragraphs decrease.

10-14-2009, 10:56 AM	#4
wamblej Member Posts: 10 Karma: 10 Join Date: Sep 2009 Device: Kindle 2	The unwrap factor under PDF input is at 0.00 already. I tried changing it to 0.5, which actually corrected the problem. THanks. I'll try it again with some other files and see if it helps. I don't know why it was defaulting to 0 instead of .5. I'll keep you posted on what I find. Thanks.

10-15-2009, 01:17 PM	#5
mysweety Connoisseur Posts: 61 Karma: 7104 Join Date: Jul 2009 Device: Hanlin V3, PB360	I have just been through this process. Here is a procedure (linux): a) Put the pdf in a viewer and do a "select all" b) Put the text into openoffice and produce an .odt file c) Ajust sentence length to 50% so as to join short bits with new line chars d) Run replace for "" to "\n" and find for [a-z] - this gets rid of paras that begin with a small letter and dialogue that has been run together. e)Split odt file into a separate file for each chapter or section you want in the TOC f) Clean up the odt files against the original, checking sentences and paras, and put (i) and (ii) for example around italic text. g) Convert these files to encoded text utf-8 h) Start ecub and put in these files for immediate conversion to html. i) Clean these html files substituting <i> for (i) and </i> for (ii) etc and putting in images etc. h) Compile to epub file and check in azardi that all the changes are ok. i) Copy the resulting build folder to e.g. finalbuild j) Correct the cover page and place a reference to the TOC in content.opf k) Place a reference to the TOC in the title page l) Run "zip -Xr9D $1.epub mimetype * -x .DS_Store" in FinalBuild to produce a new epub. Check in azardi m) Run mobigen against content.opf with wine

10-15-2009, 10:26 PM	#7
wamblej Member Posts: 10 Karma: 10 Join Date: Sep 2009 Device: Kindle 2	So far Calibre is doing a great job. I just needed to change that unwrap to 0.5. I went to preferences and changed my default to be 0.5 and it has been great.

10-16-2009, 08:13 AM	#8
mysweety Connoisseur Posts: 61 Karma: 7104 Join Date: Jul 2009 Device: Hanlin V3, PB360	I think the problem is the style of the original and how faithfully one wishes to follow it. I found the unwrap factor could handle about 50% of the problem and recourse to regular expressions probably got it up to 80-90%. However if you need to be really faithful to the original then unfortunately it needs to be checked by hand and this is where it takes most time. The advantage of ecub is that it is very flexible and produces good, simple xhtml files which can be edited with ease. Also the problems surrounding the TOC disappear and mobi,epub and voice can all be produced at the same time. I think that calibre does a very good job but is limited in the degree of accuracy of the output.

Advert

Advert