Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 08-27-2008, 06:52 AM   #1
Pulp
Palm Addict
Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.
 
Pulp's Avatar
 
Posts: 477
Karma: 1001951
Join Date: Aug 2008
Device: Cybook Gen3 [512mb, FW: 1.5]
Talking PDF->mobi "my way"

After testing around I finally found a way to convert my pdfs.

It's a multiple step process but at last I end up with a mobi-file that satisfies my needs.
  • First i run the PDF through Abbyy PDF-Transformer with the layout set to "text flow" and create a rtf-file (I suppose every other tool that does a good job in pdf->rtf is fine aswell)
  • The rtf-file is then opened in MSWord where I do a spellcheck (doesn't take more than a few minutes per book and makes sure there are no hyphe-nations left from the original text.
  • I save the file as filtered html in Word
  • At last I run the file through a php-script that does the following:
    Code:
    $text = str_replace(array("<body","</body>","<p","</p>"), array("{body}<body","</body>{/body}","{p}<p","</p>{/p}"), $text);
    $text = strip_tags($text, "<b><i><u><html><head><title><h1><h2><h3><h4><h5><h6>");
    $text = str_replace(array("{body}","{/body}","{p}","{/p}"," n "," n."," n,"), array("<body>","</body>","<p>","</p>","n ","n.","n,"), $text);
    $text = preg_replace('/>\s*</','><',$text);
    $text = preg_replace('/\s\s*\s/',' ',$text);
    $text = preg_replace('/([a-zA-Z\d\,])<\/p><p>/','$1 ',$text);

The html-file I end up with still has bold, italic, underline text and headings.
Paragraphs are reduced by the ones that were only created due to pagebreaks.

Importing this html-file with Mobipocket-creator gives a great result.

It definitely does take more time than a fully automated conversion, but the result is also a lot better
Pulp is offline   Reply With Quote
Old 08-27-2008, 08:34 AM   #2
zelda_pinwheel
zeldinha zippy zeldissima
zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.
 
zelda_pinwheel's Avatar
 
Posts: 27,827
Karma: 921169
Join Date: Dec 2007
Location: Paris, France
Device: eb1150 & is that a nook in her pocket, or she just happy to see you?
thanks very much for this explanation, pulp. it's always very helpful to see the workflow of others when you are learning.

just a question, what do you mean when you say you save the word file as "filtered" html ?
zelda_pinwheel is offline   Reply With Quote
Old 08-27-2008, 09:34 AM   #3
Pulp
Palm Addict
Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.
 
Pulp's Avatar
 
Posts: 477
Karma: 1001951
Join Date: Aug 2008
Device: Cybook Gen3 [512mb, FW: 1.5]
MS Word offers the possibility to save websites as html (or what they think it is ) and a reduced version of html (they call it 'filtered') that does not use xml.
Pulp is offline   Reply With Quote
Old 08-27-2008, 09:38 AM   #4
zelda_pinwheel
zeldinha zippy zeldissima
zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.
 
zelda_pinwheel's Avatar
 
Posts: 27,827
Karma: 921169
Join Date: Dec 2007
Location: Paris, France
Device: eb1150 & is that a nook in her pocket, or she just happy to see you?
ha ok. i believe i have heard of this before, but it is only available in more recent versions of word. thanks for the reply.

i have seen the "html" code generated by word and it's pretty appalling. it's good that now you can make a "filtered" version which (from what i hear) is a bit less catastrophic, since not everyone wants (or knows how) to write their html code by hand. i imagine that your php script also helps a lot to clean up the code. since i have an old version of word i can't use this feature so instead i prefer to do my html by hand most of the time.

feedbooks will soon have a wysiwyg editor which will allow you to make good formatting with clean code much more easily and will be a great help especially for people who are not as comfortable with code as you.
zelda_pinwheel is offline   Reply With Quote
Old 08-27-2008, 11:25 AM   #5
Pulp
Palm Addict
Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.Pulp ought to be getting tired of karma fortunes by now.
 
Pulp's Avatar
 
Posts: 477
Karma: 1001951
Join Date: Aug 2008
Device: Cybook Gen3 [512mb, FW: 1.5]
Don't let yourself be fooled, the filtered version is still catastrophic

This is why I run the phpscript to remove all but the necessary html-tags.
You wouldnt believe how much smaller the file is after that
Pulp is offline   Reply With Quote
Old 08-27-2008, 11:25 AM   #6
Hadrien
Feedbooks.com Co-Founder
Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.
 
Hadrien's Avatar
 
Posts: 2,263
Karma: 145123
Join Date: Nov 2006
Location: Paris, France
Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad
Quote:
Originally Posted by zelda_pinwheel View Post
feedbooks will soon have a wysiwyg editor which will allow you to make good formatting with clean code much more easily and will be a great help especially for people who are not as comfortable with code as you.
Clean code and standard are 2 different things though. We already use TidyHTML when we generate a book to make sure that it is XHTML-compliant. For our WYSIWYG editor: we'll have a special "paste from word" button, to clean things up a bit if you're pasting files from Word. With both the paste from word button and TidyHTML, the output should be pretty good. We also do some processing: for example, we add non-blank spaces around punctuation to avoid ugly line breaking.
Hadrien is offline   Reply With Quote
Old 08-27-2008, 11:38 AM   #7
zelda_pinwheel
zeldinha zippy zeldissima
zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.zelda_pinwheel ought to be getting tired of karma fortunes by now.
 
zelda_pinwheel's Avatar
 
Posts: 27,827
Karma: 921169
Join Date: Dec 2007
Location: Paris, France
Device: eb1150 & is that a nook in her pocket, or she just happy to see you?
Quote:
Originally Posted by Hadrien View Post
Clean code and standard are 2 different things though. We already use TidyHTML when we generate a book to make sure that it is XHTML-compliant. For our WYSIWYG editor: we'll have a special "paste from word" button, to clean things up a bit if you're pasting files from Word. With both the paste from word button and TidyHTML, the output should be pretty good. We also do some processing: for example, we add non-blank spaces around punctuation to avoid ugly line breaking.
you are right, they are two different things. but i am pretty confident that you will find a way to make feedbooks code both clean *and* standard.

what do you mean by non-blank space ? non-breaking space ? & nbsp ; ?
zelda_pinwheel is offline   Reply With Quote
Old 08-27-2008, 12:02 PM   #8
Hadrien
Feedbooks.com Co-Founder
Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.Hadrien understands the importance of being earnest.
 
Hadrien's Avatar
 
Posts: 2,263
Karma: 145123
Join Date: Nov 2006
Location: Paris, France
Device: Sony PRS-t-1/350/300/500/505/600/700, Nexus S, iPad
Quote:
Originally Posted by zelda_pinwheel View Post
what do you mean by non-blank space ? non-breaking space ? & nbsp ; ?
Yeah & nbsp; are part of what we automatically add.
Hadrien is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Transfer Web pages to Kindle - what do you use? "Print to MOBI" an option? guiyoforward Amazon Kindle 8 09-06-2010 12:50 AM
EPUB to MOBI: "Failed: Convert book 1 of 1" Jillo Introduce Yourself 4 02-20-2010 06:49 PM
Commercial program says it can "make your own pdf e-books" - Anyone know about " Fugubot PDF 3 04-29-2009 06:39 PM
Content Mobi Reference ("Huge Collections") on Kindle Mike L Amazon Kindle 26 04-28-2009 03:14 PM
"Secure" PDF and "Secure" Mobi docs? AceHarddrive iRex 9 05-08-2008 09:13 PM


All times are GMT -4. The time now is 10:03 PM.


MobileRead.com is a privately owned, operated and funded community.