View Full Version : Using perl scripts to produce .IMP ebooks and more...


nrapallo
02-03-2008, 08:29 PM
I had wanted to use command-line based tools to facilitate the conversion of .html directly into .IMP format; bypassing the need for the eBook Publisher GUI. Don't get me wrong, I think the eBook Publisher is a very powerful tool. It is the most effective way to deal with multiple input files especially if they do not 'lend' themselves to be used in a ebook.

To achieve the best results, the .html should first be cleaned-up by 'Tidy'. This will remove those annoying '?', correct ill-formed TOC and clean-up the .html.

I needed this primarily as I was converting single .html files from various sources (expanded .PRC/.PDB, exploded .LIT, Blackmask/Project Gutenberg .html, etc.). I found it cumbersome to use the eBook Publisher for just one file, especially if the .html filename was in the format 'authorname - title'.html. For these .html files, I had all the info I needed to properly create a .IMP ebook in the filename; all I had to do was choose the category and I would be finished!

Enter the perl scripts...

To use these perl scripts, it is required that:
1. You have previously installed the eBook Publisher software from http://www.ebooktechnologies.com/support_publisher_download.htm . The perl scripts use 'SBPubX' interface calls to create, view and manipulate .opf and .IMP files.
2. That perl scripts can be executed on your computer. For Windows, I had to install ActivePerl from ActiveState from http://www.activestate.com/store/activeperl/ .

BUILDIMP

A simple batch file called buildIMP.bat demonstrates how .IMP ebooks can be created using the workhorse routine 'Html2imp.pl'. The 'Html2imp.pl' perl script takes as input four parameters: 'Authorname' 'Title' 'Category' and 'htmlfilename'. If any of the parameters contain spaces, then quotes need to surround that parameter!

After executing the sample batch file, the .IMP ebook is produced along with the .opf project file used internally. This file can later be loaded into eBook Publisher for further processing, if necessary.

EXAMINEOPF

This perl script is invoked by 'examineOPF.pl project.opf'. It displays some information about the .opf file and prints out to stdout. If warranted, this output could be redirected to log file.

INFOIMP

This perl script is invoked by 'infoIMP.pl ebook.imp'. It displays some information about the .IMP file and prints out to stdout. If warranted, this output could be redirected to log file.

A variant of this script is 'infoIMPcsv.pl' which will 'dump' the .IMP details to stdout in 'comma separated values' format. You should redirect the output to a file so it can be opened in Microsoft Excel or similar for further exploration.

Another variant is 'infoIMPtab.pl' which will 'dump' the .IMP details to stdout in 'tabbed text' format. Again, you should redirect the output to a file so it can be opened in Microsoft Excel or similar for further exploration.

Try these on a directory full of .IMP files and you will get a mini-database of .IMP details!

VALIDATEOPF

This perl script is invoked by 'validateOPF.pl project.opf'. It validates the .opf showing all errors/warnings and prints out to stdout. Redirect to log file, if warranted. This can be used to extract the error log of a complex .opf build for future study.

Please feel free to modify these to suit your needs and consider sharing your achievements for others to benefit.

-Nick

EDIT: 18-May-2008 added windows executables (see IMP_OPF_windows-executables.zip) of each perl script for those that can't/won't work with perl scripts directly.

nrapallo
02-07-2008, 12:03 PM
New 'infoIMPdir.bat' showing how to get 'mini-database' of .IMP details. Just open the resulting .csv in Microsoft Excel or similar to further explore your .IMPs!

Just place the 'infoIMP*' files (from the first posting above) in any .IMP directory and execute the 'infoIMPdir.bat' provided below!

I know this is 'crude', but the results are worthwhile!

-Nick

nrapallo
02-12-2008, 12:47 AM
In the Content Forum under the (sticky) Mobiperl thread started by tompe, you will find post #219 (http://www.mobileread.com/forums/showpost.php?p=148514&postcount=219) that enables you to directly convert from mobipocket .mobi/.prc to .IMP formats via a perl script based on tompe's 'mobi2html'.

This new perl script is named 'mobi2imp.pl' and is available as a windows executable, 'mobi2imp.exe'.

MOBI2IMP

A simple batch file called mobi2IMP.bat demonstrates how .IMP ebooks can be converted directly from mobipocket .mobi/.prc using the workhorse routine 'mobi2imp'. 'Mobi2imp' takes as input two mandatory parameters: 'MobiSource' and 'ExplodeDir' and three optional parameters: 'Category' 'Authorname' and 'Title'. If any of the parameters contain spaces, then quotes need to surround that parameter!

To run this manually, just:
perl mobi2imp.pl --verbose "Oliver Twist.prc" Oliver or
c:\> mobi2imp.exe --verbose "Oliver Twist.prc" Oliver
After executing the sample batch file, the .IMP ebook is produced along with the .opf project file used internally. This file can later be loaded into eBook Publisher for further processing, if necessary.

Attached below is the 'mobi2imp.pl' code, 'mobi2imp.exe' as well as two sample conversions in the .zip file for anyone who wants to test it out.

You must have the eBook Publisher software previously installed as well as the proper perl lib setup**. This will allow those with many mobipocket .mobi/.prc files to migrate them to their ebookwise 1150 easily.

Note: ** using 'mobi2imp.pl' requires a tricky setup as I used, as a base, the 'Mobiperl' package prepared by tompe (see his website http://www.ida.liu.se/~tompe/mobiperl/ for detailed setup instructions).

While it is daunting getting all the right libs, it is now very rewarding that it's setup properly. After all this SETUP, it is easy. I promise!

This all started out at post #197 (http://www.mobileread.com/forums/showpost.php?p=148120&postcount=197) in the Mobiperl thread and has evolved into a functional perl script.

For a MINI-TUTORIAL, check here (http://www.mobileread.com/forums/showpost.php?p=153083&postcount=21).

For the Mobi2imp Wiki, check here (http://wiki.mobileread.com/wiki/Mobi2imp).

Enjoy!

-Nick

Previous changes...version 2 - Now 'Category Author Title' are optional and don't need to be provided (if the mobipocket ebook was 'well' composed).

version 3 - Now more forgiving of poorly constructed anchors (seen in feedbooks.com .prc's) and will insert the '<a name' tag as long as the 'filepos' points to the start of a tag i.e. "<". This will help retain most, if not, all hyperlinks!

version 4 - Things that changed:
- Now better warns that eBook Publisher must be installed first.
- now takes switches '--1200' and '--1100' to allow for the simultaneous creation of the REB 1200 and REB 1100 versions along with the EBW 1150 .IMP version.
- conversly, if the switch '--1150' is specified, then the EBW 1150 .IMP version is NOT created.

version 5 - Things that are allowed now:
- now allows you to change the text one font size larger ('medium') and one font size smaller (back to 'x-small') by using '--largerfont' and '--smallerfont' respectively.
- per JSWolf's request, you can now change margins from the default (2%) to '--nomargins' (0%), '--largemargins' (5%) and even '--hugemargins' (8%)
- you can change the default text-align from justify to '--nojustify' (i.e. left aligned).
- further to Kovidgoyal's recent 'mobi2oeb' post, now can output in OEBFF (.oeb) output with '--oeb'.
As a result, the output can be any and all at once of: '--1150' .IMP, '--1200' .IMP, '--1100' .rb and '--oeb' OEBFF!

version 6 - Changes:
- per DaleDe's request, you can now change margins from the default (2%) to '--tinymargins' (2px).
- no longer requires external program (nconvert.exe); all image 'fixing' done internally by GD.pm (thanks to tompe for this suggestion)!

version 7 - Changes:
- per DaleDe's suggestion, you can now add small indent with '--indent'.
- per JSWolf's request, you can now eliminate (blank line) paragraph separation with '--nopara' (may also need to indent para with '--indent').
- per DaleDe's suggestion, you can now get more info with '--verbose' or '--debug'.
- first attempt at a 'readme.txt' - you get this also by executing 'mobi2imp' without any paramenters.

version 8 - Changes:
- can now override default .IMP naming of 'Author - Title'.ext, by using '--out MYIMPBOOKNAME' to specify .IMP filename produced (omit .ext)
- BUGFIX: now strip <body> tag of any BD/mobi specific in-line styles before start 'fixing' things.[/SIZE]

EDIT 21 Feb 2008: version 9 - Changes:
- mobi2imp.exe (version 9) - windows executable (very stable now!)
- can now handle (text) .pdb files properly i.e. ereader 'TEXt'/'REAd' type
- now makes the BookDesigner notice at the end 'small print' by default :thumbsup:
- can make that BD notice 'big print' with '--BDbig' (case sensitive)
- can make that BD notice start on a newpage using '--BDnewpage' :2thumbsup
- can even remove that BD notice at the end with '--BDremove' :eek:
- to add flare, can use '--bgcolor #FF80FF' to set background color for every page
- BUGFIX: Only when using '--nopara' option, some <br />'s get ignored so another <br /> is added; if this creates issues, then '--noBRfix' will not add the second <br />.

TO DO:
- better documentation and even a tutorial would be nice
- ability to add a (default) 'cover' image to every conversion from .mobi to .imp exists, but not yet ready for the consequences
- ability to add running headers (ala GEBLibraian) exists, but not yet fully implemented
- add more user defined settings along with some 'Mobiperl' fixes like TOC first, cover link, prefix title...
- add Windows GUI ala PDFRead 1.8



EDIT: For a new GUI based Mobi2IMP with many improvements, see Mobi2IMP 9.4 with new Windows GUI & UTF-8 (http://www.mobileread.com/forums/showthread.php?t=22178)

DaleDe
02-18-2008, 01:52 PM
I have added a description in the wiki for this tool. It is very simple so far and needs additional data but it is a start.

http://wiki.mobileread.com/wiki/Mobi2imp

Can the version be added somewhere in the pl file please. I am starting to get confused as to what I have download and what the latest is. (maybe a --v option also to print this out.)

Over time the other perl scripts can be added to the wiki also but I just want to get something down today.

Dale

nrapallo
02-18-2008, 02:06 PM
To do in 'mobi2imp' version 7 (started but not yet ready for release):

- add '--TOC switch' to add that TOC entry to the beginninf of the file. abandoned.
- perl script/source code had version number, but now it is printed out. done.
- more documentation/tutorial in the works (thanks for the wiki entry)

This program is a testament to the solid foundation provided by tompe's 'mobi2html'. It made the .IMP specific changes so easy to merge from my original 'html2imp.pl'. I never thought it would take off this much, so fast.

As more users use it, I will make any 'necessary' corrections/modifications to aid in the direct conversion of .prc to .imp.

-Nick

JSWolf
02-18-2008, 02:31 PM
To do in 'mobi2imp' version 7 (started but not yet ready for release):

- add '--TOC switch' to add that TOC entry to the beginninf of the file.
- perl script/source code had version number, but now it is printed out.
- more documentation/tutorial in the works (thanks for the wiki entry)

This program is a testament to the solid foundation provided by tompe's 'mobi2html'. It made the .IMP specific changes so easy to merge from my original 'html2imp.pl'. I never thought it would take off this much, so fast.

As more users use it, I will make any 'necessary' corrections/modifications to aid in the direct conversion of .prc to .imp.

-Nick
What about the paragraph spacing? is that going to be fixed in the next version? I don't read IMP books, but I personally consider making eBooks with line spaces at every paragraph to be substandard and I won't do that to the readers.

DaleDe
02-18-2008, 03:53 PM
I just built a book using the latest version 6 mobi2imp and it said it built a 1150 but it really built a 1200.

Dale

DaleDe
02-18-2008, 04:03 PM
What about the paragraph spacing? is that going to be fixed in the next version? I don't read IMP books, but I personally consider making eBooks with line spaces at every paragraph to be substandard and I won't do that to the readers.

I do not think this will be too hard to correct, probably as a new option. As Nick said eBook publisher default style is to add a space between paragraphs but this can overridden with a style change of the <p> element. He thought, at first it was the <div> but that is because he mixed up html0 (BD) with html as used in this script.

Dale

JSWolf
02-18-2008, 06:04 PM
I do not think this will be too hard to correct, probably as a new option. As Nick said eBook publisher default style is to add a space between paragraphs but this can overridden with a style change of the <p> element. He thought, at first it was the <div> but that is because he mixed up html0 (BD) with html as used in this script.

Dale
The eBook I asked him to convert to test was one I made into a proper PRC using HTML exported from BD. I used Harry's directions to make it a proper PRC. I'm hoping that once mobi2imp is done and ready, I can use that to make better eBooks based on my PRC editions then from BD. I've already implemented the larger font fix for BD. Now all I need to is wait for the script to be fixed.

DaleDe
02-18-2008, 07:06 PM
The eBook I asked him to convert to test was one I made into a proper PRC using HTML exported from BD. I used Harry's directions to make it a proper PRC. I'm hoping that once mobi2imp is done and ready, I can use that to make better eBooks based on my PRC editions then from BD. I've already implemented the larger font fix for BD. Now all I need to is wait for the script to be fixed.

Hmm, is it <p> or <div>. If you run Nicks program the html file is left behind. Could you take a look please?

Dale

nrapallo
02-19-2008, 12:31 AM
I just built a book using the latest version 6 mobi2imp and it said it built a 1150 but it really built a 1200.

Dale

Dale:

I've had this happen once or twice before.

I re-installed the eBook Publisher software version 2.2.5 and it fixed the issue.

I think the libraries might have gotten 'unstable' by some other .IMP making programs (GEBLibrarian, BD, Softbook Word macro, my 'mobi2imp' perl script)

-Nick

nrapallo
02-19-2008, 12:33 AM
The eBook I asked him to convert to test was one I made into a proper PRC using HTML exported from BD. I used Harry's directions to make it a proper PRC. I'm hoping that once mobi2imp is done and ready, I can use that to make better eBooks based on my PRC editions then from BD. I've already implemented the larger font fix for BD. Now all I need to is wait for the script to be fixed.

JSWolf, wait for 'mobi2imp' version 7 and you will be satisfied!

-Nick

nrapallo
02-19-2008, 11:05 AM
JSWolf, wait for 'mobi2imp' version 7 and you will be satisfied!

-Nick

Mobi2imp (version 7) with windows executable now out! (See post #3 above)

Version 7 - Changes:
- mobi2imp.exe (version 7) - windows executable
- per DaleDe's suggestion, you can now add small indent with '--indent'.
- per JSWolf's request, you can now eliminate (blank line) paragraph separation with '--nopara' (this sets '--indent' automatically).
- per DaleDe's suggestion, you can now get more info with '--verbose' or '--debug'.
- first attempt at a 'readme.txt' - you get this also by executing 'mobi2imp' without any paramenters.

To follow soon, a tutorial, once I gather enough user feedback. :eek:

Enjoy!

-Nick

nrapallo
02-20-2008, 09:50 AM
Mobi2imp (version 7) with windows executable now out! (See post #3 above)

Version 7 - Changes:
- mobi2imp.exe (version 7) - windows executable
- per DaleDe's suggestion, you can now add small indent with '--indent'.
- per JSWolf's request, you can now eliminate (blank line) paragraph separation with '--nopara' (this sets '--indent' automatically).
- per DaleDe's suggestion, you can now get more info with '--verbose' or '--debug'.
- first attempt at a 'readme.txt' - you get this also by executing 'mobi2imp' without any paramenters.

To follow soon, a tutorial, once I gather enough user feedback. :eek:

Enjoy!

-Nick

Mobi2imp (version 8) with windows executable now out! (See post #3 above)

VERSION 8 - Changes:
- mobi2imp.exe (version 8) - windows executable (very stable now!)
- now allow you to specify .IMP filename produced, overriding default naming of 'Author - Title'.ext
- BUGFIX: now strip <body> tag of any BD/mobi specific in-line styles before start 'fixing' things.

TO DO:
- better documentation and even a tutorial would be nice
- ability to add a (default) 'cover' image to every conversion from .mobi to .imp exists, but not yet ready for the consequences
- add more user defined settings along with some 'Mobiperl' fixes like TOC first, cover link, prefix title...

-Nick

JSWolf
02-20-2008, 11:49 AM
nrapallo, I need to get version 7 back to test something. Can you please post it again? Thanks!

Version 8 has a bug in it that strips out blank lines that are supposed to be there. And I think version 7 kept them. That's why I want to test version 7.

JSWolf
02-20-2008, 12:59 PM
I've figured out what is going on and why the bug exists.

the blank lines are <br /> and they are not being picked up and converted to a blank line. If you fix that, you'll be good to go. I looked at the expanded HTML and yes, it had <br /> for the blank lines.

So no, I do not need to test version 7. Just get a fixed version 8 or a version 9.

nrapallo
02-20-2008, 02:00 PM
I've figured out what is going on and why the bug exists.

the blank lines are <br /> and they are not being picked up and converted to a blank line. If you fix that, you'll be good to go. I looked at the expanded HTML and yes, it had <br /> for the blank lines.

So no, I do not need to test version 7. Just get a fixed version 8 or a version 9.

You are right about the <br /> issue. It surfaces when using the '--nopara' as the eBook Publisher doesn't seem to respond to it beside the <div> construct. I have seen this in past conversions I did before the 'mobi2imp' days.

I have a work-around fix that could be inserted after line 289 in 'mobi2imp.pl' (just after the <body> tag substitution):if (defined $opt_nopara) {
$html =~ s/<br([^>])*><div/<BR \/><BR \/><div/g; #force <br /> to work better in ebook Publisher
}

This is better than just forcing two <br />'s everywhere (what I tried first and didn't like!) Further testing is required to ensure this doesn't 'break' something else...

I used this 'fix' to produce the attached .IMP version of 'The Heretic.prc'

Is this better?

-Nick

p.s. the links to Chapter 9, Chapter 10 and Chapter 22 don't get fixed by the mobi code in 'mobi2imp' so you may want to check the original .prc to see if it is working properly. Also, in Chapter 40, I noticed an extra line para break ('<br /><br /><div') where the original .prc has a '<br /><div' which probably shouldn't be there

nrapallo
02-20-2008, 03:38 PM
I've figured out what is going on and why the bug exists.

the blank lines are <br /> and they are not being picked up and converted to a blank line. If you fix that, you'll be good to go. I looked at the expanded HTML and yes, it had <br /> for the blank lines.

So no, I do not need to test version 7. Just get a fixed version 8 or a version 9.

Ok, I just produced mobi2imp.exe (version 8b) for now to test this <br /> issue.

I've tried it on some files with no unexpected results, so maybe it will work.

By the way, can you just put the below .bat file in a directory full of .prc/.mobi that you want to convert and check for any other problems?

Just make sure this new 'mobi2imp.exe' is in your 'path'.

Have fun!

JSWolf
02-20-2008, 05:09 PM
I've tried your new version 8b and it seems to work. Check out the new version of The Heretic (http://www.mobileread.com/forums/showthread.php?t=17580&highlight=chapman) to see. I would like to know what you think of it on an actual EB1150.

nrapallo
02-22-2008, 12:30 AM
Mobi2imp (version 8) with windows executable now out! (See post #3 above)

VERSION 8 - Changes:
- mobi2imp.exe (version 8) - windows executable (very stable now!)
- now allow you to specify .IMP filename produced, overriding default naming of 'Author - Title'.ext
- BUGFIX: now strip <body> tag of any BD/mobi specific in-line styles before start 'fixing' things.

TO DO:
- better documentation and even a tutorial would be nice
- ability to add a (default) 'cover' image to every conversion from .mobi to .imp exists, but not yet ready for the consequences
- add more user defined settings along with some 'Mobiperl' fixes like TOC first, cover link, prefix title...

-Nick

Mobi2imp (version 9) with windows executable now out! (See post #3 above)

VERSION 9 - Changes:
- mobi2imp.exe (version 9) - windows executable
- can now handle (text) .pdb files properly i.e. ereader 'TEXt'/'REAd' type
- now makes the BookDesigner notice at the end 'small print' by default
- can make that BD notice 'big print' with '--BDbig' (case sensitive)
- can make that BD notice start on a newpage using '--BDnewpage'
- can even remove that BD notice at the end with '--BDremove'
- to add flare, can use '--bgcolor #FF80FF' to set background color for every page
- BUGFIX: Only when using '--nopara' option, some <br />'s get ignored so another <br /> is added; if this creates issues, then '--noBRfix' will not add the second <br />.

The 'mobi2imp' program is now very stable and mature enough to be used effectively in re-conversion efforts using a .prc copy for the previously BD made .IMP.

Enjoy!

nrapallo
02-24-2008, 11:29 PM
Mobi2IMP (version 9.4) with windows executable now out! (See post here (http://www.mobileread.com/forums/showthread.php?t=22178))


Mini-tutorial follows:

After installing Mobi2IMP 9.4 using the Windows installer, you can use the new Windows GUI instead of using the dos/command prompt or perl script.

REQUIRED: You must have the eBook Publisher software previously installed to facilitate the conversions.

The mobi2imp.exe can be run from within an already opened Dos box i.e. command prompt and then only needs one argument "My Source.prc". If there are any spaces, then you need to surround them with quotes. Don't double-click the .exe directly nor the .pl; it is best run from within a batch file (see below)

You can also specify 'Category' (like Fiction) 'Author' or 'Title'.

Try this:c:\> mobi2imp.exe --verbose "My Source.prc" Fiction

If you want to automate this, try running a batch file (just copy and paste this into a file called 'prc2imp.bat')@echo off
rem Convert .mobi/.prc to .imp process devised by Nick Rapallo (Jan. 2008)
rem =============================================
rem Start the conversion of all .prc files in this directory to .imp format
rem For GEB 1150/EBW 1150 only output; add switch '--1200' for REB 1200 .IMP

for %%i in (*.prc) do mobi2imp.exe --verbose "%%i" "%%~ni"
for %%i in (*.mobi) do mobi2imp.exe --verbose "%%i" "%%~ni"
for %%i in (*.pdb) do mobi2imp.exe --verbose "%%i" "%%~ni"

rem That's it! We are now finished the conversion of all .prc files
echo WoW! All done.
pause

This will allow those with many mobipocket .prc/.mobi/.pdb files to migrate them to their ebookwise 1150 easily. For recursive batch processing, see post#11 below (http://www.mobileread.com/forums/showpost.php?p=163046&postcount=11)

Then all you have to do is put mobi2imp.exe in your path (or current directory), your 'prc2imp.bat' into the current directory containing your .prc, and then double-click 'prc2imp.bat' (just ensure you don't have too many .prc as ALL of them will be converted!)

Also, options like 'margins' and 'text-justification' can be better controlled in the mobi2imp via command-line '--options'. Popular options are: '--out IMPFILENAME' set .IMP filename to use (overrides default naming)
'--smallerfont' use 'x-small' font size for body text like pre-fix BD not default 'small'
'--nojustify' no full justification (i.e. left-aligned) not 'justify'
'--nopara' use no paragraph separation not 'blank line' (1em) separation
'--indent' use small (1em) indent instead of no (0em) indent

These options (sometimes called switches) go just after mobi2imp.exe and ALWAYS start with two dashes (i.e. '--verbose'). Just forget about getting the .pl and ActiveState Perl setup working. With the .exe, you don't even need ActiveState Perl!

With mobi2imp, just beware that you're stuck with any inconsistencies (if any) introduced by the .prc/.mobi original when converting over. However, 'mobi2imp' also creates a .opf that can be loaded into eBook Publisher and from there you can further edit/build it.

All in all, I like the output of mobi2imp.

I have been converting Madam Broshkina's .prc posts (with her permission) using mobi2imp.exe. For one I did recently, I used this command-line (in a batch file):mobi2imp.exe --1200 "Authors Various_The Worlds Greatest Books Volume V.prc" "AUTHOR5" Fiction "Authors, Various" "The Worlds Greatest Books Vol 5"


Download "Authors Various_The Worlds Greatest Books Volume V.prc (http://www.mobileread.com/forums/showpost.php?p=152908&postcount=1)" and see if you can duplicate the .IMP posted for the same ebook with the above command.

p.s. Thanks to DaleDe there is now a wiki entry for mobi2imp here (http://wiki.mobileread.com/wiki/Mobi2imp)

Moonraker
03-24-2008, 10:59 AM
I have used ebook Publisher (v2.2.3) for many years to create .imp files from my own XHTML code.
I could not get Mobi2Imp working at first because it could not find Publisher but after I updated it to v2.2.5 all worked well.

I was surprised that Publisher had been updated because I understood it was not being supported any more.

Mobi2Imp is a great and useful tool.

I have only one gripe - why does it replace a closing </p> with <div height="0em"></div> <div height="0em"></div>?

This makes for a larger than necessary file and also it is difficult to edit the html code to correct it.

nrapallo
03-24-2008, 11:33 AM
I have used ebook Publisher (v2.2.3) for many years to create .imp files from my own XHTML code.
I could not get Mobi2Imp working at first because it could not find Publisher but after I updated it to v2.2.5 all worked well.

I was surprised that Publisher had been updated because I understood it was not being supported any more.

Mobi2Imp is a great and useful tool.

I have only one gripe - why does it replace a closing </p> with <div height="0em"></div> <div height="0em"></div>?

This makes for a larger than necessary file and also it is difficult to edit the html code to correct it.

Mobi2IMP does not parse the HTML code; it just does some (global) search and replaces to "fix" things that are "broken" when used with eBook Publisher.

The </p> endings are being stripped/altered BEFORE the .prc is used by Mobi2IMP. I'm not sure if it is tompe's Mobiperl code or the use of BookDesigner. I think BookDesigner may be the culprit of the '<div height="0em"></div> <div height="0em"></div>' construct. Either way it's not Mobi2IMP's doing.

BTW, if you can read perl, check the source, mobi2imp.pl, for items replaced by Mobi2IMP.

I've used eBook Publisher for years and have come to respect its power and usefulness. It also doubles as great "validator" of HTML v3.2 code since it gives detailed error messages and points to the error in the source file.

Sure, it has a few shortcomings (image re-sizing fails to honour bottom margin; missing fraction HTML numeric codes; ...) but it has seen some improvements over the years (margin indents work better now; new default 'small' font size for eBookwise 1150; ...).

If you have any more specific Mobi2IMP questions/comments, please be sure to post in the thread
Mobi2IMP 9.4 with new Windows GUI released! (http://www.mobileread.com/forums/showthread.php?t=22178) Here, we discuss the current (and future) version of Mobi2IMP.

p.s. drool... How do you like the iLiad vs. the eBookwise 1150?

Moonraker
03-24-2008, 04:21 PM
Mobi2IMP does not parse the HTML code; it just does some (global) search and replaces to "fix" things that are "broken" when used with eBook Publisher.

The </p> endings are being stripped/altered BEFORE the .prc is used by Mobi2IMP. I'm not sure if it is tompe's Mobiperl code or the use of BookDesigner. I think BookDesigner may be the culprit of the '<div height="0em"></div> <div height="0em"></div>' construct. Either way it's not Mobi2IMP's doing.

Regarding the stripping of the </p> endings. As I don't use book designer or Perl then I don't suppose it is either of them. I always create my books in html. Then using the html file I create an imp file using eBook Publisher and then a prc file using Mobipocket Creator. I think the culprit must be Mobipocket Creator.

I've used eBook Publisher for years and have come to respect its power and usefulness. It also doubles as great "validator" of HTML v3.2 code since it gives detailed error messages and points to the error in the source file.

I agree 100%. Another good validator I use is Amaya.

p.s. drool... How do you like the iLiad vs. the eBookwise 1150?

I love all my ebook readers and wouldn't part with any of them.
I prefer my Cybook Gen 3 to the iLiad because of its longer lasting battery and faster boot-up time.
But the eBookwise 1150 beats them all in terms of ergonomics and ease of use. This would be my ideal if it had an e-ink screen.

nrapallo
03-24-2008, 04:52 PM
Regarding the stripping of the </p> endings. As I don't use book designer or Perl then I don't suppose it is either of them. I always create my books in html. Then using the html file I create an imp file using eBook Publisher and then a prc file using Mobipocket Creator. I think the culprit must be Mobipocket Creator.

To investigate this further, you may want to try to convert the .prc directly to html using tompe's windows binaries here (https://dev.mobileread.com/dist/tompe/mobiperl/mobiperl-win-0.0.37.zip) (use mobi2html.exe in the .zip).

Just issue the command:mobi2html "Your.prc" TempDir Then examine the resulting .html in the TempDir directory. If want to see just the "raw html" before Mobiperl manipulates it, try:mobi2html --rawhtml "Your.prc" Temp >My.html

BTW, I used 'mobi2html' as the base code for Mobi2IMP (at least the .prc to .html part)

Hope this helps!

Moonraker
03-24-2008, 06:37 PM
Deleted post

nrapallo
03-24-2008, 07:01 PM
Thank you very much for the link and for the instructions.

This is all very interesting to me because I have never before seen the html code behind a prc file.

This is a recent ability perfected by tompe with his Mobiperl code. I had "hacked" makedoc9 (popular .pdb to .txt converter) years ago to strip out the images and fix the <img...> to substitute the 'filenames' for the 'reindex' tag. It allowed me to see the .html code behind the .prc for the first time. Sadly, I had no idea what a 'filepos' was and the href links were all broken. That's why I was so taken by tompe's efforts and wanted to combine the two worlds (.prc to .imp)!

For the test I used the same prc file in two different folders, giving the prc files different names.

The result, as far as I can see, is that the two files are identical using either:

mobi2html "Your.prc" TempDir

or

mobi2html --rawhtml "Your.prc" Temp >My.html

Both files are the same size size and have the same number of lines and both end with �</body></html>

Both files have all the closing </p>'s stripped and replaced by <div height="0em"></div>
<div height="0em"></div>.

All my curly quotes i.e. &8220; and &8221; have been changed to &quot; (straight quotes).
My em-dash codes &8212; have all been changed to &mdash; etc.
Note: I had to omit the # sign from the above numerics in order to get this posted.

And my HTML(XML) header is completely changed although the charset=UTF-8 has been kept.

It appears to be Mobipocket Creator that changes the code don't you think?

BTW, those endings may be easy to strip out as they don't mean anything nor needed. I haven't come across these issues with .prc's built by BookDesigner or HarryT. Any other quirks to watch out for?

Moonraker
03-24-2008, 07:53 PM
Sorry for my previous post - I had missed the my.html file thinking it would be in another folder. I have retested the files and the following is my findings:

Thank you for the link and for the instructions:

This is all very interesting to me because I have never before seen the html code behind a prc file.

For the test I used the same prc file but gave it two different names.

First file test (mobi2html "Your.prc" TempDir):

Size: 1152 KB

� 250 occurrences appeared throught the document. These would have to be removed.
i.e. adhering changed to adherin�g

</p> stripped and replaced by <div height="0em"></div> <div height="0em"></div>

&8220; changed to &ldquo;
&8221; changed to &rdquo;
&8217; changed to &rsquo;
&8212; changed to &mdash;

Headings - i.e. <h4>Chapter 10</h4> Changed to = <h4 align="center"><font size="+1"><b>Chapter 10</b></font></h4>

<b></b> added to headings but where <strong></strong> were in the original file they have been left unchanged.

<br style="page-break-after:always" /> inserted at end of file.




Second file test (mobi2html --rawhtml "Your.prc" Temp >My.html)

Size: 1155 KB

All numeric code unchanged.

<b></b> Added to Headings but <strong></strong> in original file left unchanged.

<font size="+1"> added to Headings

</p> left unchanged but <div height="0em"></div> <div height="0em"></div> added between paragraphs. This seems superfluous to me.

<mbpagebreak/> added to end of file.


When I put the file through Tidy.exe I got 8833 warnings that <div> attribute "height" has invalid value "0em"

NOTE: # omitted from numeric codes to get this posted.

nrapallo
03-24-2008, 09:01 PM
Sorry for my previous post - I had missed the my.html file thinking it would be in another folder. I have retested the files and the following is my findings:

Thank you for the link and for the instructions:

This is all very interesting to me because I have never before seen the html code behind a prc file.

For the test I used the same prc file but gave it two different names.

First file test (mobi2html "Your.prc" TempDir):

Size: 1152 KB

� 250 occurrences appeared throught the document. These would have to be removed.
i.e. adhering changed to adherin�g

</p> stripped and replaced by <div height="0em"></div> <div height="0em"></div>

&8220; changed to &ldquo;
&8221; changed to &rdquo;
&8217; changed to &rsquo;
&8212; changed to &mdash;

Headings - i.e. <h4>Chapter 10</h4> Changed to = <h4 align="center"><font size="+1"><b>Chapter 10</b></font></h4>

<b></b> added to headings but where <strong></strong> were in the original file they have been left unchanged.

<br style="page-break-after:always" /> inserted at end of file.




Second file test (mobi2html --rawhtml "Your.prc" Temp >My.html)

Size: 1155 KB

All numeric code unchanged.

<b></b> Added to Headings but <strong></strong> in original file left unchanged.

<font size="+1"> added to Headings

</p> left unchanged but <div height="0em"></div> <div height="0em"></div> added between paragraphs. This seems superfluous to me.

<mbpagebreak/> added to end of file.


When I put the file through Tidy.exe I got 8833 warnings that <div> attribute "height" has invalid value "0em"

NOTE: # omitted from numeric codes to get this posted.

That � entry is weird. I wonder what the rationale behind it was. I know this will sound like you are chasing your tail, but if you make an .imp with this .html using eBook Publisher, does it bomb? It does if the HTML char &# 20; exists in the ebook i.e.<p>Html documents with this entity &# 20; bomb! No output produced by eBook Publisher v2.2.5</p>

Note for display purposes, I put a space between '#' and '2' that shouldn't be there!

I think we can conclude that the Mobiperl code strips the </p> and just leaves behind the Mobipocket empty <div>'s. I have seen this behaviour with the .pdb to .imp routine in Mobi2IMP. BTW, you can take a PalmDOC .pdb (TEXt/REAd) document and have Mobi2IMP create a .imp version.

tompe
03-24-2008, 11:10 PM
I think we can conclude that the Mobiperl code strips the </p> and just leaves behind the Mobipocket empty <div>'s. I have seen this behaviour with the .pdb to .imp routine in Mobi2IMP. BTW, you can take a PalmDOC .pdb (TEXt/REAd) document and have Mobi2IMP create a .imp version.

It should not modify the html in this way if you do not use a fixhtml flag. I will look at this when I am back from Eastercon (british sf con) in a couple of days.

JSWolf
03-25-2008, 12:41 AM
The other Mobipocket to html converter I have actually leaves characters such as the curly quotes and em dashes as the actual characters and not the HTML #s. Cannot mobi2html do the same thing?

nrapallo
03-25-2008, 01:28 AM
The other Mobipocket to html converter I have actually leaves characters such as the curly quotes and em dashes as the actual characters and not the HTML #s. Cannot mobi2html do the same thing?

Want to try a new (actually very old) .prc-->.html converter. It's called 'makedocN' and was hacked together by me almost four years ago.

It's crude, but should yield the same results as your java code on the iLiad. No filepos links are fixed, nor <mbp: pbreak>, etc, but I did strip out the images (without file ext). I used other batch files to convert all the .prc to .rb (Rocket eBooks / REB 1100 formats). The left over .html was used to generate .imps, by hand, one at a time. It was tedious and I only converted the .prc I was going to read instead of everything in sight. Sigh, Mobi2IMP had not yet been born!!!

The hack was based on 'makedoc9', but I called it 'makedocN' (N as in Nick!).

The attached .zip includes everything you need (I hope) to run makedocN. It was compiled with cygwin and required its .dll to execute. Just unzip, place your .prc's in the directory and double-click the 'doprc.bat' and wait for it to finish.

What do you think, could I give tompe a run for his money? :rofl: or should I just stick with what I am good at i.e. .imp! :thumbsup:

HarryT
03-25-2008, 06:05 AM
Book Designer uses "<DIV>" for everything, rather than "<P>". All the Mobi books I've posted in the last few months are the result of saving HTML from BD, doing some minor edits (eg replacing "<HR>" with <mbp : pagebreak/>") and then using Mobi Creator to create the PRC.

Moonraker
03-25-2008, 10:11 AM
Want to try a new (actually very old) .prc-->.html converter. It's called 'makedocN' and was hacked together by me almost four years ago.

I tried this on one of my own .prc creations.

I thought your 'makedocN' converter was excellent. It preserved the layout, retained all # numeric punctuation, and retained all the original tags. It is very fast and extremely easy to use.

I would still remove the following code inserted by Mobipocket Creator:

<div height="0em"></div> <div height="0em"></div> but because the closing </p> is preserved this would be an easy find and replace task.

The Headings I would change to my preferred simple and clean:

<h4>Chapter Number</h4>
instead of
<h4 align="center"><font size="+1"><b>Chapter Number</b></font></h4>

I don't care that it doesn't batch convert.


I then tried it on Harry's Lorna Doone Vol 1.prc.

I wanted the code to have some white space so I could read it easily so I ran the file through Tidy.exe:

Tidy reported:
1777 warnings, 52 errors were found! Not all warnings/errors were shown.
This document has errors that must be fixed before using HTML Tidy to generate a tidied up version.

So Tidy could not work on the file until the errors had been fixed.

I changed the html header to my own one and removed every <mbp:pagebreak/> and put it through Tidy again.

This time Tidy reported:
1454 warnings, 0 errors were found!

and the tidied code was easier for me to read.

Tidy.exe had corrected all out-of-date upper case tags to lower case, changed <b> to <strong>, changed all &nbsp; to & #160;, all en or em dashes were shown as ? because the original file did not contain the correct html code for these.<i> tags were corrected to <em>. <br/> tags were corrected to <br /> and all <font> tags were removed.

But most disconcerting of all, all the paragraphs started and ended with <div></div> respectively. I hate this because it makes for a horrendous task to clean up the html code because of all the other <divs this and <divs that.

I use Textpipe frequently to clean up bad html code and I doubt that even this fine programme could easily sort out all these divs.

So, if the original .prc file contains good clean code it is a quick and easy task to clean up the resultant html. But if it contains bad outdated code then it aint so easy.

I understand that the code generated by say, BookDesigner is adequate in creating good looking ebooks but a peek under the hood reveals out of date bloated code. My aim is to future proof my html files with good clean code that will render faster, reduce size and convert to any format for most reading devices.

Thank you and well done nrapallo for your makedocN converter. I shall be using it frequently.

nrapallo
03-25-2008, 10:55 AM
ISo, if the original .prc file contains good clean code it is a quick and easy task to clean up the resultant html. But if it contains bad outdated code then it aint so easy.

They have a great acronym for this: GIGO! (Garbage in Garbage out)

I understand that the code generated by say, BookDesigner is adequate in creating good looking ebooks but a peek under the hood reveals out of date bloated code. My aim is to future proof my html files with good clean code that will render faster, reduce size and convert to any format for most reading devices.

Thank you and well done nrapallo for your makedocN converter. I shall be using it frequently.

Thank you for the compliments. I abandoned use of 'makedocN' when I stumbled upon 'mobi2html' since I do like to have working links with anchors, advanced .prc parsing for say Title, Author, Category, Cover Image, Thumbnail Image, etc...

Never in my wildest dreams did I think someone else would find it useful and only posted it here, as an alternative, to view the resulting unaltered .html inside a .prc. You, have made my day! :happydance:

tompe
03-26-2008, 01:12 PM
The other Mobipocket to html converter I have actually leaves characters such as the curly quotes and em dashes as the actual characters and not the HTML #s. Cannot mobi2html do the same thing?

Why do you need this?

It might be possible but it seems much more robust to use entities.

nrapallo
03-26-2008, 01:43 PM
It might be possible but it seems much more robust to use entities.

I agree; original characters may break eBook Publisher and result in those dreaded '?' appearing in their place. Then you're forced to pass the HTML through Tidy to "clean up" thoses '?' and get Entities (i.e. &#copy; for ) anyways!

Entities, be it numeric or words, are more useful; albeit difficult to read.

nrapallo
05-28-2008, 12:31 AM
I added to post #1 (http://www.mobileread.com/forums/showthread.php?p=146088#post146088) above, windows executables (see IMP_OPF_windows-executables.zip (http://www.mobileread.com/forums/attachment.php?attachmentid=13068&d=1211944852)) of each perl script in post #1 (http://www.mobileread.com/forums/showthread.php?p=146088#post146088) and post #2 (http://www.mobileread.com/forums/showthread.php?p=147170#post147170) in this thread for those that can't/won't work with perl scripts directly.

Enjoy!

nrapallo
01-03-2009, 05:37 PM
I finally figured out (through trial & error) what to do to properly use BuildFromWordDoc using the Builder interface of the PubX.dll OLE library as explained here (http://www.mobileread.com/forums/showthread.php?p=292974#post292974). Even though, you can convert .doc directly into .imp (even in batches) using the BulkConvert program that also "ships" with the eBook Publisher software, this perl script (Word2imp.pl) now allows pre- & post-processing changes to be made using perl scripts! :thumbsup:

WORD2IMP
A simple batch file called Word2IMP.bat demonstrates how .IMP ebooks can be created using the workhorse routine 'Word2imp.pl'. The 'Word2imp.pl' perl script takes as input a single filename of the .rtf, .doc or .html to convert to .imp. If the filename contain spaces, then quotes need to surround the parameter!

After executing the sample batch file, the .IMP ebook is produced along with the .opf project file used internally. However, since temporary files are used to build the .imp, the source .html file created gets deleted at the end and is no longer available. :( A work-around is to create a .oeb ebook and then 'unpack' it to see that intermediary .html file!

Then and only then can the ebook be loaded into eBook Publisher for further processing, if necessary.

EDIT: For a revised Word2imp.pl Perl Script that produces a better .doc to .imp conversion (set CSS=1 or better still use CSS=2)) as detailed in post #41 (http://www.mobileread.com/forums/showthread.php?p=324309#post324309) below.

=X=
01-10-2009, 05:18 PM
Nick many thanks for the Perl Script. You did a fantastic job. With a little elbow grease I was able to port your PERL script to VBA and integrate it with the BookCreator tool. The IMP files created from BookCreator are excellent! Many thanks!

Also one recommendation. Change the CSS=1, even though the documentation says the CSS feature is obsolete and not used, I found this to be incorrect. The CSS=1 is required to preserve the MS Word format.

=X=

nrapallo
01-10-2009, 05:49 PM
Nick many thanks for the Perl Script. You did a fantastic job. With a little elbow grease I was able to port your PERL script to VBA and integrate it with the BookCreator tool. The IMP files created from BookCreator are excellent! Many thanks!

Also one recommendation. Change the CSS=1, even though the documentation says the CSS feature is obsolete and not used, I found this to be incorrect. The CSS=1 is required to preserve the MS Word format.

=X=

Wow, great discovery!!!

I just tried it and the .doc to .imp sample conversions look absolutely marvellous!

I wasn't too happy with my sample conversions when I had $project->{CSS} = 0;

and I was thinking about how best to add some internal manipulations to get it to look better, but now I don't have to because of your great find! I will revise the script (see below .pl) to now use: $project->{CSS} = 1;


Check out the samples I reconverted (using this CSS=1) into the 1150 ..imp , 1200 .imp and .oeb formats.

EDIT: I was able to unpack the .oeb version and display, with the Preview button, different .imp and _1200.imp showing two columns with nice margins! They are not in the .zip files above, but listed below. Looks nice indeed! (Thanks =X=)

The attached source html from the .oeb version did not create the same nice margins. I am trying to figure out why the Preview shows the nice margins but the Build Edition... doesn't. I think it has to do with the "oeb-column" settings.

nrapallo
01-10-2009, 10:34 PM
I'm testing the style="oeb-column-number:auto" and it appears that a too large margin-left and margin-right was the culprit in not allowing those nice margins displayed in the previous post's .imp test ebooks.

I now attach a .html/.opf to create a better looking TWO-COLUMN ebook!

All you need is the <div style="oeb-column-number:auto"> html style. I don't know what it does, but appears to allow text to be split over two columns.

This may have some practical uses, once it's better understood! See this thread (http://www.mobileread.com/forums/showthread.php?t=36136) entitled "Easily create two column (newspaper-style) ebooks" for an example how to use this style in ebooks!

Oh, by the way, the attached .zip contains the 'images' that were converted by word2imp.pl to .wmf format and referenced in the .html. The only problem is that .wmf is NOT supported by eBook Publisher, so I used the generated fallbacks .jpg and copied them over the .wmf pictures (I had to change the .jpg to .wmf!)

nrapallo
02-12-2009, 05:24 PM
Even though, you can convert MS PowerPoint .ppt directly into .imp (even in batches) using the BulkConvert program that also "ships" with the eBook Publisher software, this perl script (PPT2imp.pl) now allows pre- & post-processing changes to be made using perl scripts!

PPT2IMP
A simple batch file called PPT2IMP.bat demonstrates how .IMP ebooks can be created using the workhorse routine 'PPT2imp.pl'. The 'PPT2imp.pl' perl script takes as input a single filename of the PowerPoint .ppt to convert to .imp based on one (rotated) or two 'slides' to a page (not rotated). If the filename contain spaces, then quotes need to surround the parameter!

After executing the sample batch file, the .IMP ebook is produced using temporary files, but the source .html file created gets deleted at the end and is no longer available. :( A work-around is to create a .oeb ebook and then 'unpack' it to see that intermediary .html file! However, this doesn't work with the latest eBook Publisher v2.3.8 (that understands .epubs) as this part seems broken; no image files are stored in the resulting .epub). :(

nrapallo
04-22-2009, 01:05 AM
Nick many thanks for the Perl Script. You did a fantastic job. With a little elbow grease I was able to port your PERL script to VBA and integrate it with the BookCreator tool. The IMP files created from BookCreator are excellent! Many thanks!

Also one recommendation. Change the CSS=1, even though the documentation says the CSS feature is obsolete and not used, I found this to be incorrect. The CSS=1 is required to preserve the MS Word format.

=X=

Actually, I've noticed that using CSS = 2 improves the overall look by respecting margins better. At least now, we have a choice!