View Single Post
Old 11-15-2008, 11:03 PM   #1
theguru
Member
theguru doesn't littertheguru doesn't litter
 
Posts: 19
Karma: 139
Join Date: Nov 2008
Device: Sony PRS-505
soPdf - Better than Yet another PDF to LRF converter

I really liked the pdflrf tool from the "Yet another PDF to LRF converter" thread, but it has been taken down by the moderator for violation of GPL and has been down for quite some time because it seems like the author is not interested in providing the source for his tool. But there are some issues with the pdflrf tool.
  1. pdflrf renderes the pdf into image and then creates the lrf file.
    This makes the 4mb pdf file grow into more than 40mb file.
  2. No text information is preserved because of the image conversion
  3. Very slow
  4. No source for the tool <-- biggest disadvantage
So I decided to write a tool for myself. soPdf is a pdf formatter for sony reader. It is based on sumatrapdf's version of mupdf and fitz.

The advantages of soPdf over pdflrf
  1. Pdf to Pdf conversion
  2. Text and other contents of pdf are preserved
  3. Size of the output file is very close to size of input file
    and in some cases smaller than input file.
  4. Super fast conversion compared to pdflrf.
  5. Source available to make further changes !!!!!! <-- biggest advantage
The disadvantages over pdflrf
  1. Cannot yet convert the comic book. It can still split the image pdfs into two.
  2. soPdf is in alpha stage. (ver 0.1). There may be lots of bugs to be found yet. At least all of the mupdf bugs.
  3. ???
soPdf command line options
Code:
about: soPdf
   author: Navin Pai, soPdf ver 0.1 alpha
usage:
   soPdf -i file_name [options]
   -i file_name   input file name
   -p password  password for input file
   -o file_name  output file name
   -w               turn off white space cropping
                        default is on
   -m nn           mode of operation
                       0 = fit 2xWidth *
                       1 = fit 2xHeight
                       2 = fit Width
                       3 = fit Height
                       4 = smart fit Width (not yet implemented)
                       5 = smart fit Height (not yet implemented)
   -v nn          overlap percentage
                       nn = 2 percent overlap *
   -t title         set the file title
   -a author     set the file author
   -b publisher  set the publisher
   -c category  set the category
   -s subject    set the subject
   -e               proceed with errors
   -r               reverse landscape

   * = default values
The conversion algorithm is as follows
  1. If user specified Fit2xWidth or Fit2xHeight then simply make two copies of pdf page from source into destination pdf file.
  2. Render the page and get the actual boundary box that encompasses all of the content in the page. This step removes all the white space border of the page.
  3. If page cannot be rendered by mupdf and error option is specified then split the page w/o rendering by setting the MediaBox of the page.
  4. Try to split the file first by iterating all the elements that can fit in half a page and if that does not work then split the file half way with 2% overlap (this can be changed).
  5. If FitWidth or Fit2xWidth is specified then rotate the page by -90 deg.
Source code for soPdf is available from google code.
http://sopdf.googlecode.com

To compile the source code you will need Visual Studio 8.0 (Even free edition will work). Visual studio is not required if you just want to run the soPdf tool. If you are having issues running the binary then make sure you have VC runtime library. You can download the VC runtime library from Microsoft website.

Coming soon
  • Output to image pdf - for complex pdf that renders slowly on the reader devices.
Update 0.1 Rev 12
  • Added reverse landscape mode. Ever wished that you could hold your reader the other way around in landscape mode and scroll thru the pages using your right thumb. Use reverse landscape mode and start reading from last page onwards.
Update 0.1 Rev 10
  • Proceed with error option. With this option, soPdf can now process any pdf file, even the ones mupdf cannot handle. If mupdf cannot load the contents then it simply splits the page into two w/o any processing. The disadvantage is that the white space border in this case is not removed but you can still get a pdf output file.
  • Set subject of the pdf file option
  • Fixed stack over flow when processing complex pdf files
  • Better clipping algorithm
Update 0.1 Rev 7
  • Work around a mupdf bug where it is not able to allocate oid and gid numbers. This prevented some of the files from being split properly.
Attached Files
File Type: pdf ebooktestin.pdf (867.6 KB, 7806 views)
File Type: pdf ebooktestout.pdf (904.2 KB, 7911 views)
File Type: pdf ebooktestreverseout.pdf (904.2 KB, 5066 views)
File Type: zip soPdf.zip (895.1 KB, 19246 views)

Last edited by theguru; 11-23-2008 at 10:07 PM. Reason: Bug fixes
theguru is offline   Reply With Quote