View Full Version : [Tool] Multi-column PDF files on 6 inch display.


Taesoo Kwon
11-08-2008, 05:58 AM
I developed a program to convert PDF documents such as articles, and technical papers into a GIF sequence so as to be readable on a small screen of e-book devices. This program automatically detects contiguous and non-empty regions in a page, and based on the information, split the page into multiple low-res pages. Unnecessary margins are also automatically removed.

Download: PaperCrop (http://code.google.com/p/papercrop/)

Screenshots:
Input pdf:
http://jupiter.kaist.ac.kr/~taesoo/projects/paperCrop/out1/0.jpg

Output pdf:
http://jupiter.kaist.ac.kr/~taesoo/projects/paperCrop/out2/0_0.jpg

Currently, only windows are supported. (It works on other platforms through wine though)
There may exist some bugs.
Thanks,

- Version 0.24 uploaded. source codes are available too.
- Version 0.3 uploaded. (All 0.24 users should upgrade to this version. Sorry for the crash problem. Version 0.3 outputs to a PDF file. Could anybody please test the output pdf file on a Sony Reader?)
- Version 0.4 uploaded.

=X=
11-09-2008, 09:10 PM
Nice tool it would be great if this code could be added with PDFRead who needs a feature like this.

=X=

Taesoo Kwon
11-10-2008, 05:52 AM
Yes, I guess so. PDFRead supports command line mode, and paperCrop uses LUA script for generating output. So one who are familiar with LUA script can modify the .LUA files in the scripts folder such that the output images from PaperCrop are automatically converted to e-book files using PDFRead. But at the moment, I don't want to do the work by myself due to my lazyness.
I can provide the codes of paperCrop to anyone who are interested.

Pulp
11-10-2008, 10:20 AM
Thank you, this is a great tool!

nrapallo
11-10-2008, 11:52 AM
Yes, I guess so. PDFRead supports command line mode, and paperCrop uses LUA script for generating output. So one who are familiar with LUA script can modify the .LUA files in the scripts folder such that the output images from PaperCrop are automatically converted to e-book files using PDFRead. But at the moment, I don't want to do the work by myself due to my lazyness.
I can provide the codes of paperCrop to anyone who are interested.

PDFRead v1.8.2 (http://www.mobileread.com/forums/showthread.php?t=21906) already supports converting two-column layouts using the layout modes:'landscape-2col' (with four quadrants/pages);
'portrait-2col' (with four quadrants/pages);

However, PDFRead is not too "smart" in how it determines where the columns start/end; it just picks the midpoint of the page and splits it there! Of course, this will be wrong if the column widths (and side margins) are not equal.

I had looked into programming using LUA when I was porting/tweaking some PSP homebrew programs, so it would be easy to re-use your programming logic.

However, PDFRead is due for a major overhaul, so I will hold off doing this just now. I'll wait to see what ashkulz (original authour of PDFRead) does with any update to PDFRead and then go from there.

Nice effort though!

soilwork
11-12-2008, 04:02 AM
I tried the program and it looks and works excellent. Especially, the program detects content well even when a wide table spans across the whole page width while the rest of the content is arranged in two-column.

However, I have a couple of suggestions, though.

1) Pre-trim option
In most articles, headers/footers are not necessary especially in small screen reading device. It would be great if you can implement pre-trim option (just like that in PDFLRF) before detecting the content.

2) Preventing from cutting the text in the middle
I noticed that, in some cases, a line of text is cut in the middle. Since the program already does a great job of detecting content, can you apply a similar logic/process to prevent this from happening when cutting the detected content into smaller gif/jpg/pngs?

3) Easier way to enter precise segmentation parameter.
To make fine changes in the segmentation, I noticed that I should use 'Tab' to highlight the sliding bar and then use left/right cursor key to change the number in the smallest increment. I would be easier if
A. double click on the bar will highlight it, and/or
B. double click on the displayed number allows users enter the number directly.

BTW, thanks for providing an excellent program.

=X=
11-12-2008, 12:47 PM
I've just used the software on a very complicated layout and it worked quite nice. I'm quite impressed

Suggestions
* Add a feature in the UI to make output JPG/GIF/PNG. Having to change the code is a bit cumbersome
* Add a cropbox that applies to all windows
* It would be nice if customized crop settings where saved per page so that all the adjustments can be made before the bulk conversion is executed.
* It would be nice if the final product was an eBook. If not maybe write a short tutorial here on how a user can create an ebook using calibre or comic2LRF. Where an LRF can be created from zipping up the files with a CBZ extension and runing these tools on the zip file.


Bugs/Issues
* Adjust crop settings with font size. In a pages the column space was quite tight so I had to decrease the column with. However the title of the stores had lager fonts where the spacing of the word equaled
* Some case the text was cut in half.
* There is overlap of crop shows up on some pages, where part of the second column shows up on the 1 column screen. The 2nd column shows up fine on the following screens but this is a bit distracting)

=X=

Taesoo Kwon
11-12-2008, 05:40 PM
Thank you for the suggestions, =X= and Soilwork.
I would implement several of the suggestions and bug fixes in the next version,
e.g. a crop box that applies to all windows, pretrim option, font-size dependent processing (The last one is very difficult for me to implement - Currently all the processing is done at a pixel-level, not using any PDF informations such as font-sizes, PDF crop boxes, and so on.)

I will also open the source codes based on the free GPL license. (Actually, this is mandatory having used some GPL libraries.)

At the moment, supporting ebook formats is not what I want to spend much time on. (simply because I started this project for my own needs, and I don't need such a functionality.) Sorry.

ProDigit
11-16-2008, 09:10 AM
I'm sorry for the double post, but you convert pdf to jpg 800x600 pix.
The screen itself has a small bar on the bottom.
Isn't it better to convert to something like 790x600 pix?
just a question.

=X=
11-16-2008, 03:48 PM
Moderators can you please make this tread a sticky?

zelda_pinwheel
11-16-2008, 04:01 PM
Moderators can you please make this tread a sticky?

stuck. :)

Taesoo Kwon
11-16-2008, 05:27 PM
Isn't it better to convert to something like 790x600 pix?
just a question.


I don't know exactly what is the best resolution for Sony reader.
Cybook and nuutbook (a netronix variant which I use) displays images at full screen, so 600*800 was the best for those devices.
The resoultion can be changed by editing config.lua.

soilwork
11-17-2008, 12:53 AM
Hello, I found some bugs in version 0.24

- Every time I try 'process all pages', at the end, I got a dialog box saying
"Do you want to overwrite file \...\00000_000.jpg?"
If I answer yes, the last image overwrites the first file (00000_000.jpg).

- In some cases, the original PDF uses one font throughout the page, but the conversion results look different.
Please refer to the pictures attached. In the original PDF, all of them have the same font.
This bug does not affect usability but it is a bit strange. :)

- The program crashes with some files. If you would like, I will email PDF files causing the problem.

Thanks again for the great program.

ProDigit
11-17-2008, 11:20 AM
I don't know exactly what is the best resolution for Sony reader.
Cybook and nuutbook (a netronix variant which I use) displays images at full screen, so 600*800 was the best for those devices.
The resoultion can be changed by editing config.lua.

Yes I understand that, since the readers use a 800x600 screen.
However there is a small bar on the bottom on the reader,about 5mm in size (0,2"). I could imagine every reader has this bar with either some battery life info, page info, or whatever...
Is this bar hovering over the picture, or taking some space off the picture?
I say this because having a book (manga for instance) with 200 pages of jpeg compressed, at a resolution of 800x600 or say 790x600 could save up some hundreds of kB, to a few of MB's in size.
If you don't notice the difference anyway, plus, then the reader can literally input-output the image on the screen without internal resizing and rendering.

Yesterday I thought my reader froze when I plugged in an SD card with a photo my wife took from her Nikon D40 camera. 6MPix, 4,6MB in size. It took about 5 to 7 minutes to first render the picture.
After editing (resizing to 800x600,& saving as B&W) the image only took up 96kB of size (=below 100kBytes).
You couldn't see the difference in full screen view between the two, I saved about +4MB in size, and the reader rendered my 96kB image within 2 seconds.
For books that need no zooming in,or tiling to landscape it is much better to save them this way.
To bring it further I wanted to know if someone knows something about the bar displayed.
If it hovered over, or took space off the screen.
I believe if the reader does not need to render nor resizing the image, some more battery and loading time could be saved.

nrapallo
11-17-2008, 11:46 AM
I thought the max dimensions on the Sony ebook readers were 584 width x 754 height.

I got this from page 4 of the Sony recommended pdf size (http://www.sonystyle.com/wcsstore/SonyStyleStorefrontAssetStore/pdf/reader_createPDF.pdf).

ProDigit
11-17-2008, 12:32 PM
I thought the max dimensions on the Sony ebook readers were 584 width x 754 height.

I got this from page 4 of the Sony recommended pdf size (http://www.sonystyle.com/wcsstore/SonyStyleStorefrontAssetStore/pdf/reader_createPDF.pdf).

Those could be the recommended screensize for readable text, with a small border of 2x8 pix horizontal and a 2x23pix vertical border.

Though I prefer text without border, since it takes up screen space, and you already have a border around the screen. (I mean it's not like we can make annotations on these devices or so).

when I load a pdf,with this info, and zoom on text width, I notice that the left,top and right border are the same size.
So let's assume you have a border of 8pix left,up and right of the text.
This leaves a border of 38 on the bottom.
take 13 pix of the bottom border since it seems a little larger than the others, and you'll have a bar on the bottom of about 25 pix.

in other words it could be that images with a size of 775x600 are displayed exactly the same as 800x600.
In order to figure that out I can try to see if I see any difference between the quality of 800x600 and 775x600.
Obviously 775x600 will be more stretched looking.

However I had preferred if someone knew by playing around with the internal software what the exact resolutions are.

soilwork
11-17-2008, 12:37 PM
As TaeSoo and nrapallo mentioned, you can change the output resolution by modifying two lines of 'config.lua'. You can change the first two lines to these.
device_width=584
device_height=754

Better yet, you can convert set of image files into LRF using comic2lrf of Calibre. First, make one zip file from the pictures then use a simple script such as
comic2lrf -c 8 -r 'file.name.here.zip'
I think it will be easier to manage one LRF file instead of hundreds of images.

nrapallo
11-17-2008, 12:47 PM
In order to figure that out I can try to see if I see any difference between the quality of 800x600 and 775x600.
Obviously 775x600 will be more stretched looking.

It shouldn't look stretched if the original image's aspect ratio is maintained, just a bit more white margin space in lieu.

Now if PaperCrop does in fact stretch it to fit, then you would have to override this behaviour to get non-stretch/squished resulting images.

In the end, there is no right/wrong way to do this; it's a matter of personal taste/preference i.e what you like to use!

However I had preferred if someone knew by playing around with the internal software what the exact resolutions are.

Trial and error and a bit of sweat equity is always needed when charting unknown territory... :)

ProDigit
11-17-2008, 12:48 PM
As TaeSoo and nrapallo mentioned, you can change the output resolution by modifying two lines of 'config.lua'. You can change the first two lines to these.
device_width=584
device_height=754

Better yet, you can convert set of image files into LRF using comic2lrf of Calibre. First, make one zip file from the pictures then use a simple script such as
comic2lrf -c 8 -r 'file.name.here.zip'
I think it will be easier to manage one LRF file instead of hundreds of images.

I think the problem is that we don't know what the best size is.
548x754 will be the best format for text margins in pdf.
jpg's need to use the full screen.
Making them smaller than the viewable area will result in loss of quality.
Making them larger results in loss of battery life, and display time.
Some pages I could format perfectly in 320x240 and they still will be viewable, but the quality of the page will suffer...

nrapallo
11-17-2008, 12:53 PM
Oh, another bug after editing the config.lua, the resulting .png images are of size of 473 x 595 even though I set 472 x 595 as the max. dimensions my reader expects. This is probably a rounding-off problem...

JSWolf
11-17-2008, 01:16 PM
I thought the max dimensions on the Sony ebook readers were 584 width x 754 height.

I got this from page 4 of the Sony recommended pdf size (http://www.sonystyle.com/wcsstore/SonyStyleStorefrontAssetStore/pdf/reader_createPDF.pdf).
Actually, i think the dimensions are 600x775.

soilwork
11-17-2008, 01:34 PM
I think the problem is that we don't know what the best size is.
548x754 will be the best format for text margins in pdf.
jpg's need to use the full screen.
Making them smaller than the viewable area will result in loss of quality.
Making them larger results in loss of battery life, and display time.
Some pages I could format perfectly in 320x240 and they still will be viewable, but the quality of the page will suffer...


That is why I suggested the second option of converting into LRF. First, you can make the images with arbitrarily large resolution and then use PDFLRF or Comic2lrf. Then we don't need to worry about the correct resolution at all.

Of course, if you would like to read PDF using 1000's of splitted jpg/png/gifs, then you have to figure out the correct resolution. However, since LRF is much faster and easier to manage, I cannot think of any good reason to use image files instead of LRF.

soilwork
11-17-2008, 02:01 PM
Those could be the recommended screensize for readable text, with a small border of 2x8 pix horizontal and a 2x23pix vertical border.

PDFLRF makes LRF based on images and it also uses 584 width x 754 height as the final output. So I don't think the resolution is only for text file. Of course, I am assuming that image based LRFs are displayed in the same resolution as image files.

ProDigit
11-17-2008, 03:05 PM
So far I did the test of the picture Sony provided on the PRS-505 of the baby.
Resizing it would give you either 800x625 or 768x600pix.
On the baby picture I could not see any difference between the two.

I will test it out with a high quality picture (with lots of detail, and probably a resolution of higher than 1MPix, of eg: grass trees and leafs) to compare each picture's size to the quality displayed.

ProDigit
11-17-2008, 04:52 PM
here are a few pictures resized all with the same settings.
I hope they work, as I'll be testing them as soon as I post this post.

Try out for yourself to see where on the reader quality decreases.
The picture is a regular picture found on google images.

Nature2 is the original.
The other formats are 8bit greyscale (256 color) versions that have lower than the original resolution.

ProDigit
11-17-2008, 05:17 PM
OK,I'm back with some results.
To see the pictures full screen without zoom it is hard to distinguish between files having a resolution of 754x566 pixels or greater.
When you zoom in to the max of the zoom factor, you can see that certain artifacts (pixels) are not visible on the lower qualities, but they are at 800x600 resolution.

Upon review I'll have to recode the images,seeing that the 779x548 pix. image is way blurrier than the 800x600 image.
This can have to do with the encoding being set to retain 84% of the original quality.
Tomorrow I'll re-post everything on 100% quality.

nrapallo
11-17-2008, 05:21 PM
Nature2 is the original.

Just for comparison, this is what that image looks like using 584 x 754 as the max dimensions using PDFRead (default settings) and outputing to .lrf format in portrait and landscape modes.

BTW, this .lrf was prepared using the Sony bug-fixed pdfread.exe command line program (v1.8.2.1).

Ulysses
11-18-2008, 03:40 PM
Very good idea but it crashes all the time, maybe an option to start from a certain page would be good "work around" so we can continue where we stopped..

ProDigit
11-19-2008, 08:43 AM
Just for comparison, this is what that image looks like using 584 x 754 as the max dimensions using PDFRead (default settings) and outputing to .lrf format in portrait and landscape modes.

BTW, this .lrf was prepared using the Sony bug-fixed pdfread.exe command line program (v1.8.2.1).

I apologise for not earlier replying, but I'm a bit busy lately.
I'll give those documents a look, and will let you know.
Obviously reducing the resolution will reduce the quality.
But perhaps in the document you post the quality loss is unnoticable.

So far on full screen (non-zoomed out) I could not detect any difference between jpegs with a greater than 754x566 pixels .
But probably it also has to do with filesize. A 754x566 pixels 128kB jpeg fil might look exactly the same as a 800x600pix 128kB file...

I'm a newbie here, and know very little of all of this, but perhaps I can help to figure out the 'best' pixel width and length for comic books.
For pure B&W (non graytint) comic books,gif is the best format. But the Sony reader does not support that format.

I'll do further testing later on.

Taesoo Kwon
11-19-2008, 10:57 PM
Very good idea but it crashes all the time, maybe an option to start from a certain page would be good "work around" so we can continue where we stopped..

I hope the problem will disappear after downloading the version 0.3.
If it continues to crash, please send me the PDF file to me by e-mail.

inew
11-20-2008, 02:27 AM
Very impressive!

This kind of design style is just what I dreamed: I press one button and you do all the work!!!!!

From algorithm's point of view, I think the most interesting part is the algorithm that detects blocks of contents from the pdf layout. If the program can incorporate the algorithm that breaks lines, it will be able to break 1-page wide blocks to 1/2-page wide block. The program will be a star to handle academic papers then (what a desperate target?). There may be many small things need to be tweaked, but the major function (handle sing-column and double-column) is there.

And, please, please, please keep the current operation style. I understand that you experts like commend line a lot. But this types of clean and intuitive UI is very important to the newbies (Do I need some HCI refences to support this argument?).

At last, a small thing I think may make the output looks better: when segmenting the small block to fit into the 6'' pages, you may want to detect whether a cut may cross some content or not. If it does, we may need to move the cut a little bit higher. This is only a small problem. The current tool is fantastic at this stage.

Ulysses
11-20-2008, 02:28 AM
I hope the problem will disappear after downloading the version 0.3.
If it continues to crash, please send me the PDF file to me by e-mail.

I tried 0.3 and it's perfect, thank you.

nrapallo
11-20-2008, 07:35 AM
I tried 0.3 and it's perfect, thank you.

In a word: AWESOME!

In two words: VERY FAST!!

In three: Caritas' Reflow Works!!!

In four: Great Job Taesoo Kwon!!!!

'nuff said; try it, you'll like it!

nrapallo
11-20-2008, 08:47 AM
Taesoo Kwon:

As a challenge, this is the first .pdf (see attachment below) I tried to convert to ebook (.html and images) that I was not successful in getting anywhere near a satisfactory result.

I tried converting it with early version of PDFRead and it did work quite well for me. That convinced me that .pdf to images was the only solution for complex .pdfs. It made me then request that color support be added to PDFRead v1.7; and that's why I'm involved with PDFRead v1.8!

I just tried to get two column output using PaperCrop 0.3 and this .pdf, but wasn't able to.

What algorithm would you suggest in getting this .pdf converted using a combination of reflow and/or two-column support?

Like I said before, a challenge!

Taesoo Kwon
11-20-2008, 10:01 PM
I just tried to get two column output using PaperCrop 0.3 and this .pdf, but wasn't able to.
What algorithm would you suggest in getting this .pdf converted using a combination of reflow and/or two-column support?


That pdf file is really challenging. Current algorithm cannot correctly seperate non-convex regions overlapping both horizontally and vertically. There may exist good algorithms that can handle this case, but I don't know any of them at the moment. (I am not an expert in the field of document segmentation.) Also, such algorithm, if any, may probably increase processing time.

Of course, it is easy to segment the pdf file in a PDFRead-way (simply dividing a page into two regions.) I would include this over-simplified but robust segmentation method into PaperCrop as an option someday.

nrapallo
11-20-2008, 10:15 PM
That pdf file is really challenging.

Yep, it is! Given that that .pdf is difficult to segment, I hope it doesn't discourage you. I was only trying to see if you may have had any insights on how best to tackle it's conversion.

Your program couldn't figure it out because it's not two-column but behaves like it is and reflow get confused with large blocks of center inset images.

I guess we'll leave it until version 0.4 ... :rofl: ;)

Thanks for trying!

Ulysses
11-23-2008, 03:06 PM
Where are the pictures stored? I mean I don't see anything in a folder which is created together with PDF. Is there a way to get those pictures?

soilwork
11-24-2008, 12:37 PM
Where are the pictures stored? I mean I don't see anything in a folder which is created together with PDF. Is there a way to get those pictures?

You can edit 'config.lua' to get image files instead of one PDF.
The option 'output_to_pdf' should be changed to 'false' as follows.
output_to_pdf=false

hansl
12-01-2008, 04:22 AM
withdrawn

hansl
12-08-2008, 05:10 AM
withdrawn

Taesoo Kwon
12-13-2008, 12:52 PM
Sorry for a late response.
Hansl, could you please send me the PDF file to me? (taesoobear@gmail.com)
I may not be able to fix the bug soon but I will try when I have time.

hansl
12-18-2008, 05:48 AM
Sorry, I removed the generated pdf.
I'm too busy currently to deal with the issue, but thanks for your reply.

hansl

=X=
12-22-2008, 03:08 PM
I have a PDf with 300DPI images. I'd like to crop the PDF but keep the 300DPI. However PaperCrop is converting the image to 72DPI. Is there a setting that I can set to keep the DPI the same resolution as the original PDF (e.g. 300DPI) ?

Thank you,
=X=

Taesoo Kwon
12-26-2008, 09:24 AM
can I keep the DPI the same resolution as the original PDF (e.g. 300DPI) ?

No, you cannot. Sorry.

=X=
01-08-2009, 01:47 AM
I'm embarrassed to ask, but how do I get Caritas' Reflow to work. And how can I validate that it works.
=X=

nrapallo
01-09-2009, 01:02 AM
I'm embarrassed to ask, but how do I get Caritas' Reflow to work. And how can I validate that it works.
=X=

Did you get Caritas' reflow working yet? Was it easier than you thought? ;)

=X=
01-10-2009, 03:01 AM
Did you get Caritas' reflow working yet? Was it easier than you thought? ;)

Yes thank you Nick it was very easy... once you pointed it out. I even tested changing the resolution to 480x300 for my phone. It worked great.
=X=

wiffel
01-17-2009, 11:07 PM
First of all ... Thanks for this wonderful application. I've used it many times now. And it works great. I use it for most of the technical papers I need to read for work.

As Ulysses posted before, it does crash sometimes. For me, that happens most of the time with very big documents. (I do use PaperCrop regularly on documents with 500 or more pages) PaperCrop seems to keep everything in memory before creating the PDF output. It basically runs out of memory before it can finish. I didn't have the time to look at the source code yet, but it could be an idea to store the page images while processing and get them from disk while producing the PDF?

Since the conversion of such big PDF books results in PDF books with 1000 pages or more, this also becomes a problem for me and my poor BeBook with its limited memory.

To be able to split these big books into multiple smaller ones, I altered the 'config.lua' file. (which - if everything went fine - should be attached) to make PaperCrop output a new PDF after every 100 pages. You can change this number at the top of the file in the line that reads nr_of_pages_per_pdf_book = 100;

For me it fixes the big PDF books, and it didn't crash on me anymore since I'm using this 'config.lua'.

Maybe it can help you too Ulysses ?

I hope the problem will disappear after downloading the version 0.3.
If it continues to crash, please send me the PDF file to me by e-mail.

Originally Posted by Ulysses
Very good idea but it crashes all the time, maybe an option to start from a certain page would be good "work around" so we can continue where we stopped..

wiffel
01-18-2009, 04:47 PM
If anybody is interested ...

I did implement what I proposed in my previous post. (Save the images and load them later to create the PDF file(s) ). This makes the usage of memory for the conversion of large files a lot less.

The config.lua to do this is attached.

PS: Taesoo Kwon
While implementing this, I did add the LUA collectgarbage a couple of times to make sure that images are garbage collected. I noticed that I did have a lot of crashes while generating the PDF. This makes me think that the PDF structure (used by outpdf:addPage(image)) does not keep a reference to the images that have been added. That could account for the 'random' crashes. Just to be sure, I did add a little list that keeps a reference to the images until the PDF has been created. That way they can not be garbage collected by LUA. That seemed to fix the crashes for me.

=X=
01-21-2009, 11:44 AM
Great idea thank you. My BB Storm also has a heck of a time loading large PDF.

I tried to run your script and got this error.

"lua error config.lua 86: attempt to compare nil with number"

=X=

wiffel
01-22-2009, 07:40 AM
Hi =X=,

Did you cut-copy-paste part of the code into your config.lua? My latest version does not have any code on line 86 (except for the END statement). So, it's hard for me to see what is going wrong.

Wiffel

=X=
01-22-2009, 01:25 PM
Hi =X=,

Did you cut-copy-paste part of the code into your config.lua? My latest version does not have any code on line 86 (except for the END statement). So, it's hard for me to see what is going wrong.

Wiffel

Sorry my mistake. I added 5 lines different line sized since I have multiple reading devices. They where commented out however.

I've re-ran the tool with your unmodified config.lua
The problem code is line 81. Color is red.
Msg: "lua error config.lua:81:attempting to compare nil with a number"

=X=

Here is the error


function outputImage(image, outdir, pageNo, rectNo)

if output_to_pdf then ----if output_to_pdf and outpdf:isValid() then
--vv--outpdf:addPage(image)
if (book_pages.nr_of_pages < nr_of_pages_per_pdf_book) then
book_pages:add_page(image, outdir);
else
book_pages:writeToFile(outdir);
book_pages:init_for_next_part();
end
--^^--
else
image:Save(string.format("%s/%05d_%03d%s",outdir,pageNo,rectNo,output_format))
end
end

wiffel
01-22-2009, 06:12 PM
=X=,

Sure that you still have a line like
nr_of_pages_per_pdf_book = 100;
somewhere ?

If that is the case, it can't hurt to put a line like
book_pages:init(1);
just before the line
function outputImage(image, outdir, pageNo, rectNo)

Anyway it's strange. The file works fine for me.

Good luck,

Wilfried

roger64
01-23-2009, 10:50 AM
Hi,

I converted a two-columns pdf book of 144 pages (that is really a 288 pages book) using papercrop. It went up from one meg to 18 megs.

On Linux, I used imagemagick to batch reduce the size of the images by 50"percent to get about a 6 or 7 megs file but divided in 144 numbered fragments. I declared it to be a png file and it processed the whole lot.-mogrify -resize 50% *.jpg-

After that, grouping all this in a zip file and converting to lrf with comic2lrf. This is a lot of work, on two platforms...but the result is readable and not too heavy.

Taesoo Kwon
01-23-2009, 12:00 PM
While implementing this, I did add the LUA collectgarbage a couple of times to make sure that images are garbage collected. I noticed that I did have a lot of crashes while generating the PDF. This makes me think that the PDF structure (used by outpdf:addPage(image)) does not keep a reference to the images that have been added. That could account for the 'random' crashes. Just to be sure, I did add a little list that keeps a reference to the images until the PDF has been created. That way they can not be garbage collected by LUA. That seemed to fix the crashes for me.

Actually, I haven't tested the pdf output functionality of paper crop intensively, because my ebook device doesn't support PDF files. I downloaded your config.lua and it seems to work very well for me.
Thank you.
By the way, if you want, I can make you (or anybody who whats) as a member of the googlecode project page, so that you can update the source codes/binary directly, or upload your versions of papercrop binary there so that people can choose which version to use. (whenever I update a version, it seems that a new bug is always introduced..)

P.S. As far as I understand the libharu library and my PDFWriter class, outpdf:addPage doesn't keep a pointer to the image, and you can discard the image as soon as you call the addPage funtion. But I cannot figure out why such kind of problem happened.

=X=
01-23-2009, 02:34 PM
nr_of_pages_per_pdf_book = 100;

Yes I do have the that line.


book_pages:init(1);

Adding this line did the trick

Thank you
=X=

mazzeltjes
01-23-2009, 05:16 PM
Hi

Nice little program
I'm having a problem with the bottom two lines repeating on
the following page
Any ideas what causes this?
And how do I change font size?

mazzeltjes
01-23-2009, 06:01 PM
Oke I changed the scroll overlap to: 0

scroll_overlap_pixels=0

and that got rid of most of the problem
the only thing is that the top quarter of
the letters on the following page are at the
bottom of the previous one.
this progresses through the document
so that after a few pages the letters are cut in half
is there a way to fix this
like give the scroll overlap a - value ?

soilwork
01-27-2009, 03:10 PM
the only thing is that the top quarter of
the letters on the following page are at the
bottom of the previous one.
this progresses through the document
so that after a few pages the letters are cut in half
is there a way to fix this


It would be great if this problem can be fixed but, for now, I am using Papercrop and PDFLRF to avoid this problem. I do the following.

1. Convert each page into one long image
1a. First, need to edit 'config.lua' to increase 'device_height' option. You need to do this only once. For example, I use the following.

device_width=700
device_height=30000
scroll_overlap_pixels=0

1b. Load a PDF file
1c. Press "Process current page" in each page :eek:

2. Archive all images into one ZIP file

3. Process by PDFLRF using Portrait or Comic-Portrait with Smart-cut option on. I use a simple batch file such as

pdflrf --erode=2 --nocrop -rs -c 8 --rotation="0" --pad=10 --overlap=0 -i %1 -o "%~n1.lrf" -t %2 -a %3

Note that I am using a DOS version of PDFLRF instead of windows version to avoid changing the option each time, but you can also use PDFLRFwin with proper option.

If this batch file is named as 'plb.bat', then you can use the following command to convert the zip file from step 2.

plb 'filename.zip' 'Title of the book' 'Author of the book'


As you can see, this may work with rather short PDF (e.g. journal articles) but probably too labor-intensive for longer books (step 1c). :(
It is too bad that PDFLRF is not open-sourced. If it were, smart crop algorithm could have been incorporated in programs like PaperCrop rather easily.

mazzeltjes
01-28-2009, 10:14 AM
Thanx Soilwork
I'll give that a shot
Looks like a lot of work but might be worth it
:thanks::thanks:

murraypaul
02-10-2009, 06:00 PM
(Deleted)

murraypaul
02-11-2009, 03:48 PM
PaperCrop is giving incorrect output on some sections on some pdfs.
An example is attached.
Has anyone worked out what causes this?

A trace of the PDF decoding:
Op = Tf:'F8', 1,
Op = TD:8.473000, 0,
Op = Tj:'publishes articles with a political or social',
Op = TD:-12.350000, -1.198000,
Op = Tc:-0.017000,
Op = Tw:-0.022000,
Op = TJ:['edge. Where’', 75, 's the science? The three of them knew', 93, '.', ],
Op = Tm:26.060000, 0, 0, 26.060000, 62.730000, 251.500000,
Op = g:0.610000,
Op = Tj:'M',
Op = Tm:9.600000, 0, 0, 9.600000, 89.450000, 263,
Op = g:0,
Op = Tc:-0.013000,
Op = Tw:-0.020000,
Op = Tj:'arch had the world biting its nails that asteroid 1997 XF-11 might',
Op = T*:
Op = Tc:-0.012000,
Op = Tw:0.058000,
Op = TJ:['pass close enough to the ear', -8, 'th in 30 years to collide. (Reanalysis', ],
Op = TD:-2.743000, -1.198000,
Op = Tw:0.126000,
Op = TJ:['promised a comfor', -6, 'table margin for safety', 87, '.) Then Hollywood staged a', ],
Op = T*:
Op = Tw:0.032000,
Op = TJ:['summertime double featur', -6, 'e, with ', ],
Op = Tf:'F10', 1,
Op = TD:14.372000, 0,
Op = Tj:'Deep Impact ',
Op = Tf:'F8', 1,
Op = TD:5.814000, 0,
Op = Tj:'destroying the world by',
Op = TD:-20.185000, -1.198000,
Op = Tj:'comet in May and ',

It is as though it has lost or corrupted the font info, and when the font gets reset everything looks ok again.

murraypaul
02-11-2009, 06:08 PM
I've tracked down a bit more on this.
When is happening is that the Tm command triggers a font update. Part of this update checks that the substituted font is roughly the same width as the intended font, and scales it if not. For the paragraphs in question the getGlyphAdvance call is returning a value greater than 1 (which shouldn't be possible), causing the font to be dramatically over-shrunk.
This can be worked around by changing SplashOutputDev.cc, line 1177, to add a check for w2 <= 1.

murraypaul
02-12-2009, 05:56 AM
Quick-and-dirty svn diff to enable four-sided cropping, with crop figures entered as percentages of width/height:

Index: papercrop-read-only.psm/PDFwin.cpp
================================================== =================
--- papercrop-read-only.psm/PDFwin.cpp (revision 2)
+++ papercrop-read-only.psm/PDFwin.cpp (working copy)
@@ -823,17 +823,17 @@

SummedAreaTable t(*bmp);//bmp->getWidth(), bmp->getHeight(), bmp->getDataPtr(), bmp-
>getRowSize());

- TRect domain(0,0, bmp->GetWidth(), bmp->GetHeight());
-
-
-
-
FlLayout* layout=mLayout->findLayout("Automatic segmentation");
double min_gap_percentage=layout->findSlider("MIN gap")->value();
double margin_percentage=layout->findSlider("Margin")->value();
int thr_white=layout->findSlider("white point")->value();
+ double cropT=layout->findSlider("Crop T")->value()/100.0;
+ double cropB=layout->findSlider("Crop B")->value()/100.0;
+ double cropL=layout->findSlider("Crop L")->value()/100.0;
+ double cropR=layout->findSlider("Crop R")->value()/100.0;
double max_width=1.0/layout->findSlider("N columns")->value();

+ TRect domain(cropL*bmp->GetWidth(),cropT*bmp->GetHeight(), (1-cropR)*bmp->GetWidth()
, (1-cropB)*bmp->GetHeight());

ImageSegmentation s(t, true, domain, 0, min_gap_percentage, thr_white);
s.segment();
@@ -994,4 +994,20 @@
temp.format("selected %d", mSelectedRect);
fl_draw(temp, 0,0,w(), h(), FL_ALIGN_CENTER);
*/
+
+ FlLayout* layout=mLayout->findLayout("Automatic segmentation");
+ double cropT=layout->findSlider("Crop T")->value()/100.0;
+ double cropB=layout->findSlider("Crop B")->value()/100.0;
+ double cropL=layout->findSlider("Crop L")->value()/100.0;
+ double cropR=layout->findSlider("Crop R")->value()/100.0;
+
+ int wCropL=toWindowCoord(cropL,cropT).x;
+ int wCropT=toWindowCoord(cropL,cropT).y;
+ int wCropR=toWindowCoord(cropR,cropB).x;
+ int wCropB=toWindowCoord(cropR,cropB).y;
+
+ fl_draw_box( FL_BORDER_FRAME, 0, 0, ww, wCropT, FL_BLACK);
+ fl_draw_box( FL_BORDER_FRAME, 0, hh-wCropB, ww, wCropB, FL_BLACK);
+ fl_draw_box( FL_BORDER_FRAME, 0, 0, wCropL, hh, FL_BLACK);
+ fl_draw_box( FL_BORDER_FRAME, ww-wCropR, 0, wCropR, hh, FL_BLACK);
}


Index: papercrop-read-only.psm/RightPanel.cpp
================================================== =================
--- papercrop-read-only.psm/RightPanel.cpp (revision 2)
+++ papercrop-read-only.psm/RightPanel.cpp (working copy)
@@ -61,6 +61,10 @@
double margin=L.getValue<double>("margin");
int nColumns=L.getValue<int>("N_columns");
int white_point=L.getValue<int>("white_point");
+ double cropT=L.getValue<double>("crop_T");
+ double cropB=L.getValue<double>("crop_B");
+ double cropL=L.getValue<double>("crop_L");
+ double cropR=L.getValue<double>("crop_R");
std::string option=L.getValue<std::string>("option");

FlLayout* layout=findLayout("Automatic segmentation");
@@ -68,6 +72,10 @@
layout->findSlider("Margin")->value(margin);
layout->findSlider("N columns")->value(nColumns);
layout->findSlider("white point")->value(white_point);
+ layout->findSlider("Crop T")->value(cropT);
+ layout->findSlider("Crop B")->value(cropB);
+ layout->findSlider("Crop L")->value(cropL);
+ layout->findSlider("Crop R")->value(cropR);
find<Fl_Input>("Option_Input")->value(processOption(option.c_str()));

redraw();
@@ -102,6 +110,18 @@
layout(0)->create("Value_Slider", "white point","white point");
layout(0)->slider(0)->range(230, 255);
layout(0)->slider(0)->step(1);
+ layout(0)->create("Value_Slider", "Crop T","Crop T");
+ layout(0)->slider(0)->range(0, 20);
+ layout(0)->slider(0)->step(0.1);
+ layout(0)->create("Value_Slider", "Crop B","Crop B");
+ layout(0)->slider(0)->range(0, 20);
+ layout(0)->slider(0)->step(0.1);
+ layout(0)->create("Value_Slider", "Crop L","Crop L");
+ layout(0)->slider(0)->range(0, 20);
+ layout(0)->slider(0)->step(0.1);
+ layout(0)->create("Value_Slider", "Crop R","Crop R");
+ layout(0)->slider(0)->range(0, 20);
+ layout(0)->slider(0)->step(0.1);
layout(0)->create("Button", "update","update");
layout(0)->updateLayout();

Index: papercrop-read-only.psm/presets/two-column papers (portrait).lua
================================================== =================
--- papercrop-read-only.psm/presets/two-column papers (portrait).lua (revision 2)
+++ papercrop-read-only.psm/presets/two-column papers (portrait).lua (working copy)
@@ -2,4 +2,8 @@
margin=1.45
N_columns=2
white_point=255
-option="(portrait) vertical scroll (outputs multiple images)"
\ No newline at end of file
+crop_T = 0
+crop_B = 0
+crop_L = 0
+crop_R = 0
+option="(portrait) vertical scroll (outputs multiple images)"

Taesoo Kwon
02-13-2009, 09:30 AM
This can be worked around by changing SplashOutputDev.cc, line 1177, to add a check for w2 <= 1.

Could you e-mail me the corrected source files for this and the following patch?

Taesoo Kwon.

Mach1.9pants
04-06-2009, 04:07 AM
Hi
Thanks for this great tool, it is just what I needed. I have (pre-) ordered an eSlick reader so hopefully it will do some of this since it just uses PDFs. It should work for reflowing my University papers however it won't work (I reckon) for the second reason I am getting an eSlick; my Dungeons and Dragons PDFs. These are all online PDFs now in a landscape format in 3 columns. I have trialled your tool and it works great, although I have to do some manual editing, to get a PDF of just one column... perfect for my eSlick (once again, I hope!).

I have one (rather stupid) question when I manually select the areas to crop I have the first page in your program, uuummm how do I then move onto see the second page of the document so I can set the crop on that. :blush: Auto segmentation makes too many errors?

Oh the only bug I have found is that if you right click in the area of the PDF (like, for example, you are maybe looking for a next page function ;)) it crashes the programme. I am using XP Pro SP3 all updated etc.

Thanks again
M1.9P

Taesoo Kwon
04-10-2009, 09:23 AM
Hi
I have one (rather stupid) question when I manually select the areas to crop I have the first page in your program, uuummm how do I then move onto see the second page of the document so I can set the crop on that. :blush:

Hi,
Try page up and down keys if that's what you are asking.
BTW, currently PaperCrop applies the same areas to all pages you are converting.
Thanks,

Taesoo.

Mach1.9pants
04-10-2009, 03:59 PM
Thanks I have (now) figured that it applies the same cut to all. Works (mostly) OK for me so still a great tool, often the 'title' page has a picture so being able to go down to the text is great, thanks for your reply

suecsi
05-19-2009, 10:36 AM
This looks great and with my pdf definitely selected things right, but I would actually like to apply the same crop to all pages, I can't seem to find that setting?

Taesoo Kwon
05-22-2009, 10:43 AM
Suecsi:
Manual segmentation mode applies the same crop to all pages. This might be what you need.

crAss
05-24-2009, 04:59 AM
A very BIG thank you for this piece of software. I was looking for something like that ever since I got my cybook gen 3. Now I can say that I am not in need of any firmware update or whatever because I can now read everything I download from the Internet easily. Thanks again!

namiamy
05-31-2009, 04:08 AM
thanks, wonderful tool

ss1997
09-13-2009, 08:11 PM
really good tool that i am looking for, but, i can't figure out how to create a script that, simply crop a "two pages scan" into 2 pages in portrait. :p

can anyone give me some idea? (is there any reference about the script functions of this wonderful tool) :)

ab7vf
09-15-2009, 03:10 PM
re 2 page scan

manual segmentation , one image per page ...

in other news, papercrop runs under wine .. Slackware 13.0 linux for those non-windows users

zlds
10-13-2009, 06:33 PM
I used this software. I found it no good for CROP PDF files. Big problem is it made font to become so big. I could read it by my sony prs 600.

quarkonics
10-23-2009, 08:56 AM
I used this software. I found it no good for CROP PDF files. Big problem is it made font to become so big. I could read it by my sony prs 600.

I'd like to share my script for papercrop that is supposed to resolve this issue.
It tries to resolve 4 issues in all:
1. sometime papercrop may crash, according to my experiment, it happens when calling postprocessimage when some small rects are processed. So I'd try to work around this issue by always do postprocessimage for whole page
2. this script will find max rect in the page, and scale that rect to match your device width, and scale other rects with similar ratio. (seems width is not exported for image, so I take a not-so-accurate way)
3. after the work done in 3. all rects will be concate vertically, and will be split to pages that match your device size.
4. ratio of height to width larger than 5 is dropped, since no reading blocks are in this shape. surely, you could remove this part, if you don't like it

sankar10
11-04-2009, 02:59 PM
Thank You for this wonderful software . It makes previously unreadable pdf readable in bebook . good work:2thumbsup

Nathan Campos
12-27-2009, 09:16 AM
Thanks very much for this very useful tool! :)

panou
01-04-2010, 12:21 PM
hi,
thanx for the tool. I have a problem with a approx 2000 page pdf book. The cropping starts and at 600 page shows up a window with an error: mallock failed or something. I have tried to split the file with nitro and the papercrop doesn't want to crop the extracted pages. Have you got any suggestions for this problem. Thank you in advance.

ziegl027
01-20-2010, 09:44 AM
So, after I process the file, how do I save it somewhere so I can get it on the reader?? I see the page with the crop boxes outlined, and it looks fine, but can't seem to figure out how to get the final rearranged document?

Taesoo Kwon
03-20-2010, 02:55 PM
So, after I process the file, how do I save it somewhere so I can get it on the reader??

Click process all pages button. Output will be in the same folder as the input file.

bthoven
03-28-2010, 04:42 AM
Hi,

I have this document which is two pages per pdf page, how to specify the right option to make it on page per pdf page?

Thanks in advance.

TGS
03-28-2010, 09:55 AM
Hi,

I have this document which is two pages per pdf page, how to specify the right option to make it on page per pdf page?

Thanks in advance.

Can't at the moment help you split the pages - but isn't that Ajahn Mun?

bthoven
03-28-2010, 10:34 AM
Yes, the book is about Ajarn Mun. I've tried "Portrait - Fit image to screen", and it did split pages for me; but it also produces a number of blank pages in between.

I then use pdfmanipulate.exe in Calibre to split it into many files by skipping those blank pages; to my surprise, when I merge those files with pdfmanipulate merge, those blank pages came back!

Taesoo Kwon
04-02-2010, 08:30 AM
I've tried "Portrait - Fit image to screen", and it did split pages for me; but it also produces a number of blank pages in between.

If you use the latest version on the googlecode page, you can pretrim the borders (using crop T, crop B, crop L, crop R). Then choose "Portrait - Fit image to screen" should not produce blank pages.

nelson7lim
08-22-2010, 05:10 PM
how do i manually crop the pages? i get an error saying page compare to nil or somthing. error 87.

thrawn_aj
08-22-2010, 07:30 PM
Old post I know, but I just found this and it's just perfect for reading research papers on my nook. Excellent piece of software! Thanks a million!

louislaolu
09-09-2010, 08:46 AM
Thanks for this fantastic prgramme!
Could anyone please tell me how I can edit the config.lua to achieve a higer resolution?
The default resolution is not eye-friendly enough!

oldmankit
09-19-2010, 06:23 AM
=X=,

Sure that you still have a line like
nr_of_pages_per_pdf_book = 100;
somewhere ?

If that is the case, it can't hurt to put a line like
book_pages:init(1);
just before the line
function outputImage(image, outdir, pageNo, rectNo)

Anyway it's strange. The file works fine for me.

Good luck,

Wilfried

I had the same problem, and it went away by following this advice.

However, the output directory is empty. I can't find any output pdfs or jpgs anywhere? A folder is created with the same name as the original pdf, but after having pressed 'process', it remains empty.

Interestingly I'm trying this under windows and linux (with wine), and had the exact same original problem (lua error), fixed it, but have the remaining no output problem!

oldmankit
09-22-2010, 07:12 AM
Still patiently(?) waiting for some help on this...

If there's no solution to my problem, does anyone know of any software that does something similar? I'm guessing the answer is 'no', in which case getting it to work is even more important!

frabjous
09-22-2010, 05:06 PM
Still patiently(?) waiting for some help on this...

If there's no solution to my problem, does anyone know of any software that does something similar? I'm guessing the answer is 'no', in which case getting it to work is even more important!

Try BRISS (http://www.mobileread.com/forums/showthread.php?t=83053). In fact, I pesonally like it better, since it doesn't rasterize everything, and is platform neutral. It isn't specifically tailored for multiple columns, but since you can hand-define as many crop regions as you want, it does the same thing.

oldmankit
09-26-2010, 10:53 AM
Try BRISS (http://www.mobileread.com/forums/showthread.php?t=83053). In fact, I pesonally like it better, since it doesn't rasterize everything, and is platform neutral. It isn't specifically tailored for multiple columns, but since you can hand-define as many crop regions as you want, it does the same thing.

Hi,

That's great. At first I thought the tool wasn't working, when I saw all the similar pages blurred together. Then I realised it was saving me a heck of a lot of time.

I couldn't ask for anything more than this tool. Thanks for the pointer!

Kit

vjdew
12-05-2010, 06:00 AM
Hi Taesoo Kwon,
I downloaded the tools and ran this on one of the technical paper with 2 columns.
While running a process all pages, I chose ( portrait) vertical scroll (outputs a single image if not too big).
But it was taking the crop numbers not in sequence. For the first page it was fine. But from second page it took the left column to be the first crop and col1 as the 2 crop. Can u please help me resolving this .

Thanks
VJ

civiliza
01-13-2011, 11:11 AM
Having just tried PaperCrop 0.43, I found I was getting the same error (lua error config.lua:104: attempt to compare nil with number) as per Issue 6 on your web page.

Unlike the raiser(s) I encountered the problem with Windows XP.

You might be interested to know that the problem only occured when the "Process current page" button was pressed, it did not happen when the "Process all pages" button was used.

--

This in turn raises a different issue - when a pdf contains different page orientations / margins (in my case a landscape cover page followed by portrait contents) is there any way to process a range of pages with one set of parameters then another range with other parameters.

Even if single page processing was working, I could not face processing 159 pages one at a time.

Taesoo Kwon
02-08-2011, 03:49 PM
Having just tried PaperCrop 0.43, I found I was getting the same error (lua error config.lua:104: attempt to compare nil with number) as per Issue 6 on your web page.

Unlike the raiser(s) I encountered the problem with Windows XP.

You might be interested to know that the problem only occured when the "Process current page" button was pressed, it did not happen when the "Process all pages" button was used.

--

This in turn raises a different issue - when a pdf contains different page orientations / margins (in my case a landscape cover page followed by portrait contents) is there any way to process a range of pages with one set of parameters then another range with other parameters.

Even if single page processing was working, I could not face processing 159 pages one at a time.

I will add this functionality in the next version.
By the way, it is difficult for me to answer questions here.
Please use the issue tracker on the google papercrop project page.
Thank you.

Taesoo.

mahatmanto
08-07-2011, 02:47 AM
good idea,
but does it work in apple too?

thomass
08-13-2011, 01:18 AM
Still patiently(?) waiting for some help on this...

If there's no solution to my problem, does anyone know of any software that does something similar? I'm guessing the answer is 'no', in which case getting it to work is even more important!
Try Willus.com's K2pdfopt (http://www.willus.com/archive/#kindle)

Taesoo Kwon
08-19-2011, 07:28 AM
Hello, I am the programmer of papercrop. This thread is too old and I no longer regulary visit this thread. If you have any problems or questions about papercrop, please use the issue tracker in the code.google.com/p/papercrop page instead. Thank you.

Taesoo.

jakoh
10-13-2011, 01:00 AM
I had the same problem, and it went away by following this advice.

However, the output directory is empty. I can't find any output pdfs or jpgs anywhere? A folder is created with the same name as the original pdf, but after having pressed 'process', it remains empty.

Interestingly I'm trying this under windows and linux (with wine), and had the exact same original problem (lua error), fixed it, but have the remaining no output problem!

Ya i have the same problem of no output. except when i press process current page.

user743
04-09-2014, 06:16 AM
I'm sorry for the double post, but you convert pdf to jpg 800x600 pix.
The screen itself has a small bar on the bottom.
Isn't it better to convert to something like 790x600 pix?
just a question.

calibre already has all the ereader screen dimensions. just chose your ereader and it'll tell you.

(maybe I'll make a list and post it for convenience.)