View Full Version : PDFRead 1.7 released


ashkulz
04-25-2007, 09:02 AM
UPDATE: PDFRead 1.7 has been released. The changes for 1.7 and batch conversion instructions are mentioned first, then followed by the inital release announcement for 1.6.

I've released PDFRead 1.7, which has minor bug fixes and enhancements. Changes in this release:

add a "landscape-half" mode which splits a page into two even halves (gdxf's suggestion)
if the output document does not have the proper file extension, then append it automatically.
remove imagemagick and use pngnq for color reduction.
fix the problems if the PDF has an incorrect TOC referring to an invalid page. Also added option --no-toc to disable TOC generation.


Also, batch conversion can now be done on Windows for all PDFs in a folder.


Download the file attached to linked post and rename it as pdfread-batch.bat
Open up the renamed file, and change the set OPT= line to use the appropriate profile. In case you have installed in a non-default location, change the set LOC= line too.
Copy the batch file into a directory where you want to convert, and double click on it. Please do not put the directory anywhere on the Desktop or My Documents, it can cause some problems. Put it somewhere in the root of your drive ( C:, D: )
The filename will be used as the book title, so be sure to name files properly. Please ensure that the filename does not contain special characters not present in UTF-8. A ebook with be created with the same name (but with given extension ie. sample.pdf => sample.lrf).

In case you want to customize further:

Do a normal conversion with your custom params for a single file and copy the command line options to a text file. Some advice on how to copy the options from the window: To copy text from a CMD window, right-click on the title bar (the bar that has the X and minimize buttons), choose properities, and then enable QuickEdit mode. This lets you highlight text and copy it by right-clicking on it. Copy everything, even if you have to scroll up.
Copy the command line parameters and replace the set OPT= mentioned above. Do NOT include the input filename, the title (-t option) or the call to pdfread, just the options. The value should be valid command line options (http://pdfread.sourceforge.net/#options).


People on OS X/Linux can hack together a similiar script very easily, so I won't bother to post it. If you do want such a script, let me know.


Original announcement follows


After a long wait, PDFRead (http://pdfread.sourceforge.net/) 1.6 has been released. You can download from PDFRead @ SourceForge (https://sourceforge.net/project/showfiles.php?group_id=87679&package_id=91453).

The focus on this release has been to rewrite the code for better maintainability. It can now be easily integrated into other tools. PDFRead now has a plugin based architecture, which will allow new features to be added easily -- which I've already done for this release.

Lots of new image processing options have been added to PDFRead. unpaper integration (http://unpaper.berlios.de/) ensures that bad scans will be cleaned up properly. The new cropping algorithm removes whitespace very agressively, even from the middle of the page without any loss of content. All images are now run through an edge-enhancement filter, which is the same one used by both rbmake and RasterFarian.

Support for the TIFF and IMGLIST input formats has been added. The IMGLIST format is a simple text file containing a list of images which are to be considered as a single document.

Batch support is not directly present for Windows, but can be achieved via a batch file. The command line used to convert each book (using the current settings) is printed before conversion. You can then copy this to tweak your conversion settings. Users of Linux/OS X are assumed to be familiar with the command-line, and the batch support can be achieved by scripting.

You can also specify a range of pages for conversion. This has the side-effect of giving a preview feature, as specifying the same page as the start and end page will run the processing only for that page.

The Windows GUI has been revamped: there are now tooltips everywhere, and there is no "advanced" page anymore. If you do want to control those parameters, please use the command line directly.

Lots of other minor tweaks have gone into this release.

The detailed changelog for this release:

revamped the Windows GUI: added tooltips, preview feature and show the command line options when executed (useful for batch execution).
add support for TIFF and a list of page images for input.
add unpaper support for image cleanup.
add extremely agressive whitespace detection, even in the middle of the page text.
added an edge-enhancement filter, similiar to rbmake and RasterFarian.
allow all processing stages to be selectively disabled.
allow a page range to be specified for conversion.
tweak the prs-500 profile to rotate right instead of left (thanks gdxf)
add an optional step to optimize generated PNG images via OptiPNG.
removed the dependency on xpdf.
removed the autocontrast and ghostscript cropping features (no longer useful).
fix problem where the IMP file was not created if the latest eBook Publisher was not installed.
complete overhaul of the code for better maintainability.

Some screenshots of the effect of the various image processing options are also attached.

Azayzel
04-25-2007, 10:27 AM
Wow, you've been quite busy! Have to give this one a whirl and see how things turn out. I have a watermarked PDF I hope PDFRead works well on, we'll see. Thanks!

kovidgoyal
04-25-2007, 01:11 PM
Cool now that I've finished the HTML,TXT -> LRF converters, I can look into integrating PDFRead into libprs500. There is one concern: http://www.py2exe.org/index.cgi/Py2ExeSubprocessInteractions

Would you be willing to fix that in your code?

EDIT: More information http://sourceforge.net/tracker/index.php?func=detail&aid=1124861&group_id=5470&atid=105470

ashkulz
04-25-2007, 01:21 PM
Cool now that I've finished the HTML,TXT -> LRF converters, I can look into integrating PDFRead into libprs500. There is one concern: http://www.py2exe.org/index.cgi/Py2ExeSubprocessInteractions

Would you be willing to fix that in your code? I don't see how it affects pdfread. I use explicit pipes when I'm calling other executables (gs, convert, etc), so there shouldn't be a problem in pdfread. If you mean that you may have problem when calling pdfread as a console application, I'd suggest you not to do it that way. Just import pdfread and call the convert() function -- and you're set to go. You should also replace the variable P_STREAM in common.py with any valid stream -- it currently points to sys.stdout, so that's only one place for you to replace streams.

kovidgoyal
04-25-2007, 01:39 PM
OK...anyway I just realized that this bug has been squashed in python 2.5.1

ashkulz
04-26-2007, 10:04 AM
For those on Windows, there's a quick way to convert all PDFs in a folder.


Download the attached file and rename it as pdfread-batch.bat
Open up the renamed file, and change the set OPT= line to use the appropriate profile. You may also have to change the EXT= line if you are using a different profile. In case you have installed in a non-default location, change the set LOC= line too.
Copy the batch file into a directory where you want to convert, and double click on it. The filename will be used as the book title, so be sure to name files properly. A ebook with be created with the same name (but with given extension ie. sample.pdf => sample.lrf).

In case you want to customize further:

Do a normal conversion with your custom params for a single file and copy the command line options to a text file. Some advice on how to copy the options from the window: To copy text from a CMD window, right-click on the title bar (the bar that has the X and minimize buttons), choose properities, and then enable QuickEdit mode. This lets you highlight text and copy it by right-clicking on it. Copy everything, even if you have to scroll up.
Copy the command line parameters and replace the set OPT= mentioned above. Do NOT include the input filename, the title (-t option) or the call to pdfread, just the options. The value should be valid command line options (http://pdfread.sourceforge.net/#options).


People on OS X/Linux can hack together a similiar script very easily, so I won't bother to post it. If you do want such a script, let me know.

Gravitas
04-26-2007, 10:43 AM
I'm being such a muppet (so much so that I changed my title and avatar to match), but I was trying to use this software last night and couldn't get any lrf files from it. i also couldn't get the files it did produce (png) into the folder I specified in my output path - they all went into a temp folder. Even when I used the .lrf file extension in the name of the book.

I'm not usually such a muppet IT-wise (thank god, as I'm an IT Manager looking after a MPLS Citrix network over 36 sites with 600 users) so I reckon it's the excitement of finally getting my hands on my Reader tomorrow, that is shortcircuiting my brain.

Any idea what I'm doing wrong? - I have every confidence that you guys will get me using this stuff properly,as you all sorted me out using BD :o

oh, and I'm using Windows

ashkulz
04-26-2007, 12:15 PM
Gravitas: did you use the prs500 or the prs500-l profile? The LRF is produced only if the profile is one of the above two. Otherwise, depending on the profile it will produce output targeted for another device.

If it still doesn't work, can you post some a screenshot of the settings before pressing Convert and the explorer view of the output folder?

Don't worry, we all have those days every now and then ;)

Gravitas
04-26-2007, 12:19 PM
I was using the prs500 profile. I'll have another go when I get home and post some screenies.

EDIT

Ok here are my screenies, I'm sure I've done something blindly obviously wrong :o

http://www.empiresfinest.com/settings.bmp
http://www.empiresfinest.com/settings2.bmp
http://www.empiresfinest.com/output.png

kovidgoyal
04-26-2007, 06:15 PM
Works pretty well for me. Minor point:
the spelling of portrait is 'portrait not potrait (-m option)

kovidgoyal
04-26-2007, 07:22 PM
Hmm problems
The following cmdline cause an exception

python pdfread.py -p prs500 -o /home/kovid/temp/test.lrf -t 'Guide to NumPy' -a 'Travis Oliphant' -f lrf -i pdf -m potrait /home/kovid/documents/text/notes/NumPy/numpybook.pdf --last-page=2

Creating BBeB file ... Traceback (most recent call last):
File "/home/kovid/build/pdfread-1.6/pdfread.py", line 204, in <module>
main()
File "/home/kovid/build/pdfread-1.6/pdfread.py", line 90, in main
delete = output.generate(input.toc)
File "/home/kovid/build/pdfread-1.6/output.py", line 211, in generate
imagenum = toc_map[int(page_)]
KeyError: 12


Probably because the TOC refers to pages not included.

Also, this is my first time rasterizing a PDF (I usually have access to the LaTeX sources). Is the font rasterization always so bad? I've attached samples to show you what I mean.

gdxf
04-27-2007, 12:43 AM
I followed the batch mode instructions to run batch conversion in windows, but had encountered this notice in the command line:

"Unable to determine total number of pages in document
Please enter number of pages: "

When I put in a page number, it results in a blank lrf file.

Here is what the screen says:

"Unable to determine total number of pages in document
Please enter number of pages: 1

Temporary directory: c:\docume~1........

Page 1/1: EXTRACT RASTERIZE BLANK

Creating BBeB file ... done.
Unable to determine total number of pages in document
Please enter number of pages: 1

Temporary directory: c:\docume~1\.........

Page 1/1: EXTRACT RASTERIZE BLANK

Creating BBeB file ... done.
Press any key to continue . . ."



For those on Windows, there's a quick way to convert all PDFs in a folder.

ashkulz
04-27-2007, 05:40 AM
Okay, I've discovered the problem that bit Gravitas and kovidgoyal. The PDF file is incorrect, as it contains a TOC reference for a page that doesn't exist. I've fixed that, and will be making another release tomorrow.

Gravitas
04-27-2007, 06:07 AM
Okay, I've discovered the problem that bit Gravitas and kovidgoyal. The PDF file is incorrect, as it contains a TOC reference for a page that doesn't exist. I've fixed that, and will be making another release tomorrow.

What a star :)

ashkulz
04-27-2007, 12:06 PM
Okay, I've released 1.7. Changes in this release:


add a "landscape-half" mode which splits a page into two even halves (gdxf's suggestion)
if the output document does not have the proper file extension, then append it automatically.
remove imagemagick and use pngnq for color reduction.
fix the problems if the PDF has an incorrect TOC referring to an invalid page. Also added option --no-toc to disable TOC generation.


If you are on OS X or Linux, please recheck the installation instructions -- there have been changes since the last release.

EDIT: I'm going away for the weekend (it's a long weekend), so I may not respond quickly for a few days :)

ashkulz
04-27-2007, 12:16 PM
I followed the batch mode instructions to run batch conversion in windows, but had encountered this notice in the command line:

"Unable to determine total number of pages in document
Please enter number of pages: "

When I put in a page number, it results in a blank lrf file.

Here is what the screen says:

"Unable to determine total number of pages in document
Please enter number of pages: 1

Temporary directory: c:\docume~1........

Page 1/1: EXTRACT RASTERIZE BLANK

Creating BBeB file ... done.
That's a very weird error, it usually results when your installation has not been set up correctly. Can you check the following:


Check whether you can convert PDF files normally via the GUI
Try the attached script with same instructions
Check that the PDFRead location is set correctly (set LOC=)
Uncomment the commented call in the file and try it again and send me the output.
zip up the directory and attach it here or send it to me

ashkulz
04-27-2007, 12:25 PM
Also, this is my first time rasterizing a PDF (I usually have access to the LaTeX sources). Is the font rasterization always so bad? I've attached samples to show you what I mean. I don't have Sony Reader, so I can't really see how the generated LRF looks. On the other hand, the converted PDF did look decent when I looked at the PNG. Do you have any particular points that felt really bad? I'm always interested in knowing where I can improve things...

kovidgoyal
04-27-2007, 01:32 PM
You can install the connect reader software and use that to see how the files look. Basically the fonts look like they've been reasterized without any antialiasing.

ashkulz
04-27-2007, 01:42 PM
Uhm, I don't have access to a Windows PC at home ... so if you could post some screenshots I'd be grateful. But yes, the fonts do look a bit ragged ... what happens is that I render at 300dpi (anti-aliased), perform dilation at that resolution and then reduce the size. Now, as a result of this anti-aliasing happens with the reduced image, which is bad because when you downsample it to 4 colors you can get "gaps" where the color information is lost due to the 2-bit grayscale limitation. As far as I know, even RasterFarian has pretty much the same output. Can you try with that and see how good the result is?

BTW, can you try again with 1.7? I replaced imagemagick with pngnq, this may give better output...

Gravitas
04-28-2007, 11:58 AM
1.7 fixed the problems I was having, now works like a dream. Thanks :D

ashkulz
04-28-2007, 12:13 PM
The text is not as clear as non-pdf converted documents, but is perfectly readable so long as I up the font size to medium. I may try the same document again with the pngs optimized to see if that improves the text any, but I'm happy with how it is at the moment. Well, that's a side-effect of having native font rendering, and putting up with something that is rasterized from PDFs which target a much higher DPI. Also, PNG optimization will try to reduce the file size, not any of the display parameters! You may want to experiment with the DPI and/or edge enhancement level to find what looks best. I don't have a reader, so I don't know whether the default settings I've chosen are equally good for the reader.

ashkulz
04-28-2007, 12:21 PM
Okay, I'm planning to release 1.8 in a day or two. The major feature planned would be an all-color pipeline (with option to downsample to grayscale, of course). This won't be of much use to anyone except people who own the REB 1200 (ie. me ;)) and those who get those newfangled color e-ink readers.

Some previews of things look in color: raw page (http://puggy.symonds.net/~ashish/downloads/0.png), dilated page (http://puggy.symonds.net/~ashish/downloads/0-dilated.png), and after color reduction (http://puggy.symonds.net/~ashish/downloads/0-dilated-nq8.png). Regular text pages also work as they used to: raw text page (http://puggy.symonds.net/~ashish/downloads/3.png) and the dilated text (http://puggy.symonds.net/~ashish/downloads/3-dilated.png).

Do any of you have any feature requests for 1.8? I don't feel comfortable with such short releases where only a few new things are added ...

gdxf
04-28-2007, 07:41 PM
I used your batch file and changed the batch file conversion directory from "My Desktop" to another drive on my computer. It works! I guess there might be some restriction of user access issue involved, but I am not sure about that.

Some files are converted with no problem, others are still with this annoying "unable to determine total number of pages" problem. I later find that those files that cannot be converted include: 1. pdf files with OCR text underneath the image, 2. pdf files with non-alphabet file names. Hope it can be dealt with in later releases.

kovidgoyal
04-28-2007, 09:26 PM
Uhm, I don't have access to a Windows PC at home ... so if you could post some screenshots I'd be grateful. But yes, the fonts do look a bit ragged ... what happens is that I render at 300dpi (anti-aliased), perform dilation at that resolution and then reduce the size. Now, as a result of this anti-aliasing happens with the reduced image, which is bad because when you downsample it to 4 colors you can get "gaps" where the color information is lost due to the 2-bit grayscale limitation. As far as I know, even RasterFarian has pretty much the same output. Can you try with that and see how good the result is?

BTW, can you try again with 1.7? I replaced imagemagick with pngnq, this may give better output...

I'm travelling but I'll do some experimentation when I return. I highly recommend vmware and an old windows installation disk.

ashkulz
04-29-2007, 01:25 AM
I used your batch file and changed the batch file conversion directory from "My Desktop" to another drive on my computer. It works! I guess there might be some restriction of user access issue involved, but I am not sure about that. Did you use the new batch file and if so, did you run from both Desktop and some other place? There's no logical reason I can think of why it shouldn't run from Desktop -- did you get the same error as before or something else when you ran from there?

Some files are converted with no problem, others are still with this annoying "unable to determine total number of pages" problem. I later find that those files that cannot be converted include: 1. pdf files with OCR text underneath the image, 2. pdf files with non-alphabet file names. Hope it can be dealt with in later releases. That happens when pdftk (http://www.pdfhacks.com/pdftk/) cannot report how many pages there are in a document. You'll have to manually open each such document and find out how many pages there are and enter it. Can you link/post a sample file? I'll have to see how to detect the page count for those files -- they look like their information dictionary is corrupt or something.

gdxf
04-29-2007, 05:23 AM
Did you use the new batch file and if so, did you run from both Desktop and some other place? There's no logical reason I can think of why it shouldn't run from Desktop -- did you get the same error as before or something else when you ran from there?

That happens when pdftk (http://www.pdfhacks.com/pdftk/) cannot report how many pages there are in a document. You'll have to manually open each such document and find out how many pages there are and enter it. Can you link/post a sample file? I'll have to see how to detect the page count for those files -- they look like their information dictionary is corrupt or something.

Yes, I did use the new batch file. It worked well in any other places except on desktop directories. But that doesn't matter very much for me, the point is it at least worked elsewhere.

I manually put in the page number and it encountered the decoding error. I've posted the command line error info below and also attached the zipped directory and problematic file. I think it is because the filename is non-unicode...

---------------------------------------------

Unable to determine total number of pages in document
Please enter number of pages: 2

Page 1/2: EXTRACT RASTERIZE CROP DILATE SPLIT SAVE DONE
Page 2/2: EXTRACT RASTERIZE CROP DILATE SPLIT SAVE DONE
Creating BBeB file ... Traceback (most recent call last):
File "pdfread.py", line 201, in <module>
File "pdfread.py", line 86, in main
File "output.pyo", line 212, in generate
File "pylrs\pylrs.pyo", line 472, in renderLrf
File "pylrs\pylrs.pyo", line 250, in toLrf
File "pylrs\pylrs.pyo", line 246, in toLrfDelegates
File "pylrs\pylrs.pyo", line 250, in toLrf
File "pylrs\pylrs.pyo", line 246, in toLrfDelegates
File "pylrs\pylrs.pyo", line 561, in toLrf
File "pylrs\elements.pyo", line 68, in toString
File "pylrs\elements.pyo", line 76, in write
File "pylrs\elements.pyo", line 51, in _write
File "pylrs\elements.pyo", line 51, in _write
File "pylrs\elements.pyo", line 42, in _write
File "pylrs\elements.pyo", line 25, in _writeAttribute
File "pylrs\elements.pyo", line 13, in _encodeCdata
File "encodings\utf_8.pyo", line 16, in decode
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb1 in position 0: unexpecte
d code byte
Press any key to continue . . .

ashkulz
04-29-2007, 12:32 PM
I manually put in the page number and it encountered the decoding error. I've posted the command line error info below and also attached the zipped directory and problematic file. I think it is because the filename is non-unicode... Yes, you're right -- it did fail because of non-unicode filename (I think you have some kind of chinese/japanese encoding). That's a limitation of pylrs, you have to use the utf8 encoding (although this can be overridden, but is way too much trouble to implement and get it right).

If you ensure that fonts are embedded in the PDF and the filename doesn't have special characters, it should convert properly.

Jary
04-29-2007, 06:55 PM
Hi people.

ashkulz, you did a great job ! I've been using 1.6 and I quite like it.

The install is just perfect.
The prs-500 mode is good, and prs500-l is very nice too. Maybe GUI isn't totally clear the first time, and it misses the .lrf extension on my files, but otherwise it rocks :)

One thing: why add "title" when there is "output" field ? When you fill output name, shouldn't it be auto copied in title ?

Good job !

Please keep it up.

gdxf
04-29-2007, 07:09 PM
Thanks ashkulz! I'll convert the filenames to fit utf8 encoding. I've batch converted a dozen files overnight and it turned out quite well.

ashkulz
04-30-2007, 02:06 AM
The prs-500 mode is good, and prs500-l is very nice too. Maybe GUI isn't totally clear the first time, and it misses the .lrf extension on my files, but otherwise it rocks :)

One thing: why add "title" when there is "output" field ? When you fill output name, shouldn't it be auto copied in title ? You might want to upgrade to 1.7; the extension is now automatically added after processing. The output field is for the output filename which can be anything -- I might want to store books with filename "Author - Title" or any other scheme. That's why I have a separate title field. But yes, you can copy the basic filename as the title (which I do in the batch conversion script) but it's currently not very easy to implement in the GUI (which is actually based on the NSIS installer).

ashkulz
04-30-2007, 02:07 AM
Thanks ashkulz! I'll convert the filenames to fit utf8 encoding. I've batch converted a dozen files overnight and it turned out quite well. It's good to see that it worked without problems :) I will update the instructions to not put the folder on the desktop.

gdxf
04-30-2007, 02:42 AM
It's good to see that it worked without problems :) I will update the instructions to not put the folder on the desktop.

I just find out that the real cause of that problem is perhaps not with "Desktop" user restrictions. The problem seems to be related to the folder names. To put it simple, the folder names cannot contain spaces. For instance, if the batch file is placed in the "C:\My Folder", it won't work. However, it will work with "C:\MyFolder".

ashkulz
04-30-2007, 02:44 AM
BTW, in case anyone wants to monitor new releases, you can subscribe to the RSS feed of the PDFRead file releases (http://sourceforge.net/export/rss2_projfiles.php?group_id=87679) on Sourceforge.

gdxf
04-30-2007, 02:46 AM
ashkulz, any way to add another option for converting pdf to lrf? What I mean is splitting one page into three even parts in landscape mode. I find that in some larger pages, the legibility of two halves is still not good enough.

ashkulz
04-30-2007, 02:59 AM
I think that using the landscape mode (make as many pages as needed) should work for you. It doesn't help if you make it 3 even pages, as ultimately the aspect ratio will determine the width (because making it 3 pages increases the height, not the width).

If you have any suggestions, please post them (with some example PDFs, if you can).

gdxf
04-30-2007, 03:00 AM
I like the batch conversion very much. The computer just continues running to turn pdf files into lrf files without human attention and without error disruption. The next concern I have is about the aspect ratio: how to enhance legibility while still efficiently utilizing reader screen space? i.e., adjusting the aspect ratio to split pages to meet the demand.
The screen display dimensions of the reader:
portrait: 4.54 x 3.47 in (115.4 x 88.2 mm), 754 x 584 pixels
landscape: 6.09 x 4.41 in (154.8 x 112 mm), 1012 x 784 pixels

ashkulz
04-30-2007, 03:05 AM
I just find out that the real cause of that problem is perhaps not with "Desktop" user restrictions. The problem seems to be related to the folder names. To put it simple, the folder names cannot contain spaces. For instance, if the batch file is placed in the "C:\My Folder", it won't work. However, it will work with "C:\MyFolder". Hmm, that makes much more sense. I thought I had guarded against that in my batch file, but I'll look it up later (I don't have Windows access at the moment).

gdxf
04-30-2007, 03:10 AM
I think that using the landscape mode (make as many pages as needed) should work for you. It doesn't help if you make it 3 even pages, as ultimately the aspect ratio will determine the width (because making it 3 pages increases the height, not the width).

If you have any suggestions, please post them (with some example PDFs, if you can).

I am aware that 3 even pages with the same width probably won't help. I am thinking of adjusting the width and length at the same time. I am wondering if there is a way to let the program do it automatically instead of manually fill in the dimension figures after experimentations.

ashkulz
04-30-2007, 03:13 AM
I like the batch conversion very much. The computer just continues running to turn pdf files into lrf files without human attention and without error disruption.Heh, if you really see the batch file it doesn't have much in it. That's the power of the command-line automation -- which I use in PDFRead internally, while doing the conversion.
The next concern I have is about the aspect ratio: how to enhance legibility while still efficiently utilizing reader screen space? i.e., adjusting the aspect ratio to split pages to meet the demand.
The screen display dimensions of the reader:
portrait: 4.54 x 3.47 in (115.4 x 88.2 mm), 754 x 584 pixels
landscape: 6.09 x 4.41 in (154.8 x 112 mm), 1012 x 784 pixels That's something that I need ideas from people who actually use the reader -- I don't have one. Also, for the moment, I don't use the reader's landscape mode at all -- do you think I should add a profile for it? you can try how it looks by making a copy of the batch file you have, and modify the set OPT= line to set OPT=-p prs500-l --rotate none --hres 784 --vres 1012If that looks good, then I'll add that as a new profile.

ashkulz
04-30-2007, 03:15 AM
I am aware that 3 even pages with the same width probably won't help. I am thinking of adjusting the width and length at the same time. I am wondering if there is a way to let the program do it automatically instead of manually fill in the dimension figures after experimentations. Well, you can't adjust the width and height of the reader at all -- all you can do is split it up. If you have any idea of how to go about it, I can automate the calculation.

(heh, we are almost chatting over the forum right now just like IM)

gdxf
04-30-2007, 03:28 AM
ashkulz, yeah, very glad to chat with you here. You can download a Sony Connect Reader Software here to test for PDFRead. It works exactly the same as the Sony Reader Screen. I myself test on this too.
http://ebooks.connect.com/downloadclient.html

gdxf
04-30-2007, 03:43 AM
do you think I should add a profile for it? you can try how it looks by making a copy of the batch file you have, and modify the set OPT= line to set OPT=-p prs500-l --rotate none --hres 784 --vres 1012If that looks good, then I'll add that as a new profile.

I tested this mode and found it good at first and then later changed my mind. The texts look too crowded to be sharp enough for reading.

Jary
04-30-2007, 05:27 AM
ashkulz I did find a bug. When I put pages 10 to 15, it starts at 10 and goes up to the end of the book (did that for a preview).

Btw, a preview button, where you could choose a page to see how it looks before converting in .lrf could be a good thing ;)

ashkulz
04-30-2007, 08:15 AM
ashkulz I did find a bug. When I put pages 10 to 15, it starts at 10 and goes up to the end of the book (did that for a preview).

Btw, a preview button, where you could choose a page to see how it looks before converting in .lrf could be a good thing ;) Hmm, the page range specification works quite well for me. Could you post the command line options used when the command window is started by the GUI? You can look at the first post on how to copy it (or just post a screenshot).

Also, entering the same page as start and end will effectively give you a preview. That's why I didn't add a seperate preview option...

ashkulz
04-30-2007, 08:19 AM
I tested this mode and found it good at first and then later changed my mind. The texts look too crowded to be sharp enough for reading. I guess we'll have to live with the current quality then. Beyond a certain point, the quality of the tool matters less than the quality of the input document ... The only thing I can think of is to add 1/4 mode, but that would work only for documents which have 2 column text.

gdxf
04-30-2007, 07:53 PM
I guess we'll have to live with the current quality then. Beyond a certain point, the quality of the tool matters less than the quality of the input document ... The only thing I can think of is to add 1/4 mode, but that would work only for documents which have 2 column text.
ashkulz, please consider adding a 1/3 mode too. From my experiments, I find that if you reduce the width to 2/3 of its original 784 pixels (which is 523, perhaps the overlap should be counted here, then another 15 pixels added to 523 =538) and keep the length at the same 1012, the result would be 3 evenly split parts, very legible because it is magnified in each part as a result of adjusted aspect ratio, actually much more readable than two halves.

ashkulz
05-01-2007, 11:54 AM
ashkulz, please consider adding a 1/3 mode too. From my experiments, I find that if you reduce the width to 2/3 of its original 784 pixels (which is 523, perhaps the overlap should be counted here, then another 15 pixels added to 523 =538) and keep the length at the same 1012, the result would be 3 evenly split parts, very legible because it is magnified in each part as a result of adjusted aspect ratio, actually much more readable than two halves. That would mean in the Reader landscape mode, right? I'll be busy for a few days, so probably will work on it this weekend...

gdxf
05-01-2007, 06:18 PM
That would mean in the Reader landscape mode, right? I'll be busy for a few days, so probably will work on it this weekend...
It should work the same way as when you did landscape-half mode, where you use "-p prs500-l -m landscape-half". You can simply modify it a little bit to become "-p prs500-l -m landscape-one third". In my view, whether the width/length is 1012/784 or 754/584 does not matter much, since the aspect ratio is about the same 1.29. In one third mode, the aspect ratio may be approximately 1.88. You may test it on the Connect Software yourself on these figures. Thanks.

alex_d
05-04-2007, 06:26 AM
ashuklz, the samples page for pngnq shows amazing rgb->256 quantizations. However, I think its magic is about finding the right palette and it doesn't let you specify your own. If so, then it's the wrong tool for the job because its optimized palette would degrade quality because the reader cannot display it. The only tool i've found so far that does color reduction properly is pnmquant.

Anyway. It looks like you are creating an extremely powerful backend and pdfread may soon obviate rasterfarian. Before that happens, though, you need to add a few more screens to your windows gui. One screen has to let you set up how the page looks. This means rotation, splitting, etc. Optionally in addition, you should have a page that shows an in-window preview (which is important with all your options for autocropping, etc.) Another page, in the beginning, should let you collect multiple source (e.g. multiple images or chapters from a book). Also, some form of GUI batching is also needed (maybe it can go on that first page). What i did in rasterfarian was to use the filesystem to effect semaphores. Very dirty, but better than nothing. I don't know too much about the limitations of the way you do your gui, but at least some things should be implemented.

But anyway, now that we have a good backend I really hope someone who knows C# or something can step up and make an excellent front end too. It upsets me how few people know of these rasterization tools (eg on sites that talk about the reader or in reviews). That means something very big is still missing.

kovidgoyal
05-04-2007, 11:45 AM
I am working on a GUI right now. Just have to make my current GUI multi-threaded.

ashkulz
05-04-2007, 01:26 PM
I am working on a GUI right now. Just have to make my current GUI multi-threaded. That's very good news! So libprs500 will hopefully be the official GUI going forward :)

BTW, I think the implementation of the Device class for REB1100 should be done in a day or two, so you'll also have to refactor the GUI to take that into account too...

ashkulz
05-04-2007, 01:46 PM
ashuklz, the samples page for pngnq shows amazing rgb->256 quantizations. However, I think its magic is about finding the right palette and it doesn't let you specify your own. If so, then it's the wrong tool for the job because its optimized palette would degrade quality because the reader cannot display it. The only tool i've found so far that does color reduction properly is pnmquant. Well, I've done conversions using both pngnq and pnmquant and I've found that the pngnq conversions look best on the desktop. All the other devices I have access to either support monochrome (so it doesn't matter) or at least 16 colors (where there the palette difference would be small). Also, most of the PDFs I tested were mostly text where there aren't too many colors involved -- results may be different if images are involved. If you (or someone else) can verify that there is significant degradation then I'll implement pnmquant side by side with pngnq and use pnmquant for the reader profiles.

Anyway. It looks like you are creating an extremely powerful backend and pdfread may soon obviate rasterfarian. Before that happens, though, you need to add a few more screens to your windows gui. One screen has to let you set up how the page looks. This means rotation, splitting, etc. Optionally in addition, you should have a page that shows an in-window preview (which is important with all your options for autocropping, etc.) Another page, in the beginning, should let you collect multiple source (e.g. multiple images or chapters from a book). Also, some form of GUI batching is also needed (maybe it can go on that first page). What i did in rasterfarian was to use the filesystem to effect semaphores. Very dirty, but better than nothing. I don't know too much about the limitations of the way you do your gui, but at least some things should be implemented.

But anyway, now that we have a good backend I really hope someone who knows C# or something can step up and make an excellent front end too. It upsets me how few people know of these rasterization tools (eg on sites that talk about the reader or in reviews). That means something very big is still missing. Thanks! I still have plans to improve PDFRead backend, I will prefer to leave the frontend to kovidgoyal :) Some things I am planning to add:

the 1/3 mode requested by gdxf
1/4 mode, similiar to what you have in RasterFarian
color support throughout (only useful for the 1200 atm)
"reflow" mode
The "reflow" mode is where I am going to put in the most effort. That will take the dilated image at 300dpi, use page segmentation techniques for figuring out the individual lines and words (taking inspiration from OCR) and then cutting/pasting image segments so that the aspect ratio of the image is changed, while keeping the same text. This is important as sometimes even when using landscape mode the text is too small because of the page size and/or the aspect ratio. This will effectively reflow documents having mainly text content. I'll be using the ideas/code I've already implemented for aggressive cropping and taking it quite a bit further :)

gdxf
05-04-2007, 04:13 PM
the 1/3 mode requested by gdxf
1/4 mode, similiar to what you have in RasterFarian
color support throughout (only useful for the 1200 atm)
"reflow" mode
[/list] The "reflow" mode is where I am going to put in the most effort. That will take the dilated image at 300dpi, use page segmentation techniques for figuring out the individual lines and words (taking inspiration from OCR) and then cutting/pasting image segments so that the aspect ratio of the image is changed, while keeping the same text. This is important as sometimes even when using landscape mode the text is too small because of the page size and/or the aspect ratio. This will effectively reflow documents having mainly text content. I'll be using the ideas/code I've already implemented for aggressive cropping and taking it quite a bit further :)

Those improvements are all great news! I am eagerly looking forward to the new release. Ever since the release of the Sony Reader last year, I think one of the most crucial issues for effective reading on it is converting large-sized pdf pages into mutiple smaller pages without losing much legibility.

alex_d
05-04-2007, 10:12 PM
ashkulz, that page reflowing sounds pretty cool. will it only take out big spaces, or could you do something subtle like decrease line spacing? Actually, for that, you'd probably be better off trying something drastic like modifying the pdf itself. (It's nearly impossible to increase boldness that way... believe me, I've tried... but font kerning should be easier.)

Another thing you should put in the backend is the ability to render any subsection of a page. JAP used this feature as the basis of the tool, and for a suitably powerful gui it would allow the greatest flexibility.

My next project I think will be to start writing a new gui for the Sony Reader. (The knowledge for doing this is a bit scattered right now, but I've talked with some people who collectively seem to have all the pieces.) At first i'll focus on the core rendering to speed up page-turns (very important for nonfiction/textbooks that I read) and improve quality. I'll do this by implementing page caching and change the the refresh policy in a way that lowers battery life (i.e. no more black-white flashing but rather multiple refreshes). Cold page turns should go from the 2s they are now (for rastered .lrf) to 0.25s or less. Also I'll implement page-number entry and folder-based navigation (and maybe panning, zooming, and even searching... although searching won't be straightforward seeing as I'll still be using an image-based book format). I know none of the above things are an issue for you, ashkulz, with your fancy 1100, but hopefully it'll be the start of a good project.

I've also given some thought to a better font-enhancement algorithm vs edge enhancement that would essentially do autohinting on bitmaps. I'll try experimenting a bit with that (but I have a feeling it'll be incredibly processor-intensive).

alex_d
05-04-2007, 10:51 PM
P.S. For 16-color devices, the dithering won't make a big difference. For the 4-color sony, it does. White backgrounds get a light gray snow and black text gets a dark gray one. The end result is decreased contrast on a device which has poor contrast to begin with. I made a mistake, however. The right tool is NOT pnmquant (which does the whole find-the-best-palette bs). Rather, it is pnmremap which lets you specify a custom palette. The command line for it is 'pnmremap mapfile=palette.pnm -floyd' where palette.pnm is a file that contains:
P2
4 1
255
0 85 170 255

or P2
16 1
255
0 17 34 41 68 85 102 119 136 153 170 187 194 211 238 255



Also, in regards to further tuning, I suggest you tweak your default settings a little. This may vary by device, but on my reader the best edge enhancement seems to be 7 not five, and you may want to change your default dilation factor. In fact, how exactly are you doing it right now? By setting a dpi and then multiplying by page size? That would give very inconsistent results if the size of the page is different from A4 (a good example is if one does manual cropping as one should, but also most pdf books are not a4).

kovidgoyal
05-05-2007, 01:47 AM
That's very good news! So libprs500 will hopefully be the official GUI going forward :)

BTW, I think the implementation of the Device class for REB1100 should be done in a day or two, so you'll also have to refactor the GUI to take that into account too...


That's great I'm going to refactor the GUI anyway to support multithreaded device operations as I dont like having the GUI freeze while transferring stuff to the device. There may be a couple of small additions to the device interface to support multithreading...nothing that would require any major rewrites though.

ashkulz
05-05-2007, 06:19 AM
ashkulz, that page reflowing sounds pretty cool. will it only take out big spaces, or could you do something subtle like decrease line spacing? Actually, for that, you'd probably be better off trying something drastic like modifying the pdf itself. (It's nearly impossible to increase boldness that way... believe me, I've tried... but font kerning should be easier.) alex_d, PDFRead already supports "taking out big spaces" and "decreasing line spacing" -- it's done by the cropping mechanism. If you reduce it from 2.0 to something like 1.0 or 0.5, it will decrease the inter-line spacing drastically. An example of removing big spaces is present in the demos at the start of the thread (the "romans" example).

What I want to do with reflow is to break up indiviudal lines, and move part of the words to the next line (and so on for the rest of the lines). Something like what happens when you reduce the browser window size, the text automatically reflows. I've got a very simplistic algorithm in mind, but I'll document it after I've actually tested it out :)

Another thing you should put in the backend is the ability to render any subsection of a page. JAP used this feature as the basis of the tool, and for a suitably powerful gui it would allow the greatest flexibility. I can't think how to implement this properly, as specifying "which" subsection of the page is quite tricky (considering I have to support multiple devices and dilation at variable DPI). It'd be better to crop it properly in Acrobat or whatever, and then render it through PDFRead (as it respects the CropBox).

My next project I think will be to start writing a new gui for the Sony Reader. (The knowledge for doing this is a bit scattered right now, but I've talked with some people who collectively seem to have all the pieces.) At first i'll focus on the core rendering to speed up page-turns (very important for nonfiction/textbooks that I read) and improve quality. I'll do this by implementing page caching and change the the refresh policy in a way that lowers battery life (i.e. no more black-white flashing but rather multiple refreshes). Cold page turns should go from the 2s they are now (for rastered .lrf) to 0.25s or less. Also I'll implement page-number entry and folder-based navigation (and maybe panning, zooming, and even searching... although searching won't be straightforward seeing as I'll still be using an image-based book format). I know none of the above things are an issue for you, ashkulz, with your fancy 1100, but hopefully it'll be the start of a good project. Do you mean something that will actually run on the Sony Reader itself? That'd be absolutely great, as I don't think anyone has gotten/wanted to run apps on the Sony yet (as compared to the Librie). Wish you the best of luck, and hats off to you :)

I've also given some thought to a better font-enhancement algorithm vs edge enhancement that would essentially do autohinting on bitmaps. I'll try experimenting a bit with that (but I have a feeling it'll be incredibly processor-intensive). It's done by all desktop PDF Readers I've seen, and I think it would be rather processor intensive, even if you do it in C.

ashkulz
05-05-2007, 06:38 AM
P.S. For 16-color devices, the dithering won't make a big difference. For the 4-color sony, it does. White backgrounds get a light gray snow and black text gets a dark gray one. The end result is decreased contrast on a device which has poor contrast to begin with. Hmm, I think that I'll go the image -> pngnq -> pnmremap route. What I'll do is add another command-line option, --grayscale-remap which will take a comma seperated list of values for the allowed pixel intensities. Then, during conversion I'll create the map dynamically and call pnmremap with it. I'll set it up so that the reader profiles have the proper values set properly. Do you think that's a good approach?

Also, in regards to further tuning, I suggest you tweak your default settings a little. This may vary by device, but on my reader the best edge enhancement seems to be 7 not five, and you may want to change your default dilation factor. In fact, how exactly are you doing it right now? By setting a dpi and then multiplying by page size? That would give very inconsistent results if the size of the page is different from A4 (a good example is if one does manual cropping as one should, but also most pdf books are not a4). Hmm, I think that 7 does a little too much edge-enhancement, but I'm willing to change it. Actually, what would be nice if someone could volunteer to test various parameters and report what looks the best. Any volunteers? ;)

I also do dilation and resizing differently from RasterFarian. I render the PDF/DJVU at the dilation DPI, without specifying a page size. So Ghostscript or DJVU automatically create a image with size appropriate for that resolution. I don't know what the size is up front at all, and in fact it varies from book to book. I perform cropping/dilation at that DPI, and then depending on the mode I resize it down and split it up if necessary while maintaining the aspect ratio. This ensures that whatever may be the input page size it really doesn't matter to PDFRead -- what matters is the DPI, aspect ratio of cropped content and the mode.

BTW, I've already implemented everything except the "reflow" mode in SVN. I hope to have at least a rudimentary implementation done by tomorrow (I'll optimize for speed later).

ashkulz
05-05-2007, 06:53 AM
That's great I'm going to refactor the GUI anyway to support multithreaded device operations as I dont like having the GUI freeze while transferring stuff to the device. There may be a couple of small additions to the device interface to support multithreading...nothing that would require any major rewrites though. Instead of trying to make the GUI multithreaded, wouldn't it be a bit simpler to add a callback to display the progress during bulk writes, which is where it would freeze most of the time? So you could display a progressbar showing [[[ xyz bytes transferred ]]]. If you don't specify the callback then you'd simply won't call it. Don't know whether calling it directly and updating the Qt control will work, as I recall that you have to change control values only from the event thread for Win32, there must be someway of enqueuing things to run in the event thread. I assume you know what'll work in Qt :)

kovidgoyal
05-05-2007, 10:38 AM
The callback way is how its done now. When I said the GUI freezes, I meant only that you cant perform other tasks while waiting for bulk writes to finish.
The reason I'm switching to threads is that I will want to make a bunch of processes multithreaded...converting from TXT,HTML,PDF->LRF as well as actually copying files to the device. In the future there will probably be more functionality as well. Making it multi-threaded will make implementing a queueing system easy. Besides I don't have much experience with multi-threaded programming and I'm looking at this as a good way to learn ;-)

Azayzel
05-05-2007, 01:11 PM
I agree with the whole multi-threading schema, it's more efficient (or should be) and will more fully realize the use of the system on which it runs. I guess the only bottleneck will be if something gets stuck in a deadlock waiting for a thread to finish if it stalls or fails. Guess that's where the learning part will really come into play with your scheduling system. I think Alex_D used this with his program, so if you need pointers he might be of help.

ashkulz
05-05-2007, 02:30 PM
Besides I don't have much experience with multi-threaded programming and I'm looking at this as a good way to learn ;-) You must really, really want to punish yourself if you want learn multithreading ... I try to avoid it whenever I can get away with it ;)

kovidgoyal
05-05-2007, 05:44 PM
Hey I'm a theoretical physicist...punishing myself is pretty much a given :)

alex_d
05-06-2007, 03:28 AM
I also do dilation and resizing differently from RasterFarian. I render the PDF/DJVU at the dilation DPI, without specifying a page size. So Ghostscript or DJVU automatically create a image with size appropriate for that resolution. I don't know what the size is up front at all, and in fact it varies from book to book.That's what I was saying. The boldness intensification changes from book to book.

Hmm, I think that I'll go the image -> pngnq -> pnmremap route.Use pnmremap only. Pngnq is designed to try to find the best 16 colors out of your monitor's 16M. It then sets up a 16-entry mapping table whose elements are 24-bit rgb. This will of course look good on your PC, but those colors it will find (e.g. a gray that's 250,250,250) just don't exist on the 16-color Iliad. The image's mapping table can't be optimized, it must simply be the one that the Iliad/Reader/etc can natively support. Pngnq might be useful if you want a 4-color bitmap for displaying on a 16-color device, but I don't think pngnq lets you input the right settings for that (I think it always picks out of 16M rgb). Lastly, the "dithering" step (the mixing of pixels after you figure out which pixels to mix) is done just as well by pnmremap as by pngnq. Running pnmremap and pngnq will perform dithering twice, actually reducing quality. (Running pngnq and then displaying on a Reader also does double dithering).

Hmm, I think that 7 does a little too much edge-enhancement, but I'm willing to change it. Actually, what would be nice if someone could volunteer to test various parameters and report what looks the best. Any volunteers? Yes, we need a poll. I've done extensive testing, but I'm only one opinion. Also, the Sony Reader seems to best like settings which look too harsh on an LCD. But on your 1100, different settings will probably look better (but dude, quit doing image quality testing on your pc!)

[Bitmap autohinting is] done by all desktop PDF Readers I've seen, and I think it would be rather processor intensive, even if you do it in C. No, I don't think anyone has tried bitmap autohinting before. PDF viewers (and operating systems) do regular autohinting where they look at the vector information itself. That approach can't be applied because by the time I increase the boldness of the font via dilation, I've lost all the vector info. I'll have to shift the pixels themselves.

It might be easier to just go back to the font vector-changing efforts and see if I can get better results out of that. Thing is that it's easy to take embedded fonts out of a PDF, a bit hard to deal with the various formats, and nearly impossible to modify the vectors (there is one commercial program that can do it, and this program is scriptable, but it obviously can't be distributed). What I haven't done, though, is research how to write a program that does the vector modifying itself. However, I think the math of scaling bezier curves and etc is beyond me. The worst part, though, is i'm not sure how it'll all turn out after rastering. Ghostscript's autohinting engine, for example, isn't optimized to antialias. It treats it like it's higher-res and then downsamples. OpenType autohinting isn't supported. In the end, the letters seem to turn out blurry and require edge-enhancement anyway. Also, unfortunately, it probably won't be able to make any use of the font's internal hinting information (since it'll likely become meaningless after the vectors change).

Azayzel
05-26-2007, 12:51 PM
I was just curious... since this extracts each page of the PDF as a rasterized image, is there any way you can make it use already rasterized images; e.g., PNG, JPG, GIF, etc. That way if we already have the images, only the crop, dilate, save functions need to be run. It's not often that this is the case, but I have a few ebooks that are already in image format.

Thanks!

ashkulz
05-26-2007, 01:58 PM
Well, there is already support for such a scenario with the IMGLIST format. Create a simple text file containing the list of images in the order you want them, and select the input format as imglist (in GUI or via the command-line option -i). Almost all common image formats will be supported, see this page (http://www.pythonware.com/library/pil/handbook/#appendixes) for all supported formats.

gdxf
05-26-2007, 05:28 PM
ashkulz, can you give more details on how to convert multi-page tif files into IMGLIST format? I find PDFRead does not support multi-page tif(f) files. Also, any new development on PDFRead 1.8? Thanks.

Well, there is already support for such a scenario with the IMGLIST format. Create a simple text file containing the list of images in the order you want them, and select the input format as imglist (in GUI or via the command-line option -i). Almost all common image formats will be supported, see this page (http://www.pythonware.com/library/pil/handbook/#appendixes) for all supported formats.

gdxf
05-26-2007, 10:11 PM
I guess if the image file (tiff, tif) does not have to be rasterized, the final result would be much better, as the content quality won't be degraded too much. So even if the the screen still displays the same size content, the content would be much more legible. The results in "Just Another Printer" testified to this. The only issue with Just Another Printer is that it does not have batch processing capability. If we can integrate some advantages of JAP into PDFRead, I believe we are going to find a final solution for reading A5 sized tif(f) image files on the Sony Reader.

ashkulz
05-27-2007, 12:09 AM
ashkulz, can you give more details on how to convert multi-page tif files into IMGLIST format? I find PDFRead does not support multi-page tif(f) files. Also, any new development on PDFRead 1.8? Thanks. Do not use multi-page TIFFs in the IMGLIST format, use the TIFF support directly (Input Type TIFF in the GUI or command line option -i tiff). This will explode the multi-page TIFF and convert it directly. Note that it assumes the TIFF is at 300dpi, or else you may want to turn off dilation (as dilation is good at higher DPIs).

I guess if the image file (tiff, tif) does not have to be rasterized, the final result would be much better, as the content quality won't be degraded too much. So even if the the screen still displays the same size content, the content would be much more legible. The results in "Just Another Printer" testified to this. The only issue with Just Another Printer is that it does not have batch processing capability. If we can integrate some advantages of JAP into PDFRead, I believe we are going to find a final solution for reading A5 sized tif(f) image files on the Sony Reader. The solution already exists! I must not have advertised the features enough, because both you and Azayel were asking me about things already added in 1.6.

About 1.8, I may not be able to work on it for a week or two as I am currently travelling out of the country. The features I've already added are:
- add the landscape-third and portrait-2col modes (actually can now
support NxN splitting)
- add support for color processing

But I haven't had time to release it yet. I'm delaying my image reflowing to 2.0, as it will take quite a bit of time. There's a commercial implementation of it called UbiText (http://www2.parc.com/istl/projects/ubitext/), I am studying it and trying to come up with something.

gdxf
05-27-2007, 12:52 AM
ashkulz, these are all good news! I'm very much looking forward. Is the 1/3 mode any good?

As to the multipage tiff issue, when I use the tiff input mode, it always encounters such an error below. Any explanation? I am pretty sure I have the correct file type. Or is it because I've used .tiff files converted from .tif files?
Thanks.

Command Line
============
"C:\Program Files\PDFRead\bin\pdfread" -p prs500-l -i tiff -t "pages.tiff" -o "
C:\Tests\page.lrf" --no-crop --no-dilate --no-enhance -m "landscape-half" "C:\Tests\saved\Page.tiff"

Extracting TIFF pages ... done.

Temporary directory: c:\...\temp\pdfread-lsqvn8

Page 1/2: EXTRACT Traceback (most recent call last):
File "pdfread.py", line 201, in <module>
File "pdfread.py", line 84, in main
File "pdfread.py", line 43, in convert
File "input.pyo", line 160, in get_page
File "Image.pyo", line 1916, in open
IOError: cannot identify image file
Press any key to continue . . .

Azayzel
05-27-2007, 11:01 AM
Well, there is already support for such a scenario with the IMGLIST format. Create a simple text file containing the list of images in the order you want them, and select the input format as imglist (in GUI or via the command-line option -i). Almost all common image formats will be supported, see this page (http://www.pythonware.com/library/pil/handbook/#appendixes) for all supported formats.

Thanks for the response, I'll give it a whirl once I find a quick method of creating a list with 250+ images (probably just redirect a dir to a text file, now that I think about it).

The reason I had asked this was that the initial result with an older version of JEC gave some pretty buggered results; i.e., really fuzzy text with pieces missing. After reading a few of the latest responses, I think it might be the dilation filter over fuzzing the text too much. I'll play around a bit more.

igorsk
05-27-2007, 02:08 PM
dir /b >filelist.txt

Bob Russell
06-15-2007, 01:32 AM
Finally got this installed. (Needed to read a pdf on it!)
Works great. I am using either:

1) Layout mode = Default
Profile = prs-500

or

2) Layout mode = Landscape-half
Profile = prs-500

The only problem I have run into is that when I read the resulting .lrf book, the half page seems to sometimes cut right in the middle of a line of text, and I can't read it. Is there a way to get some overlap?

Also, I'm really not sure what profile does vs layout. Especially prs-500-l versus prs-500.

Can anyone clarify a bit, or point me to a post I may have missed with the info?

Thanks!

Bob Russell
06-15-2007, 01:51 AM
Maybe I can sort of guess at the answer to my questions, but I'm not sure, and am also unsure about the optimal settings.

First the profiles:

When you choose prs-500, you get portrait. To see landscape, you need to hold down the size button on the Reader until it switches to landscape mode.

When choosing prs-500-l, you get landscape orientation even when the Reader is set to portrait. It's rotated to the right, presumably to allow the right thumb to change pages.

Next the Layout mode:

It seems that portrait will set the dimensions of the output to show the whole page on a single screen.

Setting it to landscape will cause it to be "wide and short", i.e. landscape dimensions that only look good when you switch the Reader to landscape mode so you can see all of it.

Landscape-half appears to cut the page in half and do landscape output for each half of the page (which the Reader then displays one-half at a time also, making for 4 screens per page on the original).

The disadvantages of landscape-half appear to be the following:
* Lines can be cut in the middle. There is not overlapping of the cut, so it can be hard or impossible to read the line that was split.
* When mixed with the Readers choice to do some overlap automatically in landscape mode, it can be confusing to read because it's not obvious what has been repeated and what is missing (e.g. cut off in the middle of the line and not repeated).

My tentative conclusion:

1) First try a few pages with Landscape/Prs-500
If you can read it at that size (with the Sony Reader in landscape mode), stick with it because that's the more natural version.

2) If you need a larger size, then use Landscape-half/Prs-500
You will have odd page breaks, but at least you can read it unless a line got split in a bad way in the half-page split that PDFRead made.

3) If you have something like a presentation (e.g. two slides per page, one over the other), then just use Portrait/Prs-500 because the slides are probably very large lettering, so you can shrink it a lot. At least that worked in the document I used it for. Actually, I didn't try it, but that sort of document is probably even readable by moving it directly to Connect from the original .pdf also.

Please take the above as the naive descriptions of someone that doesn't know what he's doing yet. Feel free to correct me and add other helpful info, or confirm parts that you folks agree with. I would really appreciate input on a better way to do this!

ashkulz
06-15-2007, 08:41 AM
A "profile" is a collection of settings for the various command line options, one of which is the layout-mode. When you choose "Default" layout in the GUI, you are using the layout defined in the profile.

I have set it up to always use the reader's portrait mode: the reader's landscape mode is never used. If you choose to switch to that, it will not look good as the resolution targeted is for the portrait version. So avoid the reader's landscape mode in general.

As you correctly found, the prs500 profile is for portrait and prs500-l for landscape (holding the reader sideways). There is always some amount of overlap between pages in landscape mode (20 is default), so I'm surprised that you got no overlap. Can you just try using the default settings, just changing the profile to prs500-l and seeing the output?

There is also a major difference between landscape and landscape-half layout: landscape will take as many pages as necessary to show the page in correct aspect ratio (it may be anything from 2-4 pages) while landscape-half will resize the image to fit two pages then chop it up.

I've been meaning to release 1.8 for a long time now, but am travelling at the moment so no chance... probably will be resume development from next weekend onwards :-)

Bob Russell
06-15-2007, 10:12 AM
Aha! That helps a lot. Thanks!

Actually, it was in landscape & landscape-half with Prs500 (not prs500-l) that I didn't see overlap (in the middle where the page was cut).

I'll try default settings with prs500-l tonight!

Btw, is there an option on the GUI to chose the rotate direction for prs500-l? While I like it rotating to the right because the button is on my right thumb, when I am using a cover that doesn't fold back underneath, it's slightly awkward to have it coming towards my body between me and the Reader as I read. Not a big deal, though.

I'm quite excited at the prospect of being able to read my PDFs on the Reader with very little effort in the conversion!

Bob Russell
06-15-2007, 08:35 PM
Okay, here's what I learned...

There are two good options that I see for setting up the pdf document that I was using (a tech book).

1) Portrait/PRS500
This works well, and you can read itsy bitsy print in portrait, or more likely you will hold down the size button until the Reader goes into landscape mode.

Advantage: Very straightforward and simple, and you use the Reader modes the way you are accustomed to.
Disadvantage: There was some extra margin on my document on the right side of pages, and you do have to put the Reader in landscape mode, which takes a little time to get out of.

2) Default/PRS500-L

Advantages: Easiest and best option, in my opinion, based on the one book. There really is some overlap when you do landscape this way, so you don't miss anything. It works great. It rotates the page to the right, so you can turn pages with your right thumb, and it really isn't bad having the cover opening in towards you.

Disadvantages: I think the pages may not be as evenly split as when you let the Reader break pages in half by itself.

But I don't think that it really matters about the page split. If you want the prettiest document, you probably want to read it on a laptop, not the Reader. What matters here is readability, and I #2 is actually the best option. At least for my test document.

Come on - surely there are others out there who are using PDFRead and can give us some tips on how to best make use of the program.

The best news is that you can likely just follow #1 and #2 and get great results for many pdf documents. And there are probably more ways to improve it also, but I like fast and easy! Thanks ashkulz! :)

ashkulz
06-16-2007, 02:45 AM
Advantage: Very straightforward and simple, and you use the Reader modes the way you are accustomed to.
Disadvantage: There was some extra margin on my document on the right side of pages, and you do have to put the Reader in landscape mode, which takes a little time to get out of. The margin on the right side is inevitable, as PDFRead will resize the image down to the reader dimensions to maintain proper aspect ratio of the cropped page -- very few pages will have a aspect ratio which matches the screen aspect ratio. I had thought of doing a non-aspect ratio preserving resize, but the output tends to be bad on pages having only a few lines: they then expand to take the whole page, making them VERY tall and looks really weird.

2) Default/PRS500-L

Advantages: Easiest and best option, in my opinion, based on the one book. There really is some overlap when you do landscape this way, so you don't miss anything. It works great. It rotates the page to the right, so you can turn pages with your right thumb, and it really isn't bad having the cover opening in towards you.

Disadvantages: I think the pages may not be as evenly split as when you let the Reader break pages in half by itself. As I said earlier, in this mode it will not split evenly at all: taking into consideration the screen dimensions and the aspect ratio of the page content, it will make 2/3/4 pages as needed. You can also rotate it left instead of right, but that option is not in the GUI. You have 2 choices: copy and modify the command line when the conversion process starts (it is shown at the start) or use the batch conversion script and put it in the options. The option to use is "--rotate left"

Come on - surely there are others out there who are using PDFRead and can give us some tips on how to best make use of the program.

The best news is that you can likely just follow #1 and #2 and get great results for many pdf documents. And there are probably more ways to improve it also, but I like fast and easy! Thanks ashkulz! :) I think that for most users, the default profile settings work very well and they rarely bother to do the extra tuning. That's something that I do myself, I rarely bother to change the default settings -- in fact, I'm so lazy that I've written a script which will take all extract all archives in a directory, convert any LIT, TXT, RTF or PDF file to HTML, index all of the HTML files, make an RB ebook and send it across to my ebook. So typically all I do is copy some files to a folder and run the script -- and that's it ;) If you want to see the (linux-specific) script, you can access it here (http://puggy.symonds.net/~ashish/downloads/build-dir.py).

Bob Russell
06-16-2007, 09:24 AM
So one can almost always just do the following...

1) Set profile = "prs500-l" for the Sony Reader
2) Set the input and output filenames, document title & author
3) Click convert!

Now that's simple!!!!

And for those of you that haven't tried it yet, you should know that installation is also very simple. There is a standard installer to guide you through the process, and the new program is then available in the start menu!

I think that for most users, the default profile settings work very well and they rarely bother to do the extra tuning. That's something that I do myself, I rarely bother to change the default settings

harpum
07-14-2007, 05:22 PM
thank you for making your program.
Bob Russell already mentioned margin problem.
Can you add option which does not resize and leave original page?

I use pdflatex to split and resize letter paper. It is more easy to handle to me. But after use your program , I got a lot of margin.

When I use rasterfarian, there is no margin, but it does not support TOC.

I attached original pdf file and result from your program and rasterfarian.
Thanks again.

sputnik
08-02-2007, 03:14 AM
I scanned a book and I got - as one usually gets when scans standard format books - landscape images (one page to the left, the other to the right). When I use the landscape-half feature of the PDFRead, it splits the big image into two halves (corresponding to each page). The problem is that these pages are not readable on the ebook, the text being to small. So I'm thinking if it is possible to use PDFRead to produce readable pages from scanned books without having to manually cut the images first. I'm suggesting a feature that would split a two-pages landscape image resulted from scanning a book into 4 smaller images, so that they can be read on a EBW-1150.

Or, I could use PDFRead twice: the first time using landscape-half to split the scan into 2 pages, then use the resulting png's to obtain half-pages that are readable on the EBW-1150.

Kraebber
08-06-2007, 11:48 AM
Just tried PDFRead 1.7 nad want to convert a PDF to a Sony LRH file. But in the output boxes the ebook always reverts to Rocketbook. How can I make sure I create a Sony Ebook? Thanks

test-subjacked
11-05-2007, 01:08 AM
Is there any pdf converters for Creative Zen Vision :M 30 gig

mvoosten
11-30-2007, 05:12 AM
Hi,

After struggling with various formats on my Hanlin V3 and especially with PDF files, I decided to give PDFRead a try.
While there is no profile for the Hanlin V3 I used the Sony 500-l profile and used the resulting png images in the temp folder to see how the result would be....
WOW.. How usefull!!! this makes PDF files previously impossible to read on a small screen a bless.
However... since the output of PDF read is limited to some closed formats not supported by the Hanlin V3 I took a look at some alternatives.
using the loose PNG files only is not an option as a lot of reference PDF's I use output too much files, and I like having a good overview of my folders ;)

So I decided to take a look at some PNG to PDF converters. I found one working quite well, but for bigger files with hundreds of PNG images the resulting PDF gets corrupt. However, the smaller ones I did manage to creates give me good hope.

Now, my question is.. since PDFRead converts from PDF and DJVU to PNG (and does a very good job in this!), will it be possible to get the option in to recreate the PNG files to PDF and/or DJVU? This way we have a true PDFRead program that converts PDF's to a format readible by Ebook readers with the advantage that PDF en DJVU files are read by a lot more readers then just the ones in the profile making the PDFRead program truely generic and usefull (at least for me :D)

mvoosten
11-30-2007, 10:19 AM
A Bug with PDFRead 1.7 and Vista (not sure if that has to do with it but mention it anyway).
When using the output as HTML on the command line, the output location is ignored. Instead it will pop up the temp directory with the output!!

Talking about temp files, PDFRead is not cleaning up the temp files after conversion. The temp folders for creating the PNG output are there after the resulting file is written.

mvoosten
11-30-2007, 10:24 AM
Just tried PDFRead 1.7 nad want to convert a PDF to a Sony LRH file. But in the output boxes the ebook always reverts to Rocketbook. How can I make sure I create a Sony Ebook? Thanks

Noticed that too.. I think that there should be no extention option in the output section. Whatever you choose seems to be overwritten by the selected profile anyway.
See my logged bug for example on HTML.. There is no way to output to HTML in the interface what I've seen so far.. Everything is saved as LRH or IMP depending on the profile selection.

My suggesion: Get rid of the forced file extention in the profiles and have a selection of output formats so that these two are independed.
For me for example the Hanlin V3 has the same specs as the Sony 500 but it can not read the LRF format. So all the settings are fine, I just need another output format.

Antonieb
01-04-2008, 11:44 AM
Support for prc or any other format readable by my bookeen gen 3 would be nice ?

Keep up the good work.. it's a nice tool !!

nrapallo
03-12-2008, 11:29 AM
See continuation of this thread in PDFRead 1.8 Released! (http://www.mobileread.com/forums/showthread.php?p=159387)

PDFRead 1.8 has many enhancements and minor bug fixes. Attached are screenshots of the new version 1.8 GUI.