|
|
View Full Version : pythonized PDFrasterFarian
curiouser 03-29-2007, 04:48 PM In the interest of advancing the cross-platform development of alex_d's work, I present a reasonable translation over to python. It's not feature complete with the current PDFrasterFarian, but as is, it should give you nice full page PDF conversions. Plus, I've tossed in some new features. I might have waited a little longer to toss this out, but the resulting output from Google Books PDFs is a big improvement, which I thought would interest some.
Also, the Windows-specific dependencies are pretty much gone. It should not be too hard to go from this point to Linux and Mac versions, as well as automate its use as CGI, etc.
Improvements:
- properly centers content
- does nice trimming of PDFs produced from scans (like ones from Google Books - try it out!)
- monochrome output option for smaller (though a bit less legible) PDFs
- eliminates use of AutoImager (using PIL - no speed penalty)
Missing:
- various orientation settings (splitting into 1/2 and 1/4 pages)
- not using temp dir for scratch work
- changing title & author fields
To Do:
- add proper hookup of command line args
- eliminate dependency on pdftk - should be able to digest the PDF in python
- add a little more control for cropping
Installation:
1) install PDFrasterFarian
2) install python
3) install PIL (http://www.pythonware.com/products/pil/)
4) unzip the attached pyprf.zip into same directory
Usage:
python pyprf.py foo.pdf
(sorry, but if you want to tweak processing options, you'll have to tweak the python code)
Enjoy!
P.S> I just noticed the title and author are hardcoded - sorry!
ashkulz 03-29-2007, 10:06 PM I have also developed a Python based version using PIL and a Windows installer, you can see it at
http://www.mobileread.com/forums/showthread.php?p=63175
Maybe we can merge our efforts so we get a cross-platform, cross-ebook version?
Shake 03-30-2007, 05:14 PM Do I understand you correctly and PDFrasterFarian should work now unter linux with your work? That would really be great news.
curiouser 03-30-2007, 05:58 PM The python script no longer needs anything from Windows. However, to get it working under Linux or OS X, a little bit of work needs to be put in. First of all, your Linux system must have the following installed:
- pdftk
- ImageMagick
- ghostscript
- pdftops (which is part of xpdf)
Luckily, all of the above are either part of a standard Linux install, or there should be packages available to install them.
Then, the python script need to be changed to fix the relevant paths. You're looking at stripping out the paths from the following lines (since the execs should be in your path):
gs_exec = "software\\gs\\gs8.54\\bin\\gswin32c.exe"
im = "software\\ImageMagick-6.3.1-Q8\\convert.exe"
(pin, pout) = os.popen2('software\\pdftk.exe "%s" dump_data' % (fil))
os.popen2('software\\pdftops.exe -f %d -l %d -eps -pagecrop "%s" prv_1.eps' % (pageNum, pageNum, fil))
(pin, pout) = os.popen2("software\\pdftk.exe %s dump_data" % (fil))
Plus, you'll have to make some change to the following line (I suspect you can set it equal to "", since ghostscript will be fully installed, and the items will be found):
gs_includes = '-I.\\software\\gs\\gs8.54\\lib -I.\\software\\gs\\gs8.54\\Resource -I.\\software\\gs\\fonts'
The last big thing you'll have to deal with is replacing the following:
os.popen2('software\\lrs2lrf\\lrs2lrf.exe "%s" "%s"' % (fil[:-4] + ".lrs", fil[:-4] + ".lrf"))
I'm not sure what the equivalent is on Linux, but I assume something is out there. Otherwise, you're probably looking at a conversion over to using the pylrs stuff put out by Falstaff (I've started playing with this, but there is a lot about LRS/LRF I don't understand yet).
Finally, you'll need a copy of modules/book_thumb.gif, which you can get from PDFrasterFarian.
I don't have a Linux system going right now, so I can't do this myself. Hopefully this provides enough of a roadmap.
ashkulz 03-30-2007, 09:49 PM Do I understand you correctly and PDFrasterFarian should work now unter linux with your work? That would really be great news.
The script I mentioned above (PDFRead) does work on Linux, I developed it primarily on Ubuntu and did the Windows port later on.
Shake 03-31-2007, 10:24 AM I have tried the thing ashkulz has build.
I strongly suggest that you merge your work.
alex_d 04-01-2007, 01:42 AM yeah, i'd love it if we could all collaborate.
anyone reading this also good at GUIs? maybe .net or java that'll work cross-platform (eg using mono)? Besides getting the nuts and bolts working smoothly across platforms, what i'd really like to see is a good UI that allows you to collate various sources (e.g. folders of images, djvu, etc) and then manually crop the pages properly. That would really make for a great, all-in-one tool.
ashkulz 04-01-2007, 01:54 AM Well, I was talking to kovidgoyal (author of libprs500) and he was thinking of calling PDFRead in its GUI so that there'd be an end-to-end solution for the Sony Reader. I was also planning to contribute to make a REB/EBW backend, but I don't know how feasible that would be -- the device capabilities are completely different.
Maybe we should all get together on chat and have a brainstorming?
kovidgoyal 04-01-2007, 09:44 AM I vote for a PyQt GUI. I'm willing to do the work of making the GUI. I can write it so that it can be run both as a standalone as well as from within libprs500. We can then use py2exe and py2app to make standalone executables for the windows and osx users who don't want to download the full libprs500.
alex_d 04-02-2007, 01:10 AM PyQt will work on windows, and without [significant] dependencies?
Anyway, what's certain is that we need to split the project up into a backend and frontend. The backend could then be used by other people for their front-ends. It should also be modular, so that, e.g., people could pass to it documents like pdfs or they could pass raw images and have them be post-processed and collated. Or they might pass a pdf and the output resolution, and then receive a folder of images they could collate themselves. Thus, the back end could be extended for use with unforseen input and output formats.
Thus the backend is in three parts: a rasterizing stage (this stage is capable of autocropping, it can also produce output directly for the frontend for showing previews and for setting manual cropping), a processing stage (which takes lists of images, cropping parameters, processing parameters, etc.), and a collating stage.
One fun question is how can the processing stage be made better. Right now it's a dilate and a sharpening. The dilate is straightforward but the sharpening has many parameters. I used defaults for PDFR 2.1 but tweaked them for 2.2 for more effect. I don't think anyone actually got to see the 2.2 changes, but Ashkulz currently feels the benefits of sharpening are negligible (at least on his monochrome LCD device).
I disagree, but what I think is certain is that in the vast world of photoshop filters, there's gotta be something impressive. Come on... who here's been C&Ping beavers on monkeys and passing them off as paris hilton pics? Speak up!
kovidgoyal 04-02-2007, 09:51 AM Yup you can use py2exe to embed all the dependencies into a single executable.
ashkulz 04-02-2007, 12:19 PM Anyway, what's certain is that we need to split the project up into a backend and frontend. The backend could then be used by other people for their front-ends. It should also be modular, so that, e.g., people could pass to it documents like pdfs or they could pass raw images and have them be post-processed and collated. Or they might pass a pdf and the output resolution, and then receive a folder of images they could collate themselves. Thus, the back end could be extended for use with unforseen input and output formats.
Thus the backend is in three parts: a rasterizing stage (this stage is capable of autocropping, it can also produce output directly for the frontend for showing previews and for setting manual cropping), a processing stage (which takes lists of images, cropping parameters, processing parameters, etc.), and a collating stage.
I think we should keep things as simple as they can be, as it makes for easier maintainability. Let's just create one (or maybe two) command line tools, and that's it. The GUI simply calls these and gets its job done. It will also enforce clean separation and make sure that additional frontends can be added.
What we really need to think about is packaging. On Windows, it is quite easy to package all the required stuff, it will much more difficult to do so for Linux or OS X.
One fun question is how can the processing stage be made better. Right now it's a dilate and a sharpening. The dilate is straightforward but the sharpening has many parameters. I used defaults for PDFR 2.1 but tweaked them for 2.2 for more effect. I don't think anyone actually got to see the 2.2 changes, but Ashkulz currently feels the benefits of sharpening are negligible (at least on his monochrome LCD device).
I disagree, but what I think is certain is that in the vast world of photoshop filters, there's gotta be something impressive. Come on... who here's been C&Ping beavers on monkeys and passing them off as paris hilton pics? Speak up!
Well, I was checking the effects of sharpening on my desktop, not on the reader -- I couldn't see any noticeable difference in the images.
Either way, I see that a standalone GUI has very little utility value -- the rudimentary GUI on Windows is good enough for most people, and on Linux/OSX people are OK with the command line. The real benefit as I see it is integration with a device communication app like libprs500/rebcomm, integrated into a rudimentary library management system (like libprs500-gui). That would offer a 1-stop solution for ebook creation, management and communication with the device.
Ideally, we should develop the GUI on a plugin-style architecture (http://trac.edgewall.org/wiki/TracDev/ComponentArchitecture), so that people can easily integrate various apps (rss/word/whatever) without touching the main code.
kovidgoyal 04-02-2007, 02:03 PM I agree with ashkulz. As I said before I'm willing to do the work necessary to
a) Make the libprs500 GUI device independent by defining a set of functions that any device communication software will be expected to provide. I will then refactor the GUI code to achieve as clean a separation between the device communication backend and the GUI as possible. That way, any future device driver writers can use the library management features of the GUI easily.
b) Integrate a conversion GUI into the libprs500-gui. I'm working on a fully open source HTML->LRF converter and I'm happy to add the PDF->LRF/ebook optimized PDF converter as well
Shake 04-03-2007, 04:08 PM That are great news :-). Keep on the good work!
And (but I am sure this is not an important point for you) - I am willing to donate some dollars for the project if it will run on Linux...
kovidgoyal 04-03-2007, 06:57 PM $$ are always welcome ;-)
That are great news :-). Keep on the good work!
And (but I am sure this is not an important point for you) - I am willing to donate some dollars for the project if it will run on Linux...
alex_d 04-04-2007, 01:09 AM Ashkulz, it is most important right now to create or plan for a framework that can be easily extended or repurposed for new things and by new applications. Anyway, all it'd be is just three executables. And they'll be doing the same things the monolithic thing is doing now. there'll just be a bit more work talking through a defined interface, but it'll pay off through much greater flexibility. Flexibility to change pieces
how about we discuss a spec?
ok, so, the rasterizer exe would expose some of the things ghostscript, etc should be able to do:
-- input - pdf file or list of files
-- output - output folder and filename
-- output size in pixels and format (8bit, gray, color)
-- autocropping, explicit cropbox
-- (opt) output file type (png, jpg, bmp, raw)
-- (opt) rotation
-- (opt) device-specific features (eg ghostscript's font-rendering modes)
this exe prints out the names of the files it processes so that these could be piped or saved to a variable (or to a file). The other exes should be able to accept input filenames piped in (and maybe from a file).
the processing exe would be:
-- input/output filenames
-- output resolution, format
-- (opt) fit (centered, upper-left, stretched)
-- (impl-specific, opt) dilate factor
-- (impl-specific, opt) eg sharpen or other filter parameters
collating exe would just take a list of files and bind them into a format for some specific device. it would also accept a TOC as a file or something. (people could write new .exe's to add support for new/old devices and file formats)
misc ideas-
overcropping... option to crop not at the first black pixel but only after, say, a few dozen (so dust, dots, or lines don't mess up autocropping)
output filenames... imagemagic etc can take output filename as eg "fileA%02d.png" and produce fileA01.png, fileA02.png
I think a standalone app would be used more than an integrated one. Personally, i just use sd cards and never sony connect. Also, a standalone app can focus better on adding support to do all the things that could give the best results. Maybe doing it in qt will make it more difficult to do something fancy that lets you preview, crop, rotate, etc. I don't know, but i know that manually cropping in acrobat is very, very helpful. However, I've never found a free alternative to do manual cropping.
curiouser 04-04-2007, 11:25 AM Sounds like good stuff is happening. I'm swamped closing out my last semester of school, so I won't be able to contribute for a bit.
Just wanted to point out two bits of code from my work that may be the most useful:
1) overcropping is already implemented - check the trimNoise function. Big help for scanned PDFS (such as Google Books).
2) proper centering of images. Related code is found in trimNoise as well as the main processing function.
ashkulz 04-04-2007, 02:42 PM Ashkulz, it is most important right now to create or plan for a framework that can be easily extended or repurposed for new things and by new applications. Anyway, all it'd be is just three executables. And they'll be doing the same things the monolithic thing is doing now. there'll just be a bit more work talking through a defined interface, but it'll pay off through much greater flexibility. Flexibility to change pieces
how about we discuss a spec?
Agreed. I've been thinking about this too (couldn't post as was too busy yesterday). The most sensible way to design the system to think of it in terms of a pipeline, exactly the way GStreamer is designed (http://en.wikipedia.org/wiki/GStreamer#Internals_technical_overview). So effectively, you write a lot of plugins which implement discrete actions in the whole process (rasterizing, cropping, dilation, etc -- all that you mentioned). Each plugin declares some input and output "pads". We can create different types of "pads", so that you can't accidentally connect an incompatible set of inputs/outputs. I will assume that we write all of this in Python, which I think is best as it is cross-platform. So now we have the following components in the system:
the base framework, which defines interactions and the various types of pads
the actual plugins, which use the framework and define various types of input/output pads
a low-level command line interface which will allow one to create a pipeline and execute it, similiar to gst-launch See this as an example (http://www-128.ibm.com/developerworks/aix/library/au-gstreamer.html?ca=dgr-lnxw07GStreamer#listing2).
a command line app, which will parse command-line parameters, do some validations and then finally create a pipeline and execute it via the plugins/framework.
So #4 will be essentially a replacement for what we currently have. Apps who want to use a part (or any combination of the pipeline) will essentially use #3 directly. So ideally, we should have very "thin" glue code in #3 and #4, with most of the logic being in #1 and #2. Also, trying out new approaches is very painless, as it is easy to add a new plugin and introduce in the pipeline via #3. As an example, the current process can be represented as filesrc location=input.pdf ! pdftops ! gsrasterize dpi=300 ! autocrop ! dilate ! resize width=565 height=784 ! makelrf author=XYZ title=foo | libprs500-send.
I don't know if you're familiar with electronics/IC design, but that's what essentially what you do there. It would make development MUCH easier and make the whole process much more easier to tweak for everyone (once the initial bump is past, of course). So let's say I want to use xpdf for rasterizing (it's much smaller than gs on win32), I replace gsrasterize with xpdfrasterize (which is the only thing I need to write) and then recreate/rerun the pipeline.
ok, so, the rasterizer exe would expose some of the things ghostscript, etc should be able to do:
-- input - pdf file or list of files
-- output - output folder and filename
-- output size in pixels and format (8bit, gray, color)
-- autocropping, explicit cropbox
-- (opt) output file type (png, jpg, bmp, raw)
-- (opt) rotation
-- (opt) device-specific features (eg ghostscript's font-rendering modes)
this exe prints out the names of the files it processes so that these could be piped or saved to a variable (or to a file). The other exes should be able to accept input filenames piped in (and maybe from a file).
the processing exe would be:
-- input/output filenames
-- output resolution, format
-- (opt) fit (centered, upper-left, stretched)
-- (impl-specific, opt) dilate factor
-- (impl-specific, opt) eg sharpen or other filter parameters
collating exe would just take a list of files and bind them into a format for some specific device. it would also accept a TOC as a file or something. (people could write new .exe's to add support for new/old devices and file formats)
misc ideas-
overcropping... option to crop not at the first black pixel but only after, say, a few dozen (so dust, dots, or lines don't mess up autocropping)
output filenames... imagemagic etc can take output filename as eg "fileA%02d.png" and produce fileA01.png, fileA02.png
All the features you mentioned above should be implemented as plugins, with the necessary parameters.
I think a standalone app would be used more than an integrated one. Personally, i just use sd cards and never sony connect. Also, a standalone app can focus better on adding support to do all the things that could give the best results. Maybe doing it in qt will make it more difficult to do something fancy that lets you preview, crop, rotate, etc. I don't know, but i know that manually cropping in acrobat is very, very helpful. However, I've never found a free alternative to do manual cropping.
To each his choice. I mean, PDFRead is working for most people and that's the way it should be for them. If someone finds they need to do something custom, then with this approach they have a gradual approach for delving in deeper and deeper. With the above approach, whether you use command-line app (#4) or just use the pipeline directly from GUI (#3) becomes irrelevant: both are equally easy to use for different set of people, and it allows other developers to leverage PDFRead as they see fit.
As an aside, we should call it something other than PDFRead or PDFRasterFarian: the above is not merely a tool, it is a ebook conversion framework. I mean, I can imagine that html being a source plugin sometime in the future, so this could be a standard way of interacting with ebook formats, devices and whatnot.
alex_d 04-04-2007, 10:17 PM what exactly do you mean by plugins? Do you mean the "rasterizer" and "post-processing" components that i'm talking about would themselves be composed of smaller pieces?
"the above is not merely a tool, it is a ebook conversion framework. I mean, I can imagine that html being a source plugin sometime in the future, so this could be a standard way of interacting with ebook formats, devices and whatnot."
Right now I was just thinking about a framework that handled image-based ebooks. For html, and indeed for a larger audience, you would need to support native-text formats (although i dunno.. native text would never look as good as dilated and processed images). To handle native-text you would need to create an intermediary text format with formatting and embeded links that could carry HTML, pdf, rtf, etc and then be reprocessed into lrf, pdf, starebook, etc. is... ambitious. And it'd have to work perfectly (ie just as well as a direct html->lrf conversion).
If we just stick to working with images (and even claim that's the suprior way to do things) I think it makes things much simpler (and much easier to get right). We can omit things like sophisticated pads that keep track of their own dependencies. Simply moving images from one folder to another would be fine and would even make it easier for other developers to hook in. (It's still the same spirit as the pads, but just a simpler implementation.)
however, let's ask the question: if say we only work with images, what things could/would/would-want-to be done by others? Are there things that can't be done by a 3-layer framework of Create images, Reprocess images, Bind images (provided each layer exposes enough features)? What are the usage scenarios?
ashkulz 04-04-2007, 11:28 PM what exactly do you mean by plugins? Do you mean the "rasterizer" and "post-processing" components that i'm talking about would themselves be composed of smaller pieces?
Yep, very much. The current script is getting too big, and not so easy to understand at first glance. The "plugins" would allow one to abstract out the steps to take in the pipeline, and then to weave the individual steps in any manner that the calling tool/app chooses.
From the point of view of the calling tool, there would be only one executable which would allow one to choose and setup the pipeline. All the plugins and other low-level details will be in code, and not exposed to the user.
the above is not merely a tool, it is a ebook conversion framework. I mean, I can imagine that html being a source plugin sometime in the future, so this could be a standard way of interacting with ebook formats, devices and whatnot.
Right now I was just thinking about a framework that handled image-based ebooks. For html, and indeed for a larger audience, you would need to support native-text formats (although i dunno.. native text would never look as good as dilated and processed images). To handle native-text you would need to create an intermediary text format with formatting and embeded links that could carry HTML, pdf, rtf, etc and then be reprocessed into lrf, pdf, starebook, etc. is... ambitious. And it'd have to work perfectly (ie just as well as a direct html->lrf conversion). If we just stick to working with images (and even claim that's the suprior way to do things) I think it makes things much simpler (and much easier to get right). Agreed, but what I meant was that it is easily possible to theoretically visualize that some things that we develop here might be integrated as different types of "pads" or whatever. I'm not proposing to do anything on this at all, just that it leaves future scope for expansion -- the framework would already be there, and reuse would be dead simple.
We can omit things like sophisticated pads that keep track of their own dependencies. Simply moving images from one folder to another would be fine and would even make it easier for other developers to hook in. (It's still the same spirit as the pads, but just a simpler implementation.) I never said anything about pads keeping track of their own dependencies. All I meant is, if a particular stage expects input as an image, then we shouldn't be able to pass in a PDF there (or vice versa). The stage should validate these things and then move on.
I disagree about the folder-to-folder thing -- that's a poor solution, as that means we have to create and maintain that many folders. Why communicate over the filesystem when you can communicate much more clearly via code? Also, you get around that in PDFRasterFarian by fixing the stages upfront and pre-creating folders in the installation directory. That is not feasible on other platforms, plus it implicitly means you can run only 1 instance of PDFRasterFarian at 1 time. PDFRead has no such limitation, and I think that supporting (simultaneous) batch processing is very important.
however, let's ask the question: if say we only work with images, what things could/would/would-want-to be done by others? Are there things that can't be done by a 3-layer framework of Create images, Reprocess images, Bind images (provided each layer exposes enough features)? What are the usage scenarios? If each layer exposes enough features to turn on/off features individually, the command line options for it will grow quite a bit (see PDFRead). It is much better to approach it conceptually as a pipeline than as passing these set of parameters to stage1, another set to stage2 and so on.
Usage scenarios are simple:
User A wants to use the framework "as is" in one of the default profiles
User B wants to customize one of the stages in the pipeline. He/she runs a tool that will print the default pipeline for a profile, customizes it and then runs it directly (or saves it directly as a new profile).
User C wants to add or drop stages in the pipeline (e.g. remove dilation for comics, add a manual cropping stage, etc)
User D is a tool writer that wants to integrate the entire conversion process (with preview). This would be easy, as one would run a shorter pipeline or one customized only to process a few pages.
On the whole, I think the most compelling argument would be the transparency and simplicity from the user/tool writer point of view. It will also make the code much more modular and easier to maintain.
alex_d 04-09-2007, 05:09 AM "Why communicate over the filesystem when you can communicate much more clearly via code?"
Using folders as pads is a bit dirty (especially for concurrent conversions... although those should really be batched and run sequentially anyway) but it is _somewhat_ elegant and, above all, _very_ easy to hook into and extend.
Say I have a program that can be told from the command-line to accept some input files and create some output files. How would I integrate it into your framework?
"PDFRead has no such limitation, and I think that supporting (simultaneous) batch processing is very important."
Actually, I think batching serially rather than concurrently makes more sense. You get your first output quicker and there is no problem if you want to convert an obscene number of files. (Even a few dozen concurrent conversions would kill the ram).
"If each layer exposes enough features to turn on/off features individually, the command line options for it will grow quite a bit (see PDFRead). "
Well, the command line options wouldn't be for the user to use but for the developer writing a wrapper. Surely it'll be much easier on (and give more freedom to) a developer to code a long command line in his script than to output a custom pipeline file?
In the end, though, there are two questions: Can a sophisticated framework of which you speak be implemented in theory (ie is the concept compatible with being very flexible and easy to extend)? And: Will such a framework be actually implemented by us (ie will it be too much work)? The folders approach, I think, has both points going for it.
I must say, however, I like the cut of jib.
ashkulz 04-25-2007, 07:08 AM Okay, I've implemented the ideas which I mentioned here in the 1.6 release. You can look at the code at
http://pdfread.svn.sourceforge.net/viewvc/pdfread/trunk/
Please see the
PDFRead 1.6 thread (http://www.mobileread.com/forums/showthread.php?t=10558) for other features added in this release.
Shake 06-02-2007, 09:40 AM Any progress? Can I try something?
ashkulz 06-03-2007, 10:39 PM Any progress? Can I try something? I'm not clear on what you mean. PDF Read (http://pdfread.sourceforge.net) is already available, so do you mean progress on PDFRead or something else?
|