|
|
View Full Version : How to Do Everything with PDF Files
The following article gives a good overview over what you can do with PDF files (without using the expensive Adobe Acrobat):
http://www.labnol.org/software/adobe-pdf-guide-tutorial/6296/
Nate the great 01-01-2009, 10:24 AM good find
xianfox 01-01-2009, 10:30 AM Thanks, some of that will come in handy at work.
JSWolf 01-01-2009, 10:45 AM But I notice no good way to convert from PDF.
smithno 01-02-2009, 06:10 PM But I notice no good way to convert from PDF.
PDF was designed as an output format. It will probably never be easy to manipulate.
RWood 01-02-2009, 07:57 PM But I notice no good way to convert from PDF.
"Good" is a matter of conjecture Jon, the article suggests that "You can upload the PDF document to Zamzar and convert it any formats like doc, html, png, txt or rtf (rich text format). Alternatively, you can convert PDF to HTML using Gmail."
I have used ABBYY PDF Transformer 2.0, ABC Amber PDF Converter, Paperport, and several other packages over the years. There is not one solution for all cases and the correct choice depends on the specific PDF in question, the tools on your computer, what tools are currently available for free, what you tools you can get in a functioning trial copy, and how much money you are willing to spend on new tools.
While I am not the biggest fan of PDF for ebooks, PDFs have their place and I have created PDF files for the Sony Reader where I felt they were the best option.
alexxx 01-03-2009, 03:57 AM too many of the options proposed in the article involve the uploading of your document to some server.
Call me paranoid, but I don't like at all this kind of "services" - I want my documents to stay on <my> server.
Apart from that, under linux (which is not mentioned at all in the article) software exists to do practically any kind of conversion you need.
alessandro
Flinx 01-03-2009, 05:52 AM Apart from that, under linux (which is not mentioned at all in the article) software exists to do practically any kind of conversion you need.
alessandro
Really? I did search for one and have found no Linux program at all that tries to convert from PDF to floating text with attributes and with paragraph recognition. The only program that generates useful output I could find is PdfGrabber, but I am still interested in a better solution.
bookbinder 01-03-2009, 05:24 PM I have a few scanned google books in pdf that I'm having a hard time converting to text, even following advice from the article. Has anyone done this successfully? I've tried:
-Zamzar (returns an unopenable doc file)
-Google mail (doesn't display pdf as html)
-Pdf2Word program
labnol 01-04-2009, 02:46 AM I have a few scanned google books in pdf that I'm having a hard time converting to text, even following advice from the article. Has anyone done this successfully?
You can upload the scanned PDF files to a public web server, link those files from web page and then wait for google bots to index those PDF. See complete instructions (http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/).
Flinx 01-04-2009, 07:52 AM ...wait for google bots to index those PDF.
The linked example shows why this way is essentially useless. The resulting text has line breaks on each line. A good converter for books has to try to set a line break only at the end of a paragraph.
tompe 01-04-2009, 09:59 AM The linked example shows why this way is essentially useless. The resulting text has line breaks on each line. A good converter for books has to try to set a line break only at the end of a paragraph.
Really not true at all. You can also use the convention that two line breaks in a row indicates a new paragraph like TeX and LaTeX do. It is trivial to convert between the two conventions using some simple program or a one line script.
Flinx 01-04-2009, 02:24 PM Really not true at all. You can also use the convention that two line breaks in a row indicates a new paragraph
No, that is not really useful for the most standard PDFs. The text object in a PDF file does not contain a real line break. It contains the position where on the page it has to drawn and a number of characters. The result is a line of text.
The progam that makes the conversion has to estimate from the positions of the text objects in which order the lines come. Simple converters like the most available (including Acrobat) use one text object, convert it to text and set a line break at the end, resulting in one line of the output text. The better converters can try to join the separate text objects, if their horizontal start position is identical and the line is long enough. But this is a difficult job, and I have not yet found a program that works good enough for me.
tompe 01-04-2009, 02:51 PM No, that is not really useful for the most standard PDFs. The text object in a PDF file does not contain a real line break. It contains the position where on the page it has to drawn and a number of characters. The result is a line of text.
The progam that makes the conversion has to estimate from the positions of the text objects in which order the lines come. Simple converters like the most available (including Acrobat) use one text object, convert it to text and set a line break at the end, resulting in one line of the output text. The better converters can try to join the separate text objects, if their horizontal start position is identical and the line is long enough. But this is a difficult job, and I have not yet found a program that works good enough for me.
That might be the case but there is no functional different between encoding paragraphs with two line breaks or one. What you are talking about is how go a converter is detecting a paragraph break but that has no necessary connection to how the encoding is done. You can argue that you loose information if you do not keep the line breaks in a paragraph since they are impossible to recreate but it is trivial to take a paragraph specified by using double line breaks and convert it to one line.
stonehat 01-05-2009, 05:28 AM From TFA:
"Most mobile phones can read PDF files."
I stopped reading after that.
millerjpmd 01-07-2009, 06:11 PM Thanks for the find. I started a thread concerning a similar issue with PDFs. This is what I found related to converting from a PDF.
Programs that allow you to manipulate and extract info from PDF:
File Juicer ($17,http://echoone.com/filejuicer/)
deskUNPDF ($100,
http://www.docudesk.com/deskUNPDF_product_home.shtml)
PDFpen and PDFpenPro ($50-100, http://www.smileonmymac.com/index.html)
Program that allows you to join multiple pdfs into single file with Table of Contents:
PDF Lab (free, http://www.iconus.ch/fabien/products/pleng/pleng.html)
w/r to just getting the PDF into a PRS-505 calibre, for the most part, worked as well as any of these programs
Hope this helps.
jpm
BlackVoid 04-16-2009, 08:11 AM When converting a PDF with pictures for an ebook device, I found a good method with minimal fuss. It is a bit time consuming and you need a 3rd party product.
Use ABBY Finereader to convert to LIT format, then convert the LIT to the ebook format of your choice. Pictures will be preserved. Abby Finereader takes a while to convert for its own format, but it will also handle scanned books. I have not tried 2 column PDFs, but an average PDF with pictures is OK.
I then use BookDesigner to convert from LIT to LRF and the result is quite good.
namiamy 05-31-2009, 04:53 AM good find. thx.
i got more knowledge about adobe...
stranjer 07-25-2009, 04:59 PM thanks for the trick BlackVoid, I'm gonna try this myself...
|