|07-19-2011, 02:13 PM||#1|
Join Date: Jul 2011
Is this true?
Apparently, any PDF document page begins as a new paragraph, independently of whether its first line is part of the sentence that ends the previous page. Assuming this is true, is anybody aware of a program that exports a PDF document to an editable format (html, doc) with the ability to override this PDF limitation?
Another question in my first post: when I save a DOC document with images as RTF, its size increases dramatically. Does anybody know why?
Thank you for your time and attention.
|07-19-2011, 02:57 PM||#2|
Join Date: Feb 2008
Device: Sony PRS-600, Fujitsu Stylistic ST-4121
Unfortunately, .pdfs were designed before the idea of re-flowable electronic documents was prevalent, so yes, each page starts over, and the text on it constitutes a new paragraph as detected by most textual-oriented software. Marcel Weiher's TextLightning.app for Mac OS X does attempt to recognize paragraphs based on text formatting, but it's better to go back to the original source document.
PDFs use compression, RTFs don't, hence the file-size change.
|07-19-2011, 09:23 PM||#3|
Join Date: Feb 2009
Location: Amherst, Massachusetts, USA
Device: Sony PRS-505
There's PDF reflow, here, but don't expect perfection.
The original purpose of a PDF was to make sure a document looked exactly the same on every medium and printer; it's meant to emulate paper. Really it's just a map of the exact location of each character. Not only does it not understand the continuation of one sentence on one page to another, it doesn't even have the concept of sentence or paragraph. That's just in the source document.
|07-20-2011, 07:50 AM||#4|
Join Date: Nov 2009
Device: PW2 2014
Yes, from my experience I think it's true. For instance, if you start a paragraph at the end of a page, and it ends on the next, the close tag will be on the second page.
But you see, most PDF files are saved as regular PDF, with objects that float around (including text, parts of text, or sometimes even individual letters). You can usually spot them right away if you select the text and there are all sorts of spaces between characters or entire groups of characters. These are very difficult to convert because each object (or groups of objects) have their own coordinates on the page.
On the other hand, there are tagged PDF files, which have open and close tags for objects, especially for text segments, making them (relatively) easier to convert - but not perfect. Keep in mind that PDF is considered an output format and as Will suggested, it's always better to go back to the source of the document.
Oh, and when you say size, do you mean file size or display size ? Display size is different probably because the margins weren't converted properly. Find out the .doc page size and margins and add them to the .rtf.
|Thread Tools||Search this Thread|
|Thread||Thread Starter||Forum||Replies||Last Post|
|Seriously thoughtful I almost can't believe this is true||Exer||Lounge||1||04-06-2011 05:23 AM|
|Two good to be true||agraff||Introduce Yourself||10||05-21-2010 03:40 PM|
|is this true?||pathfinderca||News||1||04-05-2010 01:17 PM|