PDF is flowable?

FinancialWar · 03-14-2012, 10:21 AM

I always though pdf is displayed as is, like a photo. I didn't know they are flowable like epub.

are all pdfs flowable? Some pdf are obviously just photocopy of textbooks, are they flowable?

Really good surprise

fjtorres · 03-14-2012, 10:29 AM

As a rule PDFs are best assumed *not* flowable.
*Some* PDFs can be *created* to reflow but it is hit or miss and usually not particularly effective.
PDF reflow is not something to count on.

Penforhire · 03-14-2012, 02:22 PM

PDF is sort of a wrapper for different types of data. You already noticed that some data is just a graphic image (photocopy). Those cannot be intelligently reflowed because it is just a bunch of dots. If the data is a text file then it becomes possible to reflow and, for example, to print it in a different font.

dwig · 03-14-2012, 03:05 PM

Quote:

Originally Posted by Penforhire

PDF is sort of a wrapper for different types of data. You already noticed that some data is just a graphic image (photocopy). Those cannot be intelligently reflowed because it is just a bunch of dots. If the data is a text file then it becomes possible to reflow and, for example, to print it in a different font.

Not quite correct. Just because something in a PDF is text doesn't mean that it is reflowable. The text objects in most PDFs are not reflowable.

rkomar · 03-14-2012, 05:08 PM

Quote:

Originally Posted by dwig

Not quite correct. Just because something in a PDF is text doesn't mean that it is reflowable. The text objects in most PDFs are not reflowable.

I'm not an expert, but I believe there are tags that can be added to text to make it reflowable. As far as I know, there is no such stuff for tables, equations, code, line drawings,... So even if the text tags have been added, reflow doesn't work with technical documents with the above elements. It really only works for paragraphs of text with the odd image embedded between them (and as dwig implies, only if the tags have been added).

fjtorres · 03-14-2012, 05:26 PM

Quote:

Originally Posted by rkomar

I'm not an expert, but I believe there are tags that can be added to text to make it reflowable. As far as I know, there is no such stuff for tables, equations, code, line drawings,... So even if the text tags have been added, reflow doesn't work with technical documents with the above elements. It really only works for paragraphs of text with the odd image embedded between them (and as dwig implies, only if the tags have been added).

Pretty much.
If you pay for the full Acrobat application you can open pdfs (as long as they're not locked) embed the reflow tags and hope to see some reflow.
But the results tend to be mostly underwhelming, even for all-text documents.

dwig · 03-14-2012, 05:55 PM

Quote:

Originally Posted by rkomar

I'm not an expert, but I believe there are tags that can be added to text to make it reflowable. ...

Correct, but only if such Tagged sections are created.

In the common basic PDF, text exist as a block only so long as there is no deviation from the default linear flow. Any font change (reg to italic, ...) breaks the block and starts a new separate block. Any alteration in the letterspacing/kerning also ends a block and begins another. As a result, what appears to be a paragraph when the document is displayed is actually, at the least, one separate text block for each line and often several blocks per line.

The presence of Tagged text blocks allows readers that are aware of them to skip the fixed layout version of the text, with all its separate pieces, and replace it with the flowable block. With these viewers and with tagged PDFs you have the option to turn on the reflow at the sacrifice of the carefully designed layout.

Penforhire · 03-14-2012, 07:17 PM

Well, using 3rd party software (or full Acrobat) any time there is text in a PDF I can extract it. If I can extract it, as text, then it has to be reflowable in certain applications.

rkomar · 03-14-2012, 09:47 PM

Quote:

Originally Posted by Penforhire

Well, using 3rd party software (or full Acrobat) any time there is text in a PDF I can extract it. If I can extract it, as text, then it has to be reflowable in certain applications.

PDF files are programs that execute inside of a state engine. They combine data with instructions, and what is done with either depends on the state of the engine at that time. For example, the exact position at which one character is rendered may depend on where the preceding character was located. Modifying what happens at some stage of rendering could have drastic effects on what comes after, since the engine state could be different than what was expected when the PDF file/program was written. I think these reflow tags work to alleviate some of that problem, breaking the text into smaller independent objects that can be relocated as a group as far as the engine is concerned. The main point is that you can't think of a PDF file as content and metadata in arbitrary arrangements; it is really a set of precise instructions that have to be followed consecutively according to strict rules. Applying reflow to, say, a mathematical paper will show you what happens when you start messing with the engine in arbitrary ways while it's working (i.e. you get gibberish).

ignacio ferrer · 03-15-2012, 03:29 AM

Quote:

Originally Posted by fjtorres

As a rule PDFs are best assumed *not* flowable.
*Some* PDFs can be *created* to reflow but it is hit or miss and usually not particularly effective.
PDF reflow is not something to count on.

That is my experience too. With programs like ABBYY you can transform any PDF into PDF/A, so it is readable like text: theoretically, that is. If there a no pictures, just text, you get pretty good results. The resulting file can be reflowed, often very well. If there are many pictures, unusual fonts etc. you will not be very happy ...

Alfy · 03-15-2012, 08:10 AM

I've played for absolute ages with PDF documents to get them on my reader, and have very seldom succeeded. Even simple .txt documents imbedded in PDFs lose their end-of-line properties, and it is an absolute mess to recreate a proper flow of text. Sure, you can add the tags, even try to automate them, but it takes a lot of work and too much time. My advice to FinancialWar: just don't think of these documents as flowable, technical issues aside, they're really not.

Joykins · 03-15-2012, 10:18 AM

I have a nook that reflows pdfs sold as ebooks. I have read a lot of them, too, because initially many of our Overdrive library books were only available as pdf.

It's clunky. You can get the text larger (generally you get to choose between microscopic, small, and HUGE), but there are always hyphenation* and page break problems, and sometimes the header/page number gets folded into the text. Images are an issue. If you're reading a pdf intended to be an ebook without much reliance on images, you can get reflow.

However, if your pdf has no underlying text or formatting, fuhgeddaboutit.

* the choice in hyphenation is between dropped hyphens and retained hyphens. Dropped hyphens are IMO infinitely preferrable. Retained end of line hy-phens can make the text near-ly unreadable.

murraypaul · 03-15-2012, 11:37 AM

Quote:

Originally Posted by Penforhire

Well, using 3rd party software (or full Acrobat) any time there is text in a PDF I can extract it. If I can extract it, as text, then it has to be reflowable in certain applications.

You can extract text, yes.
You cannot 100% reliably extract text in the correct order.
A two column PDF might be laid out as all of column one, then all of column two, or as the first line of both columns, then the second line of both columns...
You could have a perfectly valid PDF, which displayed fine on the screen, which printed all the letter 'a's, then all the letter 'b's, and so on.
You cannot reliably extract sentence and paragraph endings.
You cannot reliably tell whether a new page should or should not start a new paragraph.
In short, PDF is excellent at being a final display format, and poor at being a transitional format.

Elfwreck · 03-15-2012, 11:51 AM

FWIW, extracting text *mostly* works. I'd say 85% or more of text-based PDFs (not scans) convert fairly well to Word or HTML formats... and then need cleanup. Remove the headers & page #'s, which extract as just text. Get rid of the forced paragraph breaks at the ends of pages. Find the chapter headers and fix them. (They might be fine. They might be converted to plain text, depending on various font issues.) Look for sets of short lines of text--dialogue especially--that were all crammed into one paragraph.

The text itself tends to extract fine (if there weren't columns or magazine layouts to deal with), but the formatting needs a thorough touchup to be useful.

Penforhire · 03-15-2012, 12:41 PM

Murray, good point about the order (or possible lack thereof). I have seen that myself.

03-14-2012, 10:21 AM	#1
FinancialWar Banned Posts: 397 Karma: 85500 Join Date: Feb 2011 Location: Sydney Device: Sony PRS350, Onyx M92, Onyx T68 (defective!)	PDF is flowable? I always though pdf is displayed as is, like a photo. I didn't know they are flowable like epub. are all pdfs flowable? Some pdf are obviously just photocopy of textbooks, are they flowable? Really good surprise

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
eBook PDF - free tool for creating PDF eBooks from text files	KACartlidge	PDF	6	01-04-2012 10:41 AM
PDF Reader Review and Guide: View, Optimize and Create PDF files	UpSpin	Sony Reader	15	11-26-2011 11:11 AM
【Best PDF Size】I find The reason of slowing When Read PDF file	linlance	Sony Reader	0	03-11-2010 09:13 AM
Flowable Text PDF app	Gideon	Apple Devices	2	11-19-2009 05:46 PM

03-14-2012, 10:29 AM	#2
fjtorres Grand Sorcerer Posts: 11,732 Karma: 128354696 Join Date: May 2009 Location: 26 kly from Sgr A* Device: T100TA,PW2,PRS-T1,KT,FireHD 8.9,K2, PB360,BeBook One,Axim51v,TC1000	As a rule PDFs are best assumed not flowable. Some PDFs can be created to reflow but it is hit or miss and usually not particularly effective. PDF reflow is not something to count on.

03-14-2012, 02:22 PM	#3
Penforhire Wizard Posts: 2,230 Karma: 7145404 Join Date: Nov 2007 Location: Southern California Device: Kindle Voyage & iPhone 7+	PDF is sort of a wrapper for different types of data. You already noticed that some data is just a graphic image (photocopy). Those cannot be intelligently reflowed because it is just a bunch of dots. If the data is a text file then it becomes possible to reflow and, for example, to print it in a different font.

03-14-2012, 07:17 PM	#8
Penforhire Wizard Posts: 2,230 Karma: 7145404 Join Date: Nov 2007 Location: Southern California Device: Kindle Voyage & iPhone 7+	Well, using 3rd party software (or full Acrobat) any time there is text in a PDF I can extract it. If I can extract it, as text, then it has to be reflowable in certain applications.

03-15-2012, 08:10 AM	#11
Alfy Liseur de Bonne Aventure Posts: 374 Karma: 2176666 Join Date: Sep 2008 Location: Paris, France Device: PRS T1	I've played for absolute ages with PDF documents to get them on my reader, and have very seldom succeeded. Even simple .txt documents imbedded in PDFs lose their end-of-line properties, and it is an absolute mess to recreate a proper flow of text. Sure, you can add the tags, even try to automate them, but it takes a lot of work and too much time. My advice to FinancialWar: just don't think of these documents as flowable, technical issues aside, they're really not.

03-15-2012, 10:18 AM	#12
Joykins Wizard Posts: 1,613 Karma: 9211856 Join Date: Jan 2010 Device: kindle Oasis 2018, kindle 4 NT, kindle PW2, iPhone, iPad mini	I have a nook that reflows pdfs sold as ebooks. I have read a lot of them, too, because initially many of our Overdrive library books were only available as pdf. It's clunky. You can get the text larger (generally you get to choose between microscopic, small, and HUGE), but there are always hyphenation* and page break problems, and sometimes the header/page number gets folded into the text. Images are an issue. If you're reading a pdf intended to be an ebook without much reliance on images, you can get reflow. However, if your pdf has no underlying text or formatting, fuhgeddaboutit. * the choice in hyphenation is between dropped hyphens and retained hyphens. Dropped hyphens are IMO infinitely preferrable. Retained end of line hy-phens can make the text near-ly unreadable.

03-15-2012, 11:51 AM	#14
Elfwreck Grand Sorcerer Posts: 5,187 Karma: 25133758 Join Date: Nov 2008 Location: SF Bay Area, California, USA Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)	FWIW, extracting text mostly works. I'd say 85% or more of text-based PDFs (not scans) convert fairly well to Word or HTML formats... and then need cleanup. Remove the headers & page #'s, which extract as just text. Get rid of the forced paragraph breaks at the ends of pages. Find the chapter headers and fix them. (They might be fine. They might be converted to plain text, depending on various font issues.) Look for sets of short lines of text--dialogue especially--that were all crammed into one paragraph. The text itself tends to extract fine (if there weren't columns or magazine layouts to deal with), but the formatting needs a thorough touchup to be useful.

03-15-2012, 12:41 PM	#15
Penforhire Wizard Posts: 2,230 Karma: 7145404 Join Date: Nov 2007 Location: Southern California Device: Kindle Voyage & iPhone 7+	Murray, good point about the order (or possible lack thereof). I have seen that myself.