MobileRead Forums - View Single Post - PDF extraction

Elfwreck · 09-27-2009, 12:33 AM

Quote:

Originally Posted by orion2001

If you don't mind, could you explain this to me? I'm not sure what the ^p and the ^pqqq refer to. I'm a bit of a formatting noob

.

Not knowing those doesn't mean you're a formatting noob; it means you don't use Microsoft Word for formatting. Word's find-and-replace functions use ^ to indicate a non-keyboard character. So ^p is "paragraph break;" ^t is "tab;" ^$ is "any letter;" ^? is "any character;" ^b is "section break;" ^m is "manual page break." (There are more, but there's no need for anyone to learn them; they're part of Word's dropdown menus in the find-and-replace dialog box.)

I use "qqq" as a substitute sequence for multi-stage find-and-replace functions, because Word's abilities are limited. It can find "[any letter][paragraph break]" but doesn't allow "replace the paragraph part of that with a space."

It can format or replace the entire search string, or add something to the beginning or end of it. So I add qqq to the end of it, and then search for "[paragraph break]qqq" and replace *that* with a space.

I use it because qqq is exceedingly unlikely to be repeated anywhere in the body of the book, and I won't accidentally replace real text that way.

I am almost entirely clueless about HTML. I gather the principles are about the same as what I usually do in Word, but I'd have to learn a whole new set of keywords and search options. (Which I should do.) I have Kompozer, and occasionally have tried to work with it. It's confusing, and Word is not, because I have lots of practice with Word and none with HTML editors. (I suspect that Semagic doesn't count as an HTML editor. Most of what I know about HTML, I learned by posting at LiveJournal.)

Quote:

What the Regex expression above does is only find those paragraph breaks that do not have a (. , !, ), ?, : ) character just preceeding the paragraph break (since those would indicate complete sentences and probably the end of a legit paragraph).

I'd add mdashes to that list. And quotation marks.

Same basic principle I use, except Word doesn't have a way to "find all X that don't match Trait Y," nor a way to "find all X with trait A, or B, or C." Much less "find all X that don't match trait A, B, or C." However, it does have "find any letter" separate from "any character" or "any digit." (Does not have "any punctuation.")

The biggest problem working with Word is that the HTML output is atrocious; it has to be ported into something else & converted to be useful to anything other than Frontpage websites. Word 97 had okay HTML output. But you lose a lot of features using the old versions of Word.