PDF to Epub Workshop

Faster · 04-28-2011, 03:55 PM

PDF to Epub workshop

The process consists of
transferring
editing
styling
assembling

In this first post I'll deal with transferring and editing.

MS Word is used because it can reveal hidden characters, retains some of the PDF's useful formatting and it has powerful styling capabilities.
Final assembly is in Sigil.
This post deals only with converting novels, ie all text apart from a cover, maybe a map and possibly a little chapter decoration.

----------------------------
TRANSFERRING
----------------------------
Open the PDF in Adobe Reader
Take a careful look at it. If it is so riddled with errors it may not be worth bothering with. Is there a better format or better version available [edited]? Realize that the intensive proof reading of a novel will often spoil the later reading experience for you.
Open MS Word (mine is Office 2003 because I hate the ribbon in Office 2007. It sits unused on another computer!)
In Adobe Reader from the Edit Menu select 'Copy File to Clipboard' (Alt-E, Alt-B)
Note: nothing may happen at this stage!
Switch focus to MS Word. Have a new document open in Web Layout. From Edit Menu select 'Paste' (CTRL-V)
Note: depending on the speed of your computer, you should see a progress bar in the bottom right of Adobe Reader and screen refresh may be slow.

Eventually you should get a copy of the PDF in document format in Word.
Some formatting will have carried over. In a moment we'll use variations in format to delete bits we don't want - such as page titles and numbers which are senseless junk on an ebook reader.

If all you see are a whole bunch of squares you won't be able to copy the PDF (without some stuff I'm not dealing with) so you'll need to either read the PDF on your computer or find an alternative format of the book.
---------------------------------------
DELETE UNWANTED PAGES
---------------------------------------
First delete any sections you don't want such as 'Also by This Author', TOC, Acknowledgements, Dedications, teasers, 'About the Author', 'About the Publisher' - in fact any parts you don't want to read later. You'll still have the PDF to read these.
----------------------------
FIND AND REPLACE
----------------------------
*Until you trust the expressions presented here use a copy of your file.
*Don't jump straight into 'Replace All', try a few single Replaces first
*If a dialogue says '0' replacements it's probably because you haven't checked or unchecked 'Use wildcards' when required, or you haven't cleared the formatting from a previous search. (Click 'No formatting' box under 'More'.)
*Get into the habit of using ^13 in the Find box and ^p in the Replace box. There are important reasons for this.
*Get used to working with hidden characters showing.
--------------
THE JUNK
--------------
Removing Amber/Nova/file junk:

Use F/R and leave Replace blank
Check 'Use wildcards'

Code:

Find:	Generated by ABC *^13
Find:	ABC Amber *^13
Find:	Create PDF *^13
Find:	file:*^13

-------------------------------------------------------------------------
HIDDEN CHARACTERS THAT CAN CAUSE PROBLEMS
-------------------------------------------------------------------------
Click the pilcrow (reversed P shape in toolbar) to see all the hidden characters.
At this stage, if paragraph characters occur at the end of each line of text instead of only at the end of paragraphs IGNORE them. If you make corrections now dealing with page numbers will be difficult. They will become embedded inside paragraphs and you don't want that.

First:
a) Look for optional hyphens (little lines inside words)
b) Tabs (little arrows)
c) A space (small dot) followed by a paragraph character (pilcrow)
d) Manual line breaks (bent arrows) (Inserted with SHIFT-ENTER)
e) Non-breaking space (small circle) (Inserted with CTRL-SHIFT SPACE

Uncheck 'Use wildcards'
Removal

Code:

(a) Find: ^-		Replace: blank
	(b) Find: ^t		Replace: blank
	(c) Find: <space>^13	Replace: ^p	(for <space> hit the spacebar)
	(d) Find: ^|		Replace: ^p	(the char after ^ is the 'pipe' found with shift backslash '\'.
	(e) Find: ^s		Replace: <space>	(<space> means press space bar once)

Codes for these hidden characters can be inserted in the Find/Replace dialogue by clicking 'More' - 'Special' and selecting what you need.
-----------------------
PAGE NUMBERS
-----------------------
Now use the formatting to remove repetitive items like Author, Book title, Chapter title, Page number.
Here's how:
Click on a sample piece of the text/number you want to remove. Check the formatting toolbar to see if the text has some distinguishing characteristic such as size or font.
It may be necessary to click the AA icon on the left of this toolbar to open a panel where there's more information. The format of the text where the cursor is will be identified by having a box around it and hovering the arrow over it brings up format info.
If you find something unique then put the cursor in the Find box in the Find/Replace dialogue and click More - Format. Set the format to be found.
Leave the Find and Replace box empty and Replace All.
(Remember to click 'No Formatting' before doing further searches!)

If there was no unique format for the numbers/authors/title we need plan B!
Here are some common page numberings:- (where x = page number)

(a) Author-x and x-Title OR Title-x and x-Author.
(b) Page x of y where y is the total pages
(c) Page x
(d) {x} where {} is some simple form of decoration
(e) x

(a) Author-x and x-Title OR Title-x and x-Author
You won't be able to deal with all of them in one scan because of Word's limited F/R compared to Regex.

Find a line that that has the authors name and page number. Copy and paste this line into the Find box. Change the digit to [0-9]@
and modify the line to look like (A) or (B) according to whether the number is before or after the text.
Click 'More' and check 'Use wildcards'.

Code:

(A)
Find:	(^13)the title or author*[0-9]{1,3}*^13
Repl:	\1
(B)
Find:	(^13)[0-9]{1,3}*the title or author*^13
Repl:	\1

Notes:
If you've copied and pasted the title or author, deselect this before searching.
[0-9] means any digit in the range 0-9 and @ means 1 or more.
* means zero or more of anything (always use with care. Have a back-up.)
^13 means a paragraph character (use only in Find box, never in Replace box)
\1 means the item captured by the expression in the first set of round brackets.
We are the restoring the first found paragraph character because this contains formatting information about its preceding paragraph.

Repeat for the way the other page is done.

Sometimes title/author, page number sneaks onto the end of a text paragraph. It's a good idea to search (Find Next) for both the author name and the title to check this out.
If the title/author page number is not in it's own paragraph, use the following F/Rs:

Notes:
IT IS IMPORTANT TO DO THESE IN THE ORDER SHOWN
Use REPLACE not REPLACE ALL as use of * CAN SELECT TOO MUCH.
Where possible replace *s with actual text.
If there is a character that won't paste into the Find box use the '?' for any character.
In the expressions below paste the actual title or author.

FIRST
Check 'Use wildcards'

Code:

Find:	the title/author[!^13]@[0-9]{1,3}*^13
Repl:	blank or space	Check with Find Next to see what is required
			- it depends whether you've selected 'title/author' with a leading space

THEN

Code:

Find:	[0-9]{1,3}[!^13]@the title/author*^13
Repl:	blank or space	Check with Find Next to see what is required

(b) Page x of y

Check 'Use wildcards'

Code:

Find:	(^13)[Pp][Aa][Gg][Ee] [0-9]@ of [0-9]@^13
Repl:	\1

Notes:
Finds all forms of capitalisation of 'page'
There are three spaces in the expression - one after [Ee] and two around " of ".

(c) Page x

Check 'Use wildcards'

Code:

Find:	(^13)[Pp][Aa][Gg][Ee] [0-9]@^13
Repl:	\1

(d) {x}
Check 'Use wildcards'

Code:

Find:	(^13)[\decorative character ]@[0-9]@[\decorative character ]@^13
Repl:	\1

Example: (^13)[\{\} ]@[0-9]@[\{\} ]@^13
Notes:
The backslash is to 'escape' the next character which otherwise could be interpreted as a wildcard.
The expression [\{\} ]@ means one or more of the characters '{' '}' and <space> in any order.
Replace {} with the decorative character you find in your document.

(e) x USE WITH CARE ***

Check 'Use wildcards'

Code:

Find:	(^13)[0-9 ]@^13
Repl:	\1

Notes: A space before or after the number or between digits is allowed for here by using [0-9<space>].

*** If you're retaining the TOC this Find/Replace could play havoc with it.
The solution is to get the Find/Replace ready then select the TOC and Cut it using Ctrl X.
Do the F/R then Paste the TOC back. Or selecta Find format that excludes the TOC's format.
*** If chapters are headed only by digits you will need to avoid deleting those.
Often you can do this by specifying a format in the Find criteria (eg Size 10 or Not Bold) which will exclude the chapter numbers but include the page numbers.
----------------------------
UNHAPPY RETURNS
----------------------------
Empty and Broken paragraphs
Before going any further it would be a good idea to replace straight quotes with curly quotes. They may be present but not obvious. So select a 'straight' quote and change the font to Times New Roman which displays them clearly. If it looks curly then you don't need to do anything; otherwise type "curly quotes" into the help box and follow the instructions to change straight quotes into curly quotes. Use CTRL Z to revert the quote back to its original font.

Some PDFs reflow the text and when you paste into Word a line of text will fill the available width before wrapping around to the next line. If this is the case you can disregard this section.

After you've pasted into Word and clicked the pilcrow you may see a paragraph character at the end of each and every line. You'll want to remove all of these except those marking the end of a true paragraph.

Firstly, there should not be any empty paragraphs creating blank lines. Spacing should be accomplished by using format/style.
Remove empty paragraphs:

Check 'Use wildcards'

Code:

Find:	^13{2,10}
Repl:	^p

Broken paragraphs:

First method -
When not following certain punctuation marks such as full stop, question mark and so on, the paragraph character is removed.

(A)

Check 'Use wildcards'
With cursor in Find box click More, Font, Not Bold

Code:

Find:	([!.\?:"\!”'’\)0-9])^13	Note both straight and closing curly double quotes are required
Repl:	\1		That's \1 followed by a <space>

*** Afterwards with cursor in Find box click 'No formatting' ***

The main problem with this is some of the punctuation may not actually end a paragraph - yet coincidently it's at the end of a line followed by a paragraph character.
Here's an example created by the above F/R: indicates a paragraph character

Even Palmer couldn’t ignore something like that. “Mwhuh?”
she replied as she chewed her doughnut.

'she replied ...' should be on the same line as “Mwhuh?”

So a different approach is to look for lines starting with lowercase letters:

(B)

Check 'Use wildcards'

Code:

Find:	^13([a-z])		Looks for lines starting with a lowercase letter after the paragraph character
Repl:	 \1		That's a <space> followed by \1

So the answer (still imperfect) is to use (A) then use (B)

Still there are problems:
(a) Chapter headings and poetry bits (epigraphs) don't have end punctuation so lose their line/paragaph characters.
Solution: Parts not to be scanned should be made bold then during F/R the Find Criteria should include Format, Font - Not Bold. Change back from Bold afterwards. Click 'No formatting' in F/R dialogue.

(b) In USA it seems the practice is to put a full stop (period) after Mr. Mrs. Ms. Dr. A coincidental paragraph character after the dot would survive and we'd get:
Mr. represents a paragraph character
Smith....
(C)

Solution: After the above F/Rs, (A) and (B)

Check 'Use wildcards'

Code:

Find:	([DM][rs]{1,2}.)^13([A-Z])
Repl:	\1 \2			Optional space between \1 and \2 for Dr.Smith or Dr. Smith

(c) Similarly, if by coincidence the para-character is just after a full stop then it's left there even though it's mid paragraph.
Here's an example pre F/R: represents paragraph character

“Okay, now you just sound like a scary boyfriend,” May
said, reaching under her T-shirt to unhook her bra. “Explain.
Why am I doing this?”

Following a F/R the paragraph character after 'May' is removed but the one after 'Explain.' isn't because it just happens to come after a full stop.

(D)

Solution: This will only work for curly quotes, because of a need to distinguish between opening and closing quotes.

Check 'Use wildcards'

Code:

Find:	(“[!^13”]@)^13([!”]@”)	
Repl:	\1 \2			Space between \1 and \2

--------------------------------
MINOR CORRECTIONS
--------------------------------
You may have gone ahead with this conversion knowing that there were errors in the original PDF. Common errors are missing spaces and wrongly scanned or OCRed letters eg 'r n' becomes 'm'. There's also sometimes a problem with a PDF that uses characters, such as stylistic ligatures.

Missing Spaces
a) Missing space after a punctuation mark

Code:

Find:	[!A-Z]([.,:;”\!])([A-z])
Repl:	\1 \2	\1<space>\2

Notes: [!A-Z] is to avoid titles such as A.B.C. becoming A. B. C.; however Ph.D. will be split into Ph. D.
Unfortunately this also will put a space between punctuation and a closing quote;
Example: .” becomes . ” .<space>”

Code:

 
Find:	 ”	That's 	<space>”
Repl:	”	That's	” only

b) Missing space between a word in italic and a word non-italic.
You may see this: firstsecond

(This is a two part F/R)

UNCHECK 'Use wildcards'
(i)
With the cursor in the Find box, click Format, Font and select Italic.

Code:

Find:	leave blank
Repl:	 ^& 		That's <space>^&<space>

If the Find button is dimmed, you haven't unchecked 'Use wildcards'.

(ii)
CHECK 'Use wildcards'
With the cursor in the Find box, click No Formatting

Code:

Find:	 {2}		That's <space>{2}
Repl:	 		That's <space>

Explanation: First we add a space on each side of every block of word(s) in italic.
Then we search for any double spaces and replace with a single space.

Problems: If there is a blockquote in italic (eg a poetic verse) or if a paragraph ends in italic a space will be added to the start of the following line.

Solution:

CHECK 'Use wildcards'

Code:

Find:	^13 		That's (^13)<space>
Repl:	\1

Spelling
Mistakes and missing spaces between regular style words
There's no easy solution to this, but remember you can right-click on a word underlined with a wavy red line and Word will suggest corrections including inserting a missing space (sometimes).
-------------------
VBA MACROS
-------------------
If you're happy with using macros most of these will be suitable. (The exceptions are where you need to enter specific text such as Author and Title. I may show you how to deal with this using an Input box in a later post if requested, but I'm not sure if this is the correct forum for that sort of thing).

In the sequence presented here put the Find and Replace data into the F/R dialogue in advance.
With the Visual Basic toolbar showing click the round red button. Accept the name. Click OK.
In the F/R dialogue click Replace All.
On the floating Visual Basic toolbar click the sqare button to stop recording.

Repeat for each F/R you want to use.

Go into the Visual Basic Editor. (ALT F11 or find the icon on the VB toolbar)
Now I'm not sure how this will open for you, so assuming you cannot see your macros this is what you do:
Go to View, click on Project Explorer.
In the Project Explorer open Modules by clicking the plus sign in a small box.
Right click New Macros and select View Code.

Each of your macros begins Sub Macro_whatever() and ends with End Sub
Leave the first Sub Macro_whatever() and leave the very last End Sub.
Remove all the other Sub Macro_s titles and End Subs inbetween to make one big sub-routine.
You can change the name by editing it -
example to Sub PdfToEpubEdit () note no spaces in name and empty ( )

When you close Word, the macro will be saved (in Normal.dot)
To run your macro in future click on the Run button (a triangle) on the VB toolbar and select your macro.

If you don't like it you can select it all in the VB Editor and delete.
Because you have recorded your macro(s) the VB uses Select. More efficient (faster) macros use Range but they have to be hand written.

In Word, Tools > Customize > Commands tag > Keyboard
Left panel find and select 'Macros'
Right panel shows macros:
Find your macro
Click on it.
Click in the box 'Press new shortcut key:'
Press a key combination, example ALT SHIFT P
This will be stored in the Normal.dot template along with your macro.
Click Assign, and close the dialogues. Try it out on a copy of a document.

-------------------------------------------------------
Still to come: Styles and the CSS in MS Word.
-------------------------------------------------------

Adjust · 04-28-2011, 07:56 PM

Brilliant, Thanks for the excellent write up.

Faster · 04-30-2011, 03:30 PM

Thanks for your comment Adjust; however, the lack of more responses suggests that there is little interest in the topic so I won't bother to continue with it.

DaleDe · 04-30-2011, 04:15 PM

Actually it could be a great page in our wiki

Dale

Pablo · 04-30-2011, 04:39 PM

Thanks for sharing your expertise!!!

GeoffC · 05-01-2011, 05:52 AM

A masterful piece of work - and I agree - it should be in the Wiki .....

rakulos · 05-01-2011, 08:46 AM

Brilliant summary - and this...

"Realize that the intensive proof reading of a novel will often spoil the later reading experience for you."

is all too true

JSWolf · 05-01-2011, 08:56 AM

The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?

Pablo · 05-01-2011, 09:04 AM

Quote:

Originally Posted by JSWolf

The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?

It's a technical post, I don't see why you should say this.

GeoffC · 05-01-2011, 10:08 AM

Quote:

Originally Posted by JSWolf

The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?

Where, Jon, does it say that ?

JSWolf · 05-01-2011, 08:28 PM

Quote:

Open the PDF in Adobe Reader
Take a careful look at it. If it is so riddled with errors it may not be worth bothering with. Is there a better format or better version available (v5.0 is best)?

Riddled with errors and v5.0 give it away. I've not seen publisher PDF riddled with errors. Also, v5.0 is a version number for publisher copy. It's used by the people pirating eBooks. So yes, this is an article describing how to convert a pirated PDF. It was a dead giveaway.

DaleDe · 05-01-2011, 08:52 PM

Quote:

Originally Posted by JSWolf

The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?

I saw no DRM breaking in the description or necessarily anything other than format shifting for personal use. This is legal in the US but may be illegal other places unless the document is not copyrighted. A copyright caution can easily be added to the wiki entry to tell the user check the laws of their country regarding format shifting. Of course redistributing any copyrighted document without permission is illegal whether the format is changed or not. Did you see something I missed?

Dale

JSWolf · 05-01-2011, 09:04 PM

Quote:

Originally Posted by DaleDe

I saw no DRM breaking in the description or necessarily anything other than format shifting for personal use. This is legal in the US but may be illegal other places unless the document is not copyrighted. A copyright caution can easily be added to the wiki entry to tell the user check the laws of their country regarding format shifting. Of course redistributing any copyrighted document without permission is illegal whether the format is changed or not. Did you see something I missed?

Dale

The fact that the OP started off by telling us to try to get a v5.0 PDF means a pirated PDF.

Adjust · 05-02-2011, 12:38 AM

Quote:

Originally Posted by JSWolf

The fact that the OP started off by telling us to try to get a v5.0 PDF means a pirated PDF.

No I read that to be open in Acrobat V5.0 reader.
(in my industry we refer to PDF from whatever version they were made from, v5.0 being one)

I have no idea where you are getting your assumption that its pirated.

I have every edition of Acrobat going back to V4.0 and now CS5 (v9.4.4)

CS3 Acrobat convert PDFs to text like vomit. CS5 does a good job.
CS does a better job.

I am constantly converting PDFs to text (Word files) and found the write up excellent.

And it has already helped me fast track my workflow

And I'm looking forward to his next one

DaleDe · 05-02-2011, 01:05 AM

I would agree. Many people have reasons to convert PDF's. Even PDF files from scans they made themselves. There is no reason to believe this process condones copyright violation or stealing. It has legitimate purpose.

Dale

04-28-2011, 03:55 PM	#1
Faster Connoisseur Posts: 61 Karma: 12096 Join Date: Sep 2010 Location: Tasmania Device: Sony PRS 650	PDF to Epub Workshop PDF to Epub workshop The process consists of transferring editing styling assembling In this first post I'll deal with transferring and editing. MS Word is used because it can reveal hidden characters, retains some of the PDF's useful formatting and it has powerful styling capabilities. Final assembly is in Sigil. This post deals only with converting novels, ie all text apart from a cover, maybe a map and possibly a little chapter decoration. ---------------------------- TRANSFERRING ---------------------------- Open the PDF in Adobe Reader Take a careful look at it. If it is so riddled with errors it may not be worth bothering with. Is there a better format or better version available [edited]? Realize that the intensive proof reading of a novel will often spoil the later reading experience for you. Open MS Word (mine is Office 2003 because I hate the ribbon in Office 2007. It sits unused on another computer!) In Adobe Reader from the Edit Menu select 'Copy File to Clipboard' (Alt-E, Alt-B) Note: nothing may happen at this stage! Switch focus to MS Word. Have a new document open in Web Layout. From Edit Menu select 'Paste' (CTRL-V) Note: depending on the speed of your computer, you should see a progress bar in the bottom right of Adobe Reader and screen refresh may be slow. Eventually you should get a copy of the PDF in document format in Word. Some formatting will have carried over. In a moment we'll use variations in format to delete bits we don't want - such as page titles and numbers which are senseless junk on an ebook reader. If all you see are a whole bunch of squares you won't be able to copy the PDF (without some stuff I'm not dealing with) so you'll need to either read the PDF on your computer or find an alternative format of the book. --------------------------------------- DELETE UNWANTED PAGES --------------------------------------- First delete any sections you don't want such as 'Also by This Author', TOC, Acknowledgements, Dedications, teasers, 'About the Author', 'About the Publisher' - in fact any parts you don't want to read later. You'll still have the PDF to read these. ---------------------------- FIND AND REPLACE ---------------------------- Until you trust the expressions presented here use a copy of your file. Don't jump straight into 'Replace All', try a few single Replaces first If a dialogue says '0' replacements it's probably because you haven't checked or unchecked 'Use wildcards' when required, or you haven't cleared the formatting from a previous search. (Click 'No formatting' box under 'More'.) Get into the habit of using ^13 in the Find box and ^p in the Replace box. There are important reasons for this. Get used to working with hidden characters showing. -------------- THE JUNK -------------- Removing Amber/Nova/file junk: Use F/R and leave Replace blank Check 'Use wildcards' Code: Find: Generated by ABC ^13 Find: ABC Amber ^13 Find: Create PDF ^13 Find: file:^13 ------------------------------------------------------------------------- HIDDEN CHARACTERS THAT CAN CAUSE PROBLEMS ------------------------------------------------------------------------- Click the pilcrow (reversed P shape in toolbar) to see all the hidden characters. At this stage, if paragraph characters occur at the end of each line of text instead of only at the end of paragraphs IGNORE them. If you make corrections now dealing with page numbers will be difficult. They will become embedded inside paragraphs and you don't want that. First: a) Look for optional hyphens (little lines inside words) b) Tabs (little arrows) c) A space (small dot) followed by a paragraph character (pilcrow) d) Manual line breaks (bent arrows) (Inserted with SHIFT-ENTER) e) Non-breaking space (small circle) (Inserted with CTRL-SHIFT SPACE Uncheck 'Use wildcards' Removal Code: (a) Find: ^- Replace: blank (b) Find: ^t Replace: blank (c) Find: <space>^13 Replace: ^p (for <space> hit the spacebar) (d) Find: ^\| Replace: ^p (the char after ^ is the 'pipe' found with shift backslash '\'. (e) Find: ^s Replace: <space> (<space> means press space bar once) Codes for these hidden characters can be inserted in the Find/Replace dialogue by clicking 'More' - 'Special' and selecting what you need. ----------------------- PAGE NUMBERS ----------------------- Now use the formatting to remove repetitive items like Author, Book title, Chapter title, Page number. Here's how: Click on a sample piece of the text/number you want to remove. Check the formatting toolbar to see if the text has some distinguishing characteristic such as size or font. It may be necessary to click the AA icon on the left of this toolbar to open a panel where there's more information. The format of the text where the cursor is will be identified by having a box around it and hovering the arrow over it brings up format info. If you find something unique then put the cursor in the Find box in the Find/Replace dialogue and click More - Format. Set the format to be found. Leave the Find and Replace box empty and Replace All. (Remember to click 'No Formatting' before doing further searches!) If there was no unique format for the numbers/authors/title we need plan B! Here are some common page numberings:- (where x = page number) (a) Author-x and x-Title OR Title-x and x-Author. (b) Page x of y where y is the total pages (c) Page x (d) {x} where {} is some simple form of decoration (e) x (a) Author-x and x-Title OR Title-x and x-Author* You won't be able to deal with all of them in one scan because of Word's limited F/R compared to Regex. Find a line that that has the authors name and page number. Copy and paste this line into the Find box. Change the digit to [0-9]@ and modify the line to look like (A) or (B) according to whether the number is before or after the text. Click 'More' and check 'Use wildcards'. Code: (A) Find: (^13)the title or author[0-9]{1,3}^13 Repl: \1 (B) Find: (^13)[0-9]{1,3}the title or author^13 Repl: \1 Notes: If you've copied and pasted the title or author, deselect this before searching. [0-9] means any digit in the range 0-9 and @ means 1 or more. * means zero or more of anything (always use with care. Have a back-up.) ^13 means a paragraph character (use only in Find box, never in Replace box) \1 means the item captured by the expression in the first set of round brackets. We are the restoring the first found paragraph character because this contains formatting information about its preceding paragraph. Repeat for the way the other page is done. Sometimes title/author, page number sneaks onto the end of a text paragraph. It's a good idea to search (Find Next) for both the author name and the title to check this out. If the title/author page number is not in it's own paragraph, use the following F/Rs: Notes: IT IS IMPORTANT TO DO THESE IN THE ORDER SHOWN Use REPLACE not REPLACE ALL as use of * CAN SELECT TOO MUCH. Where possible replace s with actual text. If there is a character that won't paste into the Find box use the '?' for any character. In the expressions below paste the actual title or author. FIRST Check 'Use wildcards' Code: Find: the title/author[!^13]@[0-9]{1,3}^13 Repl: blank or space Check with Find Next to see what is required - it depends whether you've selected 'title/author' with a leading space THEN Code: Find: [0-9]{1,3}[!^13]@the title/author^13 Repl: blank or space Check with Find Next to see what is required (b) Page x of y* Check 'Use wildcards' Code: Find: (^13)[Pp][Aa][Gg][Ee] [0-9]@ of [0-9]@^13 Repl: \1 Notes: Finds all forms of capitalisation of 'page' There are three spaces in the expression - one after [Ee] and two around " of ". (c) Page x Check 'Use wildcards' Code: Find: (^13)[Pp][Aa][Gg][Ee] [0-9]@^13 Repl: \1 (d) {x} Check 'Use wildcards' Code: Find: (^13)[\decorative character ]@[0-9]@[\decorative character ]@^13 Repl: \1 Example: (^13)[\{\} ]@[0-9]@[\{\} ]@^13 Notes: The backslash is to 'escape' the next character which otherwise could be interpreted as a wildcard. The expression [\{\} ]@ means one or more of the characters '{' '}' and <space> in any order. Replace {} with the decorative character you find in your document. (e) x USE WITH CARE * Check 'Use wildcards' Code: Find: (^13)[0-9 ]@^13 Repl: \1 Notes: A space before or after the number or between digits is allowed for here by using [0-9<space>]. * If you're retaining the TOC this Find/Replace could play havoc with it. The solution is to get the Find/Replace ready then select the TOC and Cut it using Ctrl X. Do the F/R then Paste the TOC back. Or selecta Find format that excludes the TOC's format. * If chapters are headed only by digits you will need to avoid deleting those. Often you can do this by specifying a format in the Find criteria (eg Size 10 or Not Bold) which will exclude the chapter numbers but include the page numbers. ---------------------------- UNHAPPY RETURNS ---------------------------- Empty and Broken paragraphs Before going any further it would be a good idea to replace straight quotes with curly quotes. They may be present but not obvious. So select a 'straight' quote and change the font to Times New Roman which displays them clearly. If it looks curly then you don't need to do anything; otherwise type "curly quotes" into the help box and follow the instructions to change straight quotes into curly quotes. Use CTRL Z to revert the quote back to its original font. Some PDFs reflow the text and when you paste into Word a line of text will fill the available width before wrapping around to the next line. If this is the case you can disregard this section. After you've pasted into Word and clicked the pilcrow you may see a paragraph character at the end of each and every line. You'll want to remove all of these except those marking the end of a true paragraph. Firstly, there should not be any empty paragraphs creating blank lines. Spacing should be accomplished by using format/style. Remove empty paragraphs: Check 'Use wildcards' Code: Find: ^13{2,10} Repl: ^p Broken paragraphs: First method - When not following certain punctuation marks such as full stop, question mark and so on, the paragraph character is removed. (A) Check 'Use wildcards' With cursor in Find box click More, Font, Not Bold Code: Find: ([!.\?:"\!”'’\)0-9])^13 Note both straight and closing curly double quotes are required Repl: \1 That's \1 followed by a <space> * Afterwards with cursor in Find box click 'No formatting' * The main problem with this is some of the punctuation may not actually end a paragraph - yet coincidently it's at the end of a line followed by a paragraph character. Here's an example created by the above F/R: </p> indicates a paragraph character Even Palmer couldn’t ignore something like that. “Mwhuh?”</p> she replied as she chewed her doughnut.</p> 'she replied ...' should be on the same line as “Mwhuh?” So a different approach is to look for lines starting with lowercase letters: (B) Check 'Use wildcards' Code: Find: ^13([a-z]) Looks for lines starting with a lowercase letter after the paragraph character Repl: \1 That's a <space> followed by \1 So the answer (still imperfect) is to use (A) then use (B) Still there are problems: (a) Chapter headings and poetry bits (epigraphs) don't have end punctuation so lose their line/paragaph characters. Solution: Parts not to be scanned should be made bold then during F/R the Find Criteria should include Format, Font - Not Bold. Change back from Bold afterwards. Click 'No formatting' in F/R dialogue. (b) In USA it seems the practice is to put a full stop (period) after Mr. Mrs. Ms. Dr. A coincidental paragraph character after the dot would survive and we'd get: Mr.<p/> </p> represents a paragraph character Smith.... (C) Solution: After the above F/Rs, (A) and (B) Check 'Use wildcards' Code: Find: ([DM][rs]{1,2}.)^13([A-Z]) Repl: \1 \2 Optional space between \1 and \2 for Dr.Smith or Dr. Smith (c) Similarly, if by coincidence the para-character is just after a full stop then it's left there even though it's mid paragraph. Here's an example pre F/R: </p> represents paragraph character “Okay, now you just sound like a scary boyfriend,” May</p> said, reaching under her T-shirt to unhook her bra. “Explain.</p> Why am I doing this?”</p> Following a F/R the paragraph character after 'May' is removed but the one after 'Explain.' isn't because it just happens to come after a full stop. (D) Solution: This will only work for curly quotes, because of a need to distinguish between opening and closing quotes. Check 'Use wildcards' Code: Find: (“[!^13”]@)^13([!”]@”) Repl: \1 \2 Space between \1 and \2 -------------------------------- MINOR CORRECTIONS -------------------------------- You may have gone ahead with this conversion knowing that there were errors in the original PDF. Common errors are missing spaces and wrongly scanned or OCRed letters eg 'r n' becomes 'm'. There's also sometimes a problem with a PDF that uses characters, such as stylistic ligatures. Missing Spaces** a) Missing space after a punctuation mark Code: Find: [!A-Z]([.,:;”\!])([A-z]) Repl: \1 \2 \1<space>\2 Notes: [!A-Z] is to avoid titles such as A.B.C. becoming A. B. C.; however Ph.D. will be split into Ph. D. Unfortunately this also will put a space between punctuation and a closing quote; Example: .” becomes . ” .<space>” Code: Find: ” That's <space>” Repl: ” That's ” only b) Missing space between a word in italic and a word non-italic. You may see this: firstsecond (This is a two part F/R) UNCHECK 'Use wildcards' (i) With the cursor in the Find box, click Format, Font and select Italic. Code: Find: leave blank Repl: ^& That's <space>^&<space> If the Find button is dimmed, you haven't unchecked 'Use wildcards'. (ii) CHECK 'Use wildcards' With the cursor in the Find box, click No Formatting Code: Find: {2} That's <space>{2} Repl: That's <space> Explanation: First we add a space on each side of every block of word(s) in italic. Then we search for any double spaces and replace with a single space. Problems: If there is a blockquote in italic (eg a poetic verse) or if a paragraph ends in italic a space will be added to the start of the following line. Solution: CHECK 'Use wildcards' Code: Find: ^13 That's (^13)<space> Repl: \1 Spelling Mistakes and missing spaces between regular style words There's no easy solution to this, but remember you can right-click on a word underlined with a wavy red line and Word will suggest corrections including inserting a missing space (sometimes). ------------------- VBA MACROS ------------------- If you're happy with using macros most of these will be suitable. (The exceptions are where you need to enter specific text such as Author and Title. I may show you how to deal with this using an Input box in a later post if requested, but I'm not sure if this is the correct forum for that sort of thing). In the sequence presented here put the Find and Replace data into the F/R dialogue in advance. With the Visual Basic toolbar showing click the round red button. Accept the name. Click OK. In the F/R dialogue click Replace All. On the floating Visual Basic toolbar click the sqare button to stop recording. Repeat for each F/R you want to use. Go into the Visual Basic Editor. (ALT F11 or find the icon on the VB toolbar) Now I'm not sure how this will open for you, so assuming you cannot see your macros this is what you do: Go to View, click on Project Explorer. In the Project Explorer open Modules by clicking the plus sign in a small box. Right click New Macros and select View Code. Each of your macros begins Sub Macro_whatever() and ends with End Sub Leave the first Sub Macro_whatever() and leave the very last End Sub. Remove all the other Sub Macro_s titles and End Subs inbetween to make one big sub-routine. You can change the name by editing it - example to Sub PdfToEpubEdit () note no spaces in name and empty ( ) When you close Word, the macro will be saved (in Normal.dot) To run your macro in future click on the Run button (a triangle) on the VB toolbar and select your macro. If you don't like it you can select it all in the VB Editor and delete. Because you have recorded your macro(s) the VB uses Select. More efficient (faster) macros use Range but they have to be hand written. In Word, Tools > Customize > Commands tag > Keyboard Left panel find and select 'Macros' Right panel shows macros: Find your macro Click on it. Click in the box 'Press new shortcut key:' Press a key combination, example ALT SHIFT P This will be stored in the Normal.dot template along with your macro. Click Assign, and close the dialogues. Try it out on a copy of a document. ------------------------------------------------------- Still to come: Styles and the CSS in MS Word. ------------------------------------------------------- Last edited by netseeker; 05-05-2011 at 02:08 AM. Reason: moderation edit

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Looking for an Epub workshop/class in SF Bay Area	SamL	Workshop	3	04-27-2011 11:21 AM
Digital Public Library Workshop	Giggleton	News	0	03-01-2011 11:25 AM
Steampunk Workshop - ya gotta see this	kennyc	Lounge	0	12-29-2010 05:25 PM
Other Fiction Audoux, Marguerite: Marie Claire’s Workshop, v.1, 3 September 2008.	Patricia	Kindle Books (offline)	0	09-03-2008 12:12 AM

04-28-2011, 07:56 PM	#2
Adjust Addict Posts: 351 Karma: 70000 Join Date: Jul 2010 Location: Australia Device: ADE, iPad	Brilliant, Thanks for the excellent write up.

04-30-2011, 03:30 PM	#3
Faster Connoisseur Posts: 61 Karma: 12096 Join Date: Sep 2010 Location: Tasmania Device: Sony PRS 650	Thanks for your comment Adjust; however, the lack of more responses suggests that there is little interest in the topic so I won't bother to continue with it.

04-30-2011, 04:15 PM	#4
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	Actually it could be a great page in our wiki Dale

04-30-2011, 04:39 PM	#5
Pablo Guru Posts: 970 Karma: 4999999 Join Date: Mar 2009 Location: Rosario, Argentina Device: SONY PRS-505, PRS-T2	Thanks for sharing your expertise!!!

05-01-2011, 05:52 AM	#6
GeoffC Chocolate Grasshopper ... Posts: 27,600 Karma: 20821184 Join Date: Mar 2008 Location: Scotland Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW	A masterful piece of work - and I agree - it should be in the Wiki .....

05-01-2011, 08:46 AM	#7
rakulos Connoisseur Posts: 57 Karma: 36 Join Date: Aug 2009 Device: ipad, K3, acer aspire switch 10	Brilliant summary - and this... "Realize that the intensive proof reading of a novel will often spoil the later reading experience for you." is all too true

05-01-2011, 08:56 AM	#8
JSWolf Resident Curmudgeon Posts: 73,975 Karma: 128903378 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?

05-02-2011, 01:05 AM	#15
DaleDe Grand Sorcerer Posts: 11,470 Karma: 13095790 Join Date: Aug 2007 Location: Grass Valley, CA Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7	I would agree. Many people have reasons to convert PDF's. Even PDF files from scans they made themselves. There is no reason to believe this process condones copyright violation or stealing. It has legitimate purpose. Dale

Advert

Advert