04-28-2011, 03:55 PM | #1 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
PDF to Epub Workshop
PDF to Epub workshop
The process consists of transferring editing styling assembling In this first post I'll deal with transferring and editing. MS Word is used because it can reveal hidden characters, retains some of the PDF's useful formatting and it has powerful styling capabilities. Final assembly is in Sigil. This post deals only with converting novels, ie all text apart from a cover, maybe a map and possibly a little chapter decoration. ---------------------------- TRANSFERRING ---------------------------- Open the PDF in Adobe Reader Take a careful look at it. If it is so riddled with errors it may not be worth bothering with. Is there a better format or better version available [edited]? Realize that the intensive proof reading of a novel will often spoil the later reading experience for you. Open MS Word (mine is Office 2003 because I hate the ribbon in Office 2007. It sits unused on another computer!) In Adobe Reader from the Edit Menu select 'Copy File to Clipboard' (Alt-E, Alt-B) Note: nothing may happen at this stage! Switch focus to MS Word. Have a new document open in Web Layout. From Edit Menu select 'Paste' (CTRL-V) Note: depending on the speed of your computer, you should see a progress bar in the bottom right of Adobe Reader and screen refresh may be slow. Eventually you should get a copy of the PDF in document format in Word. Some formatting will have carried over. In a moment we'll use variations in format to delete bits we don't want - such as page titles and numbers which are senseless junk on an ebook reader. If all you see are a whole bunch of squares you won't be able to copy the PDF (without some stuff I'm not dealing with) so you'll need to either read the PDF on your computer or find an alternative format of the book. --------------------------------------- DELETE UNWANTED PAGES --------------------------------------- First delete any sections you don't want such as 'Also by This Author', TOC, Acknowledgements, Dedications, teasers, 'About the Author', 'About the Publisher' - in fact any parts you don't want to read later. You'll still have the PDF to read these. ---------------------------- FIND AND REPLACE ---------------------------- *Until you trust the expressions presented here use a copy of your file. *Don't jump straight into 'Replace All', try a few single Replaces first *If a dialogue says '0' replacements it's probably because you haven't checked or unchecked 'Use wildcards' when required, or you haven't cleared the formatting from a previous search. (Click 'No formatting' box under 'More'.) *Get into the habit of using ^13 in the Find box and ^p in the Replace box. There are important reasons for this. *Get used to working with hidden characters showing. -------------- THE JUNK -------------- Removing Amber/Nova/file junk: Use F/R and leave Replace blank Check 'Use wildcards' Code:
Find: Generated by ABC *^13 Find: ABC Amber *^13 Find: Create PDF *^13 Find: file:*^13 HIDDEN CHARACTERS THAT CAN CAUSE PROBLEMS ------------------------------------------------------------------------- Click the pilcrow (reversed P shape in toolbar) to see all the hidden characters. At this stage, if paragraph characters occur at the end of each line of text instead of only at the end of paragraphs IGNORE them. If you make corrections now dealing with page numbers will be difficult. They will become embedded inside paragraphs and you don't want that. First: a) Look for optional hyphens (little lines inside words) b) Tabs (little arrows) c) A space (small dot) followed by a paragraph character (pilcrow) d) Manual line breaks (bent arrows) (Inserted with SHIFT-ENTER) e) Non-breaking space (small circle) (Inserted with CTRL-SHIFT SPACE Uncheck 'Use wildcards' Removal Code:
(a) Find: ^- Replace: blank (b) Find: ^t Replace: blank (c) Find: <space>^13 Replace: ^p (for <space> hit the spacebar) (d) Find: ^| Replace: ^p (the char after ^ is the 'pipe' found with shift backslash '\'. (e) Find: ^s Replace: <space> (<space> means press space bar once) ----------------------- PAGE NUMBERS ----------------------- Now use the formatting to remove repetitive items like Author, Book title, Chapter title, Page number. Here's how: Click on a sample piece of the text/number you want to remove. Check the formatting toolbar to see if the text has some distinguishing characteristic such as size or font. It may be necessary to click the AA icon on the left of this toolbar to open a panel where there's more information. The format of the text where the cursor is will be identified by having a box around it and hovering the arrow over it brings up format info. If you find something unique then put the cursor in the Find box in the Find/Replace dialogue and click More - Format. Set the format to be found. Leave the Find and Replace box empty and Replace All. (Remember to click 'No Formatting' before doing further searches!) If there was no unique format for the numbers/authors/title we need plan B! Here are some common page numberings:- (where x = page number) (a) Author-x and x-Title OR Title-x and x-Author. (b) Page x of y where y is the total pages (c) Page x (d) {x} where {} is some simple form of decoration (e) x (a) Author-x and x-Title OR Title-x and x-Author You won't be able to deal with all of them in one scan because of Word's limited F/R compared to Regex. Find a line that that has the authors name and page number. Copy and paste this line into the Find box. Change the digit to [0-9]@ and modify the line to look like (A) or (B) according to whether the number is before or after the text. Click 'More' and check 'Use wildcards'. Code:
(A) Find: (^13)the title or author*[0-9]{1,3}*^13 Repl: \1 (B) Find: (^13)[0-9]{1,3}*the title or author*^13 Repl: \1 If you've copied and pasted the title or author, deselect this before searching. [0-9] means any digit in the range 0-9 and @ means 1 or more. * means zero or more of anything (always use with care. Have a back-up.) ^13 means a paragraph character (use only in Find box, never in Replace box) \1 means the item captured by the expression in the first set of round brackets. We are the restoring the first found paragraph character because this contains formatting information about its preceding paragraph. Repeat for the way the other page is done. Sometimes title/author, page number sneaks onto the end of a text paragraph. It's a good idea to search (Find Next) for both the author name and the title to check this out. If the title/author page number is not in it's own paragraph, use the following F/Rs: Notes: IT IS IMPORTANT TO DO THESE IN THE ORDER SHOWN Use REPLACE not REPLACE ALL as use of * CAN SELECT TOO MUCH. Where possible replace *s with actual text. If there is a character that won't paste into the Find box use the '?' for any character. In the expressions below paste the actual title or author. FIRST Check 'Use wildcards' Code:
Find: the title/author[!^13]@[0-9]{1,3}*^13 Repl: blank or space Check with Find Next to see what is required - it depends whether you've selected 'title/author' with a leading space Code:
Find: [0-9]{1,3}[!^13]@the title/author*^13 Repl: blank or space Check with Find Next to see what is required Check 'Use wildcards' Code:
Find: (^13)[Pp][Aa][Gg][Ee] [0-9]@ of [0-9]@^13 Repl: \1 Finds all forms of capitalisation of 'page' There are three spaces in the expression - one after [Ee] and two around " of ". (c) Page x Check 'Use wildcards' Code:
Find: (^13)[Pp][Aa][Gg][Ee] [0-9]@^13 Repl: \1 Check 'Use wildcards' Code:
Find: (^13)[\decorative character ]@[0-9]@[\decorative character ]@^13 Repl: \1 Notes: The backslash is to 'escape' the next character which otherwise could be interpreted as a wildcard. The expression [\{\} ]@ means one or more of the characters '{' '}' and <space> in any order. Replace {} with the decorative character you find in your document. (e) x USE WITH CARE *** Check 'Use wildcards' Code:
Find: (^13)[0-9 ]@^13 Repl: \1 *** If you're retaining the TOC this Find/Replace could play havoc with it. The solution is to get the Find/Replace ready then select the TOC and Cut it using Ctrl X. Do the F/R then Paste the TOC back. Or selecta Find format that excludes the TOC's format. *** If chapters are headed only by digits you will need to avoid deleting those. Often you can do this by specifying a format in the Find criteria (eg Size 10 or Not Bold) which will exclude the chapter numbers but include the page numbers. ---------------------------- UNHAPPY RETURNS ---------------------------- Empty and Broken paragraphs Before going any further it would be a good idea to replace straight quotes with curly quotes. They may be present but not obvious. So select a 'straight' quote and change the font to Times New Roman which displays them clearly. If it looks curly then you don't need to do anything; otherwise type "curly quotes" into the help box and follow the instructions to change straight quotes into curly quotes. Use CTRL Z to revert the quote back to its original font. Some PDFs reflow the text and when you paste into Word a line of text will fill the available width before wrapping around to the next line. If this is the case you can disregard this section. After you've pasted into Word and clicked the pilcrow you may see a paragraph character at the end of each and every line. You'll want to remove all of these except those marking the end of a true paragraph. Firstly, there should not be any empty paragraphs creating blank lines. Spacing should be accomplished by using format/style. Remove empty paragraphs: Check 'Use wildcards' Code:
Find: ^13{2,10} Repl: ^p First method - When not following certain punctuation marks such as full stop, question mark and so on, the paragraph character is removed. (A) Check 'Use wildcards' With cursor in Find box click More, Font, Not Bold Code:
Find: ([!.\?:"\!”'’\)0-9])^13 Note both straight and closing curly double quotes are required Repl: \1 That's \1 followed by a <space> The main problem with this is some of the punctuation may not actually end a paragraph - yet coincidently it's at the end of a line followed by a paragraph character. Here's an example created by the above F/R: </p> indicates a paragraph character Even Palmer couldn’t ignore something like that. “Mwhuh?”</p> she replied as she chewed her doughnut.</p> 'she replied ...' should be on the same line as “Mwhuh?” So a different approach is to look for lines starting with lowercase letters: (B) Check 'Use wildcards' Code:
Find: ^13([a-z]) Looks for lines starting with a lowercase letter after the paragraph character Repl: \1 That's a <space> followed by \1 Still there are problems: (a) Chapter headings and poetry bits (epigraphs) don't have end punctuation so lose their line/paragaph characters. Solution: Parts not to be scanned should be made bold then during F/R the Find Criteria should include Format, Font - Not Bold. Change back from Bold afterwards. Click 'No formatting' in F/R dialogue. (b) In USA it seems the practice is to put a full stop (period) after Mr. Mrs. Ms. Dr. A coincidental paragraph character after the dot would survive and we'd get: Mr.<p/> </p> represents a paragraph character Smith.... (C) Solution: After the above F/Rs, (A) and (B) Check 'Use wildcards' Code:
Find: ([DM][rs]{1,2}.)^13([A-Z]) Repl: \1 \2 Optional space between \1 and \2 for Dr.Smith or Dr. Smith Here's an example pre F/R: </p> represents paragraph character “Okay, now you just sound like a scary boyfriend,” May</p> said, reaching under her T-shirt to unhook her bra. “Explain.</p> Why am I doing this?”</p> Following a F/R the paragraph character after 'May' is removed but the one after 'Explain.' isn't because it just happens to come after a full stop. (D) Solution: This will only work for curly quotes, because of a need to distinguish between opening and closing quotes. Check 'Use wildcards' Code:
Find: (“[!^13”]@)^13([!”]@”) Repl: \1 \2 Space between \1 and \2 MINOR CORRECTIONS -------------------------------- You may have gone ahead with this conversion knowing that there were errors in the original PDF. Common errors are missing spaces and wrongly scanned or OCRed letters eg 'r n' becomes 'm'. There's also sometimes a problem with a PDF that uses characters, such as stylistic ligatures. Missing Spaces a) Missing space after a punctuation mark Code:
Find: [!A-Z]([.,:;”\!])([A-z]) Repl: \1 \2 \1<space>\2 Unfortunately this also will put a space between punctuation and a closing quote; Example: .” becomes . ” .<space>” Code:
Find: ” That's <space>” Repl: ” That's ” only You may see this: firstsecond (This is a two part F/R) UNCHECK 'Use wildcards' (i) With the cursor in the Find box, click Format, Font and select Italic. Code:
Find: leave blank Repl: ^& That's <space>^&<space> (ii) CHECK 'Use wildcards' With the cursor in the Find box, click No Formatting Code:
Find: {2} That's <space>{2} Repl: That's <space> Then we search for any double spaces and replace with a single space. Problems: If there is a blockquote in italic (eg a poetic verse) or if a paragraph ends in italic a space will be added to the start of the following line. Solution: CHECK 'Use wildcards' Code:
Find: ^13 That's (^13)<space> Repl: \1 Mistakes and missing spaces between regular style words There's no easy solution to this, but remember you can right-click on a word underlined with a wavy red line and Word will suggest corrections including inserting a missing space (sometimes). ------------------- VBA MACROS ------------------- If you're happy with using macros most of these will be suitable. (The exceptions are where you need to enter specific text such as Author and Title. I may show you how to deal with this using an Input box in a later post if requested, but I'm not sure if this is the correct forum for that sort of thing). In the sequence presented here put the Find and Replace data into the F/R dialogue in advance. With the Visual Basic toolbar showing click the round red button. Accept the name. Click OK. In the F/R dialogue click Replace All. On the floating Visual Basic toolbar click the sqare button to stop recording. Repeat for each F/R you want to use. Go into the Visual Basic Editor. (ALT F11 or find the icon on the VB toolbar) Now I'm not sure how this will open for you, so assuming you cannot see your macros this is what you do: Go to View, click on Project Explorer. In the Project Explorer open Modules by clicking the plus sign in a small box. Right click New Macros and select View Code. Each of your macros begins Sub Macro_whatever() and ends with End Sub Leave the first Sub Macro_whatever() and leave the very last End Sub. Remove all the other Sub Macro_s titles and End Subs inbetween to make one big sub-routine. You can change the name by editing it - example to Sub PdfToEpubEdit () note no spaces in name and empty ( ) When you close Word, the macro will be saved (in Normal.dot) To run your macro in future click on the Run button (a triangle) on the VB toolbar and select your macro. If you don't like it you can select it all in the VB Editor and delete. Because you have recorded your macro(s) the VB uses Select. More efficient (faster) macros use Range but they have to be hand written. In Word, Tools > Customize > Commands tag > Keyboard Left panel find and select 'Macros' Right panel shows macros: Find your macro Click on it. Click in the box 'Press new shortcut key:' Press a key combination, example ALT SHIFT P This will be stored in the Normal.dot template along with your macro. Click Assign, and close the dialogues. Try it out on a copy of a document. ------------------------------------------------------- Still to come: Styles and the CSS in MS Word. ------------------------------------------------------- Last edited by netseeker; 05-05-2011 at 02:08 AM. Reason: moderation edit |
04-28-2011, 07:56 PM | #2 |
Addict
Posts: 351
Karma: 70000
Join Date: Jul 2010
Location: Australia
Device: ADE, iPad
|
Brilliant, Thanks for the excellent write up.
|
Advert | |
|
04-30-2011, 03:30 PM | #3 |
Connoisseur
Posts: 61
Karma: 12096
Join Date: Sep 2010
Location: Tasmania
Device: Sony PRS 650
|
Thanks for your comment Adjust; however, the lack of more responses suggests that there is little interest in the topic so I won't bother to continue with it.
|
04-30-2011, 04:15 PM | #4 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Actually it could be a great page in our wiki
Dale |
04-30-2011, 04:39 PM | #5 |
Guru
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
Thanks for sharing your expertise!!!
|
Advert | |
|
05-01-2011, 05:52 AM | #6 |
Chocolate Grasshopper ...
Posts: 27,600
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
A masterful piece of work - and I agree - it should be in the Wiki .....
|
05-01-2011, 08:46 AM | #7 |
Connoisseur
Posts: 57
Karma: 36
Join Date: Aug 2009
Device: ipad, K3, acer aspire switch 10
|
Brilliant summary - and this...
"Realize that the intensive proof reading of a novel will often spoil the later reading experience for you." is all too true |
05-01-2011, 08:56 AM | #8 |
Resident Curmudgeon
Posts: 73,975
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
The OP is describing how to take a PDF downloaded illegally from the net and convert it into some other format. Do we want to really give credence to this illegal activity?
|
05-01-2011, 09:04 AM | #9 |
Guru
Posts: 970
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-505, PRS-T2
|
|
05-01-2011, 10:08 AM | #10 |
Chocolate Grasshopper ...
Posts: 27,600
Karma: 20821184
Join Date: Mar 2008
Location: Scotland
Device: Muse HD , Cybook Gen3 , Pocketbook 302 (Black) , Nexus 10: wife has PW
|
|
05-01-2011, 08:28 PM | #11 | |
Resident Curmudgeon
Posts: 73,975
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
05-01-2011, 08:52 PM | #12 | |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
05-01-2011, 09:04 PM | #13 | |
Resident Curmudgeon
Posts: 73,975
Karma: 128903378
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
05-02-2011, 12:38 AM | #14 | |
Addict
Posts: 351
Karma: 70000
Join Date: Jul 2010
Location: Australia
Device: ADE, iPad
|
Quote:
(in my industry we refer to PDF from whatever version they were made from, v5.0 being one) I have no idea where you are getting your assumption that its pirated. I have every edition of Acrobat going back to V4.0 and now CS5 (v9.4.4) CS3 Acrobat convert PDFs to text like vomit. CS5 does a good job. CS does a better job. I am constantly converting PDFs to text (Word files) and found the write up excellent. And it has already helped me fast track my workflow And I'm looking forward to his next one Last edited by Adjust; 05-02-2011 at 12:40 AM. |
|
05-02-2011, 01:05 AM | #15 |
Grand Sorcerer
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
I would agree. Many people have reasons to convert PDF's. Even PDF files from scans they made themselves. There is no reason to believe this process condones copyright violation or stealing. It has legitimate purpose.
Dale |
Tags |
conversion, edit, epub, pdf, word |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Looking for an Epub workshop/class in SF Bay Area | SamL | Workshop | 3 | 04-27-2011 11:21 AM |
Digital Public Library Workshop | Giggleton | News | 0 | 03-01-2011 11:25 AM |
Steampunk Workshop - ya gotta see this | kennyc | Lounge | 0 | 12-29-2010 05:25 PM |
Other Fiction Audoux, Marguerite: Marie Claire’s Workshop, v.1, 3 September 2008. | Patricia | Kindle Books (offline) | 0 | 09-03-2008 12:12 AM |