| 
			
			 | 
		#1 | 
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	
	
		
		
			
			 
				
				converting PDF to ePub tips
			 
			
			
			I understand that PDF is not very friendly when converting to ePub, I've tried few online converters (seems most use Calibre) and the result requires a lot of work. It seems browser can open easily a PDF file keeping the formatting, just wondering if anyone tried to convert PDF to ePub usinig the following method: 
		
	
		
		
		
		
		
		
		
		
		
		
	
	1 - open PDF file in browser (Brave in my case) 2 - copy the text from browser and paste into an ePub editing app (Sigil) 3 - copy/paste links/fotos manually the step 3 requires a lot of work if there are lot of links/photos, anyone knows any tools/tips to make it automatic or easier than manually? ps: OS MX Linux  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#2 | 
| 
			
			
			
			 Bibliophagist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 48,175 
				Karma: 174315444 
				Join Date: Jul 2010 
				Location: Vancouver 
				
				
				Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Michael, please note that posting multiple copies of the same message is not permitted on MobileRead. Posting the same message in this forum and the Sigil forum is unlikely to get you any extra help and is annoying to those who find themselves starting to reply before realizing that they already replied.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#3 | 
| 
			
			
			
			 Fanatic 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 531 
				Karma: 2268308 
				Join Date: Nov 2015 
				
				
				
				Device: none 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			What I found very useful was using LibreOffice Draw to convert PDF to a flat ODG document, which can be easily converted to a markup language like XML. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	This is the only sure way to work around paragraphs, fonts, and page breaks. Automatic tools often fail with those.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#4 | |
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#5 | |
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| Advert | |
| 
         | 
    
| 
			
			 | 
		#6 | 
| 
			
			
			
			 Wizard 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,874 
				Karma: 10700629 
				Join Date: May 2016 
				Location: Canada 
				
				
				Device: Onyx Nova 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#7 | 
| 
			
			
			
			 Guru 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 681 
				Karma: 929286 
				Join Date: Apr 2014 
				
				
				
				Device: PW-3, iPad, Android phone 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			You need to be aware that there are several different kinds of PDFs. One is made by scanning text. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	So these are sets of images of text. Many apps, like Adobe Acrobat, will try to OCR this and embed an invisible text layer which allows you to search for text and copy chunks of it. Other is created by a DTP app, like InDesign. The text is actually text, not images of text, if you zoom in max it will still be smooth. With either of these, you can get the text cursor and select all, copy and paste in Word (or equivalent). From there, save as "web page, filtered" and you get HTML, which you can import to Sigil, or Calibre. You will have a lot more work then to clean it up. If it was OCR text, there will be lots of errors. Spellcheck will help find many. Also you need to review the pagebreaks, which insert spurious paragraph breaks. Then rationalise the CSS. If it's a scan document, you can also try a full OCR app like ABBYY which can export directly to HTML or some versions do epub. And there is a command line app , "pdftohtml" which does as it says. See https://poppler.freedesktop.org/ No matter how you do it you need to invest hours at least to clean up and check. Otherwise you will get garbage, see the awful epubs automatically created by OCR at Internet Archive.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#8 | 
| 
			
			
			
			 Resident Curmudgeon 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 80,782 
				Karma: 150249619 
				Join Date: Nov 2006 
				Location: Roslindale, Massachusetts 
				
				
				Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			The only way you can 100% be sure that the converted PDF is error free is to do a 100% A/B comparison of the PDF and the resulting ePub. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	Do you want to do this? Do you have to convert PDF > ePub?  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#9 | 
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#10 | |
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#11 | 
| 
			
			
			
			 Connoisseur 
			
			![]() Posts: 81 
				Karma: 10 
				Join Date: Aug 2010 
				Location: Murcia/Spain 
				
				
				Device: Android 12 
				
				
				 | 
	
	|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#12 | 
| 
			
			
			
			 Still reading 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 15,004 
				Karma: 111111255 
				Join Date: Jun 2017 
				Location: Ireland 
				
				
				Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			Probably neither.
		 
		
	
		
		
		
		
		
		
		
		
		
		
	
	 | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#13 | 
| 
			
			
			
			 Bibliophagist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 48,175 
				Karma: 174315444 
				Join Date: Jul 2010 
				Location: Vancouver 
				
				
				Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos 
				
				
				 | 
	
	
	
		
		
		
		
		 
			
			The tool used to do that is your Mark 1 eyeball. I suspect that if you want someone else to use their eyeballs on your behalf, it will get expensive. 
		
	
		
		
		
		
		
		
		
		
		
		
	
	You will have to bring up both the ePub and PDF on your screen and check line by line for differences.  | 
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#14 | |
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 454 
				Karma: 3886916 
				Join Date: May 2013 
				Location: Ontario, Canada 
				
				
				Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 For OCR, try OCRFeeder as a front end to tesseract. Tesseract is very accurate given a good image. I do a page at a time, defining the text area manually. It can then handle multiple columns, advertisements, "continued on page 99" and so on. OCRFeeder is very good at connecting the lines into correct paragraphs, dealing with end-of-line hyphens, and so on. Might seem slow, but this as actually the quick part of the process...you will have to proof read and correct no matter what. Pdftopng will get images out of pdfs, that works better than OCRing the pdf itself. ImageMagick can tame image files that are too large and slow down tesseract. Scan Taylor Advanced and Unpaper may be useful; I find them black magic, but I use them if needed. If you want to try and use existing text, pdftohtml will sometimes fail while pdftotext will work. No idea why. If you use the pdftotext, try the --layout option and get ready for a lot of regex to tame the spacing.  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
| 
			
			 | 
		#15 | |
| 
			
			
			
			 Evangelist 
			
			![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 454 
				Karma: 3886916 
				Join Date: May 2013 
				Location: Ontario, Canada 
				
				
				Device: Kindle KB, Oasis, Pop_Os!, Kobo Forma 
				
				
				 | 
	
	
	
		
		
		
		
		 Quote: 
	
 Dealing with epub images is a pain and depends on your audience...what do they use for readers? The most basic advice for physical ereaders like Kindle or Kobo is to declare the width as a percent and the height auto in the CSS...don't use absolute units. Then surround the <img...> line with a <p> or <div> to make it center/right/left. Like this: <p class="center"> <img alt="" class="widepic" src="../Images/c02.jpg"/> </p> where "widepic" says in the CSS: .widepic { height: auto; width: 98%; } And "center" is just that: .center { text-align: center; text-indent: 0; margin: .5em 0 .5em 0; } So you can pull the images into the Writer version, and apply something like this at the epub editor stage. I can usually replace what the conversion does with this using some simple regex. This is almost always readable on various readers or apps..  | 
|
| 
		 | 
	
	
	
		
		
		
		
			 
		
		
		
		
		
		
		
			
		
		
		
	 | 
![]()  | 
            
        
            
| Tags | 
| epub, pdf conversion, tip, tool tips | 
            
  | 
    
			 
			Similar Threads
		 | 
	||||
| Thread | Thread Starter | Forum | Replies | Last Post | 
| Large book with a 2-column structure, PDF -> EPUB .. any tips on this one? | cow | Conversion | 4 | 03-09-2022 05:29 PM | 
| A little help converting ePUB to PDF | GuilleCrK | Conversion | 8 | 01-07-2019 12:05 PM | 
| Ultimate PDF to Epub/Mobi conversion tips | sinan | Workshop | 43 | 08-01-2017 01:46 AM | 
| converting pdf to epub | Gagan | ePub | 65 | 06-29-2017 12:57 AM | 
| Converting PDF Tips | baker2gs | Amazon Kindle | 4 | 03-10-2010 11:53 PM |