View Single Post
Old 02-11-2016, 06:58 PM   #1
leito360
Member
leito360 began at the beginning.
 
Posts: 14
Karma: 10
Join Date: Feb 2010
Device: none
Help converting file from HTML>EPub. File is divided in several pages I want to merge

Hello.

The problem is as follows:

I have a PDF book, I convert it from PDF to HTML using pdftotext (in this case pdftohtml).
The HTML files look good and everything, the PDF has been copied maintaining most of its format, the indentation is, even, intact.
The problem is that pdftohtml separated the book in 239 html files... a file per page.

I did a mild editing on the HTMLs deleting the page number at the bottom, and then I exported them to EPUB and later, to MOBI, all this with calibre. When I read the file on my Kindle, I noticed that the device respected the disposition of the text in HTML. For example, if page3.html has 5 lines, Kindle shows those lines and nothing else, when you pass to page4.html, it shows the lines contained inside the file, doesn't merge the lines in Page3 with the ones of Page4, it doesn't matter if they are from the same chapter.

I thought about opening every HTML and merge them in a single big DOC file while correcting all the strange page breaks, but I can't find a way to make Word or something similar to preserve all the indentation the book has, and that's my problem.

Just to be clear, I want to find a way to remove all the page breaks (Manually if necessary) while maintaining the format as clean as possible, especially the indentation, which is my biggest problem.
Is there a way to copy-paste text while keeping the original indentation? If I could do that, I would be able to merge the text of all 239 pages and then create a new ebook file.

Is there a program or way to do this?
leito360 is offline   Reply With Quote