learning to convert docs

mike_bike_kite · 04-27-2010, 07:11 AM

I'm a bit new at this and have been trying to convert a few pdf books to epub to read in FBReader on my old Nokia N800. The pdf looks fine on my computer and on my N800 but I wanted to learn to convert using Calibre. I know regexp's etc but I don't understand these XPATH lines and can't see how they apply to non html files.

Problems I'm having:

Partial lines seem to get treated as a new paragraph so I get this

Quote:

the passengers rushed to view the cosmic visitor that had fallen from

the sky. But it was impossible to examine the burning hot meteorite

in any detail. Later, when the meteorite cooled, it was trenched

Chapters aren't recognised - obviously I need to enter some kind of format but I'm not sure what. Chapters Look like this:

Quote:

7. This is the new chapter
I tried copying an image from the web and pasting it into Calibre on the meta info page. It showed the image here but didn't seem to display the image later. Perhaps I'm missing something?
I'm using defaults for both input and output devices - is this a good idea for a non ereader device like an N800? it has a wide screen 800*600 ( I think).

Thanks for any advice or useful links

Mike

mike_bike_kite · 04-29-2010, 08:52 PM

Interestingly I still get all sorts of issues when converting from PDF to TXT. My aim was to just grab the text and then do the formatting with an editor like vi. Strangely the txt has many odd artefacts like double L's appearing as on L followed by a few strange graphic characters.

I do understand that PDFs are very poor as a container of text but I thought I might be able to convert my pdf files to epub (or even just txt) with the intention of picking a suitable ereader - I guess I'm stuck on getting one that can display the pdfs well.

kovidgoyal · 04-29-2010, 09:42 PM

http://calibre-ebook.com/user_manual...ture-detection

http://calibre-ebook.com/user_manual...-pdf-documents

As for the double ll glyph, that's a bug, which wont be fixed until calibre's new PDF engine is done.

mike_bike_kite · 04-30-2010, 05:09 AM

Yep - I'd read those pages, I also understand HTML and, to a lesser extent, XML. Problem is I'm trying to write small bits of code in Calibre using a language I don't know (XPATH) to process the contents of a file I can't see the contents of (PDF) and for some strange reason I seem to be having problems

If I could just view the text then I could write a little program to stitch things back together. Are there converters that perhaps perform OCR on the PDF and just output the text?

Mike

kovidgoyal · 04-30-2010, 07:44 AM

read this http://calibre-ebook.com/user_manual...rsion.html#id7

in particular the section on the debug option which will allow you access to the text in the intermediate stages of conversion.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Language learning	Kumabjorn	General Discussions	5	07-28-2010 01:33 PM
e-learning	irenas	Astak EZReader	42	03-03-2010 12:56 PM
Seriously thoughtful Learning a new language	GraceKrispy	Lounge	159	11-22-2009 09:38 AM
Plucker Fails to convert HTML docs via Word	evwool	Reading and Management	8	05-10-2009 02:23 PM
Convert word DOCs when you don't have WORD ? heheh	macthekitten	Calibre	9	01-30-2009 08:41 AM

04-29-2010, 08:52 PM	#2
mike_bike_kite Digitally confused Posts: 500 Karma: 1500000 Join Date: Mar 2010 Location: London, UK Device: KPW, K2i, Nexus 7 32gb, Kobo Mini	Interestingly I still get all sorts of issues when converting from PDF to TXT. My aim was to just grab the text and then do the formatting with an editor like vi. Strangely the txt has many odd artefacts like double L's appearing as on L followed by a few strange graphic characters. I do understand that PDFs are very poor as a container of text but I thought I might be able to convert my pdf files to epub (or even just txt) with the intention of picking a suitable ereader - I guess I'm stuck on getting one that can display the pdfs well.

04-29-2010, 09:42 PM	#3
kovidgoyal creator of calibre Posts: 45,723 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://calibre-ebook.com/user_manual...ture-detection http://calibre-ebook.com/user_manual...-pdf-documents As for the double ll glyph, that's a bug, which wont be fixed until calibre's new PDF engine is done.

04-30-2010, 05:09 AM	#4
mike_bike_kite Digitally confused Posts: 500 Karma: 1500000 Join Date: Mar 2010 Location: London, UK Device: KPW, K2i, Nexus 7 32gb, Kobo Mini	Yep - I'd read those pages, I also understand HTML and, to a lesser extent, XML. Problem is I'm trying to write small bits of code in Calibre using a language I don't know (XPATH) to process the contents of a file I can't see the contents of (PDF) and for some strange reason I seem to be having problems If I could just view the text then I could write a little program to stitch things back together. Are there converters that perhaps perform OCR on the PDF and just output the text? Mike

04-30-2010, 07:44 AM	#5
kovidgoyal creator of calibre Posts: 45,723 Karma: 28549306 Join Date: Oct 2006 Location: Mumbai, India Device: Various	read this http://calibre-ebook.com/user_manual...rsion.html#id7 in particular the section on the debug option which will allow you access to the text in the intermediate stages of conversion.

Advert

Advert