Thread: PDF to text
View Single Post
Old 07-28-2010, 12:26 PM   #14
mike_bike_kite
Digitally confused
mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.mike_bike_kite ought to be getting tired of karma fortunes by now.
 
mike_bike_kite's Avatar
 
Posts: 500
Karma: 1500000
Join Date: Mar 2010
Location: London, UK
Device: KPW, K2i, Nexus 7 32gb, Kobo Mini
I can't understand why there aren't simple post processors to process the text. Take the text output and join the lines together unless they end in a full stop, a question mark or a double quote.

You may need to remove page numbers if present and any chapter titles that appear at the top of each page. I managed to get this far but then found there were various funny characters in the text to represent double ll's etc and these need to be converted.

My aim was to finally generate HTML and then use the chapter titles to create a TOC. I got halfway there but considering how clever tools like Calibre are, it surprised me that this wasn't done automatically.
mike_bike_kite is offline   Reply With Quote