View Single Post
Old 01-21-2011, 11:31 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,803
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Archon View Post
I would like to learn more about cleaning up text files that need editing i.e. removing page numbers, adding indents to paragraphs, and creating a template for all text files to be converted to.

What should I begin learning about to accomplish this?

I have started to read a little about regular expressions and perl and I can use other programs or the Terminal (Mac OSX) to convert files. I am a somewhat experienced computer user but am not a programmer by any means.

I was hoping a Guru could tell me what to focus on to get Calibre to clean these files as they are imported.
OR
Give me a shortlist of what to learn to create scripts or small programs (Applescript or perl?) that I could drop a txt file or rtf file on and have it cleaned up and converted.


Taking all helpful advice.
Archon
I take an alternate approach (total EPUB bias here).

I Import format x into Calibre,
I fix my meta-data first
Then I Convert to EPUB, getting the Paragraphs detected properly and don't spend a lot of time fine tuning the Regex for that 'perfect' convert.
(My experience: Each document needs a slightly different approach(es). OS does not FA )

Then I use Sigil for the rest.
A "clean" book takes less than 5 Minutes in Sigil/Flightcrew.
A messy (Word sourced?) can take 30 minutes to trim the gross cruft.)
Really bad (UC OCR?) might go to an hour-plus or get tossed.

Note: I run Mutiple Monitors, so I can have both versions displayed at once for visual comparison I also use a Programmable key pad with frequently used keystroke patterns (Del Del space), so my right hand controls the mouse and the left punches a macro button, thus reducing the hands motion back and forth.
YMMV
theducks is offline   Reply With Quote