Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 01-21-2011, 11:02 AM   #1
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Learning More About Cleaning Up Documents

I would like to learn more about cleaning up text files that need editing i.e. removing page numbers, adding indents to paragraphs, and creating a template for all text files to be converted to.

What should I begin learning about to accomplish this?

I have started to read a little about regular expressions and perl and I can use other programs or the Terminal (Mac OSX) to convert files. I am a somewhat experienced computer user but am not a programmer by any means.

I was hoping a Guru could tell me what to focus on to get Calibre to clean these files as they are imported.
OR
Give me a shortlist of what to learn to create scripts or small programs (Applescript or perl?) that I could drop a txt file or rtf file on and have it cleaned up and converted.


Taking all helpful advice.
Archon
Archon is offline   Reply With Quote
Old 01-21-2011, 11:25 AM   #2
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
I recommend learning Python. Mainly because calibre is written in Python and you will be able to use calibre's code as examples to help you. Also you can then write on import plugins to modify the files as you import them.
user_none is offline   Reply With Quote
Advert
Old 01-21-2011, 11:31 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,809
Karma: 54830978
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Archon View Post
I would like to learn more about cleaning up text files that need editing i.e. removing page numbers, adding indents to paragraphs, and creating a template for all text files to be converted to.

What should I begin learning about to accomplish this?

I have started to read a little about regular expressions and perl and I can use other programs or the Terminal (Mac OSX) to convert files. I am a somewhat experienced computer user but am not a programmer by any means.

I was hoping a Guru could tell me what to focus on to get Calibre to clean these files as they are imported.
OR
Give me a shortlist of what to learn to create scripts or small programs (Applescript or perl?) that I could drop a txt file or rtf file on and have it cleaned up and converted.


Taking all helpful advice.
Archon
I take an alternate approach (total EPUB bias here).

I Import format x into Calibre,
I fix my meta-data first
Then I Convert to EPUB, getting the Paragraphs detected properly and don't spend a lot of time fine tuning the Regex for that 'perfect' convert.
(My experience: Each document needs a slightly different approach(es). OS does not FA )

Then I use Sigil for the rest.
A "clean" book takes less than 5 Minutes in Sigil/Flightcrew.
A messy (Word sourced?) can take 30 minutes to trim the gross cruft.)
Really bad (UC OCR?) might go to an hour-plus or get tossed.

Note: I run Mutiple Monitors, so I can have both versions displayed at once for visual comparison I also use a Programmable key pad with frequently used keystroke patterns (Del Del space), so my right hand controls the mouse and the left punches a macro button, thus reducing the hands motion back and forth.
YMMV
theducks is offline   Reply With Quote
Old 01-21-2011, 11:36 AM   #4
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
I forgot to mention if you use Python and leverage calibre you can also use the helper functions in TXT input (conversion). There are a number that could be helpful such as formatting and paragraph type detection.
user_none is offline   Reply With Quote
Old 01-23-2011, 10:41 AM   #5
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Just wanted to thank those who responded for their input.

I am pursuing learning about regular expressions first to learn how to clean up documents with poor OCR, page numbers left it, etc.

Thanks again
Archon
Archon is offline   Reply With Quote
Advert
Old 01-23-2011, 10:57 AM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
learning enough regex to get by is the key. there is sufficient in the forums & online or in a simple book like "sam's teach yourself in 10 mins"

I think the issues break down into:

1. formatting - mostly line break issues, but also there's malfunctioning special characters like accents & dashes.

2 chapter detection & headings. - once you get to understand what e-reader software looks for & how calibre handles heuristics - they are not so bad.

3. typos / scan errors

2of 3 can be fixed well with calibre & sigl

3 needs the most manual input, but if a word is mis-scanned once it's likely to be consistently mis-scanned through out, so I use find / replace to look for other instances e.g. find all I'11, replace with I'll.

there is a spell check free tool ( microspell ) that will work with Sigil and/or will scan your txt files, but I've found it too troublesome for regular use.. it has to be taught all the proper nouns anyway then it develops a tendency to accept some scan errors as uncommon words & has to be retrained... if you have the patience, it does have lots of options & can learn as it goes....http://www.microspell.com/ seems to work OK in windows 7.

stripping header & footer ( & page numbers ) is only an issue with .pdf sources, which are best avoided anyway.

when it comes to fine tuning book appearance, then a little understanding of HTML styles & the stylesheet.css file ( editable in sigil) goes a long way. I keep goggling & asking whenever I bump into something that I don't fully understand.

Last edited by cybmole; 01-23-2011 at 11:01 AM.
cybmole is offline   Reply With Quote
Old 01-23-2011, 11:19 AM   #7
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
For learning regular expressions start with what you want to replace and start replacing parts of it with regular expression characters. Meaning you have:

Code:
<div><h4>Book Author</h4><h6>Page 1 of 365</h6></div>
Start by replacing parts that change:

Code:
<div><h4>Book Author</h4><h6>Page \d{1,3} of 365</h6></div>
Then:

Code:
<div>\s*<h4>\s*Book\s*Author\s*</h4>\s*<h6>\s*Page\s*\d{1,3}\s*of\s*365</h6>\s*</div>
And so forth.

Also, the trickiest part of regular expressions is the confusion when it matches things you didn't expect it to match. calibre has a regular expression wizard in the search and replace option section that will show you all matches for your regex in the document. Use this or if you have a text editor that can highlight all matches use it. It will save you a lot of time and trouble.
user_none is offline   Reply With Quote
Old 01-23-2011, 05:02 PM   #8
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Hmm yeah I am not to that level just that quick.

I had a book that someone OCRd and it had replaced all the italics with _foo_.

I was quite pleased with myself for using:

Find
_(.+)_

and replacing with

\\i \1 \\i0

With some tweaking I was able to replace a couple hundred occurrences with spaces, commas and periods after the words.

Now the book is back closer to its original with italics in it.

Not bad for my first foray but I am sure I will learn how to search and replace with just one search instead of repeated with different variations.

For anyone interested Bare Bones software has a free program called Text Wrangler for Mac OSX that will do regular expressions search and replacing.
http://www.barebones.com/products/te.../download.html

I will keep at it though until I understand everything you typed.
Thanks for the help
Archon
Archon is offline   Reply With Quote
Old 01-23-2011, 05:20 PM   #9
Archon
Zealot
Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!Archon , Klaatu Barada Niktu!
 
Archon's Avatar
 
Posts: 110
Karma: 5176
Join Date: Dec 2010
Device: Mac OSX, iPad, iPod, & Nook
Hmm yeah I couldn't find where Sigil does regex search and replace.

Calibre might be a little unwieldy for editing files since I would like to work on a copy before adding to Calibre that way if it gets too screwed up I can trash it.

So, I am left with Text Wrangler or BBEdit at the moment they can work on just about any file including txt, rtf, html, java, c++, etc.

I will try to learn some of the html and rtf tagging also so I can understand how to make some basic format changes.

If you have ever seen a FeedBooks epub that is kinda what I would be going for.

Happy Sunday
Archon
Archon is offline   Reply With Quote
Old 01-23-2011, 05:55 PM   #10
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
I don't know if you've found that already, but there's a tutorial available in the manual that tries to explain regexes and their use in Calibre.
Manichean is offline   Reply With Quote
Old 01-23-2011, 08:56 PM   #11
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Archon View Post
Hmm yeah I am not to that level just that quick.

I had a book that someone OCRd and it had replaced all the italics with _foo_.

I was quite pleased with myself for using:

Find
_(.+)_

and replacing with

\\i \1 \\i0

With some tweaking I was able to replace a couple hundred occurrences with spaces, commas and periods after the words.

Now the book is back closer to its original with italics in it.

Not bad for my first foray but I am sure I will learn how to search and replace with just one search instead of repeated with different variations.

For anyone interested Bare Bones software has a free program called Text Wrangler for Mac OSX that will do regular expressions search and replacing.
http://www.barebones.com/products/te.../download.html

I will keep at it though until I understand everything you typed.
Thanks for the help
Archon
You should also check out the 'Heuristics' section of Calibre's conversion settings. A lot of the things you want to do may be covered there already. For example, the _foo_ case for italics is already covered there under the 'italicize common cases' function. It was still a good learning exercise for you, so not all was lost, but in many common cases the work has been done.

TextWrangler is a good app - it supports most of the python syntax, I use it for testing most of my expressions when debugging.

Last edited by ldolse; 01-23-2011 at 10:57 PM.
ldolse is offline   Reply With Quote
Old 01-23-2011, 10:54 PM   #12
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,864
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by Archon View Post
Hmm yeah I couldn't find where Sigil does regex search and replace.
Right where you would expect it while in code view under the find and replace window, check Regular expression under search mode. But this is a topic for the Sigil forum.
DoctorOhh is offline   Reply With Quote
Old 01-24-2011, 02:55 AM   #13
Agama
Guru
Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.Agama ought to be getting tired of karma fortunes by now.
 
Agama's Avatar
 
Posts: 776
Karma: 2751519
Join Date: Jul 2010
Location: UK
Device: PW2, Nexus7
Quote:
Originally Posted by user_none View Post
I recommend learning Python. Mainly because calibre is written in Python and you will be able to use calibre's code as examples to help you. Also you can then write on import plugins to modify the files as you import them.
I have written some ePub cleanup tools which hook into calibre's Tweak ePub but they are all in VBScript. Can you recommend any resources, (books/websites), for learning Python so that I can rewrite them as calibre plugins?
Agama is offline   Reply With Quote
Old 01-24-2011, 03:37 AM   #14
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by Agama View Post
I have written some ePub cleanup tools which hook into calibre's Tweak ePub but they are all in VBScript. Can you recommend any resources, (books/websites), for learning Python so that I can rewrite them as calibre plugins?
Python comes with an excellent tutorial. I've found that quite sufficient when you have prior programming knowledge.
Manichean is offline   Reply With Quote
Old 01-24-2011, 03:42 AM   #15
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Agama View Post
I have written some ePub cleanup tools which hook into calibre's Tweak ePub but they are all in VBScript. Can you recommend any resources, (books/websites), for learning Python so that I can rewrite them as calibre plugins?
What do the epub cleanup tools do, do you have them linked anywhere? One of my near-future projects was to look into extending tweak epub to apply various tweaks that exist in the conversion pipeline but don't neccessarily require the full conversion process - smarten punctuation, dehyphenate, etc. Sounds like some of these may apply
ldolse is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Cleaning screen sadievan Amazon Kindle 15 01-11-2011 10:44 AM
Screen cleaning melw Bookeen 7 10-02-2008 11:52 AM
Cleaning the reader pilotbob Sony Reader 19 11-27-2007 05:41 PM
Cleaning the Screen? mckenzie Sony Reader 7 10-22-2007 11:36 PM


All times are GMT -4. The time now is 03:48 AM.


MobileRead.com is a privately owned, operated and funded community.