Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 10-01-2009, 11:14 PM   #1
Phil_C
Zealot
Phil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud of
 
Phil_C's Avatar
 
Posts: 100
Karma: 27924
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-300Black/350Silver/350Blue/T3Black x2
Regular Expressions help needed

Hi All --- first post here ---

I've been playing with my Sony Reader for three weeks now and trying to format various books found around the Web. Researching these forums have led me to calibre and Book Designer, both excellent programs.

I'm having a problem cleaning up some .lit files for conversion to .lrf in BD because of page numbers and the breaks caused by them. I have been able to get rid of the numbers by use of simple Regular Expressions (which I had never heard of until now) in Book Cleaner.

But I can't figure out how to deal with the empty spaces and lines left by page breaks after the page numbers are removed. What remains is the broken end of a sentence, two blank lines after that, then the broken sentence continuing on what previously was the next page. So I need to get rid of all the space and make the sentence whole again.

I fixed one book by visually scanning several hundred pages and deleting the offending spaces manually, but don't want to do that again!

This must be a common problem, so I hope someone here can give me a clue. I think I do not understand exactly what is in all that blank space, or how to tell the Reg Exp where to begin and end.

Thanks for any help.

Phil
Phil_C is offline   Reply With Quote
Old 10-02-2009, 08:04 AM   #2
jheaney
Member
jheaney began at the beginning.
 
Posts: 11
Karma: 10
Join Date: Oct 2009
Location: Sutton, Surrey, England
Device: PRS-505
Not sure if this uses the standard Unix reg exp, but if it does then ^$ should match a blank line
jheaney is offline   Reply With Quote
Old 10-02-2009, 08:45 AM   #3
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Copy/paste a couple-three examples here.

m a r
rogue_ronin is offline   Reply With Quote
Old 10-02-2009, 10:00 AM   #4
Phil_C
Zealot
Phil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud of
 
Phil_C's Avatar
 
Posts: 100
Karma: 27924
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-300Black/350Silver/350Blue/T3Black x2
Here is an example of the book I'm trying to clean now, except that in Book Designer the page number is centered and the first line of each section is indented 4 spaces:

*****

Twice he'd sued the company, and twice he'd won. And once the boys upstairs realized he was determined to join them, and that he had the brains to do so, they accepted him as a person. It still wasn't easy, but he had their respect. Teaker, now on his third scotch, leaned in and offered, confidentially of course, that Peel was being groomed

166

for the big job. "You could be talking to a future CEO," he said to Lonnie.

*****

I can remove the page numbers with:

\d+

but can't figure out how to join the broken sentence.

Phil
Phil_C is offline   Reply With Quote
Old 10-02-2009, 10:42 AM   #5
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 5,901
Karma: 4269879
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Maybe something like:

Search for any number of line breaks, followed by any number of digits, followed by any number of line breaks, followed by a lower case letter (and save this letter). (In vim: '\n\+\d\+\n\+\(\l\)')

Replace it with a space an the lowercase letter: (in vim: ' \1')

Then manually check the instances of something else instead of the lowercase letter to see whether or not they are broken sentences/paragraphs.
Jellby is offline   Reply With Quote
Old 10-02-2009, 11:57 AM   #6
Phil_C
Zealot
Phil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud ofPhil_C has much to be proud of
 
Phil_C's Avatar
 
Posts: 100
Karma: 27924
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-300Black/350Silver/350Blue/T3Black x2
Quote:
Originally Posted by Jellby View Post
Maybe something like:

Search for any number of line breaks, followed by any number of digits, followed by any number of line breaks, followed by a lower case letter (and save this letter). (In vim: '\n\+\d\+\n\+\(\l\)')

Replace it with a space an the lowercase letter: (in vim: ' \1')

Then manually check the instances of something else instead of the lowercase letter to see whether or not they are broken sentences/paragraphs.
I can't get that or variations to do anything in Book Cleaner. Is "vim" different from Regular Expressions? The only reference I have is the list of Reg Exp in the Book Designer help section.
Phil_C is offline   Reply With Quote
Old 10-02-2009, 12:47 PM   #7
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 5,901
Karma: 4269879
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by Phil_C View Post
I can't get that or variations to do anything in Book Cleaner. Is "vim" different from Regular Expressions? The only reference I have is the list of Reg Exp in the Book Designer help section.
Vim is a text editor, which can use regular expressions, but each editor, processor, language, etc. uses a different flavour of regular expressions, with varying syntax, escaped characters, etc. I don't know what would be the exact dialect used by Book Cleaner. Also, if you are on Windows, you may have to search for \r\n instead of \n...
Jellby is offline   Reply With Quote
Old 10-02-2009, 01:05 PM   #8
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Quote:
Originally Posted by Phil_C View Post
I can't get that or variations to do anything in Book Cleaner. Is "vim" different from Regular Expressions? The only reference I have is the list of Reg Exp in the Book Designer help section.
No VIM is a "Vi" clone text editor that as Regular Expressions built into it. It's my favorite text editor but it is very hard learn, but once you learn it is fantastic. (Also you might have remove the Escape mask "\" in front of the "+" since the "+" is a command in VI.

I'd try
'\s*\n+\s*\d{1,4}\s*\n+([a-z'"]{1,2})/ \1/'

>> Ignore the single quotes

Note that I added "\s*" because there can be hidden spaces the \s* will remove any space or ignore if there is no space.
Also limit the page number size from 1-9999 so you don't replace text that is a valid number. Last only conat the string if the next paragraph starts with a lower cap letter or a double or single quote. Note the \1 adds what is in the () in the replace text. Some RegEx use $1 you'll have to play around. VIM uses \1

Then run
'\s*\n+\s*\d{1,4}\s*\n+[A-Z'"]/\n\1/'

To fix paragraphs with capital letters letters

=X=
=X= is offline   Reply With Quote
Old 10-02-2009, 01:14 PM   #9
orion2001
Groupie
orion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notesorion2001 can name that song in three notes
 
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
You might find this post of mine useful:

http://www.mobileread.com/forums/sho...2&postcount=16

Komodo edit seems to be the best at handling multiline Reg expressions amongst the editors I have tested so far. My post is more for cleaning up the relevant HTML file, but you should be able to adapt it for your regular text file without much hassle.

I'd suggest removing the page numbers first as you did, and then use that type of expression to match for the required number of pagebreaks and then join the sentences if necessary. The expression makes sure to only join sentences that are incomplete. If however the sentence on the next page is a new one it will retain the linebreak. It might result in a couple of spurious paragraphs but that is something you can manually edit. The other option is to just delete all such linebreaks and fuse the sentences across pages although in that case you might end up joining two separate paragraphs if the second one started on the first line of a page.
orion2001 is offline   Reply With Quote
Old 10-02-2009, 01:17 PM   #10
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,792
Karma: 507333
Join Date: May 2009
Device: none
I can (somewhat) program, make websites, build computers, install and reinstall OSes... but I haven't yet figured out how to either edit text files in Vi/VIM... or for that matter how to even simply quite the program.

Is it wise to suggest to a person of moderate computing know-how to use Vi or Vim?

- Ahi
ahi is offline   Reply With Quote
Old 10-02-2009, 02:16 PM   #11
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 5,901
Karma: 4269879
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ahi View Post
Is it wise to suggest to a person of moderate computing know-how to use Vi or Vim?
I'm aware vi(m) is weird, I couldn't understand it when I was first introduced to it, now I can't live without it. But I was not suggesting anyone should learn vi(m), I was just providing an example regexp with the "warning" that it's in the vim regexp dialect.

(To exit vim, press ESC (to make sure you are not in edit mode) then type ":q!" or "ZQ", without quotes.)

EDIT: Maybe you were asking whether it is worth for you to learn vi(m)? Well, that depends. Are you comfortable to some other advanced text editor like emacs? If you are, you don't need to learn vi(m). But if you are not and you want to learn some editor, vi(m) is as good as other alternatives. Try running "vimtutor" to start with.

Last edited by Jellby; 10-02-2009 at 02:21 PM.
Jellby is offline   Reply With Quote
Old 10-02-2009, 02:32 PM   #12
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,792
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by Jellby View Post
I'm aware vi(m) is weird, I couldn't understand it when I was first introduced to it, now I can't live without it. But I was not suggesting anyone should learn vi(m), I was just providing an example regexp with the "warning" that it's in the vim regexp dialect.

(To exit vim, press ESC (to make sure you are not in edit mode) then type ":q!" or "ZQ", without quotes.)

EDIT: Maybe you were asking whether it is worth for you to learn vi(m)? Well, that depends. Are you comfortable to some other advanced text editor like emacs? If you are, you don't need to learn vi(m). But if you are not and you want to learn some editor, vi(m) is as good as other alternatives. Try running "vimtutor" to start with.
Thanks. If I remember this, next time I won't have to kill my konsole tab when I have the misfortune of ending up in vim... for some reason in the event of a latex error, some (I do not definitively know what) key combination mysteriously fires it up...

And yeah, I'm aware of the potential benefits... but because I type fairly fast, I have doubts whether or not I would see any massive benefits as a result of learning either VIM or Emacs. If I need something complex, I usually throw a python script at the problem.

Thanks again though!

- Ahi
ahi is offline   Reply With Quote
Old 10-02-2009, 03:57 PM   #13
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Quote:
Originally Posted by ahi View Post
I can (somewhat) program, make websites, build computers, install and reinstall OSes... but I haven't yet figured out how to either edit text files in Vi/VIM... or for that matter how to even simply quite the program.

Is it wise to suggest to a person of moderate computing know-how to use Vi or Vim?

- Ahi
It depends on their age and background. When I first started working in the Computing field, everybody that used UNIX knew VI well, tech savvy or not.

Now there are so many alternatives that is not a safe assumption.

VI is by no means a developer/coder's tool. It is just a very quick and powerful text editor, power in the form of efficiency, where using a mouse just slows you down.

=X=
=X= is offline   Reply With Quote
Old 10-02-2009, 03:58 PM   #14
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,792
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by =X= View Post
It depends on their age and background. When I first started working in the Computing field, everybody that used UNIX knew VI well, tech savvy or not.

Now there are so many alternatives that is not a safe assumption.

VI is by no means a developer/coder's tool. It is just a very quick and powerful text editor, power in the form of efficiency, where using a mouse just slows you down.

=X=
I don't use a mouse for editing with gEdit or Kate... so however good it may be, VI is not a necessary savior from mouse-use.

- Ahi
ahi is offline   Reply With Quote
Old 10-02-2009, 04:06 PM   #15
=X=
Wizard
=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.=X= ought to be getting tired of karma fortunes by now.
 
=X='s Avatar
 
Posts: 3,672
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
Quote:
Originally Posted by Jellby View Post

(To exit vim, press ESC (to make sure you are not in edit mode) then type ":q!" or "ZQ", without quotes.)
Just to clarify the "!" is only needed if you want to quit without saving.

Quote:
Originally Posted by ahi View Post
... but because I type fairly fast, I have doubts whether or not I would see any massive benefits as a result of learning either VIM or Emacs.
Actually, that is even a bigger argument for learning VI. If you type fast and are often on the keyboard, you will even more efficient because you don't have to move your hand from the keyboard. BUT there are also health reasons, the mouse is the number one cause of carpal tunnel. Using VIM will reduce that risk.

Learning VI is steep but once you learn it you will be amazed that you no longer have to think or look for any menu/tool bar. Your fingers will know the key sequence and be done with it before you realize it.


=X=
=X= is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with regular expressions Manichean Conversion 10 02-03-2011 02:27 PM
Custom Regular Expressions for adding book information bigbot3 Calibre 1 12-25-2010 06:28 PM
Regular expressions, Calibre and you- an introduction (Archived) Manichean Conversion 80 11-11-2010 07:37 AM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM
BookDesigner v5 and regular expressions ShineOn Sony Reader 11 08-25-2008 04:06 PM


All times are GMT -4. The time now is 01:23 AM.


MobileRead.com is a privately owned, operated and funded community.