10-01-2009, 11:14 PM | #1 |
Addict
Posts: 254
Karma: 391602
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-350; Kobo Clara HD; Kobo Clara 2E; Kobo Clara BW
|
Regular Expressions help needed
Hi All --- first post here ---
I've been playing with my Sony Reader for three weeks now and trying to format various books found around the Web. Researching these forums have led me to calibre and Book Designer, both excellent programs. I'm having a problem cleaning up some .lit files for conversion to .lrf in BD because of page numbers and the breaks caused by them. I have been able to get rid of the numbers by use of simple Regular Expressions (which I had never heard of until now) in Book Cleaner. But I can't figure out how to deal with the empty spaces and lines left by page breaks after the page numbers are removed. What remains is the broken end of a sentence, two blank lines after that, then the broken sentence continuing on what previously was the next page. So I need to get rid of all the space and make the sentence whole again. I fixed one book by visually scanning several hundred pages and deleting the offending spaces manually, but don't want to do that again! This must be a common problem, so I hope someone here can give me a clue. I think I do not understand exactly what is in all that blank space, or how to tell the Reg Exp where to begin and end. Thanks for any help. Phil |
10-02-2009, 08:04 AM | #2 |
Member
Posts: 11
Karma: 10
Join Date: Oct 2009
Location: Sutton, Surrey, England
Device: PRS-505
|
Not sure if this uses the standard Unix reg exp, but if it does then ^$ should match a blank line
|
Advert | |
|
10-02-2009, 08:45 AM | #3 |
Banned
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
Copy/paste a couple-three examples here.
m a r |
10-02-2009, 10:00 AM | #4 |
Addict
Posts: 254
Karma: 391602
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-350; Kobo Clara HD; Kobo Clara 2E; Kobo Clara BW
|
Here is an example of the book I'm trying to clean now, except that in Book Designer the page number is centered and the first line of each section is indented 4 spaces:
***** Twice he'd sued the company, and twice he'd won. And once the boys upstairs realized he was determined to join them, and that he had the brains to do so, they accepted him as a person. It still wasn't easy, but he had their respect. Teaker, now on his third scotch, leaned in and offered, confidentially of course, that Peel was being groomed 166 for the big job. "You could be talking to a future CEO," he said to Lonnie. ***** I can remove the page numbers with: \d+ but can't figure out how to join the broken sentence. Phil |
10-02-2009, 10:42 AM | #5 |
frumious Bandersnatch
Posts: 7,533
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Maybe something like:
Search for any number of line breaks, followed by any number of digits, followed by any number of line breaks, followed by a lower case letter (and save this letter). (In vim: '\n\+\d\+\n\+\(\l\)') Replace it with a space an the lowercase letter: (in vim: ' \1') Then manually check the instances of something else instead of the lowercase letter to see whether or not they are broken sentences/paragraphs. |
Advert | |
|
10-02-2009, 11:57 AM | #6 | |
Addict
Posts: 254
Karma: 391602
Join Date: Oct 2009
Location: Chicago, IL USA
Device: Sony PRS-350; Kobo Clara HD; Kobo Clara 2E; Kobo Clara BW
|
Quote:
|
|
10-02-2009, 12:47 PM | #7 |
frumious Bandersnatch
Posts: 7,533
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Vim is a text editor, which can use regular expressions, but each editor, processor, language, etc. uses a different flavour of regular expressions, with varying syntax, escaped characters, etc. I don't know what would be the exact dialect used by Book Cleaner. Also, if you are on Windows, you may have to search for \r\n instead of \n...
|
10-02-2009, 01:05 PM | #8 | |
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Quote:
I'd try '\s*\n+\s*\d{1,4}\s*\n+([a-z'"]{1,2})/ \1/' >> Ignore the single quotes Note that I added "\s*" because there can be hidden spaces the \s* will remove any space or ignore if there is no space. Also limit the page number size from 1-9999 so you don't replace text that is a valid number. Last only conat the string if the next paragraph starts with a lower cap letter or a double or single quote. Note the \1 adds what is in the () in the replace text. Some RegEx use $1 you'll have to play around. VIM uses \1 Then run '\s*\n+\s*\d{1,4}\s*\n+[A-Z'"]/\n\1/' To fix paragraphs with capital letters letters =X= |
|
10-02-2009, 01:14 PM | #9 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
You might find this post of mine useful:
https://www.mobileread.com/forums/sho...2&postcount=16 Komodo edit seems to be the best at handling multiline Reg expressions amongst the editors I have tested so far. My post is more for cleaning up the relevant HTML file, but you should be able to adapt it for your regular text file without much hassle. I'd suggest removing the page numbers first as you did, and then use that type of expression to match for the required number of pagebreaks and then join the sentences if necessary. The expression makes sure to only join sentences that are incomplete. If however the sentence on the next page is a new one it will retain the linebreak. It might result in a couple of spurious paragraphs but that is something you can manually edit. The other option is to just delete all such linebreaks and fuse the sentences across pages although in that case you might end up joining two separate paragraphs if the second one started on the first line of a page. |
10-02-2009, 01:17 PM | #10 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
I can (somewhat) program, make websites, build computers, install and reinstall OSes... but I haven't yet figured out how to either edit text files in Vi/VIM... or for that matter how to even simply quite the program.
Is it wise to suggest to a person of moderate computing know-how to use Vi or Vim? - Ahi |
10-02-2009, 02:16 PM | #11 | |
frumious Bandersnatch
Posts: 7,533
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Quote:
(To exit vim, press ESC (to make sure you are not in edit mode) then type ":q!" or "ZQ", without quotes.) EDIT: Maybe you were asking whether it is worth for you to learn vi(m)? Well, that depends. Are you comfortable to some other advanced text editor like emacs? If you are, you don't need to learn vi(m). But if you are not and you want to learn some editor, vi(m) is as good as other alternatives. Try running "vimtutor" to start with. Last edited by Jellby; 10-02-2009 at 02:21 PM. |
|
10-02-2009, 02:32 PM | #12 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
And yeah, I'm aware of the potential benefits... but because I type fairly fast, I have doubts whether or not I would see any massive benefits as a result of learning either VIM or Emacs. If I need something complex, I usually throw a python script at the problem. Thanks again though! - Ahi |
|
10-02-2009, 03:57 PM | #13 | |
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Quote:
Now there are so many alternatives that is not a safe assumption. VI is by no means a developer/coder's tool. It is just a very quick and powerful text editor, power in the form of efficiency, where using a mouse just slows you down. =X= |
|
10-02-2009, 03:58 PM | #14 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
- Ahi |
|
10-02-2009, 04:06 PM | #15 | ||
Wizard
Posts: 3,671
Karma: 12205348
Join Date: Mar 2008
Device: Galaxy S, Nook w/CM7
|
Quote:
Quote:
Learning VI is steep but once you learn it you will be amazed that you no longer have to think or look for any menu/tool bar. Your fingers will know the key sequence and be done with it before you realize it. =X= |
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with regular expressions | Manichean | Conversion | 10 | 02-03-2011 02:27 PM |
Custom Regular Expressions for adding book information | bigbot3 | Calibre | 1 | 12-25-2010 06:28 PM |
Regular expressions, Calibre and you- an introduction (Archived) | Manichean | Conversion | 80 | 11-11-2010 07:37 AM |
Help with Regular Expressions | ghostyjack | Workshop | 2 | 01-08-2010 11:04 AM |
BookDesigner v5 and regular expressions | ShineOn | Sony Reader | 11 | 08-25-2008 04:06 PM |