08-15-2010, 10:00 PM | #1 |
Enthusiast
Posts: 29
Karma: 10
Join Date: Aug 2010
Device: ipod touch
|
How to join broken paragraphs?
After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. Therefore, I have to go through and use the delete key once in a while in order to clean it up, in order to join them together. I thought maybe there is a trick to this, so it doesn't take so much time?
thanks. |
08-16-2010, 04:58 AM | #2 |
frumious Bandersnatch
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
Search for paragraphs that end in a character other than . ! ? :, possibly followed by "
Search for paragraphs starting with a lowercase letter, possibly preceded by " Those searches are simple with regex (regular expressions), but in order to give more help we'd have to know the particular dialect of regex your software uses (if any). Grab the paper book or the scans, and search every page, looking for pages that start with an uppercase letter that is not the beginning of a paragraph. |
Advert | |
|
08-16-2010, 06:04 AM | #3 | |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
We just need to know what software you are using and what are your skills. Do you use OpenOffice.org writer, or MSOffice, or something else? Do you konw what Regular Expression is? As previous poster said, loking for paragraphs that begin with a lower cap letter would find the vast majority of such paragraphs. You can also start looking for paragrephs that do not end with . ? ! ." ?" !" .' ?' !' ... you get the idea. |
|
08-16-2010, 10:38 AM | #4 |
Enthusiast
Posts: 29
Karma: 10
Join Date: Aug 2010
Device: ipod touch
|
Thanks. I will bone up on regular expressions The software I am using to edit the html after exporting it is Dreamweaver CS5 and SIGIL .
|
08-18-2010, 05:54 PM | #5 | |
Samurai Lizard
Posts: 14,251
Karma: 66666666
Join Date: Nov 2009
Device: NookColor
|
Quote:
I hope it helps. |
|
Advert | |
|
08-18-2010, 06:06 PM | #6 |
Addict
Posts: 248
Karma: 100148
Join Date: Jul 2010
Location: Germany, Munich
Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603
|
I'm not that good with RegExes, takes some T&E and googling to find how to use them.
I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!... Scrolling Firefox and PDF to "view-compare" is what I do, that catches most of them, but some I oversee... |
08-19-2010, 04:05 AM | #7 | |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
I have downloaded Notepad++. Regular expressions are practically undocumented and they behave in a really weird way. It doesn't, for example, recognize \n as an end of line. Notepad++ can only do very limited range of operations on bookmarked lines. I suggest, download TextPad for this operation. (or find out why Notepad++ does not recognize \n as an "end of line" metacharacter) Open document. go to menu Search -> replace.. To find all lines ending with a literal dot, you write search expression [.]$ If you look for all the lines ending with literal "?" search for [?]$ [] is "set" and it selects one character, out of all characters listed inside, so [abc] would find either a, b or c. And previous two searches would be written as [.?]$. If you look for characters that are at the end of line and are NOT . or ?, you use negation operator ^ so [^.?!]$ would find all lines ending with characters that are NOT .,? or ! Now you want to remember the last character found. You do that by \( and \) as a grouping operator. In the replace string you then refer to expression marked by \( and \) as \1 for the first group, \2 for second, \9 for ninth. Please note, in various implementations of Regular Expressions you use either \( and \) or plain ( and ) as grouping operators. TextPad can use both, depending on preferences (set up as "use POSIX Regular Expressions). Let us put that together. Look for \([^.?!]\)\n replace with "\1 " There is space after \1, so the the last word of line and the first word of next line are not run together. Now you might end with two spaces between words, if there *was* space at the end of the line. to get rid of this you simply replace two spaces by one space. In Vim text editor I would simply issue command :global/[^.?!]$/ join or, using short versions of commands :g/[^.?!]$/ j It means: find all lines not ending with .?! and join them with the next line. Join command inserts the space instead of end of line if there wasn't space at the end of joined line. It would also reduce number of spaces if the next line was intended with spaces. Vim is difficult to learn, but it is one of THE most powerful text editors, and is also one of THE most completely documented editors. Just check its on-line manual for RE http://vimdoc.sourceforge.net/htmldoc/usr_toc.html http://vimdoc.sourceforge.net/htmldo...n.html#pattern (I am using "one of" diplomatic language, because I do not want to pick fight with our resident Emacs users ;-) ) Last edited by kacir; 08-19-2010 at 04:22 AM. |
|
08-19-2010, 02:23 PM | #8 |
Addict
Posts: 248
Karma: 100148
Join Date: Jul 2010
Location: Germany, Munich
Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603
|
Thanks a lot, I'll take a look at TextPad this weekend, if I find the time
I might experiment with these RegExes or variations in notepad++ as well, as I see they are quite like the ones I know already: Code:
\r\n finds line breaks Code:
(space)class=".*" and replace it with (nothing) |
08-19-2010, 03:21 PM | #9 | |
Wizard
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
http://www.vim.org/ Vim has an excellent documentation. It has two parts - user manual and reference manual. The user manual was actually written by a very good professional author of technical books. Yes, it does (for files with DOS type end-of line, see http://en.wikipedia.org/wiki/End-of-line ), but not in Regular Expression mode, only in Enhanced mode. I tried all combination of \r\n \n\r \r \n You might try $ OpenOffice.org writer uses $ as a metacharacter for end of line and \n for manual pagebreak. Strange. EVERY SINGLE implementation of Regular Expressions I have seen has some strange incompatibility with all the other versions. Some programs even provide several syntaxes you can use (TextPad has two, Vim four (very magic, magic, nonmagic and very nonmagic) I am not kidding ;-) have a look : http://vimdoc.sourceforge.net/htmldo...rn.html#/magic ) I strongly recommend book Mastering Regular Expressions http://oreilly.com/catalog/9780596528126/ |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Handling Broken Paragraphs | crutledge | Sigil | 14 | 06-21-2010 07:41 PM |
Join Library | randyveach | Sony Reader | 4 | 03-14-2010 12:04 AM |
Broken PRS-505; any place to buy chrome bottom piece? Or anyone with broken 505? | erikk | Sony Reader | 1 | 12-09-2009 06:51 PM |
Broken Ipod works Fine! except that its broken | Andybaby | Lounge | 1 | 06-04-2009 02:03 AM |
You need to own a reader to join | Ned | Feedback | 19 | 10-12-2008 12:33 PM |