How to join broken paragraphs?

purcelljf · 08-15-2010, 10:00 PM

After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. Therefore, I have to go through and use the delete key once in a while in order to clean it up, in order to join them together. I thought maybe there is a trick to this, so it doesn't take so much time?

thanks.

Jellby · 08-16-2010, 04:58 AM

Search for paragraphs that end in a character other than . ! ? :, possibly followed by "

Search for paragraphs starting with a lowercase letter, possibly preceded by "

Those searches are simple with regex (regular expressions), but in order to give more help we'd have to know the particular dialect of regex your software uses (if any).

Grab the paper book or the scans, and search every page, looking for pages that start with an uppercase letter that is not the beginning of a paragraph.

kacir · 08-16-2010, 06:04 AM

Quote:

Originally Posted by purcelljf

After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. ... I thought maybe there is a trick to this, so it doesn't take so much time?

There are lots of tricks.
We just need to know what software you are using and what are your skills.
Do you use OpenOffice.org writer, or MSOffice, or something else?
Do you konw what Regular Expression is?

As previous poster said, loking for paragraphs that begin with a lower cap letter would find the vast majority of such paragraphs.
You can also start looking for paragrephs that do not end with . ? ! ." ?" !" .' ?' !' ... you get the idea.

purcelljf · 08-16-2010, 10:38 AM

Thanks. I will bone up on regular expressions The software I am using to edit the html after exporting it is Dreamweaver CS5 and SIGIL .

Solitaire1 · 08-18-2010, 05:54 PM

Quote:

Originally Posted by kacir

There are lots of tricks.
We just need to know what software you are using and what are your skills.
Do you use OpenOffice.org writer, or MSOffice, or something else?
Do you konw what Regular Expression is?

As previous poster said, loking for paragraphs that begin with a lower cap letter would find the vast majority of such paragraphs.
You can also start looking for paragrephs that do not end with . ? ! ." ?" !" .' ?' !' ... you get the idea.

When it comes to joining paragraphs in plain text documents, I use OpenOffice.org. If you check my post in this thread (https://www.mobileread.com/forums/showthread.php?t=52709), it contains step-by-step instructions on how to join the paragraphs using OpenOffice.org.

I hope it helps.

bear4hunter · 08-18-2010, 06:06 PM

I'm not that good with RegExes, takes some T&E and googling to find how to use them.

I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!...

Scrolling Firefox and PDF to "view-compare" is what I do, that catches most of them, but some I oversee...

kacir · 08-19-2010, 04:05 AM

Quote:

Originally Posted by bear4hunter

I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!...

I have been using TextPad for many, many years, and I still use it when I need to demonstrate Regular Expressions to casual users. I do not want to scare them away with Vim ;-)
I have downloaded Notepad++. Regular expressions are practically undocumented and they behave in a really weird way. It doesn't, for example, recognize \n as an end of line. Notepad++ can only do very limited range of operations on bookmarked lines.

I suggest, download TextPad for this operation. (or find out why Notepad++ does not recognize \n as an "end of line" metacharacter)

Open document.
go to menu Search -> replace..
To find all lines ending with a literal dot, you write search expression [.]$
If you look for all the lines ending with literal "?" search for [?]$
[] is "set" and it selects one character, out of all characters listed inside, so [abc] would find either a, b or c. And previous two searches would be written as [.?]$.
If you look for characters that are at the end of line and are NOT . or ?, you use negation operator ^
so [^.?!]$ would find all lines ending with characters that are NOT .,? or !

Now you want to remember the last character found. You do that by $ and $ as a grouping operator. In the replace string you then refer to expression marked by $ and $ as \1 for the first group, \2 for second, \9 for ninth.
Please note, in various implementations of Regular Expressions you use either $ and $ or plain ( and ) as grouping operators. TextPad can use both, depending on preferences (set up as "use POSIX Regular Expressions).

Let us put that together.
Look for $[^.?!]$\n
replace with "\1 "
There is space after \1, so the the last word of line and the first word of next line are not run together.

Now you might end with two spaces between words, if there *was* space at the end of the line.
to get rid of this you simply replace two spaces by one space.

In Vim text editor I would simply issue command
:global/[^.?!]$/ join
or, using short versions of commands
:g/[^.?!]$/ j
It means: find all lines not ending with .?! and join them with the next line. Join command inserts the space instead of end of line if there wasn't space at the end of joined line. It would also reduce number of spaces if the next line was intended with spaces.

Vim is difficult to learn, but it is one of THE most powerful text editors, and is also one of THE most completely documented editors. Just check its on-line manual for RE
http://vimdoc.sourceforge.net/htmldoc/usr_toc.html
http://vimdoc.sourceforge.net/htmldo...n.html#pattern
(I am using "one of" diplomatic language, because I do not want to pick fight with our resident Emacs users ;-) )

bear4hunter · 08-19-2010, 02:23 PM

Thanks a lot, I'll take a look at TextPad this weekend, if I find the time

I might experiment with these RegExes or variations in notepad++ as well, as I see they are quite like the ones I know already:

Code:

\r\n finds line breaks

and if my PDF is messy, ending up with lots of styles like p1, p2,... I use

Code:

(space)class=".*" and replace it with (nothing)

Actually, both come from or are based on tips from Joshua Tallents excellent book on Kindle Formatting - I found it worth every penny.

kacir · 08-19-2010, 03:21 PM

Quote:

Originally Posted by bear4hunter

Thanks a lot, I'll take a look at TextPad this weekend, if I find the time

If you are not afraid, do try [G]Vim text editor.
http://www.vim.org/
Vim has an excellent documentation. It has two parts - user manual and reference manual. The user manual was actually written by a very good professional author of technical books.

Quote:

Originally Posted by bear4hunter

Code:

\r\n finds line breaks

Yes, it does (for files with DOS type end-of line, see http://en.wikipedia.org/wiki/End-of-line ), but not in Regular Expression mode, only in Enhanced mode.
I tried all combination of
\r\n
\n\r
\r
\n

You might try $
OpenOffice.org writer uses $ as a metacharacter for end of line and \n for manual pagebreak. Strange.

EVERY SINGLE implementation of Regular Expressions I have seen has some strange incompatibility with all the other versions.
Some programs even provide several syntaxes you can use (TextPad has two, Vim four (very magic, magic, nonmagic and very nonmagic) I am not kidding ;-) have a look : http://vimdoc.sourceforge.net/htmldo...rn.html#/magic )

I strongly recommend book Mastering Regular Expressions http://oreilly.com/catalog/9780596528126/

08-15-2010, 10:00 PM	#1
purcelljf Enthusiast Posts: 29 Karma: 10 Join Date: Aug 2010 Device: ipod touch	How to join broken paragraphs? After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. Therefore, I have to go through and use the delete key once in a while in order to clean it up, in order to join them together. I thought maybe there is a trick to this, so it doesn't take so much time? thanks.

08-19-2010, 02:23 PM	#8
bear4hunter Addict Posts: 248 Karma: 100148 Join Date: Jul 2010 Location: Germany, Munich Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603	Thanks a lot, I'll take a look at TextPad this weekend, if I find the time I might experiment with these RegExes or variations in notepad++ as well, as I see they are quite like the ones I know already: Code: \r\n finds line breaks and if my PDF is messy, ending up with lots of styles like p1, p2,... I use Code: (space)class=".*" and replace it with (nothing) Actually, both come from or are based on tips from Joshua Tallents excellent book on Kindle Formatting - I found it worth every penny.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Handling Broken Paragraphs	crutledge	Sigil	14	06-21-2010 07:41 PM
Join Library	randyveach	Sony Reader	4	03-14-2010 12:04 AM
Broken PRS-505; any place to buy chrome bottom piece? Or anyone with broken 505?	erikk	Sony Reader	1	12-09-2009 06:51 PM
Broken Ipod works Fine! except that its broken	Andybaby	Lounge	1	06-04-2009 02:03 AM
You need to own a reader to join	Ned	Feedback	19	10-12-2008 12:33 PM

08-16-2010, 04:58 AM	#2
Jellby frumious Bandersnatch Posts: 7,516 Karma: 18512745 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	Search for paragraphs that end in a character other than . ! ? :, possibly followed by " Search for paragraphs starting with a lowercase letter, possibly preceded by " Those searches are simple with regex (regular expressions), but in order to give more help we'd have to know the particular dialect of regex your software uses (if any). Grab the paper book or the scans, and search every page, looking for pages that start with an uppercase letter that is not the beginning of a paragraph.

08-16-2010, 10:38 AM	#4
purcelljf Enthusiast Posts: 29 Karma: 10 Join Date: Aug 2010 Device: ipod touch	Thanks. I will bone up on regular expressions The software I am using to edit the html after exporting it is Dreamweaver CS5 and SIGIL .

08-18-2010, 06:06 PM	#6
bear4hunter Addict Posts: 248 Karma: 100148 Join Date: Jul 2010 Location: Germany, Munich Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603	I'm not that good with RegExes, takes some T&E and googling to find how to use them. I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!... Scrolling Firefox and PDF to "view-compare" is what I do, that catches most of them, but some I oversee...

Advert

Advert