Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 08-15-2010, 10:00 PM   #1
purcelljf
Enthusiast
purcelljf began at the beginning.
 
Posts: 29
Karma: 10
Join Date: Aug 2010
Device: ipod touch
How to join broken paragraphs?

After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. Therefore, I have to go through and use the delete key once in a while in order to clean it up, in order to join them together. I thought maybe there is a trick to this, so it doesn't take so much time?

thanks.
purcelljf is offline   Reply With Quote
Old 08-16-2010, 04:58 AM   #2
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,516
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Search for paragraphs that end in a character other than . ! ? :, possibly followed by "

Search for paragraphs starting with a lowercase letter, possibly preceded by "

Those searches are simple with regex (regular expressions), but in order to give more help we'd have to know the particular dialect of regex your software uses (if any).

Grab the paper book or the scans, and search every page, looking for pages that start with an uppercase letter that is not the beginning of a paragraph.
Jellby is offline   Reply With Quote
Advert
Old 08-16-2010, 06:04 AM   #3
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by purcelljf View Post
After scanning a book and exporting it to html, I frequently have separate paragraphs where the pages break in the document. ... I thought maybe there is a trick to this, so it doesn't take so much time?
There are lots of tricks.
We just need to know what software you are using and what are your skills.
Do you use OpenOffice.org writer, or MSOffice, or something else?
Do you konw what Regular Expression is?

As previous poster said, loking for paragraphs that begin with a lower cap letter would find the vast majority of such paragraphs.
You can also start looking for paragrephs that do not end with . ? ! ." ?" !" .' ?' !' ... you get the idea.
kacir is offline   Reply With Quote
Old 08-16-2010, 10:38 AM   #4
purcelljf
Enthusiast
purcelljf began at the beginning.
 
Posts: 29
Karma: 10
Join Date: Aug 2010
Device: ipod touch
Thanks. I will bone up on regular expressions The software I am using to edit the html after exporting it is Dreamweaver CS5 and SIGIL .
purcelljf is offline   Reply With Quote
Old 08-18-2010, 05:54 PM   #5
Solitaire1
Samurai Lizard
Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.Solitaire1 ought to be getting tired of karma fortunes by now.
 
Solitaire1's Avatar
 
Posts: 14,251
Karma: 66666666
Join Date: Nov 2009
Device: NookColor
Quote:
Originally Posted by kacir View Post
There are lots of tricks.
We just need to know what software you are using and what are your skills.
Do you use OpenOffice.org writer, or MSOffice, or something else?
Do you konw what Regular Expression is?

As previous poster said, loking for paragraphs that begin with a lower cap letter would find the vast majority of such paragraphs.
You can also start looking for paragrephs that do not end with . ? ! ." ?" !" .' ?' !' ... you get the idea.
When it comes to joining paragraphs in plain text documents, I use OpenOffice.org. If you check my post in this thread (https://www.mobileread.com/forums/showthread.php?t=52709), it contains step-by-step instructions on how to join the paragraphs using OpenOffice.org.

I hope it helps.
Solitaire1 is offline   Reply With Quote
Advert
Old 08-18-2010, 06:06 PM   #6
bear4hunter
Addict
bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!
 
bear4hunter's Avatar
 
Posts: 248
Karma: 100148
Join Date: Jul 2010
Location: Germany, Munich
Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603
I'm not that good with RegExes, takes some T&E and googling to find how to use them.

I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!...

Scrolling Firefox and PDF to "view-compare" is what I do, that catches most of them, but some I oversee...
bear4hunter is offline   Reply With Quote
Old 08-19-2010, 04:05 AM   #7
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by bear4hunter View Post
I would appreciate, if someone can tell me the RegEx I can use with Notepad++ to find those paragraphs that do not end with .?!...
I have been using TextPad for many, many years, and I still use it when I need to demonstrate Regular Expressions to casual users. I do not want to scare them away with Vim ;-)
I have downloaded Notepad++. Regular expressions are practically undocumented and they behave in a really weird way. It doesn't, for example, recognize \n as an end of line. Notepad++ can only do very limited range of operations on bookmarked lines.

I suggest, download TextPad for this operation. (or find out why Notepad++ does not recognize \n as an "end of line" metacharacter)

Open document.
go to menu Search -> replace..
To find all lines ending with a literal dot, you write search expression [.]$
If you look for all the lines ending with literal "?" search for [?]$
[] is "set" and it selects one character, out of all characters listed inside, so [abc] would find either a, b or c. And previous two searches would be written as [.?]$.
If you look for characters that are at the end of line and are NOT . or ?, you use negation operator ^
so [^.?!]$ would find all lines ending with characters that are NOT .,? or !

Now you want to remember the last character found. You do that by \( and \) as a grouping operator. In the replace string you then refer to expression marked by \( and \) as \1 for the first group, \2 for second, \9 for ninth.
Please note, in various implementations of Regular Expressions you use either \( and \) or plain ( and ) as grouping operators. TextPad can use both, depending on preferences (set up as "use POSIX Regular Expressions).

Let us put that together.
Look for \([^.?!]\)\n
replace with "\1 "
There is space after \1, so the the last word of line and the first word of next line are not run together.

Now you might end with two spaces between words, if there *was* space at the end of the line.
to get rid of this you simply replace two spaces by one space.


In Vim text editor I would simply issue command
:global/[^.?!]$/ join
or, using short versions of commands
:g/[^.?!]$/ j
It means: find all lines not ending with .?! and join them with the next line. Join command inserts the space instead of end of line if there wasn't space at the end of joined line. It would also reduce number of spaces if the next line was intended with spaces.

Vim is difficult to learn, but it is one of THE most powerful text editors, and is also one of THE most completely documented editors. Just check its on-line manual for RE
http://vimdoc.sourceforge.net/htmldoc/usr_toc.html
http://vimdoc.sourceforge.net/htmldo...n.html#pattern
(I am using "one of" diplomatic language, because I do not want to pick fight with our resident Emacs users ;-) )

Last edited by kacir; 08-19-2010 at 04:22 AM.
kacir is offline   Reply With Quote
Old 08-19-2010, 02:23 PM   #8
bear4hunter
Addict
bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!bear4hunter rocks like Gibraltar!
 
bear4hunter's Avatar
 
Posts: 248
Karma: 100148
Join Date: Jul 2010
Location: Germany, Munich
Device: Kindle 3 & DX Graphite, PocketBook 302 & Pro 603
Thanks a lot, I'll take a look at TextPad this weekend, if I find the time

I might experiment with these RegExes or variations in notepad++ as well, as I see they are quite like the ones I know already:

Code:
\r\n finds line breaks
and if my PDF is messy, ending up with lots of styles like p1, p2,... I use

Code:
(space)class=".*" and replace it with (nothing)
Actually, both come from or are based on tips from Joshua Tallents excellent book on Kindle Formatting - I found it worth every penny.
bear4hunter is offline   Reply With Quote
Old 08-19-2010, 03:21 PM   #9
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,450
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by bear4hunter View Post
Thanks a lot, I'll take a look at TextPad this weekend, if I find the time
If you are not afraid, do try [G]Vim text editor.
http://www.vim.org/
Vim has an excellent documentation. It has two parts - user manual and reference manual. The user manual was actually written by a very good professional author of technical books.

Quote:
Originally Posted by bear4hunter View Post
Code:
\r\n finds line breaks
Yes, it does (for files with DOS type end-of line, see http://en.wikipedia.org/wiki/End-of-line ), but not in Regular Expression mode, only in Enhanced mode.
I tried all combination of
\r\n
\n\r
\r
\n

You might try $
OpenOffice.org writer uses $ as a metacharacter for end of line and \n for manual pagebreak. Strange.

EVERY SINGLE implementation of Regular Expressions I have seen has some strange incompatibility with all the other versions.
Some programs even provide several syntaxes you can use (TextPad has two, Vim four (very magic, magic, nonmagic and very nonmagic) I am not kidding ;-) have a look : http://vimdoc.sourceforge.net/htmldo...rn.html#/magic )

I strongly recommend book Mastering Regular Expressions http://oreilly.com/catalog/9780596528126/
kacir is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Handling Broken Paragraphs crutledge Sigil 14 06-21-2010 07:41 PM
Join Library randyveach Sony Reader 4 03-14-2010 12:04 AM
Broken PRS-505; any place to buy chrome bottom piece? Or anyone with broken 505? erikk Sony Reader 1 12-09-2009 06:51 PM
Broken Ipod works Fine! except that its broken Andybaby Lounge 1 06-04-2009 02:03 AM
You need to own a reader to join Ned Feedback 19 10-12-2008 12:33 PM


All times are GMT -4. The time now is 03:08 AM.


MobileRead.com is a privately owned, operated and funded community.