Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 12-22-2010, 06:13 PM   #16
tscamera
Enthusiast
tscamera began at the beginning.
 
Posts: 30
Karma: 10
Join Date: Dec 2010
Device: PRS-650 ... ipad
Exclamation unwanted linebreaks

Quote:
Originally Posted by kiwidude
I've done this a lot to "repair" the results of PDF conversions.

What you want to do is something like this:
Find: ([a-z])</p>\s+<p class="calibre2">
Replace: \1
In the replace expression, it is \1 followed by a single space.

That will find any sentences ending with a lowercase a-z and strip the paragraph end/beginning and replace with that same last character with an additional space. Putting the () brackets around the expression in the Find puts it into a group which you then access in the replace with \1
uups!

if you are living in ne countries, cleaning up text in your own language...
don't forget to put additional chars - words may ending with it , in the formula!!!
i.e. german (ß)

([a-zß])</p>\s+<p class="calibre2">

but if you understood what happens here, you know already.
don't you?
tscamera is offline   Reply With Quote
Old 12-22-2010, 08:14 PM   #17
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by kiwidude View Post
I know I am at risk of going O/T with Calibre discussion in a Sigil forum here but this is all related to recommended ways of conversion to make the Sigil editing work easier...

One thing that I have found with Calibre is due to the way it stores the conversion metadata I have to be careful to "unselect" stuff when doing different conversions. i.e. I always want EPUB to be my "master copy" since it converts so easily to other formats. So the first conversion will be from something else to EPUB for tidying up in Sigil. After that I then need to convert to MOBI for use on my Kindle. However I found I need to make sure I deselect any Calibre conversion options before I do the EPUB->MOBI conversion or else some of my careful Sigil work gets undone.

Is this what you would expect or am I doing something wrong? Because of this I don't really set much in the way of "global defaults" for conversions since so many settings are common to all formats but you actually only want them to be applied to the first conversion. The "re-run" factor to other formats becomes an issue when you turn these things on. Maybe I just got unlucky or imagined it...
Might be worth opening a thread in the Calibre forum with more details about what it changes. Calibre shouldn't do much, but there are things under the 'look and feel' options which may change things - in particular font size rescaling, line spacing handling, and margins may get added based on your output profile. The other thing is that when you go from epub to mobi you're downgrading from html 4 to html 3.2 - there are a lot of things you can do with epub that aren't supported in mobi, and Calibre needs to change the content to support that.



Quote:
Originally Posted by tscamera
if you are living in ne countries, cleaning up text in your own language...
don't forget to put additional chars - words may ending with it , in the formula!!!
i.e. german (ß)
The full list of characters I've put together so far would be:
Code:
([a-zäëïöüàèìòùáćéíóńśúâêîôûçąężı,:)\IA\u00DF]|(?<!\&\w{4});)
This is the full regex that I use for unwrapping:
Code:
(?<=.{85}([a-zäëïöüàèìòùáćéíóńśúâêîôûçąężı,:)\IA\u00DF]|(?<!\&\w{4});))\s*</(span|p|div)>\s*(</(p|span|div)>)?\s*(?P<up2threeblanks><(p|span|div)[^>]*>\s*(<(p|span|div)[^>]*>\s*</(span|p|div)>\s*)</(span|p|div)>\s*){0,3}\s*<(span|div|p)[^>]*>\s*(<(span|div|p)[^>]*>)?\s*
The number 85 in the beginning should be changed to the median line length for your document. This doesn't require any \1 \2 replacement as it uses a zero width lookahead on the last letter of the first line, and doesn't bother with the matching the first letter of the second line. I use this one with Python, it may need a few tweaks to work with Sigil's regex engine.

Last edited by ldolse; 12-22-2010 at 08:17 PM.
ldolse is offline   Reply With Quote
Old 12-23-2010, 04:21 AM   #18
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,515
Karma: 18512745
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by ldolse View Post
Code:
([a-zäëïöüàèìòùáćéíóńśúâêîôûçąężı,:)\IA\u00DF]|(?<!\&\w{4});)
You're missig at least ãõñæøþð.
Jellby is offline   Reply With Quote
Old 12-23-2010, 06:13 AM   #19
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Instead of building a character class yourself, how about using "\w"? That will match any unicode letter, number and underscore.
Valloric is offline   Reply With Quote
Old 12-23-2010, 07:14 AM   #20
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by Valloric View Post
Instead of building a character class yourself, how about using "\w"? That will match any unicode letter, number and underscore.
I'm not a regex guru by any means but a number of the expressions we have been looking at are intentionally excluding the uppercase versions of characters which \w would include.
kiwidude is offline   Reply With Quote
Old 12-23-2010, 07:35 AM   #21
Valloric
Created Sigil, FlightCrew
Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.Valloric ought to be getting tired of karma fortunes by now.
 
Valloric's Avatar
 
Posts: 1,982
Karma: 350515
Join Date: Feb 2008
Device: Kobo Clara HD
Quote:
Originally Posted by kiwidude View Post
I'm not a regex guru by any means but a number of the expressions we have been looking at are intentionally excluding the uppercase versions of characters which \w would include.
Haven't been paying close enough attention to notice that requirement.
Valloric is offline   Reply With Quote
Old 12-23-2010, 09:47 AM   #22
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Jellby View Post
You're missig at least ãõñæøþð.
Cool, I'll add those - a lot of those I only get when a foreign language user opens a Calibre pdf bug.
ldolse is offline   Reply With Quote
Old 12-23-2010, 09:50 AM   #23
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Quote:
Originally Posted by Valloric View Post
Haven't been paying close enough attention to notice that requirement.
Requirement might be a bit strong, just balancing between false positives and false negatives... I generally try to err on false negatives, since they're easier to detect later, but sometimes I think it might be easier to use \w since line length is in there as an extra check, but haven't made that jump.
ldolse is offline   Reply With Quote
Old 12-26-2010, 11:04 AM   #24
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
this thread is very useful. I'd been fixing up line breaks manually in Word, which took ages.
I am amazed how easy it can be via regex.

I use variations of the code given earlier in this thread i.e.
What you want to do is something like this:
Find: ([a-z])</p>\s+<p class="calibre2">
Replace: \1


changing calibre2 as needed on a per book basis - sometimes it needs a different 1 or 2 digit number like calibre13

one that still slips though the test though, is when a sentence has split such that the new line starts with the one letter word capital I. can that also be caught via regex ?
cybmole is offline   Reply With Quote
Old 01-05-2011, 06:39 AM   #25
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by kiwidude View Post

Replace: \1 (a space followed by \1)
I look at regex syntax reference page here
http://www.regular-expressions.info/reference.html

but there's no definition of the \1 operation ?

do I need a better reference page or book ???


what's a good book to learn from ( intending to use more regex in both sigil and calibre ). I'd prefer an ebook that I can put onto my kindle.
cybmole is offline   Reply With Quote
Old 01-05-2011, 07:34 AM   #26
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,636
Karma: 2162064
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by cybmole View Post
I look at regex syntax reference page here
http://www.regular-expressions.info/reference.html

but there's no definition of the \1 operation ?

do I need a better reference page or book ???
You just needed to click on the "Advanced syntax" page link (that is the "Basic syntax" page).
http://www.regular-expressions.info/refadv.html
Quote:
Originally Posted by cybmole View Post
what's a good book to learn from ( intending to use more regex in both sigil and calibre ). I'd prefer an ebook that I can put onto my kindle.
I can't name any books, perhaps someone else can. That website is very good, and between the basic and advanced page you have the "cheat sheet" for most of what you need to know to refresh your memory.
kiwidude is offline   Reply With Quote
Old 01-07-2011, 03:48 AM   #27
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
i found this - printable cheat sheets

http://www.addedbytes.com/cheat-shee...eet-version-1/
cybmole is offline   Reply With Quote
Reply

Tags
find, html code, regex, replace, source view


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Request: Adding linebreaks in sidebar window. svenlind Calibre 5 04-14-2010 03:46 AM
Chapters showing unwanted pagebreaks and < h1 > text raltman Calibre 2 10-05-2009 04:50 PM
PDF reformatting help. Ham88 Workshop 1 05-14-2009 03:07 PM
Using Acrobat for reformatting to e-readers snowgoose PDF 8 02-04-2009 08:13 PM
Reformatting untidy text files macro 46137 Workshop 8 05-02-2008 09:27 PM


All times are GMT -4. The time now is 04:58 AM.


MobileRead.com is a privately owned, operated and funded community.