Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 09-03-2011, 01:33 PM   #1
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
Looking for a tool to find/fix mis-matched quotes...

On a number of poorly OCR'd documents, often paragraphs with quotes where a character is speaking are either broken into 2 paragraphs, or there is a quote simply missing, either the first one or the second one.

I'm looking for a plugin or took of any kind that can identify these to make fixing them easier than reading through the entire book and fixing it as I go.

When I have to edit/fix documents, I normally convert to htmlz, then edit the html file with notepad++.

If the took did exist, here's what I believe it should do:

Paragraphs are generally (not always though, but in most cases) formatted like:

<p class-"calibre#">This is some text</p>

What it should do is look at each <p></p> grouping (ignoring the class), and count the " characters it finds within. In cases where proper slanted open/close quotes have been used, count each occurrence of these.

Then identify the paragraphs where there are an odd number of each of these so the user can review/fix it (I realize that part would have to be manual since no tool can tell where to put the missing quote properly, but at the very least, it would make finding these *much* easier).

...kind of like an extended search-and-replace - "Find mismatched quotes"

I'm not particularly hung up about whether it would work in notepad++ or not, any editor would work, and an extension/plugin for calibre would be even better if it could be integrated into a semi-automated conversion process where it prompts for user input on each mismatched set of quotes.

Anyone know if something like this exists?

If not - Kovid, is there any chance you could add this to calibre? It would save a *huge* amount of time and frustration!

Cheers
The REAL Joe
therealjoeblow is offline   Reply With Quote
Old 09-03-2011, 08:20 PM   #2
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,234
Karma: 16537336
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
As long as your book uses double-quotes, rather than single, you might find some of them with a simple Notepad++ search:

Using regular expression search mode
Code:
“[^”]+</p>
jackie_w is offline   Reply With Quote
Advert
Old 09-03-2011, 09:06 PM   #3
therealjoeblow
Zealot
therealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfoldedtherealjoeblow reads XML... blindfolded
 
Posts: 106
Karma: 52102
Join Date: Jun 2010
Device: Samsung Android Tablet w/Moon+ Pro Reader
Quote:
Originally Posted by jackie_w View Post
As long as your book uses double-quotes, rather than single, you might find some of them with a simple Notepad++ search:

Using regular expression search mode
Code:
“[^”]+</p>
Thanks - that works with proper slanted quotes, but doesn't really help with regular plain " style quotes, which exist in a lot of the files I'm having issues with, because when I replace “[^”] in the regex with "[^"] it ends up catching all of the class="calibre#" in the <p> tags.

The only way I can think of for it to work on that is if some intelligent search routine keeps track of how many occurrences of " it finds while ignoring the class attributes.

Maybe there's a way with regex, but I'm really not that great at deciphering its syntax for something this complex (I use it all the time to find broken paragraphs that start with lower case letters using <p class="calibre#">[a-z] )

The REAL Joe
therealjoeblow is offline   Reply With Quote
Old 09-04-2011, 12:51 AM   #4
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,234
Karma: 16537336
Join Date: Sep 2009
Location: UK
Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3
Another possibility, Calibre conversion (to any format you prefer, HTMLZ possibly?) can be done with the 'Smarten Punctuation' box checked in Look&Feel. That would convert all your straight quotes (in the book text only, not the Calibre classes) to curly quotes. Then you can do some more simple searching.

Last edited by jackie_w; 09-04-2011 at 12:54 AM.
jackie_w is offline   Reply With Quote
Old 09-04-2011, 02:25 AM   #5
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
It might be worth converting to something like text with markdown or textile active. That would eliminate all the HTML tags so it might be easier to manipulate the file. I guess it depends on the complexity of the formatting you need to preserve as to whether that would be a viable route?
itimpi is offline   Reply With Quote
Advert
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Change single quotes to double quotes Elfwreck Workshop 16 04-26-2013 11:06 AM
find replace - does it auto-fix closing tqags ??? cybmole Sigil 6 01-19-2011 03:32 PM
convert straight quotes to curly quotes alansplace Calibre 3 09-25-2010 04:51 PM
font mis-representation in Mobipocket Reader cyberbaffled Kindle Formats 2 06-24-2010 01:02 AM
A tool for converting to curly quotes Snowman Workshop 7 02-08-2009 01:22 PM


All times are GMT -4. The time now is 11:34 AM.


MobileRead.com is a privately owned, operated and funded community.