Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 07-19-2011, 08:57 AM   #76
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
To those who have posted in the last few days: Thanks for the feedback and interest. Though I have not responded to all the specifics, I am keeping them in mind. Please bug me again if I don't manage to deal with them in the next functional update.

@davidfor
I worked out a 'QSplitter' which should allow for easy manual resizing of the display objects in the next update; so I removed the setMinimum line of code. Also, it will have maximize/resize button in the upper right corner.

@Kovid, anyone
I have begun working on a search and replace /puntuation checker. But I can't figure out how to easily match across paragraphs, etc. Does javascript regex provide a way to easily match across tags, especially paragraph tags, and to include them in an expression? I googled it but came up with nothing helpful.

Last edited by burbleburble; 07-19-2011 at 09:34 AM.
burbleburble is offline   Reply With Quote
Old 07-19-2011, 10:01 AM   #77
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3

@Kovid
(New post in case you already looked at the last one, as I see you are logged on)
I am having strange issues when running the plugin in calibre, (it works fine in pyscripter). I recoded the delete, backspace, enter key responses for webkit. In calibre it won't delete the next character. Not only that, for all these keys, if dealing with a fairly long paragraph it will insert some very weird characters elswhere in the paragraph. Please, if you have any suggestions as to the issue, or could take a look, I would much appreciate it.
I attatched the current version (that is having the issues) below. The actual code is in main.Text.Browser under keypress() etc.
Attached Files
File Type: zip plugin.zip (165.6 KB, 232 views)
burbleburble is offline   Reply With Quote
Advert
Old 07-19-2011, 10:12 AM   #78
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,860
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Sorry, I am swamped at the moment.
kovidgoyal is offline   Reply With Quote
Old 07-19-2011, 02:49 PM   #79
user_none
Sigil & calibre developer
user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.user_none ought to be getting tired of karma fortunes by now.
 
user_none's Avatar
 
Posts: 2,488
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
Quote:
Originally Posted by burbleburble View Post
I realize that calibre's HTMLZ doesn't support all tags/css.
Admittedly I did not read through this entire thread but this is news to me. The non-css translation doesn't support all styles but all of the other out put versions should support all HTML tags and styles. All HTMLZ does is translate links and put multiple HTML files together. The only thing that I can think of that would be lost are per page background color / images. Which ones are you having issues with?
user_none is offline   Reply With Quote
Old 07-19-2011, 03:04 PM   #80
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
@user_none

I converted 'The Princess and the Goblin' (by George Macdonald, from gutenberg.org, for copyright free testing purposes) epub to htmlz. If I recall correctly, the epub had a list tag system for the table of contents, and this was converted to spans alone. I will try to double check this tomorrow.

@anyone, everyone
If any one can please answer the question posted in #76, it is rather important for implementing a good search and replace.

Last edited by burbleburble; 07-19-2011 at 03:07 PM.
burbleburble is offline   Reply With Quote
Advert
Old 07-19-2011, 09:19 PM   #81
snarkophilus
Wannabe Connoisseur
snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.snarkophilus ought to be getting tired of karma fortunes by now.
 
Posts: 425
Karma: 2516674
Join Date: Apr 2011
Location: Geelong, Australia
Device: Kobo Libra 2, Kobo Aura 2, Sony PRS-T1, Sony PRS-350, Palm TX
Quote:
Originally Posted by burbleburble View Post
I have begun working on a search and replace /puntuation checker. But I can't figure out how to easily match across paragraphs, etc. Does javascript regex provide a way to easily match across tags, especially paragraph tags, and to include them in an expression? I googled it but came up with nothing helpful.
What exactly do you want to match? I'm not sure what you mean by "across paragraphs". Any regex shouldn't care about tags. Can you provide an example of what you want to match (and might not want to match)?

Note that I know regexs in general, but absolutely nothing about javascript.

Cheers,
Simon.
snarkophilus is offline   Reply With Quote
Old 07-20-2011, 03:47 AM   #82
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
Examples:

Code:
<p>This is an example</p><p>of a broken paragraph</p>
- I need to be able to recognize with regex in js that there is a paragraph break in middle of this sentence. (i.e. the word 'example' ends a paragraph with no punctuation such as a period, etc.).

Code:
<p><span>Bob <span> went to</span> the <span> market</span>. </span></p>
- I need to be able to match such a sentence, without the tags getting in the way. Unfortunately, regex that I know of doesn't support repeating strings, only characters: I can't write (<[^>]>)*.. and then there would still be the problem of gathering all the variable amount of matches (\1=Bob , \2= went to, etc... in this case)

I've gotta assume someone's figured these issues out. I can't imagine how it hasn't come up before in the world of programming! But I googled and searched and couldn't find answers... Thanks for any help!
burbleburble is offline   Reply With Quote
Old 07-20-2011, 04:23 AM   #83
chaley
Grand Sorcerer
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
Posts: 11,742
Karma: 6997045
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by burbleburble View Post
- I need to be able to match such a sentence, without the tags getting in the way. Unfortunately, regex that I know of doesn't support repeating strings, only characters: I can't write (<[^>]>)*.. and then there would still be the problem of gathering all the variable amount of matches (\1=Bob , \2= went to, etc... in this case)
Regular expressions match sequences of characters. There is no standard way to match repeating character sequences. A favorite assignment to give students is to write a regular expression that matches arbitrary palindromes (a string that matches itself from the outside in, such as ABCDCBA). The correct answer is "You can't." What you want to do is similar, with the same answer.

In some regexp systems you can use back references to match what you matched before, but in limited ways such as fixed counts. In others you must use programmatic matching (in effect, recursive regexps) to dynamically modify the regexp. This latter scheme can be used to solve the palindrome problem. It is, however, very system dependent and rather complicated.

If you have a limited number of cases, it would probably be easier to code these using string functions instead of regexps. That way you can handle both arbitrary numbers of matches and the necessary recursion.

What does javascript have to do with this? Are you really running javascript inside a calibre (python) plugin?
chaley is offline   Reply With Quote
Old 07-20-2011, 05:52 AM   #84
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
Well, by general regex I figured as much, but thanks for the confirmation.

As for why I am using javascript - the ebook editor uses webkit, which has limited python bindings. Javascript handles much easier (if not clearer) and faster most tasks having to do with the internal dom/cursor/editing. I hoped that since js is dedicated to the html dom, it must have some regex search method that can take the tags/structure into account.
burbleburble is offline   Reply With Quote
Old 07-20-2011, 05:47 PM   #85
Ortep
Fanatic
Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.
 
Posts: 527
Karma: 470
Join Date: Sep 2007
Location: The Netherlands
Device: Kindle Oasis
Hi, I'm not sure if it is helpfull to you, but when I use for example Word to find 'extra' paragraph breaks I look for a ^p without the following characters in front of it

Quote:
. ? ! "
This is because the end of a paragraph is always at the end of a sentence. And these are the characters you will find at the end of a sentence. Well at least in 99.9% of the cases.

Of course you can't find characters that are not there so I turn it around. I look for a ^p with one of those characters in front of it and I change it to that character followed by <<PAR>>. A string you probably won't find in a text. This marks the 'real' paragraphs

I'm not sure if you can do that in a one step regex, but you alway can do it in four seperate ones.


Then I change all the ^p that are left to a space. Those are the ones that aren't at the end of a sentence. In the next step I replace all the <<PAR>> with ^p

This process effectively removes all pargraphs that do not start at the end of a sentence and leaves the ones that are at the end of a sentence.

You probably want to first replace al ^p with a space in front of it with a single ^p because sometimes there is a space between the end of sentence character and the ^p


It is not a perfect proces, but it will catch a least 95% of your problems

Last edited by Ortep; 07-20-2011 at 06:00 PM.
Ortep is offline   Reply With Quote
Old 07-21-2011, 10:35 AM   #86
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
Thanks. Thats a neat approach.

After giving it some thought, I decided it might be more robust to first replace the 'p' tags with say, a null byte or '\n', and record the original tags and their position in a list of tuples. Then - search, mark the results with some tags, replace the p tags, and proceed from there.
burbleburble is offline   Reply With Quote
Old 08-26-2011, 02:44 AM   #87
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
Its been a while, but I'm back.

Um, I've written a fully functional program, and have used it myself for more than 100 books. Problem is, I gave up on working out the kinks when interfacing with Calibre's python 2.7 (mainly unicode vs ascii issues), since I wrote it originally in 3.2. So, it's currently a standalone program.
  1. But I can't figure out where on the forum to post one's own standalone program. Any suggestions?
  2. Also, if people are interested in doing so, I don't see why it can't be ported to Py 2.7 and returned to plugin status. I just gave up doing it myself.
burbleburble is offline   Reply With Quote
Old 09-02-2011, 11:58 AM   #88
Ortep
Fanatic
Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.
 
Posts: 527
Karma: 470
Join Date: Sep 2007
Location: The Netherlands
Device: Kindle Oasis
Maybe just another stupid idea, but can you use the 'open with' plugin to open a file with your program? Not completely integrated, but you can launch from within Calibre.

I'd like to test it
Ortep is offline   Reply With Quote
Old 09-14-2011, 07:15 AM   #89
burbleburble
Connoisseur
burbleburble began at the beginning.
 
Posts: 52
Karma: 38
Join Date: Jun 2011
Device: Kindle 3
Sorry about the delay.

Ortep: Sounds okay. What with school and SAT's I don't have time just now to learn how to integrate it with the 'Open With' plugin + the fact that it still wouldn't open ebooks directly from calibre without some work. (I really would like to port it to python 2.7 to reintegrate with calibre, just testing all the unicode conversions (it primarily manipulates unicode text, and py2.7 vs py3+ differs greatly in this area) will take alot of time...)

Meanwhile [I hope this is okay to post here, as I do look forward to re-integrating it, as it was before, and you asked to test its current state]: Because it's and independant package running off Python 3 + PyQt4 + lxml, it's a 13mb rar package. I have uploaded it to megaupload:

Note: This program is currently optimized/designed for a large screen! It will appear cluttered and probably be awkward to use on a small screen!

ECleaner v1.0.6 Program


ECleaner v1.0.6 Instructions
Spoiler:


Before you start:
  • I am aware that it will be confusing at first, and the help is rather limited. I welcome questions and suggestions for making it more intuitive; however, due to a busy schedule, I may not respond right away (give me a week+).
  • There is a known 'memory leak'. This sounds scary but it's not. It just means that every few ebooks you will have to restart the program; otherwise it will slow down.
  • Use at you're own risk. I have never experienced any issues though...
Installation:
  • Unpack the RAR
  • Inside the unpacked folder is a shortcut 'ECleaner'. (It has no logo; it is just a white box. If for some reason it can't find the program, it is the file 'main.exe' in the subfolder ECleaner).
First steps ['Raw EBooks']:
  • I assume you are familiar with basic html, namely: tags: p,span,i,b,a, attributes:class,id
  • Create an HTMLZ file with calibre. You MUST first set these conversion options: In 'HTMLZ Output'-- set to 'inline' for both selections. In calibre 'Look & Feel'-- check 'Smarten punctuation'.
  • Copy the 'raw' (uncleaned) htmlz file into the ECleaner-subfolder named 'RawFiles'. (This is for your ease, not required. ECleaner looks in this folder first. It will save you time from browsing to the file.)
  • Run ECleaner.
  • Since this is a 'raw' ebook, press the button named 'Open and Clean Htmlz'. This will tidy it, compressing nested divs, p's, spans, and creating classes based on patterns in the ebook. This allows one to easily and quickly restructure the ebook.
    • For example, all chapter headers will generally fall under the same class, let us say 5. One can then rename class="5" to class="Chapter", either to a 'single' element or 'forward' to all following elements with class="5".
  • Update the basic metadata. A new cover can be dragged & dropped over the old one.
Cleaning 'em up:
  • Go to tab #2 'Content', and explore your options! It usually takes me no more than max 15 minutes an ebook, but of course I already know my way around
  • There are options for ()Checking for puncuation issues, ()Renaming classes, ()Creating id's, ()Auto generating css formatting, ()Auto generating a titlepage and toc, ()Changing to titlecase, ()Search and replace, and so on.
  • One the left is a 'navigator' which can help navigate: ()By punctuation, ()By class, ()By image. It also provides many useful pieces of information
    • For example, one might want to check that all chapter titles are followed by a paragraph with class='FirstScene'. Well, this info is listed.
    • For example, one might want to check that he found all chapter headers. So, all ones needs to do is check the numbering - if it lists 1-35, and the last chapter header is 'Chapter 36', you know you're missing one!
  • The center pane is the main editor. On the right is a previewer. For purposes of speed, it usually displays only a 'range' of the ebook, relative to the cursor. Sometimes it updates itself, sometimes you have to click it.
Tips, Tricks, and Warnings:
  • Of course you'll figure them out with time, but some basics:
  • This program is optimized for the big screen! But, all tools are resizeable and hideable!
  • The search and replace via regex has no undo option! Save your htmlz first!
  • The auto formatting 'Use class templates' button:
    • It is based on my personal taste in formatting!
    • All supported classes are in the 'choose class' drop down list.
    • Some classes span only a single paragraph: For example, Chapter, Part, Title.
    • Some classes may span multiple paragraphs: For example, a 'Letter', a 'Verse', a 'Quotation. These should be used as follows, for example: Define the first paragraph as class="Letter". Leave the following paragraphs blank. Define the paragraph FOLLOWING the last paragraph in the letter as class="Regular". This way, the program knows where the letter ends. It also works to define the following paragraph as any other class, such as 'Letter' (a new letter), or Scene....etc, just don't leave it class=""!
  • The create 'Titlepage and Toc' button:
    • The auto generated titlepage is based on my personal taste in formatting! You can always adjust it after creating it.
    • To create a TOC (Table of Contents), you must first add the 'id' attribute to the relevant points in the ebook.
    • You must format the id's as follows (there are buttons for helping to add them...): Either - 'Epilogue', 'Prologue', 'Quotation#' (for an Epigraph), 'Part#', 'Chapter#'. Obviously, no 'id' may be repeated twice.
    • In the special case that an ebook contains sections in both the form 'Part' and 'Chapter', a special multi-level toc is created. For this, you must create id's for chapters in the form 'Part#Chapter#'
Saving and Reopening Ebooks:
  • You may save in HTMLZ, EPUB, and MOBI formats.
  • You may reopen any htmlz ebook that was already run through the tidy button, by simply clicking 'Open Htmlz'.
  • You cannot reopen epub or mobi. So save an htmlz backup.
  • NOTE: Epubs are saved with the following settings: Justification is not forced, this is left to the ebook reader/user's disgression. Indents are automaticly added where not otherwise specified to be 0. The html file is not split into multiple parts. (This may cause some ebook readers to open the ebook slower...)
  • NOTE: All saves to MOBI or EPUB are first run through Epubcheck. It always pays to check the details box to make sure nothing went wrong when tidying, saving, or opening a file.
Final notes:
  • OK. I'm more than aware these instructions probably won't suffice. I really am not good at this sort of thing. I welcome anyone who wishes to better document this (he/she will get the credits of course). Either way, I welcome questions... see the first paragraph. Some of the buttons have popup tooltips too, I hope to add more when I find the time.
  • So, play around, mess with it. It does work!!! (See the books I've cleaned up on my thread of cleaned up ebooks )
burbleburble is offline   Reply With Quote
Old 09-24-2011, 10:40 AM   #90
Ortep
Fanatic
Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.Ortep has a complete set of Star Wars action figures.
 
Posts: 527
Karma: 470
Join Date: Sep 2007
Location: The Netherlands
Device: Kindle Oasis
Hi, I was busy for a while and had no time to check everything. Last night I started playing with the cleaner. Your program looks great.. And I am able to start it from within Calibre using the plugin 'Open With'. The only thing I could not do was to open the HTMLZ automatically. I'll keep on playing
Ortep is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] Reading List kiwidude Plugins 1319 04-25-2024 09:27 AM
[GUI Plugin] Open With kiwidude Plugins 403 04-01-2024 08:39 AM
[GUI Plugin] User Category kiwidude Plugins 123 03-16-2024 11:59 PM
[GUI Plugin] Find Duplicates kiwidude Plugins 1096 03-16-2024 11:28 PM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 12:27 PM


All times are GMT -4. The time now is 03:07 PM.


MobileRead.com is a privately owned, operated and funded community.