Gui Plugin for Cleaning Ebooks, Fast - Page 6

burbleburble · 07-19-2011, 08:57 AM

To those who have posted in the last few days: Thanks for the feedback and interest. Though I have not responded to all the specifics, I am keeping them in mind. Please bug me again if I don't manage to deal with them in the next functional update.

@davidfor
I worked out a 'QSplitter' which should allow for easy manual resizing of the display objects in the next update; so I removed the setMinimum line of code. Also, it will have maximize/resize button in the upper right corner.

@Kovid, anyone
I have begun working on a search and replace /puntuation checker. But I can't figure out how to easily match across paragraphs, etc. Does javascript regex provide a way to easily match across tags, especially paragraph tags, and to include them in an expression? I googled it but came up with nothing helpful.

burbleburble · 07-19-2011, 10:01 AM

@Kovid
(New post in case you already looked at the last one, as I see you are logged on)
I am having strange issues when running the plugin in calibre, (it works fine in pyscripter). I recoded the delete, backspace, enter key responses for webkit. In calibre it won't delete the next character. Not only that, for all these keys, if dealing with a fairly long paragraph it will insert some very weird characters elswhere in the paragraph. Please, if you have any suggestions as to the issue, or could take a look, I would much appreciate it.
I attatched the current version (that is having the issues) below. The actual code is in main.Text.Browser under keypress() etc.

kovidgoyal · 07-19-2011, 10:12 AM

Sorry, I am swamped at the moment.

user_none · 07-19-2011, 02:49 PM

Quote:

Originally Posted by burbleburble

I realize that calibre's HTMLZ doesn't support all tags/css.

Admittedly I did not read through this entire thread but this is news to me. The non-css translation doesn't support all styles but all of the other out put versions should support all HTML tags and styles. All HTMLZ does is translate links and put multiple HTML files together. The only thing that I can think of that would be lost are per page background color / images. Which ones are you having issues with?

burbleburble · 07-19-2011, 03:04 PM

@user_none

I converted 'The Princess and the Goblin' (by George Macdonald, from gutenberg.org, for copyright free testing purposes) epub to htmlz. If I recall correctly, the epub had a list tag system for the table of contents, and this was converted to spans alone. I will try to double check this tomorrow.

@anyone, everyone
If any one can please answer the question posted in #76, it is rather important for implementing a good search and replace.

snarkophilus · 07-19-2011, 09:19 PM

Quote:

Originally Posted by burbleburble

I have begun working on a search and replace /puntuation checker. But I can't figure out how to easily match across paragraphs, etc. Does javascript regex provide a way to easily match across tags, especially paragraph tags, and to include them in an expression? I googled it but came up with nothing helpful.

What exactly do you want to match? I'm not sure what you mean by "across paragraphs". Any regex shouldn't care about tags. Can you provide an example of what you want to match (and might not want to match)?

Note that I know regexs in general, but absolutely nothing about javascript.

Cheers,
Simon.

burbleburble · 07-20-2011, 03:47 AM

Examples:

Code:

<p>This is an example</p><p>of a broken paragraph</p>

- I need to be able to recognize with regex in js that there is a paragraph break in middle of this sentence. (i.e. the word 'example' ends a paragraph with no punctuation such as a period, etc.).

Code:

<p><span>Bob <span> went to</span> the <span> market</span>. </span></p>

- I need to be able to match such a sentence, without the tags getting in the way. Unfortunately, regex that I know of doesn't support repeating strings, only characters: I can't write (<[^>]>)*.. and then there would still be the problem of gathering all the variable amount of matches (\1=Bob , \2= went to, etc... in this case)

I've gotta assume someone's figured these issues out. I can't imagine how it hasn't come up before in the world of programming! But I googled and searched and couldn't find answers... Thanks for any help!

chaley · 07-20-2011, 04:23 AM

Quote:

Originally Posted by burbleburble

- I need to be able to match such a sentence, without the tags getting in the way. Unfortunately, regex that I know of doesn't support repeating strings, only characters: I can't write (<[^>]>)*.. and then there would still be the problem of gathering all the variable amount of matches (\1=Bob , \2= went to, etc... in this case)

Regular expressions match sequences of characters. There is no standard way to match repeating character sequences. A favorite assignment to give students is to write a regular expression that matches arbitrary palindromes (a string that matches itself from the outside in, such as ABCDCBA). The correct answer is "You can't." What you want to do is similar, with the same answer.

In some regexp systems you can use back references to match what you matched before, but in limited ways such as fixed counts. In others you must use programmatic matching (in effect, recursive regexps) to dynamically modify the regexp. This latter scheme can be used to solve the palindrome problem. It is, however, very system dependent and rather complicated.

If you have a limited number of cases, it would probably be easier to code these using string functions instead of regexps. That way you can handle both arbitrary numbers of matches and the necessary recursion.

What does javascript have to do with this? Are you really running javascript inside a calibre (python) plugin?

burbleburble · 07-20-2011, 05:52 AM

Well, by general regex I figured as much, but thanks for the confirmation.

As for why I am using javascript - the ebook editor uses webkit, which has limited python bindings. Javascript handles much easier (if not clearer) and faster most tasks having to do with the internal dom/cursor/editing. I hoped that since js is dedicated to the html dom, it must have some regex search method that can take the tags/structure into account.

Ortep · 07-20-2011, 05:47 PM

Hi, I'm not sure if it is helpfull to you, but when I use for example Word to find 'extra' paragraph breaks I look for a ^p without the following characters in front of it

Quote:

. ? ! "

This is because the end of a paragraph is always at the end of a sentence. And these are the characters you will find at the end of a sentence. Well at least in 99.9% of the cases.

Of course you can't find characters that are not there so I turn it around. I look for a ^p with one of those characters in front of it and I change it to that character followed by <<PAR>>. A string you probably won't find in a text. This marks the 'real' paragraphs

I'm not sure if you can do that in a one step regex, but you alway can do it in four seperate ones.

Then I change all the ^p that are left to a space. Those are the ones that aren't at the end of a sentence. In the next step I replace all the <<PAR>> with ^p

This process effectively removes all pargraphs that do not start at the end of a sentence and leaves the ones that are at the end of a sentence.

You probably want to first replace al ^p with a space in front of it with a single ^p because sometimes there is a space between the end of sentence character and the ^p

It is not a perfect proces, but it will catch a least 95% of your problems

burbleburble · 07-21-2011, 10:35 AM

Thanks. Thats a neat approach.

After giving it some thought, I decided it might be more robust to first replace the 'p' tags with say, a null byte or '\n', and record the original tags and their position in a list of tuples. Then - search, mark the results with some tags, replace the p tags, and proceed from there.

burbleburble · 08-26-2011, 02:44 AM

Its been a while, but I'm back.

Um, I've written a fully functional program, and have used it myself for more than 100 books. Problem is, I gave up on working out the kinks when interfacing with Calibre's python 2.7 (mainly unicode vs ascii issues), since I wrote it originally in 3.2. So, it's currently a standalone program.

But I can't figure out where on the forum to post one's own standalone program. Any suggestions?
Also, if people are interested in doing so, I don't see why it can't be ported to Py 2.7 and returned to plugin status. I just gave up doing it myself.

Ortep · 09-02-2011, 11:58 AM

Maybe just another stupid idea, but can you use the 'open with' plugin to open a file with your program? Not completely integrated, but you can launch from within Calibre.

I'd like to test it

burbleburble · 09-14-2011, 07:15 AM

Sorry about the delay.

Ortep: Sounds okay. What with school and SAT's I don't have time just now to learn how to integrate it with the 'Open With' plugin + the fact that it still wouldn't open ebooks directly from calibre without some work. (I really would like to port it to python 2.7 to reintegrate with calibre, just testing all the unicode conversions (it primarily manipulates unicode text, and py2.7 vs py3+ differs greatly in this area) will take alot of time...)

Meanwhile [I hope this is okay to post here, as I do look forward to re-integrating it, as it was before, and you asked to test its current state]: Because it's and independant package running off Python 3 + PyQt4 + lxml, it's a 13mb rar package. I have uploaded it to megaupload:

Note: This program is currently optimized/designed for a large screen! It will appear cluttered and probably be awkward to use on a small screen!

ECleaner v1.0.6 Program

Spoiler:

ECleaner v1.0.6 Instructions

Spoiler:

Before you start:

I am aware that it will be confusing at first, and the help is rather limited. I welcome questions and suggestions for making it more intuitive; however, due to a busy schedule, I may not respond right away (give me a week+).
There is a known 'memory leak'. This sounds scary but it's not. It just means that every few ebooks you will have to restart the program; otherwise it will slow down.
Use at you're own risk. I have never experienced any issues though...

Installation:

Unpack the RAR
Inside the unpacked folder is a shortcut 'ECleaner'. (It has no logo; it is just a white box. If for some reason it can't find the program, it is the file 'main.exe' in the subfolder ECleaner).

First steps ['Raw EBooks']:

I assume you are familiar with basic html, namely: tags: p,span,i,b,a, attributes:class,id
Create an HTMLZ file with calibre. You MUST first set these conversion options: In 'HTMLZ Output'-- set to 'inline' for both selections. In calibre 'Look & Feel'-- check 'Smarten punctuation'.
Copy the 'raw' (uncleaned) htmlz file into the ECleaner-subfolder named 'RawFiles'. (This is for your ease, not required. ECleaner looks in this folder first. It will save you time from browsing to the file.)
Run ECleaner.
Since this is a 'raw' ebook, press the button named 'Open and Clean Htmlz'. This will tidy it, compressing nested divs, p's, spans, and creating classes based on patterns in the ebook. This allows one to easily and quickly restructure the ebook.
- For example, all chapter headers will generally fall under the same class, let us say 5. One can then rename class="5" to class="Chapter", either to a 'single' element or 'forward' to all following elements with class="5".
Update the basic metadata. A new cover can be dragged & dropped over the old one.

Cleaning 'em up:

Go to tab #2 'Content', and explore your options! It usually takes me no more than max 15 minutes an ebook, but of course I already know my way around
There are options for ()Checking for puncuation issues, ()Renaming classes, ()Creating id's, ()Auto generating css formatting, ()Auto generating a titlepage and toc, ()Changing to titlecase, ()Search and replace, and so on.
One the left is a 'navigator' which can help navigate: ()By punctuation, ()By class, ()By image. It also provides many useful pieces of information
- For example, one might want to check that all chapter titles are followed by a paragraph with class='FirstScene'. Well, this info is listed.
- For example, one might want to check that he found all chapter headers. So, all ones needs to do is check the numbering - if it lists 1-35, and the last chapter header is 'Chapter 36', you know you're missing one!
The center pane is the main editor. On the right is a previewer. For purposes of speed, it usually displays only a 'range' of the ebook, relative to the cursor. Sometimes it updates itself, sometimes you have to click it.

Tips, Tricks, and Warnings:

Of course you'll figure them out with time, but some basics:
This program is optimized for the big screen! But, all tools are resizeable and hideable!
The search and replace via regex has no undo option! Save your htmlz first!
The auto formatting 'Use class templates' button:
- It is based on my personal taste in formatting!
- All supported classes are in the 'choose class' drop down list.
- Some classes span only a single paragraph: For example, Chapter, Part, Title.
- Some classes may span multiple paragraphs: For example, a 'Letter', a 'Verse', a 'Quotation. These should be used as follows, for example: Define the first paragraph as class="Letter". Leave the following paragraphs blank. Define the paragraph FOLLOWING the last paragraph in the letter as class="Regular". This way, the program knows where the letter ends. It also works to define the following paragraph as any other class, such as 'Letter' (a new letter), or Scene....etc, just don't leave it class=""!
The create 'Titlepage and Toc' button:
- The auto generated titlepage is based on my personal taste in formatting! You can always adjust it after creating it.
- To create a TOC (Table of Contents), you must first add the 'id' attribute to the relevant points in the ebook.
- You must format the id's as follows (there are buttons for helping to add them...): Either - 'Epilogue', 'Prologue', 'Quotation#' (for an Epigraph), 'Part#', 'Chapter#'. Obviously, no 'id' may be repeated twice.
- In the special case that an ebook contains sections in both the form 'Part' and 'Chapter', a special multi-level toc is created. For this, you must create id's for chapters in the form 'Part#Chapter#'

Saving and Reopening Ebooks:

You may save in HTMLZ, EPUB, and MOBI formats.
You may reopen any htmlz ebook that was already run through the tidy button, by simply clicking 'Open Htmlz'.
You cannot reopen epub or mobi. So save an htmlz backup.
NOTE: Epubs are saved with the following settings: Justification is not forced, this is left to the ebook reader/user's disgression. Indents are automaticly added where not otherwise specified to be 0. The html file is not split into multiple parts. (This may cause some ebook readers to open the ebook slower...)
NOTE: All saves to MOBI or EPUB are first run through Epubcheck. It always pays to check the details box to make sure nothing went wrong when tidying, saving, or opening a file.

Final notes:

OK. I'm more than aware these instructions probably won't suffice. I really am not good at this sort of thing. I welcome anyone who wishes to better document this (he/she will get the credits of course). Either way, I welcome questions... see the first paragraph. Some of the buttons have popup tooltips too, I hope to add more when I find the time.
So, play around, mess with it. It does work!!! (See the books I've cleaned up on my thread of cleaned up ebooks )

Ortep · 09-24-2011, 10:40 AM

Hi, I was busy for a while and had no time to check everything. Last night I started playing with the cleaner. Your program looks great.. And I am able to start it from within Calibre using the plugin 'Open With'. The only thing I could not do was to open the HTMLZ automatically. I'll keep on playing

07-19-2011, 03:04 PM	#80
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	@user_none I converted 'The Princess and the Goblin' (by George Macdonald, from gutenberg.org, for copyright free testing purposes) epub to htmlz. If I recall correctly, the epub had a list tag system for the table of contents, and this was converted to spans alone. I will try to double check this tomorrow. @anyone, everyone If any one can please answer the question posted in #76, it is rather important for implementing a good search and replace. Last edited by burbleburble; 07-19-2011 at 03:07 PM.

07-20-2011, 03:47 AM	#82
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Examples: Code: <p>This is an example</p><p>of a broken paragraph</p> - I need to be able to recognize with regex in js that there is a paragraph break in middle of this sentence. (i.e. the word 'example' ends a paragraph with no punctuation such as a period, etc.). Code: <p><span>Bob <span> went to</span> the <span> market</span>. </span></p> - I need to be able to match such a sentence, without the tags getting in the way. Unfortunately, regex that I know of doesn't support repeating strings, only characters: I can't write (<[^>]>).. and then there would still be the problem of gathering all the variable amount of matches (\1=Bob , \2= went to, etc... in this* case) I've gotta assume someone's figured these issues out. I can't imagine how it hasn't come up before in the world of programming! But I googled and searched and couldn't find answers... Thanks for any help!

08-26-2011, 02:44 AM	#87
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Its been a while, but I'm back. Um, I've written a fully functional program, and have used it myself for more than 100 books. Problem is, I gave up on working out the kinks when interfacing with Calibre's python 2.7 (mainly unicode vs ascii issues), since I wrote it originally in 3.2. So, it's currently a standalone program. But I can't figure out where on the forum to post one's own standalone program. Any suggestions? Also, if people are interested in doing so, I don't see why it can't be ported to Py 2.7 and returned to plugin status. I just gave up doing it myself.

09-14-2011, 07:15 AM	#89
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Sorry about the delay. Ortep: Sounds okay. What with school and SAT's I don't have time just now to learn how to integrate it with the 'Open With' plugin + the fact that it still wouldn't open ebooks directly from calibre without some work. (I really would like to port it to python 2.7 to reintegrate with calibre, just testing all the unicode conversions (it primarily manipulates unicode text, and py2.7 vs py3+ differs greatly in this area) will take alot of time...) Meanwhile [I hope this is okay to post here, as I do look forward to re-integrating it, as it was before, and you asked to test its current state]: Because it's and independant package running off Python 3 + PyQt4 + lxml, it's a 13mb rar package. I have uploaded it to megaupload: Note: This program is currently optimized/designed for a large screen! It will appear cluttered and probably be awkward to use on a small screen! ECleaner v1.0.6 Program Spoiler: http://www.megaupload.com/?d=V10H1Q8E ECleaner v1.0.6 Instructions Spoiler: Before you start: I am aware that it will be confusing at first, and the help is rather limited. I welcome questions and suggestions for making it more intuitive; however, due to a busy schedule, I may not respond right away (give me a week+). There is a known 'memory leak'. This sounds scary but it's not. It just means that every few ebooks you will have to restart the program; otherwise it will slow down. Use at you're own risk. I have never experienced any issues though... Installation: Unpack the RAR Inside the unpacked folder is a shortcut 'ECleaner'. (It has no logo; it is just a white box. If for some reason it can't find the program, it is the file 'main.exe' in the subfolder ECleaner). First steps ['Raw EBooks']: I assume you are familiar with basic html, namely: tags: p,span,i,b,a, attributes:class,id Create an HTMLZ file with calibre. You MUST first set these conversion options: In 'HTMLZ Output'-- set to 'inline' for both selections. In calibre 'Look & Feel'-- check 'Smarten punctuation'. Copy the 'raw' (uncleaned) htmlz file into the ECleaner-subfolder named 'RawFiles'. (This is for your ease, not required. ECleaner looks in this folder first. It will save you time from browsing to the file.) Run ECleaner. Since this is a 'raw' ebook, press the button named 'Open and Clean Htmlz'. This will tidy it, compressing nested divs, p's, spans, and creating classes based on patterns in the ebook. This allows one to easily and quickly restructure the ebook. For example, all chapter headers will generally fall under the same class, let us say 5. One can then rename class="5" to class="Chapter", either to a 'single' element or 'forward' to all following elements with class="5". Update the basic metadata. A new cover can be dragged & dropped over the old one. Cleaning 'em up: Go to tab #2 'Content', and explore your options! It usually takes me no more than max 15 minutes an ebook, but of course I already know my way around There are options for ()Checking for puncuation issues, ()Renaming classes, ()Creating id's, ()Auto generating css formatting, ()Auto generating a titlepage and toc, ()Changing to titlecase, ()Search and replace, and so on. One the left is a 'navigator' which can help navigate: ()By punctuation, ()By class, ()By image. It also provides many useful pieces of information For example, one might want to check that all chapter titles are followed by a paragraph with class='FirstScene'. Well, this info is listed. For example, one might want to check that he found all chapter headers. So, all ones needs to do is check the numbering - if it lists 1-35, and the last chapter header is 'Chapter 36', you know you're missing one! The center pane is the main editor. On the right is a previewer. For purposes of speed, it usually displays only a 'range' of the ebook, relative to the cursor. Sometimes it updates itself, sometimes you have to click it. Tips, Tricks, and Warnings: Of course you'll figure them out with time, but some basics: This program is optimized for the big screen! But, all tools are resizeable and hideable! The search and replace via regex has no undo option! Save your htmlz first! The auto formatting 'Use class templates' button: It is based on my personal taste in formatting! All supported classes are in the 'choose class' drop down list. Some classes span only a single paragraph: For example, Chapter, Part, Title. Some classes may span multiple paragraphs: For example, a 'Letter', a 'Verse', a 'Quotation. These should be used as follows, for example: Define the first paragraph as class="Letter". Leave the following paragraphs blank. Define the paragraph FOLLOWING the last paragraph in the letter as class="Regular". This way, the program knows where the letter ends. It also works to define the following paragraph as any other class, such as 'Letter' (a new letter), or Scene....etc, just don't leave it class=""! The create 'Titlepage and Toc' button: The auto generated titlepage is based on my personal taste in formatting! You can always adjust it after creating it. To create a TOC (Table of Contents), you must first add the 'id' attribute to the relevant points in the ebook. You must format the id's as follows (there are buttons for helping to add them...): Either - 'Epilogue', 'Prologue', 'Quotation#' (for an Epigraph), 'Part#', 'Chapter#'. Obviously, no 'id' may be repeated twice. In the special case that an ebook contains sections in both the form 'Part' and 'Chapter', a special multi-level toc is created. For this, you must create id's for chapters in the form 'Part#Chapter#' Saving and Reopening Ebooks: You may save in HTMLZ, EPUB, and MOBI formats. You may reopen any htmlz ebook that was already run through the tidy button, by simply clicking 'Open Htmlz'. You cannot reopen epub or mobi. So save an htmlz backup. NOTE: Epubs are saved with the following settings: Justification is not forced, this is left to the ebook reader/user's disgression. Indents are automaticly added where not otherwise specified to be 0. The html file is not split into multiple parts. (This may cause some ebook readers to open the ebook slower...) NOTE: All saves to MOBI or EPUB are first run through Epubcheck. It always pays to check the details box to make sure nothing went wrong when tidying, saving, or opening a file. Final notes: OK. I'm more than aware these instructions probably won't suffice. I really am not good at this sort of thing. I welcome anyone who wishes to better document this (he/she will get the credits of course). Either way, I welcome questions... see the first paragraph. Some of the buttons have popup tooltips too, I hope to add more when I find the time. So, play around, mess with it. It does work!!! (See the books I've cleaned up on my thread of cleaned up ebooks )

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] Reading List	kiwidude	Plugins	1319	04-25-2024 09:27 AM
[GUI Plugin] Open With	kiwidude	Plugins	403	04-01-2024 08:39 AM
[GUI Plugin] User Category	kiwidude	Plugins	123	03-16-2024 11:59 PM
[GUI Plugin] Find Duplicates	kiwidude	Plugins	1096	03-16-2024 11:28 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM

07-19-2011, 08:57 AM	#76
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	To those who have posted in the last few days: Thanks for the feedback and interest. Though I have not responded to all the specifics, I am keeping them in mind. Please bug me again if I don't manage to deal with them in the next functional update. @davidfor I worked out a 'QSplitter' which should allow for easy manual resizing of the display objects in the next update; so I removed the setMinimum line of code. Also, it will have maximize/resize button in the upper right corner. @Kovid, anyone I have begun working on a search and replace /puntuation checker. But I can't figure out how to easily match across paragraphs, etc. Does javascript regex provide a way to easily match across tags, especially paragraph tags, and to include them in an expression? I googled it but came up with nothing helpful. Last edited by burbleburble; 07-19-2011 at 09:34 AM.

07-19-2011, 10:12 AM	#78
kovidgoyal creator of calibre Posts: 43,860 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Sorry, I am swamped at the moment.

07-20-2011, 05:52 AM	#84
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Well, by general regex I figured as much, but thanks for the confirmation. As for why I am using javascript - the ebook editor uses webkit, which has limited python bindings. Javascript handles much easier (if not clearer) and faster most tasks having to do with the internal dom/cursor/editing. I hoped that since js is dedicated to the html dom, it must have some regex search method that can take the tags/structure into account.

07-21-2011, 10:35 AM	#86
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Thanks. Thats a neat approach. After giving it some thought, I decided it might be more robust to first replace the 'p' tags with say, a null byte or '\n', and record the original tags and their position in a list of tuples. Then - search, mark the results with some tags, replace the p tags, and proceed from there.

09-02-2011, 11:58 AM	#88
Ortep Fanatic Posts: 527 Karma: 470 Join Date: Sep 2007 Location: The Netherlands Device: Kindle Oasis	Maybe just another stupid idea, but can you use the 'open with' plugin to open a file with your program? Not completely integrated, but you can launch from within Calibre. I'd like to test it

09-24-2011, 10:40 AM	#90
Ortep Fanatic Posts: 527 Karma: 470 Join Date: Sep 2007 Location: The Netherlands Device: Kindle Oasis	Hi, I was busy for a while and had no time to check everything. Last night I started playing with the cleaner. Your program looks great.. And I am able to start it from within Calibre using the plugin 'Open With'. The only thing I could not do was to open the HTMLZ automatically. I'll keep on playing

Advert

Advert