Gui Plugin for Cleaning Ebooks, Fast - Page 4

kovidgoyal · 06-28-2011, 12:21 PM

open will not create directories for you you have to use os.makedirs first.

burbleburble · 06-29-2011, 06:28 AM

@Kovid
Thanks. Solved the problem by checking/creating missing directory.

Another question: For some reason, alot of things that work with my edition of python/pyqt/lxml don't work in calibre (v0.8.6). I keep on coming across the following when running the plugin in calibre:

Code:

Traceback (most recent call last):
  File "calibre_plugins.ebook_cleaner.main", line 1479, in slotCleanAndOpenEpub
  File "calibre_plugins.ebook_cleaner.main", line 513, in clean
  File "lxml.etree.pyx", line 2762, in lxml.etree.fromstringlist (src/lxml/lxml.etree.c:52933)
  File "parser.pxi", line 1134, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:76722)
  File "parser.pxi", line 556, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:71680)
  File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614)
  File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955)
XMLSyntaxError: Char 0x0 out of allowed range, line 2, column 1

where the stringlist being input into etree.fromstringlist() is a perfectly normal list of strings (the first three being '<html>', '<head>', '<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>' ; these first few strings are written in the plugin, not read from somewhere else; I'm guessing 'line 2' refers to the third one?)

How can I solve this problem?

kovidgoyal · 06-29-2011, 10:20 AM

You've got null bytes in your strings. stringvar.replace('\0', '')

burbleburble · 06-29-2011, 11:54 AM

@Kovid
That solved one issue. But then it found another. So I just did ''.join(list) first, then parsed from a string instead of a list. For some strange reason it no longer has a problem, even without replacing null bytes.

But it is rather time consuming to perform this operation first. Oh well. Still, is calibre's version of lxml not up to date? Because mine works fine parsing from a list!

Another question: I'm having trouble saving a page from webkit. I tried both mainFrame().toHtml() and documentElement.toOuterXml() and either way it wont save valid xhtml. It always leaves out the '/' on single tag elements (like 'img', 'br', 'meta'). (Is it even valid in an epub?) This generates serious problems when trying to parse it again with lxml. So, do you know of a way around this issue?

Thanks for all the help

kovidgoyal · 06-29-2011, 12:01 PM

calibre-debug -c "from lxml import etree; print etree.LXML_VERSION"

I've never tried saving from webkit so I dont have any advice for you on that.

kovidgoyal · 06-29-2011, 12:04 PM

http://bugreports.qt.nokia.com/browse/QTBUG-2787

burbleburble · 06-30-2011, 11:01 AM

@Kovid: Thanks. I guess I'll just have to parse webkit's output with lxml.html, and resave as xhtml.

@Anyone: I've been working on a better structure view, and a better way of editing it. I've come up with a method, and I think its kind of simple; however the javascript for implementing all its facets is a rather frustrating to write. So, I have written rudimentary code for it, and attatched below a test version.

I really need some suggestions about the following issues in this test version:

Currently the method for editing the class structure is the following:
- Every class entered in the replacements text edit is defined in relation to the root (i.e. the book). For example: 'title of chapter' or 'line of verse of chapter'; the 'of root' need not be specified. In this test version you must write ' of ' between any two classes in order to create a hierarchy.
- Every class can be defined as 'new'. For example 'title of newchapter'. This is because you may replace a 'class13' with 'scene of chapter', and you really don't want to start a new chapter there. So the plugin takes that into account, and merges it to an existing chapter, or later when you create a ' ... newchapter' previous to it.
So please, anyone: do you have a simpler/clearer approach or a clearer explanation (I'm awful at explaining); and especially a suggestion for how to present a gui interface for creating/utilizing such syntaxes. The test version below can be used to see how it works right now. Below is also a picture of being put to use in this way....
Included in this test version is an epub writer. I just don't have an approach for where to save it, rename it... currently it just saves to the temp folder, and opens it for you to copy it out of...

So, please, anyone: Should it overwrite the original? Should it add to the calibre library with and extendsion 'Harry Potter 1' + 'CLEANED'... should you save it somewhere on the computer? ideas please!!!

(I enjoy the programming, I can communicate with the computer; but how to clearly and intuitively communicate with the user is quite often beyond me

)

kovidgoyal · 06-30-2011, 11:29 AM

As per the bug report if you use setContent with an xhtml mimetype, it should generate valid xhtml on output. IIRC, the calibre viewer uses setContent not setHTML when viewing EPUB.

burbleburble · 06-30-2011, 11:33 AM

@Kovid

Once again, thank you. I didn't understand thats what it meant when I read it.

burbleburble · 07-04-2011, 12:46 PM

Updated to version 0.0.5

Due to a lack of user feedback (interest?) when I posted requests for suggestions in certain areas, I have reverted to designing this plugin as per my own needs; I have no interest in brainstorming how to develop a fully featured plugin, as per everyone's needs, if the user community won't participate.
However, I am more than happy to incorporate ideas and modify this plugin, if given some concrete, well defined proposal. I don't mind helping out and making it intuitive, accesible, and usefull for others - YOU JUST HAVE TO EXPRESS WHAT YOU WOULD FIND INTUITIVE, ACCESIBLE, AND USEFULL!

The new plugin, therefore, utilizes simple html with syntax highlighting; though it comes with a host of tools to help automate as much of the cleaning process as possible.

Under the Covers · 07-04-2011, 05:21 PM

I'm very interested but am not even close to being a programmer, so have been following this thread to see if your end-product might help me.

I'd like to see what would seem to me to be an almost magical ability to take the Caliber conversions I've made from pdf to epub and easily eliminate the page headers and footers that end up mixed in with the text. I managed to do it once, but it took me so long to study the various "expressions" and figure out how to input the various search parameters (which I've since forgotten) that I haven't done it again.

This may be well outside what you are trying to accomplish, but as one who has trouble remembering all about "expressions" for the search/replace function (which is really cool), I have been following this thread hoping your project might include some sort of easy interface for input of the appropriate expressions. I'd guess many other non-technical people would like additional interface assistance with expressions.

In any case, I'd encourage you to keep on truckin' -- even if this is NOT where you were headed with this -- because many of us who have NO programming expertise are looking for various interfaces that accomplish what is apparently so easy for programmers but so befuddling for the rest of us.

burbleburble · 07-05-2011, 02:43 AM

@Under the Covers

Your feature sounds like a sensible addition. To save me the need to brainstorm the various ways such footers and headers might appear and be identified, please can you list several examples of the following:

The headers as they appear, with some context (surrounding text/lines). Please list several so I can see how page# or odd and even pages might change the header. It of course would also help to provide such examples from different books, as they may change appearance from book to book.
The same for footers.

To adress your need of it being intuitive for non-programmers (and even for programmers, to avoid the need to write complex expressions), I think I will make it attempt to automaticly match general cases; then provide a list of matches where the user can choose which to replace/remove. Sound good?

I can't know for certain when I will implement it, but I will try within the next several weeks.

@Kovid, anyone.

How do you do regex searches in webkit? (javascript? is there a way?)
How do you do general search and replace in webkit, especcially considering the fact that the text may be spread across several elements (ex. italic, bold and p)?
For some reason, I can't get images to show in webkit. I am using, for example, webkit.setContent(data, baseUrl=QtCore.QUrl('D:\\TestA')) where the baseUrl is the folder containing the original html (of course, it was converted to data, but the image src attrib remains the same). I also tried using baseUrl=QtCore.QUrl('D:\\TestA\\index.html), where I used the name of the html as part of the baseUrl. What am I doing wrong?

kovidgoyal · 07-05-2011, 11:31 AM

google javascript regex and use QUrl.fromLocalFile

burbleburble · 07-05-2011, 12:48 PM

Updated to 0.0.6:
-reverted back to webkit
-major improvements in interface and coding
-stable, if lacking in features

@Kovid
Thanks. The .fromlocalfile worked great.

jackie_w · 07-05-2011, 07:47 PM

Hi burble,

I tried to use your utility but I have to admit to being unclear how to achieve the clean-up. I was able to create the initial htmlz file and to load it in the plugin and produce the patterns but I couldn't figure out how to proceed.

I have attached 2 htmlz files. The input one is a tiny extract (to avoid copyright problems) from a mobi-to-htmlz conversion. The output one is what I ideally would have liked as the cleaned-up simplified version. I would then be able to add my own standard external css file to match the tags (<h2>, <h3>, <p>, <i>) and classes ("ctr", "noind", "txt") in the cleaned-up index.html file.

Please could you tell me whether this is achievable with the current plugin, or even whether I could get somewhere close if I knew what I was doing

06-29-2011, 06:28 AM	#47
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	@Kovid Thanks. Solved the problem by checking/creating missing directory. Another question: For some reason, alot of things that work with my edition of python/pyqt/lxml don't work in calibre (v0.8.6). I keep on coming across the following when running the plugin in calibre: Code: Traceback (most recent call last): File "calibre_plugins.ebook_cleaner.main", line 1479, in slotCleanAndOpenEpub File "calibre_plugins.ebook_cleaner.main", line 513, in clean File "lxml.etree.pyx", line 2762, in lxml.etree.fromstringlist (src/lxml/lxml.etree.c:52933) File "parser.pxi", line 1134, in lxml.etree._FeedParser.feed (src/lxml/lxml.etree.c:76722) File "parser.pxi", line 556, in lxml.etree._ParserContext._handleParseResult (src/lxml/lxml.etree.c:71680) File "parser.pxi", line 645, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:72614) File "parser.pxi", line 585, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:71955) XMLSyntaxError: Char 0x0 out of allowed range, line 2, column 1 where the stringlist being input into etree.fromstringlist() is a perfectly normal list of strings (the first three being '<html>', '<head>', '<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>' ; these first few strings are written in the plugin, not read from somewhere else; I'm guessing 'line 2' refers to the third one?) How can I solve this problem?

07-04-2011, 05:21 PM	#56
Under the Covers Night Reader Posts: 127 Karma: 4314 Join Date: Oct 2010 Location: Rocky Mountains (US) Device: Sony PRS-650	I'm very interested but am not even close to being a programmer, so have been following this thread to see if your end-product might help me. I'd like to see what would seem to me to be an almost magical ability to take the Caliber conversions I've made from pdf to epub and easily eliminate the page headers and footers that end up mixed in with the text. I managed to do it once, but it took me so long to study the various "expressions" and figure out how to input the various search parameters (which I've since forgotten) that I haven't done it again. This may be well outside what you are trying to accomplish, but as one who has trouble remembering all about "expressions" for the search/replace function (which is really cool), I have been following this thread hoping your project might include some sort of easy interface for input of the appropriate expressions. I'd guess many other non-technical people would like additional interface assistance with expressions. In any case, I'd encourage you to keep on truckin' -- even if this is NOT where you were headed with this -- because many of us who have NO programming expertise are looking for various interfaces that accomplish what is apparently so easy for programmers but so befuddling for the rest of us. Last edited by Under the Covers; 07-04-2011 at 05:30 PM. Reason: clarification

07-05-2011, 02:43 AM	#57
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	@Under the Covers Your feature sounds like a sensible addition. To save me the need to brainstorm the various ways such footers and headers might appear and be identified, please can you list several examples of the following: The headers as they appear, with some context (surrounding text/lines). Please list several so I can see how page# or odd and even pages might change the header. It of course would also help to provide such examples from different books, as they may change appearance from book to book. The same for footers. To adress your need of it being intuitive for non-programmers (and even for programmers, to avoid the need to write complex expressions), I think I will make it attempt to automaticly match general cases; then provide a list of matches where the user can choose which to replace/remove. Sound good? I can't know for certain when I will implement it, but I will try within the next several weeks. @Kovid, anyone. How do you do regex searches in webkit? (javascript? is there a way?) How do you do general search and replace in webkit, especcially considering the fact that the text may be spread across several elements (ex. italic, bold and p)? For some reason, I can't get images to show in webkit. I am using, for example, webkit.setContent(data, baseUrl=QtCore.QUrl('D:\\TestA')) where the baseUrl is the folder containing the original html (of course, it was converted to data, but the image src attrib remains the same). I also tried using baseUrl=QtCore.QUrl('D:\\TestA\\index.html), where I used the name of the html as part of the baseUrl. What am I doing wrong? Last edited by burbleburble; 07-05-2011 at 06:24 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[GUI Plugin] Reading List	kiwidude	Plugins	1319	Yesterday 09:27 AM
[GUI Plugin] Open With	kiwidude	Plugins	403	04-01-2024 08:39 AM
[GUI Plugin] User Category	kiwidude	Plugins	123	03-16-2024 11:59 PM
[GUI Plugin] Find Duplicates	kiwidude	Plugins	1096	03-16-2024 11:28 PM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM

06-28-2011, 12:21 PM	#46
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	open will not create directories for you you have to use os.makedirs first.

06-29-2011, 10:20 AM	#48
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You've got null bytes in your strings. stringvar.replace('\0', '')

06-29-2011, 11:54 AM	#49
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	@Kovid That solved one issue. But then it found another. So I just did ''.join(list) first, then parsed from a string instead of a list. For some strange reason it no longer has a problem, even without replacing null bytes. But it is rather time consuming to perform this operation first. Oh well. Still, is calibre's version of lxml not up to date? Because mine works fine parsing from a list! Another question: I'm having trouble saving a page from webkit. I tried both mainFrame().toHtml() and documentElement.toOuterXml() and either way it wont save valid xhtml. It always leaves out the '/' on single tag elements (like 'img', 'br', 'meta'). (Is it even valid in an epub?) This generates serious problems when trying to parse it again with lxml. So, do you know of a way around this issue? Thanks for all the help

06-29-2011, 12:01 PM	#50
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	calibre-debug -c "from lxml import etree; print etree.LXML_VERSION" I've never tried saving from webkit so I dont have any advice for you on that.

06-29-2011, 12:04 PM	#51
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	http://bugreports.qt.nokia.com/browse/QTBUG-2787

06-30-2011, 11:29 AM	#53
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	As per the bug report if you use setContent with an xhtml mimetype, it should generate valid xhtml on output. IIRC, the calibre viewer uses setContent not setHTML when viewing EPUB.

06-30-2011, 11:33 AM	#54
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	@Kovid Once again, thank you. I didn't understand thats what it meant when I read it.

07-04-2011, 12:46 PM	#55
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Updated to version 0.0.5 Due to a lack of user feedback (interest?) when I posted requests for suggestions in certain areas, I have reverted to designing this plugin as per my own needs; I have no interest in brainstorming how to develop a fully featured plugin, as per everyone's needs, if the user community won't participate. However, I am more than happy to incorporate ideas and modify this plugin, if given some concrete, well defined proposal. I don't mind helping out and making it intuitive, accesible, and usefull for others - YOU JUST HAVE TO EXPRESS WHAT YOU WOULD FIND INTUITIVE, ACCESIBLE, AND USEFULL! The new plugin, therefore, utilizes simple html with syntax highlighting; though it comes with a host of tools to help automate as much of the cleaning process as possible.

07-05-2011, 11:31 AM	#58
kovidgoyal creator of calibre Posts: 43,858 Karma: 22666666 Join Date: Oct 2006 Location: Mumbai, India Device: Various	google javascript regex and use QUrl.fromLocalFile

07-05-2011, 12:48 PM	#59
burbleburble Connoisseur Posts: 52 Karma: 38 Join Date: Jun 2011 Device: Kindle 3	Updated to 0.0.6: -reverted back to webkit -major improvements in interface and coding -stable, if lacking in features @Kovid Thanks. The .fromlocalfile worked great.