04-28-2010, 03:26 PM | #1 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
LRF to ePUB -- Remove Repeating Text
Hello.
A friend has shared with me a small library of ebooks in the LRF format. I am not absolutely clear on their history, although I believe she originally purchased them in a DRMed format, and then converted them to non-DRMed LRF format for viewing on a Sony reader. At any rate, I would like to read these books on my iPhone, using Stanza. Thus, I have converted a couple of these books to ePUB format, using default settings on Calibre. (I am running Calibre 0.6.49 on 64-bit Windows 7, which works great, BTW.) The books I have thus far converted contain a couple of annoying elements, which I would like to eliminate. I suspect that doing so is possible using Calibre's rather extensive editing/formatting capabilities, but I am not experienced in using them. I am hopeful that a technically well-versed and well meaning soul can find it within him/herself to provide me some guidance. Specifically, each of the ebooks I have converted contain repeating text, preceded by a page break. When I look at the resulting ePUB files in Sigil, the books are broken into numerous xHTML blocks, so I am guessing that during the conversion from LRF to ePUB, Calibre is interpreting the repeating text as following a "Chapter" break, or something similar. (In Stanza or the Calibre ebook reader, the text simply shows as repeating throughout the book.) The repeating text looks like this: Book Title (the book title as appearing in the ebook metadata) Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html I suspect that the first line of repeating text (the book title) is being created automatically as, during conversion, the ebook is being interpreted as having a chapter break appearing just before this information is repeated. I imagine that this chapter data (which appears every few pages or so) is being misinterpreted, but as it is a systematic problem, the situation can be remedied by appropriate coding of Calibre's ebook processing engine. The second line of repeating text seems clearly to have been added to the ebook prior to Calibre's conversion from LRF to ePUB format. It may very well appear due to the prior use of ABC Amber's ebook conversion software in the creation of the LRF-formatted ebooks themselves. I suspect that ABC Amber inserts this "advertisement" text when it interprets that a new chapter is occurring in the ebook. Which, of course, begs the question as to how this chapter information made its way into the ebook in the first place. At any rate, I imagine that to remove this line of repeating text, I need to invoke something like a "Search & Replace" function which, as opposed to referencing metadata as in the book title line of text, will require that this extraneous text be referenced exactly as it appears, so that the Calibre conversion engine can remove it throughout each of my ebooks during conversion. So, if I understand the challenge correctly, I need to invoke Calibre's "intelligence" in two ways: first, with regard to removing "wild card" text referenced to metadata, and second, with regard to removal of specific text in the manner of a traditional "search & replace." On a related note, I need some way of better interpreting, and correctly processing, chapter breaks (if that is the nature of the page breaks which precede each instance of the repeating lines of text) although I am not clear as to the theory of how that task would be accomplished. Anyone who may be able to provide me with guidance here (ideally, in a "This is a step-by-step procedure for you to follow, dummy"), that would be much appreciated. I look forward to the courtesy of a reply. Many thanks. Mark Shneour mark@dotmom.com Last edited by mshneour; 04-28-2010 at 03:31 PM. |
04-28-2010, 06:26 PM | #2 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
The relevant option in calibre is Remove headers and footers under Structure Detection in the conversion settings.
Under header regular expression put something like Generated by.*abclit.html And click the wizard button next to it to see how it would affect the source file |
Advert | |
|
04-29-2010, 01:39 AM | #3 | |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
Thank you for your reply, Kovid.
Forgive my ignorance, but I do not know how to implement your suggestion here: Quote:
Many thanks. Mark |
|
04-29-2010, 08:36 AM | #4 | |
Wizard
Posts: 4,812
Karma: 26912940
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
|
select book
click convert e-books click structure detection on side panel check box beside remove headers click on magic wand by remove headers (top one) pick Lrf if it asks choose format Quote:
<b>Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html">erter,............................... ........ header regular expression means line with green text beside Regex: Click test to see what is selected works for me second one shouldn't have space among .'s that appears in message Last edited by speakingtohe; 04-29-2010 at 08:40 AM. |
|
04-29-2010, 08:20 PM | #5 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
Thank you, "speakingtohe."
I will give it another try. I didn't understand the screen which appeared after I initiated the wizard. Do you know what language and syntax are being used in the program for setting parameters? How might I gain some education in working with it? Mark Last edited by mshneour; 04-29-2010 at 08:23 PM. |
Advert | |
|
05-03-2010, 01:19 PM | #6 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
OK. I think I now understand how the Regex Builder works. I educated myself a bit regarding the writing of Regex ("Regular Expressions"). I'm still having some trouble, though...
As I mentioned in my original post, I have repeating text which appears in my ePUB documents after conversion by Calibre from LRF format. The repeating text is preceded by a page break. Here is the repeating text, again, for reference: Book Title Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html In attempting to follow the instructions above to remove this repeating text, I noticed the following coded sequence (XML?) appearing throughout the document when viewing it in Regex Preview in Calibre: --- <?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Book Title</title> <link rel="stylesheet" type="text/css" href="styles.css"/> <meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/></head> <body class="body"> <div class="bs0 ts0" id="1935"> <span><span class="ts1">Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html</span></span></span><p><span> [Note: The four-digit number in the expression id="xxxx" changes with each occurrence. I don't know to what the expression refers.] The above coded sequence appears to be the source of the repeating text. So, I first tried selecting and pasting the entire sequence into the green Regex Builder line. (To address the changing four-digit number in the id="xxxx" expression, I tried replacing the xxxx with .... wildcard characters.) When I hit "Test," though, nothing was highlighted in yellow. Ultimately, I wrote this Regex expression... Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html ...and it became highlighted in yellow in the document text. When I then ran the LRF-ePUB conversion, the second line of my repeating text was deleted. The first line, however, and the preceding page break, remained. So, then I wrote the following Regex expression... .*Book Ti.* ...and it became highlighted in yellow. I then ran the conversion on the first converted file (now an ePUB file), and it removed the first line of text (the "Book Title" text). So, after running two sets of Regex expressions with two associated conversion passes, I was able to remove both lines of text, but not the page break. Which leaves me with two questions: (1) What syntax is required to combine the two Regex expressions I used to remove the two lines of repeating text so that only one conversion pass is necessary? (I presume that Calibre is written in something like Python -- it would be helpful to confirm this fact, so that I might confirm precisely which Regex library is applicable to Calibre searches.) (2) How do I remove the preceding page break associated with these two repeating lines of text (and not interfere with the proper page break associated with a chapter break? Any further guidance here would be greatly appreciated. Thank you. Last edited by mshneour; 05-03-2010 at 05:50 PM. |
05-03-2010, 07:10 PM | #7 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
OK. I'm getting warmer...
After doing some more digging, I came across a Regex editor/builder called "RegexBuddy." With the help of this editor, I wrote a Regex expression which -- at least within RegexBuddy -- tested correctly for the entire string set forth in my prior post. The expression was: [-].*<[?]xml version=.*abclit.html</span></span></span><p/><p><span><br/> Unfortunately, while this expression tested properly within RegexBuddy, it did not highlight the subject search string within Calibre. I suspect that the problem is due to a difference in the Regex library and syntax being used, respectively, by RegexBuddy and Calibre. Again, can someone enlighten me as to the Regex library/syntax applicable to Calibre? (A rewrite of the above expression would also be helpful. Many thanks. |
05-03-2010, 07:18 PM | #8 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
05-03-2010, 07:35 PM | #9 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
Thank you, Kovid, for the clarification.
Would you mind taking a look at my last post? To help me get started, I would be appreciative if you could let me know what is wrong with the Regex expression I wrote. Many thanks again. Mark |
05-03-2010, 09:28 PM | #10 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
(?s)<\?xml.*?abclit.html</span></span></span><p><span>
|
05-03-2010, 10:31 PM | #11 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
Thank you, Kovid.
I did not come across the "(?s)" parameter in my brief education on Regex under Python. One further problem, however... The expression you wrote worked perfectly under the Regex "Test" procedure in Calibre. Oddly, though, the selected text was not deleted. And yes, I had selected "Remove Header" under Structure Detection. Do you know what is going on, and how I might fix it? Thanks again. Mark |
05-03-2010, 10:34 PM | #12 |
creator of calibre
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah you can't remove the page breaks, they're present in the original LRF. You can remove the title and the abc text, using the regex you discovered yourself.
|
05-03-2010, 10:44 PM | #13 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
I see.
How, then, am I able to combine two expressions to select two different text strings in a single conversion pass? Is there an "AND" type Regex statement. Also, is there any other way to remove the page breaks preceding this repeating text, while keeping the page breaks associated with each proper chapter break? Thank you again. Mark |
05-03-2010, 10:56 PM | #14 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
One other thing...
Is there a way I can select the Book Title text string as metadata, as opposed to searching for it by the actual text? That way, I can write a Regex expression to use on all titles in the Library. Thanks again. Mark |
05-03-2010, 11:00 PM | #15 |
Member
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
|
Come to think of it, where is the page break located in the text string I set forth in one of my earlier posts? Is it defined by HTML parameters? If so, I can't find them.
Mark |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Disabling text-to-speech (TTS) triggers DMCA exemption: YOU ARE ALLOWED TO REMOVE DRM | kamm | News | 103 | 08-01-2010 04:04 AM |
Repeating Images after EPUB to RTF conversion | kerrware | Calibre | 1 | 07-15-2010 09:05 AM |
Text to LRF loses spacings | adfrad | Calibre | 1 | 02-02-2009 02:33 PM |
LRF and wrap-around text | Seabound | Calibre | 13 | 12-28-2008 03:30 PM |
underlining text in LRF? | curiouser | Sony Reader | 2 | 03-31-2007 09:09 PM |