Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 04-28-2010, 03:26 PM   #1
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
LRF to ePUB -- Remove Repeating Text

Hello.

A friend has shared with me a small library of ebooks in the LRF format. I am not absolutely clear on their history, although I believe she originally purchased them in a DRMed format, and then converted them to non-DRMed LRF format for viewing on a Sony reader.

At any rate, I would like to read these books on my iPhone, using Stanza. Thus, I have converted a couple of these books to ePUB format, using default settings on Calibre. (I am running Calibre 0.6.49 on 64-bit Windows 7, which works great, BTW.)

The books I have thus far converted contain a couple of annoying elements, which I would like to eliminate. I suspect that doing so is possible using Calibre's rather extensive editing/formatting capabilities, but I am not experienced in using them. I am hopeful that a technically well-versed and well meaning soul can find it within him/herself to provide me some guidance.

Specifically, each of the ebooks I have converted contain repeating text, preceded by a page break. When I look at the resulting ePUB files in Sigil, the books are broken into numerous xHTML blocks, so I am guessing that during the conversion from LRF to ePUB, Calibre is interpreting the repeating text as following a "Chapter" break, or something similar. (In Stanza or the Calibre ebook reader, the text simply shows as repeating throughout the book.)

The repeating text looks like this:

Book Title (the book title as appearing in the ebook metadata)

Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

I suspect that the first line of repeating text (the book title) is being created automatically as, during conversion, the ebook is being interpreted as having a chapter break appearing just before this information is repeated. I imagine that this chapter data (which appears every few pages or so) is being misinterpreted, but as it is a systematic problem, the situation can be remedied by appropriate coding of Calibre's ebook processing engine.

The second line of repeating text seems clearly to have been added to the ebook prior to Calibre's conversion from LRF to ePUB format. It may very well appear due to the prior use of ABC Amber's ebook conversion software in the creation of the LRF-formatted ebooks themselves. I suspect that ABC Amber inserts this "advertisement" text when it interprets that a new chapter is occurring in the ebook. Which, of course, begs the question as to how this chapter information made its way into the ebook in the first place. At any rate, I imagine that to remove this line of repeating text, I need to invoke something like a "Search & Replace" function which, as opposed to referencing metadata as in the book title line of text, will require that this extraneous text be referenced exactly as it appears, so that the Calibre conversion engine can remove it throughout each of my ebooks during conversion.

So, if I understand the challenge correctly, I need to invoke Calibre's "intelligence" in two ways: first, with regard to removing "wild card" text referenced to metadata, and second, with regard to removal of specific text in the manner of a traditional "search & replace." On a related note, I need some way of better interpreting, and correctly processing, chapter breaks (if that is the nature of the page breaks which precede each instance of the repeating lines of text) although I am not clear as to the theory of how that task would be accomplished.

Anyone who may be able to provide me with guidance here (ideally, in a "This is a step-by-step procedure for you to follow, dummy"), that would be much appreciated.

I look forward to the courtesy of a reply. Many thanks.

Mark Shneour
mark@dotmom.com

Last edited by mshneour; 04-28-2010 at 03:31 PM.
mshneour is offline   Reply With Quote
Old 04-28-2010, 06:26 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The relevant option in calibre is Remove headers and footers under Structure Detection in the conversion settings.


Under header regular expression put something like

Generated by.*abclit.html

And click the wizard button next to it to see how it would affect the source file
kovidgoyal is offline   Reply With Quote
Advert
Old 04-29-2010, 01:39 AM   #3
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
Thank you for your reply, Kovid.

Forgive my ignorance, but I do not know how to implement your suggestion here:

Quote:
Originally Posted by kovidgoyal View Post
The relevant option in calibre is Remove headers and footers under Structure Detection in the conversion settings.


Under header regular expression put something like

Generated by.*abclit.html

And click the wizard button next to it to see how it would affect the source file
I don't know how or where to put your suggested expression in the regex string for testing purposes. Is there a reference source somewhere which could give me some understanding of the language (presumably XML) and syntax used here?

Many thanks.

Mark
mshneour is offline   Reply With Quote
Old 04-29-2010, 08:36 AM   #4
speakingtohe
Wizard
speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.speakingtohe ought to be getting tired of karma fortunes by now.
 
Posts: 4,812
Karma: 26912940
Join Date: Apr 2010
Device: sony PRS-T1 and T3, Kobo Mini and Aura HD, Tablet
select book
click convert e-books
click structure detection on side panel
check box beside remove headers
click on magic wand by remove headers (top one)
pick Lrf if it asks choose format
Quote:
Under header regular expression put something like

Generated by.*abclit.html
-or-

<b>Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html">erter,............................... ........

header regular expression means line with green text beside Regex:
Click test to see what is selected
works for me
second one shouldn't have space among .'s that appears in message

Last edited by speakingtohe; 04-29-2010 at 08:40 AM.
speakingtohe is offline   Reply With Quote
Old 04-29-2010, 08:20 PM   #5
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
Thank you, "speakingtohe."

I will give it another try. I didn't understand the screen which appeared after I initiated the wizard.

Do you know what language and syntax are being used in the program for setting parameters? How might I gain some education in working with it?

Mark

Last edited by mshneour; 04-29-2010 at 08:23 PM.
mshneour is offline   Reply With Quote
Advert
Old 05-03-2010, 01:19 PM   #6
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
OK. I think I now understand how the Regex Builder works. I educated myself a bit regarding the writing of Regex ("Regular Expressions"). I'm still having some trouble, though...

As I mentioned in my original post, I have repeating text which appears in my ePUB documents after conversion by Calibre from LRF format. The repeating text is preceded by a page break. Here is the repeating text, again, for reference:

Book Title

Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html


In attempting to follow the instructions above to remove this repeating text, I noticed the following coded sequence (XML?) appearing throughout the document when viewing it in Regex Preview in Calibre:

---
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Book Title</title>
<link rel="stylesheet" type="text/css" href="styles.css"/>

<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/></head>
<body class="body">

<div class="bs0 ts0" id="1935">
<span><span class="ts1">Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html</span></span></span><p><span>


[Note: The four-digit number in the expression id="xxxx" changes with each occurrence. I don't know to what the expression refers.]

The above coded sequence appears to be the source of the repeating text. So, I first tried selecting and pasting the entire sequence into the green Regex Builder line. (To address the changing four-digit number in the id="xxxx" expression, I tried replacing the xxxx with .... wildcard characters.) When I hit "Test," though, nothing was highlighted in yellow.

Ultimately, I wrote this Regex expression...

Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html

...and it became highlighted in yellow in the document text. When I then ran the LRF-ePUB conversion, the second line of my repeating text was deleted. The first line, however, and the preceding page break, remained.

So, then I wrote the following Regex expression...

.*Book Ti.*

...and it became highlighted in yellow. I then ran the conversion on the first converted file (now an ePUB file), and it removed the first line of text (the "Book Title" text).

So, after running two sets of Regex expressions with two associated conversion passes, I was able to remove both lines of text, but not the page break. Which leaves me with two questions:

(1) What syntax is required to combine the two Regex expressions I used to remove the two lines of repeating text so that only one conversion pass is necessary? (I presume that Calibre is written in something like Python -- it would be helpful to confirm this fact, so that I might confirm precisely which Regex library is applicable to Calibre searches.)

(2) How do I remove the preceding page break associated with these two repeating lines of text (and not interfere with the proper page break associated with a chapter break?


Any further guidance here would be greatly appreciated. Thank you.

Last edited by mshneour; 05-03-2010 at 05:50 PM.
mshneour is offline   Reply With Quote
Old 05-03-2010, 07:10 PM   #7
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
OK. I'm getting warmer...

After doing some more digging, I came across a Regex editor/builder called "RegexBuddy." With the help of this editor, I wrote a Regex expression which -- at least within RegexBuddy -- tested correctly for the entire string set forth in my prior post. The expression was:

[-].*<[?]xml version=.*abclit.html</span></span></span><p/><p><span><br/>

Unfortunately, while this expression tested properly within RegexBuddy, it did not highlight the subject search string within Calibre.

I suspect that the problem is due to a difference in the Regex library and syntax being used, respectively, by RegexBuddy and Calibre.

Again, can someone enlighten me as to the Regex library/syntax applicable to Calibre? (A rewrite of the above expression would also be helpful.

Many thanks.
mshneour is offline   Reply With Quote
Old 05-03-2010, 07:18 PM   #8
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
http://docs.python.org/library/re.html
kovidgoyal is offline   Reply With Quote
Old 05-03-2010, 07:35 PM   #9
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
Thank you, Kovid, for the clarification.

Would you mind taking a look at my last post? To help me get started, I would be appreciative if you could let me know what is wrong with the Regex expression I wrote.

Many thanks again.

Mark
mshneour is offline   Reply With Quote
Old 05-03-2010, 09:28 PM   #10
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
(?s)<\?xml.*?abclit.html</span></span></span><p><span>
kovidgoyal is offline   Reply With Quote
Old 05-03-2010, 10:31 PM   #11
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
Thank you, Kovid.

I did not come across the "(?s)" parameter in my brief education on Regex under Python.

One further problem, however...

The expression you wrote worked perfectly under the Regex "Test" procedure in Calibre. Oddly, though, the selected text was not deleted. And yes, I had selected "Remove Header" under Structure Detection.

Do you know what is going on, and how I might fix it?

Thanks again.

Mark
mshneour is offline   Reply With Quote
Old 05-03-2010, 10:34 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 43,776
Karma: 22666666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Ah you can't remove the page breaks, they're present in the original LRF. You can remove the title and the abc text, using the regex you discovered yourself.
kovidgoyal is offline   Reply With Quote
Old 05-03-2010, 10:44 PM   #13
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
I see.

How, then, am I able to combine two expressions to select two different text strings in a single conversion pass? Is there an "AND" type Regex statement.

Also, is there any other way to remove the page breaks preceding this repeating text, while keeping the page breaks associated with each proper chapter break?

Thank you again.

Mark
mshneour is offline   Reply With Quote
Old 05-03-2010, 10:56 PM   #14
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
One other thing...

Is there a way I can select the Book Title text string as metadata, as opposed to searching for it by the actual text? That way, I can write a Regex expression to use on all titles in the Library.

Thanks again.

Mark
mshneour is offline   Reply With Quote
Old 05-03-2010, 11:00 PM   #15
mshneour
Member
mshneour began at the beginning.
 
Posts: 10
Karma: 10
Join Date: Apr 2010
Device: iPhone 3G
Come to think of it, where is the page break located in the text string I set forth in one of my earlier posts? Is it defined by HTML parameters? If so, I can't find them.

Mark
mshneour is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Disabling text-to-speech (TTS) triggers DMCA exemption: YOU ARE ALLOWED TO REMOVE DRM kamm News 103 08-01-2010 04:04 AM
Repeating Images after EPUB to RTF conversion kerrware Calibre 1 07-15-2010 09:05 AM
Text to LRF loses spacings adfrad Calibre 1 02-02-2009 02:33 PM
LRF and wrap-around text Seabound Calibre 13 12-28-2008 03:30 PM
underlining text in LRF? curiouser Sony Reader 2 03-31-2007 09:09 PM


All times are GMT -4. The time now is 05:29 PM.


MobileRead.com is a privately owned, operated and funded community.