MobileRead Forums - View Single Post - LRF to ePUB -- Remove Repeating Text

mshneour · 05-03-2010, 02:19 PM

OK. I think I now understand how the Regex Builder works. I educated myself a bit regarding the writing of Regex ("Regular Expressions"). I'm still having some trouble, though...

As I mentioned in my original post, I have repeating text which appears in my ePUB documents after conversion by Calibre from LRF format. The repeating text is preceded by a page break. Here is the repeating text, again, for reference:

Book Title

Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

In attempting to follow the instructions above to remove this repeating text, I noticed the following coded sequence (XML?) appearing throughout the document when viewing it in Regex Preview in Calibre:

---
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Book Title</title>
<link rel="stylesheet" type="text/css" href="styles.css"/>

<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/></head>
<body class="body">

<div class="bs0 ts0" id="1935">
Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

[Note: The four-digit number in the expression id="xxxx" changes with each occurrence. I don't know to what the expression refers.]

The above coded sequence appears to be the source of the repeating text. So, I first tried selecting and pasting the entire sequence into the green Regex Builder line. (To address the changing four-digit number in the id="xxxx" expression, I tried replacing the xxxx with .... wildcard characters.) When I hit "Test," though, nothing was highlighted in yellow.

Ultimately, I wrote this Regex expression...

Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html

...and it became highlighted in yellow in the document text. When I then ran the LRF-ePUB conversion, the second line of my repeating text was deleted. The first line, however, and the preceding page break, remained.

So, then I wrote the following Regex expression...

.*Book Ti.*

...and it became highlighted in yellow. I then ran the conversion on the first converted file (now an ePUB file), and it removed the first line of text (the "Book Title" text).

So, after running two sets of Regex expressions with two associated conversion passes, I was able to remove both lines of text, but not the page break. Which leaves me with two questions:

(1) What syntax is required to combine the two Regex expressions I used to remove the two lines of repeating text so that only one conversion pass is necessary? (I presume that Calibre is written in something like Python -- it would be helpful to confirm this fact, so that I might confirm precisely which Regex library is applicable to Calibre searches.)

(2) How do I remove the preceding page break associated with these two repeating lines of text (and not interfere with the proper page break associated with a chapter break?

Any further guidance here would be greatly appreciated. Thank you.

05-03-2010, 02:19 PM	#6
mshneour Member Posts: 10 Karma: 10 Join Date: Apr 2010 Device: iPhone 3G	OK. I think I now understand how the Regex Builder works. I educated myself a bit regarding the writing of Regex ("Regular Expressions"). I'm still having some trouble, though... As I mentioned in my original post, I have repeating text which appears in my ePUB documents after conversion by Calibre from LRF format. The repeating text is preceded by a page break. Here is the repeating text, again, for reference: Book Title Generated by ABC Amber LIT Converter, http://www.processtext.com/abclit.html In attempting to follow the instructions above to remove this repeating text, I noticed the following coded sequence (XML?) appearing throughout the document when viewing it in Regex Preview in Calibre: --- <?xml version='1.0' encoding='utf-8'?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Book Title</title> <link rel="stylesheet" type="text/css" href="styles.css"/> <meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/></head> <body class="body"> <div class="bs0 ts0" id="1935"> <span><span class="ts1">Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html</span></span></span><p><span> [Note: The four-digit number in the expression id="xxxx" changes with each occurrence. I don't know to what the expression refers.] The above coded sequence appears to be the source of the repeating text. So, I first tried selecting and pasting the entire sequence into the green Regex Builder line. (To address the changing four-digit number in the id="xxxx" expression, I tried replacing the xxxx with .... wildcard characters.) When I hit "Test," though, nothing was highlighted in yellow. Ultimately, I wrote this Regex expression... Generated by ABC Amber LIT Conv<span class="ts1">erter, http://www.processtext.com/abclit.html ...and it became highlighted in yellow in the document text. When I then ran the LRF-ePUB conversion, the second line of my repeating text was deleted. The first line, however, and the preceding page break, remained. So, then I wrote the following Regex expression... .Book Ti. ...and it became highlighted in yellow. I then ran the conversion on the first converted file (now an ePUB file), and it removed the first line of text (the "Book Title" text). So, after running two sets of Regex expressions with two associated conversion passes, I was able to remove both lines of text, but not the page break. Which leaves me with two questions: (1) What syntax is required to combine the two Regex expressions I used to remove the two lines of repeating text so that only one conversion pass is necessary? (I presume that Calibre is written in something like Python -- it would be helpful to confirm this fact, so that I might confirm precisely which Regex library is applicable to Calibre searches.) (2) How do I remove the preceding page break associated with these two repeating lines of text (and not interfere with the proper page break associated with a chapter break? Any further guidance here would be greatly appreciated. Thank you. Last edited by mshneour; 05-03-2010 at 06:50 PM.