Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 01-07-2015, 01:49 PM   #1
Johann Cat
Member
Johann Cat began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Nov 2014
Device: Kobo Aura HD; Kindle III; Kindle PWII; Boyue T62D; Onyx Boox i86
Removing page numbers within text?

I think there is a way to do this using "search and replace" within Calibre conversion, but I do not know the shorthand or code for indicating these characters.

I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).

So some lines are interrupted like this:
In chemistry, we find such assertions as that hydrogen being univalent while oxygen is bivalent, "makes it plain that we must expect to find no more than three compounds of those elements." It did not make the matter plain to those who held to the strict univalence of chlorine; and Dr. Williams says nothing about variable
*
*― 281 ―
*
*valencies, but rather implies their fixity. The history of opinion concerning Mendeléef's law is inexcusably inaccurate after the admirable history of the matter by Venable.

I.e., I want to remove the space between "variable" and "valencies" and the page number and remove all similar page numbers throughout the text.

I am not a code writer: can someone walk me through this or point me to some correct, precise (I need to know how to indicate the dash as a running character and those page numbers, especially) instructions?

Last edited by Johann Cat; 01-09-2015 at 02:54 AM.
Johann Cat is offline   Reply With Quote
Old 01-07-2015, 03:22 PM   #2
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,048
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Johann Cat View Post
I think there is a way to do this using "search and replace" within Calibre conversion, but I do not know the shorthand or code for indicating these characters.

I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).

So some lines are interrupted like this:
In chemistry, we find such assertions as that hydrogen being univalent while oxygen is bivalent, "makes it plain that we must expect to find no more than three compounds of those elements." It did not make the matter plain to those who held to the strict univalence of chlorine; and Dr. Williams says nothing about variable
*
*― 281 ―
*
*valencies, but rather implies their fixity. The history of opinion concerning Mendeléef's law is inexcusably inaccurate after the admirable history of the matter by Venable.

I.e., I want to remove the space between "variable" and "valencies" and the page number and remove all similar page numbers throughout the text.

I am not a code writer: can someone walk me through this or point me to some correct, precise (I need to know how to indicate the slash as a running character and those page numbers, especially) instructions page?
You are probably going to need to learn some (REGEX) editing skills.
Among other things, * is a wildcard and will need to be escaped.

Conversion search and replace is for more simple tasks

BTW There are a few REGEX tutorials and REGEX help threads here at MR for when you do get stuck. IMHO 100% : work in code view for this one. there are at least 4 lines of code involved in your example
theducks is offline   Reply With Quote
Advert
Old 01-08-2015, 09:28 PM   #3
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by theducks View Post
Among other things, * is a wildcard and will need to be escaped.
I think he was just trying to emphasize the portion he was speaking of with asterisks, not that the actual source document had them!

Quote:
Originally Posted by Johann Cat View Post
I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).
Mind just linking to the specific Gutenberg example?

If I am understanding correctly, I am thinking it might just be using a Regex as simple as this:

Search: \s+― [0-9]+ ―\s+
Replace: (insert a single space here)

What this says in English is "look for one or more blank space characters" + "look for an em dash followed by a space" + "look for a number" + "look for a space followed by an em dash" + "look for one or more blank space characters". Replace with "a single space".

What I would then do is just clean up the file in a Text Editor using the above Regex, and then feed that document through Calibre for conversion.

Last edited by Tex2002ans; 01-08-2015 at 09:34 PM.
Tex2002ans is offline   Reply With Quote
Old 01-09-2015, 02:50 AM   #4
Johann Cat
Member
Johann Cat began at the beginning.
 
Posts: 21
Karma: 10
Join Date: Nov 2014
Device: Kobo Aura HD; Kindle III; Kindle PWII; Boyue T62D; Onyx Boox i86
Thanks for the suggestions. The asterisks that show in the left margin did not appear in the original text, but somehow appeared when I pasted the text into this editor, so I should have deleted them for accuracy's sake. I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original. I will try the regex suggestion; that code makes sense. Do you know if I can do this using openoffice's text editor? I have the "alternative" find-replace app installed. Or should I use calibre? If neither of those, what is the regex editor of choice?
Johann Cat is offline   Reply With Quote
Old 01-09-2015, 07:40 AM   #5
drjenkins
Addict
drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.drjenkins ought to be getting tired of karma fortunes by now.
 
Posts: 250
Karma: 1702156
Join Date: Nov 2010
Device: Kindle Voyage
Quote:
Originally Posted by Johann Cat View Post
I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original.
That is why it is so important to use code view. Don't search for what you see in the File Preview pane.
drjenkins is offline   Reply With Quote
Advert
Old 01-09-2015, 09:02 AM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,048
Karma: 60358908
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Johann Cat View Post
Thanks for the suggestions. The asterisks that show in the left margin did not appear in the original text, but somehow appeared when I pasted the text into this editor, so I should have deleted them for accuracy's sake. I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original. I will try the regex suggestion; that code makes sense. Do you know if I can do this using openoffice's text editor? I have the "alternative" find-replace app installed. Or should I use calibre? If neither of those, what is the regex editor of choice?
Be awar: There are various dialects of REGEX. The Basic code above usually works in all, but be aware, there can be differences.

The nice thing is OO/Calibre/Sigil has an interactive Editor where you can try-hone your REGEX.

Avoid Replace All. Step through a few dozen perfect finds before even thinking of "lettn'er rip"

Remember: File: DISCARD for when things go deep South
theducks is offline   Reply With Quote
Old 01-09-2015, 03:45 PM   #7
Tex2002ans
Wizard
Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.Tex2002ans ought to be getting tired of karma fortunes by now.
 
Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
Quote:
Originally Posted by Johann Cat View Post
Do you know if I can do this using openoffice's text editor?
Hmmm, from what I was quickly able to find, OpenOffice Writer only allows you to search PER PARAGRAPH. It doesn't let you search across paragraphs.

I would also avoid one of those fancy GUI Word Processors if you can, because they add A TON of cruft on top of the text (fonts, font sizes, spacing, etc. etc.).

Quote:
Originally Posted by Johann Cat View Post
I have the "alternative" find-replace app installed.
Hmmm, what is this "alternative" you speak of? Is this an addon for OpenOffice? I am not familiar at all... perhaps it allows that functionality.

Quote:
Originally Posted by Johann Cat View Post
Or should I use calibre? If neither of those, what is the regex editor of choice?
I personally use Notepad++ for some basic editing of TXT/HTML files.

For EPUB, you can also use Sigil or Calibre's "Edit Book" feature.

It all depends on what the source format is of this document you are getting. If you got it from Gutenberg, I assume it is TXT or EPUB?

If you link right over to the Gutenberg copy you are working on, perhaps we could figure out an even more specific answer to remove these page numbers.

Quote:
Originally Posted by theducks View Post
Be aware: There are various dialects of REGEX. The Basic code above usually works in all, but be aware, there can be differences.

[...]
Yep yep, I definitely know that Microsoft Word + Open/LibreOffice have some differences with their version of Regex.

The Regex I listed above should work in: Notepad++, Sigil, and Calibre (and whatever other program uses that same Regex engine).

And good list of warnings, it always should be stressed that you should SAVE BACKUP COPIES BEFORE YOU DO HUGE REGEX CHANGES.
Tex2002ans is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[Old Thread] Removing page numbers. ChaoZ Calibre 8 10-20-2014 03:02 PM
RegEx: Removing Page Numbers that have Spaces captainslow Conversion 2 02-27-2011 04:14 PM
Removing headers/page numbers greycobalt Calibre 3 10-10-2010 01:57 PM
Removing Page Numbers ManosHandsOfFate Calibre 6 09-28-2010 12:12 PM
Removing page numbers? Cap.T Calibre 1 02-21-2010 09:57 AM


All times are GMT -4. The time now is 05:39 PM.


MobileRead.com is a privately owned, operated and funded community.