Removing page numbers within text?

Johann Cat · 01-07-2015, 02:49 PM

I think there is a way to do this using "search and replace" within Calibre conversion, but I do not know the shorthand or code for indicating these characters.

I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).

So some lines are interrupted like this:

In chemistry, we find such assertions as that hydrogen being univalent while oxygen is bivalent, "makes it plain that we must expect to find no more than three compounds of those elements." It did not make the matter plain to those who held to the strict univalence of chlorine; and Dr. Williams says nothing about variable
*
*― 281 ―
*
*valencies, but rather implies their fixity. The history of opinion concerning Mendeléef's law is inexcusably inaccurate after the admirable history of the matter by Venable.

I.e., I want to remove the space between "variable" and "valencies" and the page number and remove all similar page numbers throughout the text.

I am not a code writer: can someone walk me through this or point me to some correct, precise (I need to know how to indicate the dash as a running character and those page numbers, especially) instructions?

theducks · 01-07-2015, 04:22 PM

Quote:

Originally Posted by Johann Cat

I think there is a way to do this using "search and replace" within Calibre conversion, but I do not know the shorthand or code for indicating these characters.

I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).

So some lines are interrupted like this:

In chemistry, we find such assertions as that hydrogen being univalent while oxygen is bivalent, "makes it plain that we must expect to find no more than three compounds of those elements." It did not make the matter plain to those who held to the strict univalence of chlorine; and Dr. Williams says nothing about variable
*
*― 281 ―
*
*valencies, but rather implies their fixity. The history of opinion concerning Mendeléef's law is inexcusably inaccurate after the admirable history of the matter by Venable.

I.e., I want to remove the space between "variable" and "valencies" and the page number and remove all similar page numbers throughout the text.

I am not a code writer: can someone walk me through this or point me to some correct, precise (I need to know how to indicate the slash as a running character and those page numbers, especially) instructions page?

You are probably going to need to learn some (REGEX) editing skills.
Among other things, * is a wildcard and will need to be escaped.

Conversion search and replace is for more simple tasks

BTW There are a few REGEX tutorials and REGEX help threads here at MR for when you do get stuck.

IMHO 100% : work in code view for this one. there are at least 4 lines of code involved in your example

Tex2002ans · 01-08-2015, 10:28 PM

Quote:

Originally Posted by theducks

Among other things, * is a wildcard and will need to be escaped.

I think he was just trying to emphasize the portion he was speaking of with asterisks, not that the actual source document had them!

Quote:

Originally Posted by Johann Cat

I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.).

Mind just linking to the specific Gutenberg example?

If I am understanding correctly, I am thinking it might just be using a Regex as simple as this:

Search: \s+― [0-9]+ ―\s+
Replace: (insert a single space here)

What this says in English is "look for one or more blank space characters" + "look for an em dash followed by a space" + "look for a number" + "look for a space followed by an em dash" + "look for one or more blank space characters". Replace with "a single space".

What I would then do is just clean up the file in a Text Editor using the above Regex, and then feed that document through Calibre for conversion.

Johann Cat · 01-09-2015, 03:50 AM

Thanks for the suggestions. The asterisks that show in the left margin did not appear in the original text, but somehow appeared when I pasted the text into this editor, so I should have deleted them for accuracy's sake. I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original. I will try the regex suggestion; that code makes sense. Do you know if I can do this using openoffice's text editor? I have the "alternative" find-replace app installed. Or should I use calibre? If neither of those, what is the regex editor of choice?

drjenkins · 01-09-2015, 08:40 AM

Quote:

Originally Posted by Johann Cat

I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original.

That is why it is so important to use code view. Don't search for what you see in the File Preview pane.

theducks · 01-09-2015, 10:02 AM

Quote:

Originally Posted by Johann Cat

Thanks for the suggestions. The asterisks that show in the left margin did not appear in the original text, but somehow appeared when I pasted the text into this editor, so I should have deleted them for accuracy's sake. I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original. I will try the regex suggestion; that code makes sense. Do you know if I can do this using openoffice's text editor? I have the "alternative" find-replace app installed. Or should I use calibre? If neither of those, what is the regex editor of choice?

Be awar: There are various dialects of REGEX. The Basic code above usually works in all, but be aware, there can be differences.

The nice thing is OO/Calibre/Sigil has an interactive Editor where you can try-hone your REGEX.

Avoid Replace All. Step through a few dozen perfect finds before even thinking of "lettn'er rip"

Remember: File: DISCARD for when things go deep South

Tex2002ans · 01-09-2015, 04:45 PM

Quote:

Originally Posted by Johann Cat

Do you know if I can do this using openoffice's text editor?

Hmmm, from what I was quickly able to find, OpenOffice Writer only allows you to search PER PARAGRAPH. It doesn't let you search across paragraphs.

I would also avoid one of those fancy GUI Word Processors if you can, because they add A TON of cruft on top of the text (fonts, font sizes, spacing, etc. etc.).

Quote:

Originally Posted by Johann Cat

I have the "alternative" find-replace app installed.

Hmmm, what is this "alternative" you speak of? Is this an addon for OpenOffice? I am not familiar at all... perhaps it allows that functionality.

Quote:

Originally Posted by Johann Cat

Or should I use calibre? If neither of those, what is the regex editor of choice?

I personally use Notepad++ for some basic editing of TXT/HTML files.

For EPUB, you can also use Sigil or Calibre's "Edit Book" feature.

It all depends on what the source format is of this document you are getting. If you got it from Gutenberg, I assume it is TXT or EPUB?

If you link right over to the Gutenberg copy you are working on, perhaps we could figure out an even more specific answer to remove these page numbers.

Quote:

Originally Posted by theducks

Be aware: There are various dialects of REGEX. The Basic code above usually works in all, but be aware, there can be differences.

[...]

Yep yep, I definitely know that Microsoft Word + Open/LibreOffice have some differences with their version of Regex.

The Regex I listed above should work in: Notepad++, Sigil, and Calibre (and whatever other program uses that same Regex engine).

And good list of warnings, it always should be stressed that you should SAVE BACKUP COPIES BEFORE YOU DO HUGE REGEX CHANGES.

01-07-2015, 02:49 PM	#1
Johann Cat Member Posts: 21 Karma: 10 Join Date: Nov 2014 Device: Kobo Aura HD; Kindle III; Kindle PWII; Boyue T62D; Onyx Boox i86	Removing page numbers within text? I think there is a way to do this using "search and replace" within Calibre conversion, but I do not know the shorthand or code for indicating these characters. I have a simple text-block book that has within it, as some gutenberg.org books do, page numbers within the text block (not coded footers, etc.). So some lines are interrupted like this: In chemistry, we find such assertions as that hydrogen being univalent while oxygen is bivalent, "makes it plain that we must expect to find no more than three compounds of those elements." It did not make the matter plain to those who held to the strict univalence of chlorine; and Dr. Williams says nothing about variable * ― 281 ― valencies, but rather implies their fixity. The history of opinion concerning Mendeléef's law is inexcusably inaccurate after the admirable history of the matter by Venable. I.e., I want to remove the space between* "variable" and "valencies" and the page number and remove all similar page numbers throughout the text. I am not a code writer: can someone walk me through this or point me to some correct, precise (I need to know how to indicate the dash as a running character and those page numbers, especially) instructions? Last edited by Johann Cat; 01-09-2015 at 03:54 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Old Thread] Removing page numbers.	ChaoZ	Calibre	8	10-20-2014 04:02 PM
RegEx: Removing Page Numbers that have Spaces	captainslow	Conversion	2	02-27-2011 05:14 PM
Removing headers/page numbers	greycobalt	Calibre	3	10-10-2010 02:57 PM
Removing Page Numbers	ManosHandsOfFate	Calibre	6	09-28-2010 01:12 PM
Removing page numbers?	Cap.T	Calibre	1	02-21-2010 10:57 AM

01-09-2015, 03:50 AM	#4
Johann Cat Member Posts: 21 Karma: 10 Join Date: Nov 2014 Device: Kobo Aura HD; Kindle III; Kindle PWII; Boyue T62D; Onyx Boox i86	Thanks for the suggestions. The asterisks that show in the left margin did not appear in the original text, but somehow appeared when I pasted the text into this editor, so I should have deleted them for accuracy's sake. I think the asterisks may indicate paragraph or line-break symbols, but, again, were not apparent in the original. I will try the regex suggestion; that code makes sense. Do you know if I can do this using openoffice's text editor? I have the "alternative" find-replace app installed. Or should I use calibre? If neither of those, what is the regex editor of choice?

Advert

Advert