How to remove unnecesary items in a text?

Darkitow · 05-14-2011, 12:09 PM

I've been tweaking a couple books I own to e-book format, I can work a bit with BD and other tools to adapt books to my Kindle, but I've come with a problem that no matter how, I can't get rid of.

The text I wanna adapt has a header in every page, and a page number at the bottom. these things are "normal text", I mean that no matter how I convert the file or what format I use, it always appear as text instead of the typical .pdf/.doc header that is much easier to clean.

The only text editor that seems to make this a bit more friendly is, amazingly, Microsoft Word, as it seems the text was made in this program, and at least here I have the headers and page numbers "ordered" in every page, but apart from that, I have no clue of how to do this. When I open the document directly by BD, it takes the page numbers as titles and ignores the real chapter titles, even being very noticeable by .pdf or .doc (they appear in bold italic and like 10 sizes bigger than the rest of the text, but this is apparently ignored in BD).

I've tried everything I know, turning the text in like 8 formats and copying all to BD directly and other stuff. Would it be a way to do something, like, deleting stuff selectively in Word, or something in BD to do this?

Soxendom · 05-15-2011, 03:05 PM

Is the text the same ie the name of the book or something like that? If it is could you not use the search and replace? Similarly for the page numbers - searching for any digit. Admittedly you couldn't use replace all as you'd possibly replace something you needed.

DDHarriman · 05-16-2011, 04:57 AM

Hello

You can, per example, “cut out” the headers and the page numbers in the original files(s) used to do the OCR (if is this the way you are doing it).

Let’s imagine you scan your book and create an image PDF (unique file) with all the pages in the correct order.
Lets imagine you use Finereader Pro to do the OCR…

Do this:

1 - make a copy of your PDF with another name (protecting the original file if something goes wrong);

2 - open the new file in Finereader and use the “crop” option in the “edit page image” part to mark a rectangular selection in the page letting the headers and page numbers out of it, apply cut (to that page or to all of them) - be careful that this cannot be undone;

3 - OCR the result - presto no headers and page numbers.

Alternative - if you have per example Acrobat Pro, go to the margins configuration and redefine the top and bottom ones so the headers and page numbers are out of the new margins and save it with a new name. Open it on your OCR program and apply step (3) above.

You can do all the above with other programs too, just check the similar functions those programs have to the ones described above.

Best regards,

Darkitow · 05-16-2011, 09:57 AM

The PDF is a text one, not scans, I'm not too sure how was it done because it looks pretty much like the "physical" book, it has all the edition and publisher stuff too.

I own the paperback book already and this one seems like a different edition because my book is like 900+ pages while this one is around 450, so I guess the one I have in my computer is the hardcover edition or something. My book doesn't have a header either, just page numbers. I've tried to scan my book, but I don't really wanna tear it apart when there are copies in the internet already that would save me that work and the murder of my poor book, lol.

So, as I was saying, I have no idea how was the PDF created. The header and the page numbers are "normal text", I mean that it doesn't appear in any program as "header" so I can't use any kind of command to "remove headers". I guess the book was scanned and OCR'd but it's really well done and without typos that I've found yet...

[EDIT] Well it seems I managed to do it, thankfully the page numbers were all labelled as "titles" or "subtitles", and the headers were all the same text so I could get rid of everything with a bit of Search/Replace and another bit of Element Browser. I also fixed some thingies like empty paragraphs between pages and moved all the translation notes to the end of the text.

My only problem now is that the book is full of "bad ends" and "broken sentences". I've been searching around in the forums but I couldn't manage to find a way to fix it without moving the text to a different program (and honestly I wouldn't like that as I have the book almost fixed). I've tried with some regexp checks but I don't know exactly what to do. There are more than two thousands of those, so doing it manually could end with me throwing the computer out of the window, and I wouldn't like that. <_<

wannabee · 05-26-2011, 03:07 AM

I had the same problem once. Every end of line had a paragraph return on it when exporting to RTF or DOC. The only way I found to not have the end of paragraphs all over the place was to export the PDF to HTML with CSS from Adobe Acrobat Pro. Opened it up in the browser and copied it from the browser back to the program I was using to edit it.
There was an option to not export headers and page numbers. Though a few crept in.

05-14-2011, 12:09 PM	#1
Darkitow Enthusiast Posts: 43 Karma: 230 Join Date: Jan 2011 Device: Kindle 3	How to remove unnecesary items in a text? I've been tweaking a couple books I own to e-book format, I can work a bit with BD and other tools to adapt books to my Kindle, but I've come with a problem that no matter how, I can't get rid of. The text I wanna adapt has a header in every page, and a page number at the bottom. these things are "normal text", I mean that no matter how I convert the file or what format I use, it always appear as text instead of the typical .pdf/.doc header that is much easier to clean. The only text editor that seems to make this a bit more friendly is, amazingly, Microsoft Word, as it seems the text was made in this program, and at least here I have the headers and page numbers "ordered" in every page, but apart from that, I have no clue of how to do this. When I open the document directly by BD, it takes the page numbers as titles and ignores the real chapter titles, even being very noticeable by .pdf or .doc (they appear in bold italic and like 10 sizes bigger than the rest of the text, but this is apparently ignored in BD). I've tried everything I know, turning the text in like 8 formats and copying all to BD directly and other stuff. Would it be a way to do something, like, deleting stuff selectively in Word, or something in BD to do this?

05-16-2011, 09:57 AM	#4
Darkitow Enthusiast Posts: 43 Karma: 230 Join Date: Jan 2011 Device: Kindle 3	The PDF is a text one, not scans, I'm not too sure how was it done because it looks pretty much like the "physical" book, it has all the edition and publisher stuff too. I own the paperback book already and this one seems like a different edition because my book is like 900+ pages while this one is around 450, so I guess the one I have in my computer is the hardcover edition or something. My book doesn't have a header either, just page numbers. I've tried to scan my book, but I don't really wanna tear it apart when there are copies in the internet already that would save me that work and the murder of my poor book, lol. So, as I was saying, I have no idea how was the PDF created. The header and the page numbers are "normal text", I mean that it doesn't appear in any program as "header" so I can't use any kind of command to "remove headers". I guess the book was scanned and OCR'd but it's really well done and without typos that I've found yet... [EDIT] Well it seems I managed to do it, thankfully the page numbers were all labelled as "titles" or "subtitles", and the headers were all the same text so I could get rid of everything with a bit of Search/Replace and another bit of Element Browser. I also fixed some thingies like empty paragraphs between pages and moved all the translation notes to the end of the text. My only problem now is that the book is full of "bad ends" and "broken sentences". I've been searching around in the forums but I couldn't manage to find a way to fix it without moving the text to a different program (and honestly I wouldn't like that as I have the book almost fixed). I've tried with some regexp checks but I don't know exactly what to do. There are more than two thousands of those, so doing it manually could end with me throwing the computer out of the window, and I wouldn't like that. <_< Last edited by Darkitow; 05-16-2011 at 11:31 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Request: Remove Related Items Links from Economist	fabian	Recipes	2	03-15-2011 02:26 PM
Remove text from a table	mufc	Recipes	2	01-18-2011 11:18 PM
how to remove already pulled feed items?	MarcDK	Calibre	2	11-12-2010 09:30 AM
Disabling text-to-speech (TTS) triggers DMCA exemption: YOU ARE ALLOWED TO REMOVE DRM	kamm	News	103	08-01-2010 04:04 AM
LRF to ePUB -- Remove Repeating Text	mshneour	Calibre	14	05-03-2010 11:00 PM

05-15-2011, 03:05 PM	#2
Soxendom Connoisseur Posts: 62 Karma: 10 Join Date: May 2009 Device: Sony PRS T1	Is the text the same ie the name of the book or something like that? If it is could you not use the search and replace? Similarly for the page numbers - searching for any digit. Admittedly you couldn't use replace all as you'd possibly replace something you needed.

05-16-2011, 04:57 AM	#3
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hello You can, per example, “cut out” the headers and the page numbers in the original files(s) used to do the OCR (if is this the way you are doing it). Let’s imagine you scan your book and create an image PDF (unique file) with all the pages in the correct order. Lets imagine you use Finereader Pro to do the OCR… Do this: 1 - make a copy of your PDF with another name (protecting the original file if something goes wrong); 2 - open the new file in Finereader and use the “crop” option in the “edit page image” part to mark a rectangular selection in the page letting the headers and page numbers out of it, apply cut (to that page or to all of them) - be careful that this cannot be undone; 3 - OCR the result - presto no headers and page numbers. Alternative - if you have per example Acrobat Pro, go to the margins configuration and redefine the top and bottom ones so the headers and page numbers are out of the new margins and save it with a new name. Open it on your OCR program and apply step (3) above. You can do all the above with other programs too, just check the similar functions those programs have to the ones described above. Best regards,

05-26-2011, 03:07 AM	#5
wannabee Media Bloke Posts: 2,382 Karma: 113956855 Join Date: Sep 2010 Location: NSW - Australia Device: iOS	I had the same problem once. Every end of line had a paragraph return on it when exporting to RTF or DOC. The only way I found to not have the end of paragraphs all over the place was to export the PDF to HTML with CSS from Adobe Acrobat Pro. Opened it up in the browser and copied it from the browser back to the program I was using to edit it. There was an option to not export headers and page numbers. Though a few crept in.

Advert

Advert