Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 05-14-2011, 12:09 PM   #1
Darkitow
Enthusiast
Darkitow doesn't litterDarkitow doesn't litterDarkitow doesn't litter
 
Posts: 43
Karma: 230
Join Date: Jan 2011
Device: Kindle 3
How to remove unnecesary items in a text?

I've been tweaking a couple books I own to e-book format, I can work a bit with BD and other tools to adapt books to my Kindle, but I've come with a problem that no matter how, I can't get rid of.

The text I wanna adapt has a header in every page, and a page number at the bottom. these things are "normal text", I mean that no matter how I convert the file or what format I use, it always appear as text instead of the typical .pdf/.doc header that is much easier to clean.

The only text editor that seems to make this a bit more friendly is, amazingly, Microsoft Word, as it seems the text was made in this program, and at least here I have the headers and page numbers "ordered" in every page, but apart from that, I have no clue of how to do this. When I open the document directly by BD, it takes the page numbers as titles and ignores the real chapter titles, even being very noticeable by .pdf or .doc (they appear in bold italic and like 10 sizes bigger than the rest of the text, but this is apparently ignored in BD).

I've tried everything I know, turning the text in like 8 formats and copying all to BD directly and other stuff. Would it be a way to do something, like, deleting stuff selectively in Word, or something in BD to do this?
Darkitow is offline   Reply With Quote
Old 05-15-2011, 03:05 PM   #2
Soxendom
Connoisseur
Soxendom began at the beginning.
 
Soxendom's Avatar
 
Posts: 62
Karma: 10
Join Date: May 2009
Device: Sony PRS T1
Is the text the same ie the name of the book or something like that? If it is could you not use the search and replace? Similarly for the page numbers - searching for any digit. Admittedly you couldn't use replace all as you'd possibly replace something you needed.
Soxendom is offline   Reply With Quote
Advert
Old 05-16-2011, 04:57 AM   #3
DDHarriman
Guru
DDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura aboutDDHarriman has a spectacular aura about
 
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
Hello

You can, per example, “cut out” the headers and the page numbers in the original files(s) used to do the OCR (if is this the way you are doing it).

Let’s imagine you scan your book and create an image PDF (unique file) with all the pages in the correct order.
Lets imagine you use Finereader Pro to do the OCR…

Do this:

1 - make a copy of your PDF with another name (protecting the original file if something goes wrong);

2 - open the new file in Finereader and use the “crop” option in the “edit page image” part to mark a rectangular selection in the page letting the headers and page numbers out of it, apply cut (to that page or to all of them) - be careful that this cannot be undone;

3 - OCR the result - presto no headers and page numbers.

Alternative - if you have per example Acrobat Pro, go to the margins configuration and redefine the top and bottom ones so the headers and page numbers are out of the new margins and save it with a new name. Open it on your OCR program and apply step (3) above.

You can do all the above with other programs too, just check the similar functions those programs have to the ones described above.

Best regards,
DDHarriman is offline   Reply With Quote
Old 05-16-2011, 09:57 AM   #4
Darkitow
Enthusiast
Darkitow doesn't litterDarkitow doesn't litterDarkitow doesn't litter
 
Posts: 43
Karma: 230
Join Date: Jan 2011
Device: Kindle 3
The PDF is a text one, not scans, I'm not too sure how was it done because it looks pretty much like the "physical" book, it has all the edition and publisher stuff too.

I own the paperback book already and this one seems like a different edition because my book is like 900+ pages while this one is around 450, so I guess the one I have in my computer is the hardcover edition or something. My book doesn't have a header either, just page numbers. I've tried to scan my book, but I don't really wanna tear it apart when there are copies in the internet already that would save me that work and the murder of my poor book, lol.

So, as I was saying, I have no idea how was the PDF created. The header and the page numbers are "normal text", I mean that it doesn't appear in any program as "header" so I can't use any kind of command to "remove headers". I guess the book was scanned and OCR'd but it's really well done and without typos that I've found yet...

[EDIT] Well it seems I managed to do it, thankfully the page numbers were all labelled as "titles" or "subtitles", and the headers were all the same text so I could get rid of everything with a bit of Search/Replace and another bit of Element Browser. I also fixed some thingies like empty paragraphs between pages and moved all the translation notes to the end of the text.

My only problem now is that the book is full of "bad ends" and "broken sentences". I've been searching around in the forums but I couldn't manage to find a way to fix it without moving the text to a different program (and honestly I wouldn't like that as I have the book almost fixed). I've tried with some regexp checks but I don't know exactly what to do. There are more than two thousands of those, so doing it manually could end with me throwing the computer out of the window, and I wouldn't like that. <_<

Last edited by Darkitow; 05-16-2011 at 11:31 AM.
Darkitow is offline   Reply With Quote
Old 05-26-2011, 03:07 AM   #5
wannabee
Media Bloke
wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.wannabee ought to be getting tired of karma fortunes by now.
 
Posts: 2,382
Karma: 113956855
Join Date: Sep 2010
Location: NSW - Australia
Device: iOS
I had the same problem once. Every end of line had a paragraph return on it when exporting to RTF or DOC. The only way I found to not have the end of paragraphs all over the place was to export the PDF to HTML with CSS from Adobe Acrobat Pro. Opened it up in the browser and copied it from the browser back to the program I was using to edit it.
There was an option to not export headers and page numbers. Though a few crept in.
wannabee is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Request: Remove Related Items Links from Economist fabian Recipes 2 03-15-2011 02:26 PM
Remove text from a table mufc Recipes 2 01-18-2011 11:18 PM
how to remove already pulled feed items? MarcDK Calibre 2 11-12-2010 09:30 AM
Disabling text-to-speech (TTS) triggers DMCA exemption: YOU ARE ALLOWED TO REMOVE DRM kamm News 103 08-01-2010 04:04 AM
LRF to ePUB -- Remove Repeating Text mshneour Calibre 14 05-03-2010 11:00 PM


All times are GMT -4. The time now is 08:17 PM.


MobileRead.com is a privately owned, operated and funded community.