MobileRead Forums - View Single Post - How to remove unnecesary items in a text?

DDHarriman · 05-16-2011, 04:57 AM

Hello

You can, per example, “cut out” the headers and the page numbers in the original files(s) used to do the OCR (if is this the way you are doing it).

Let’s imagine you scan your book and create an image PDF (unique file) with all the pages in the correct order.
Lets imagine you use Finereader Pro to do the OCR…

Do this:

1 - make a copy of your PDF with another name (protecting the original file if something goes wrong);

2 - open the new file in Finereader and use the “crop” option in the “edit page image” part to mark a rectangular selection in the page letting the headers and page numbers out of it, apply cut (to that page or to all of them) - be careful that this cannot be undone;

3 - OCR the result - presto no headers and page numbers.

Alternative - if you have per example Acrobat Pro, go to the margins configuration and redefine the top and bottom ones so the headers and page numbers are out of the new margins and save it with a new name. Open it on your OCR program and apply step (3) above.

You can do all the above with other programs too, just check the similar functions those programs have to the ones described above.

Best regards,

05-16-2011, 04:57 AM	#3
DDHarriman Guru Posts: 860 Karma: 4380 Join Date: Feb 2008 Location: Almada, Portugal Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note	Hello You can, per example, “cut out” the headers and the page numbers in the original files(s) used to do the OCR (if is this the way you are doing it). Let’s imagine you scan your book and create an image PDF (unique file) with all the pages in the correct order. Lets imagine you use Finereader Pro to do the OCR… Do this: 1 - make a copy of your PDF with another name (protecting the original file if something goes wrong); 2 - open the new file in Finereader and use the “crop” option in the “edit page image” part to mark a rectangular selection in the page letting the headers and page numbers out of it, apply cut (to that page or to all of them) - be careful that this cannot be undone; 3 - OCR the result - presto no headers and page numbers. Alternative - if you have per example Acrobat Pro, go to the margins configuration and redefine the top and bottom ones so the headers and page numbers are out of the new margins and save it with a new name. Open it on your OCR program and apply step (3) above. You can do all the above with other programs too, just check the similar functions those programs have to the ones described above. Best regards,