Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 12-14-2011, 10:46 AM   #1
minorum
Junior Member
minorum began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1, Kobo-Glo
Hyphens are not deleted

Hi everyone,

I am trying to convert a PDF File into EPub. I used the heuristic methode but all the hyphens are still there.

So I tried it with search and replace. I learned a little bit of regular expressions and than I looked for hyphens before the linebreak with -<br> and replaced it with nothing. It deletes the hyphen but there is still a space where the hyphen has been when I look in the EPub.

Any idea how I get it to work?

minorum
minorum is offline   Reply With Quote
Old 12-15-2011, 09:39 AM   #2
Serpentine
Evangelist
Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.Serpentine ought to be getting tired of karma fortunes by now.
 
Posts: 416
Karma: 1045911
Join Date: Sep 2011
Location: Cape Town, South Africa
Device: Kindle 3
The best advice I can give you is to do the pdf->epub conversion as cleanly as possible, preserving the text. Then take the epub and open it in Sigil to do the regex work - You can use the latest Sigil beta which has a nice new regex engine.

There should not be a space with that replacement, however if you are replacing it with a space, or the following line starts with a space character, perhaps using something like -<br(\s*/?)>\s* will better match. In either case I would suggest doing work like this outside of Calibre itself. While it may seem like a bit of extra work, it often saves a lot of time in the long run and will get you the results you're looking for.
Serpentine is offline   Reply With Quote
Advert
Old 12-15-2011, 09:44 AM   #3
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 29,912
Karma: 55267620
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by Serpentine View Post
The best advice I can give you is to do the pdf->epub conversion as cleanly as possible, preserving the text. Then take the epub and open it in Sigil to do the regex work - You can use the latest Sigil beta which has a nice new regex engine.

There should not be a space with that replacement, however if you are replacing it with a space, or the following line starts with a space character, perhaps using something like -<br(\s*/?)>\s* will better match. In either case I would suggest doing work like this outside of Calibre itself. While it may seem like a bit of extra work, it often saves a lot of time in the long run and will get you the results you're looking for.

With Sigil, you get to see the results of your mis-steps.

Those hyphens could be ndash or minus signs. different search terms ar needed. In sigil, you an copy and paste the character and never worry about what flavor it really was
theducks is offline   Reply With Quote
Old 12-15-2011, 10:58 AM   #4
ldolse
Wizard
ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.ldolse is an accomplished Snipe hunter.
 
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
Was the behavior that some were removed and some weren't? Some hyphens are intentionally preserved, unless Calibre can determine definitively that it should be removed.

If on the other hand every single hyphen from the source doc is still in the converted doc this sounds like a bug, and you can open a bug report with the pdf attached.
ldolse is offline   Reply With Quote
Old 12-16-2011, 04:03 AM   #5
minorum
Junior Member
minorum began at the beginning.
 
Posts: 9
Karma: 10
Join Date: Dec 2011
Device: Sony PRS-T1, Kobo-Glo
Thanks for your answers.

@idolse
In the automatic process of calibre none hyphens were removed. The linebreak was removed and the hyphen is then in the word li-ke this. But at all I think these PDF are by some ways not standard. It worked automaticly with others. When I tried the expression -<br>.* and tested it (Great praise for the regular expressions assistent ) nothing was marked.

@all
I tried it with sigil but it is a lot of work. So I started at the roots and used OCR on the document and then sigil which was a bit easyer. I also tried some of the commercial pfd to epub converter. But I have to say that most of the time the result is not better than that of calibre often worse.
But then I found that the newest version of finereader converts scans and PDFs directly in epub. I got the trial and tested it. It worked marvellous! All hyphens removed. Pagenumbers invisible, even the big initials at the begining of a chapter were recognised and put in the flowing text.
If you are willing to pay money for epub conversion finereader 11 is defintily worth it and you get one of the best OCR programs.


I also like to say thank you to the developers of calibre. It is definiatly the best program for ebooks that is avalible. I got a reader only a short time ago and still working to get all my digital assest to work on that thing. Calibre helps a lot - so I can leave my laptop at home more often.
A donation for this great project will follow.

greatings from Germany

minorum
minorum is offline   Reply With Quote
Advert
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Calibre remove soft hyphens? zuli Calibre 3 11-08-2017 09:20 PM
Soft Hyphens wallcraft Workshop 29 06-12-2012 04:21 AM
-webkit-hyphens: none; does it work in iBooks? Balaji Workshop 2 08-23-2011 10:18 AM
Soft Hyphens Deleted When Opened in Book View rcgordon Sigil 4 06-16-2010 07:14 AM
Feature request: soft hyphens paulpeer Sigil 3 12-05-2009 01:43 PM


All times are GMT -4. The time now is 12:04 AM.


MobileRead.com is a privately owned, operated and funded community.