Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 06-02-2009, 05:09 PM   #16
Argel
Opinionated [but right]
Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.
 
Argel's Avatar
 
Posts: 281
Karma: 1412
Join Date: Apr 2008
Location: UK
Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe
I've had to do this many hundreds of times over the years and never found it necessary to use specialist tools. The sequence with any editor is basically as ahi describes above.

First identify whatever character(s) mark a paragraph end and globally replace them with a unique marker - '|' will do the job but so will "&&&&" or any character combination that doesn't otherwise occur in the text.

You can now remove all the remaining return/newline characters BUT FIRST globally replace them with the same character plus a space. This is because some lines may end [space][newline] and others may not have the space. If you simply strip out the return/newline the last/first words will run together and if you don't notice immediately you're in a mess.

Now repeatedly global search for [return/newline][two spaces] and replace with [return/newline][one space]. When you no longer find the target you know that every line now terminates with [return/newline][space] so just replace that character combination with [space].Of course, as Jelby points out, if you know how to search for a return/newline with an arbitrary number of whitespace characters, you can do this in one operation.

You now have one impenetrable text block but all you have to do is globally replace your original marker, e.g. &&&&, with a paragraph return and hey presto.
Argel is offline   Reply With Quote
Old 06-03-2009, 02:05 AM   #17
Sabardeyn
Guru
Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.
 
Sabardeyn's Avatar
 
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Quote:
Originally Posted by Argel View Post
I've had to do this many hundreds of times over the years
Wouldn't it have been substantially easier if you just wrote it once and ran it as needed?


Gideon,
Most Search & Replace functions are a limited form of regular expressions (regex). What both Jellby and Argel have just posted are forms of regex. Granted, Jellby went hardcore and Argel gave more generalized info, but they're still regex.

I mentioned RegexBuddy because its a decent way to learn. It allows you to create a formula and save it in a library for future use. It also explains what the formula is doing in English. A feature whose value cannot be stressed enough - try re-reading Jellby's formulas again. Complex regex is not easy to understand. (Try understanding the expression halfway down this post if you think I'm kidding...)

You could create a Search & Replace (that is, regex) expression for every change needed in this file, and then save them all individually. You could start a new regex formula, load every one of the expressions you just made into one humongous formula, find the correct "stacking order" so they're all processed correctly, and issue one command to fix the whole file. This humongous formula could be saved as well. And, if you ever need it again, you could load, run and be done almost instantly (at least for the "hands on" portion of the work, conversion would take a bit of time, of course).

You do not have to use any of the software that has been mentioned. You can, as others have mentioned, very easily use almost any existing software that you are comfortable with - providing it can perform the necessary tasks.
Sabardeyn is offline   Reply With Quote
Old 06-03-2009, 05:39 AM   #18
Argel
Opinionated [but right]
Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.Argel is no ebook tyro.
 
Argel's Avatar
 
Posts: 281
Karma: 1412
Join Date: Apr 2008
Location: UK
Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe
Quote:
Originally Posted by Sabardeyn View Post
Wouldn't it have been substantially easier if you just wrote it once and ran it as needed?
No, not really, since manuscripts are so idiosyncratic that special tools often fail - as Gideon found.

I have some useful Word macros but it's often quicker to this kind of stuff manually. There are two kinds of people in life - those who spend hours fiddling around with quicker ways to do stuff and those who just do stuff.
Argel is offline   Reply With Quote
Old 06-03-2009, 12:19 PM   #19
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Argel View Post
No, not really, since manuscripts are so idiosyncratic that special tools often fail - as Gideon found.
The reason that "textify" failed in Gideon's case is that his file has NO indication of "end of paragraph" - there is a carriage return at the end of every line, and no special indication of the start of a paragraph. There is literally nothing for any tool to "pick up on". It's a horribly-formatted file.

Textify works beautifully on the overwhelming majority of files. I must have used it on literally hundreds of files, with great success. It also has the useful feature of being able to create HTML output, with _ _ replaced with italics.
HarryT is offline   Reply With Quote
Old 06-03-2009, 12:22 PM   #20
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by HarryT View Post
The reason that "textify" failed in Gideon's case is that his file has NO indication of "end of paragraph" - there is a carriage return at the end of every line, and no special indication of the start of a paragraph. There is literally nothing for any tool to "pick up on". It's a horribly-formatted file.

Textify works beautifully on the overwhelming majority of files. I must have used it on literally hundreds of files, with great success. It also has the useful feature of being able to create HTML output, with _ _ replaced with italics.
... how does Gideon know where the paragraph breaks are?

If he knows where they are, the same way he can ascertain it, so ought a computer program be able to... I think?

- Ahi
ahi is offline   Reply With Quote
Old 06-03-2009, 12:27 PM   #21
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by ahi View Post
... how does Gideon know where the paragraph breaks are?

If he knows where they are, the same way he can ascertain it, so ought a computer program be able to... I think?

- Ahi
Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.
HarryT is offline   Reply With Quote
Old 06-03-2009, 12:34 PM   #22
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by HarryT View Post
Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.
And the previous line was shorter than usual would also be a clue. Unfortunately the length of the line has become a rather poor indicator due to the fact that often the original assumed mono-spaced fonts and currently this is almost never the case. But if you assume it was mono-spaced you can count characters and determine the next word would have fitted. It is certainly beyond simple regexpressions.

Dale
DaleDe is offline   Reply With Quote
Old 06-03-2009, 12:43 PM   #23
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Quote:
Originally Posted by HarryT View Post
Because people can easily do "pattern recognition" tasks which are extremely difficult for a computer.

You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start.
Speaking in general terms, you are right. This ought very rarely be the case with plaintext though.

Are you guessing, by the way, or have you seen the file? I myself would not claim to be able to tell where the paragraph breaks if my only indicators were lines starting with capitals... an indicator that is, by the way, trivial to identify and process via a script.

Same question to DaleDe: are you guessing, or is what you are saying specifically the issue with Gideon's file?

- Ahi
ahi is offline   Reply With Quote
Old 06-03-2009, 01:12 PM   #24
ahi
Wizard
ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.ahi ought to be getting tired of karma fortunes by now.
 
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help:

In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.)

1 linebreak + 5 spaces = 1005
2 linebreaks + 0 spaces = 2000
1 space + 2 linebreaks + 5 spaces = 2006

Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces).

At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words.

Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines.

Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs.

Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing).

Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it...

- Ahi
ahi is offline   Reply With Quote
Old 06-03-2009, 02:31 PM   #25
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
Quote:
Originally Posted by ahi View Post
Speaking in general terms, you are right. This ought very rarely be the case with plaintext though.

Are you guessing, by the way, or have you seen the file? I myself would not claim to be able to tell where the paragraph breaks if my only indicators were lines starting with capitals... an indicator that is, by the way, trivial to identify and process via a script.

Same question to DaleDe: are you guessing, or is what you are saying specifically the issue with Gideon's file?

- Ahi
My comment was an add-on to Harry's to provide a little more information. He has the file. In general this is an issue that I have also faced from time to time.

Dale
DaleDe is offline   Reply With Quote
Old 06-03-2009, 03:01 PM   #26
Sabardeyn
Guru
Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.
 
Sabardeyn's Avatar
 
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Argel,

In the vast majority of cases files of any kind will conform to their type. Text, python, HTML, C++, whatever, will use certain patterns. Once you start to understand the pattern, you can make use of it in any regex process. You know this - it is exactly what you suggested in your search & replace example.

However their will always be cases where a particular user, file or piece of software does not follow the standard pattern for some reason. In those cases you can still run regex, but on a more limited, and with greater oversight, basis. So, in my example, I would run the "humongous regex" on it first to see what happened. If the result was a really garbled mess, I would revert to the original file and apply each individual expression and look the file over.

Chances are it will become more uniform in general although several errors will creep in because of the regex. (No automated process is infallible.) Since this is something Gideon would read anyway, he can note any remaining errors as he goes. Either he can correct these one at a time or create a new regex to handle it (correcting similar errors that exist in later portions of the text).

This is what Ahi, Dale and Harry are talking about right now. The file in question sounds as though it's one that does not conform to a known and recognized pattern. So you either have to customize any regex or perform a substantial amount of the corrections manually.

Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far.
Sabardeyn is offline   Reply With Quote
Old 06-03-2009, 10:20 PM   #27
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
If the OP is still following this, the app I linked to in my previous post is basically a tool to apply various preconfigured regex in series. It then displays the results. If you like it, continue to apply various regex, etc, until satisfied, then export the result to any file you like, including the original.

It's got some patterns that search for the things being discussed in the thread.

One of the things that you can search for, and may not be obvious to someone just looking at the text, is a space preceding a hard-return as a marker for end-of-paragraph. I'm surprised how frequently that has turned up.

Luck,

m a r
rogue_ronin is offline   Reply With Quote
Old 06-04-2009, 12:17 AM   #28
Gideon
Wearer of Pants
Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.Gideon knows the square root of minus one.
 
Gideon's Avatar
 
Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
I actually use a mac, but can dip into windows as needed (reluctantly.)

Me and regular expressions have had some run-ins before, and I really don't care enough to spend the time to learn how to do them. And this file, as Harry said.. it's a mess.

Outside of working out a system where a shorter line preceedes a new paragraph, you don't have a lot to work with.

In cases of double carriage returns it is certainly a doable thing to fix the file (I usually use TextMate) but this one is not so fortunate.
Gideon is offline   Reply With Quote
Old 06-04-2009, 02:38 AM   #29
HarryT
eBook Enthusiast
HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.HarryT ought to be getting tired of karma fortunes by now.
 
HarryT's Avatar
 
Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
Quote:
Originally Posted by Sabardeyn View Post
Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far.
I have seen the file. As Gideon says, it's a mess, and I don't think that any of the "usual tools" would be able to do anything with it.
HarryT is offline   Reply With Quote
Old 06-04-2009, 10:45 AM   #30
rogue_ronin
Banned
rogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-booksrogue_ronin has learned how to read e-books
 
Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
Well, you can always try to find another source... Mr. Google and the Keyword "torrent" are a good match.

m a r
rogue_ronin is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
How to reduce indents without "removing space between paragraphs" Skydog Calibre 5 06-05-2010 12:58 AM
removing spacing between paragraphs WITHOUT touching indent? ferossan Calibre 2 12-24-2009 02:12 PM
Removing blank lines between paragraphs? corroonb Workshop 3 08-13-2009 04:23 PM
Removing Line-breaks / Preserving Paragraphs ahi Workshop 5 06-08-2009 02:22 AM
Removing excess carriage returns Halk Calibre 5 05-17-2009 02:35 PM


All times are GMT -4. The time now is 02:06 AM.


MobileRead.com is a privately owned, operated and funded community.