![]() |
#16 |
Opinionated [but right]
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 281
Karma: 1412
Join Date: Apr 2008
Location: UK
Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe
|
I've had to do this many hundreds of times over the years and never found it necessary to use specialist tools. The sequence with any editor is basically as ahi describes above.
First identify whatever character(s) mark a paragraph end and globally replace them with a unique marker - '|' will do the job but so will "&&&&" or any character combination that doesn't otherwise occur in the text. You can now remove all the remaining return/newline characters BUT FIRST globally replace them with the same character plus a space. This is because some lines may end [space][newline] and others may not have the space. If you simply strip out the return/newline the last/first words will run together and if you don't notice immediately you're in a mess. Now repeatedly global search for [return/newline][two spaces] and replace with [return/newline][one space]. When you no longer find the target you know that every line now terminates with [return/newline][space] so just replace that character combination with [space].Of course, as Jelby points out, if you know how to search for a return/newline with an arbitrary number of whitespace characters, you can do this in one operation. You now have one impenetrable text block but all you have to do is globally replace your original marker, e.g. &&&&, with a paragraph return and hey presto. |
![]() |
![]() |
![]() |
#17 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Wouldn't it have been substantially easier if you just wrote it once and ran it as needed?
Gideon, Most Search & Replace functions are a limited form of regular expressions (regex). What both Jellby and Argel have just posted are forms of regex. Granted, Jellby went hardcore and Argel gave more generalized info, but they're still regex. I mentioned RegexBuddy because its a decent way to learn. It allows you to create a formula and save it in a library for future use. It also explains what the formula is doing in English. A feature whose value cannot be stressed enough - try re-reading Jellby's formulas again. Complex regex is not easy to understand. (Try understanding the expression halfway down this post if you think I'm kidding...) You could create a Search & Replace (that is, regex) expression for every change needed in this file, and then save them all individually. You could start a new regex formula, load every one of the expressions you just made into one humongous formula, find the correct "stacking order" so they're all processed correctly, and issue one command to fix the whole file. This humongous formula could be saved as well. And, if you ever need it again, you could load, run and be done almost instantly (at least for the "hands on" portion of the work, conversion would take a bit of time, of course). You do not have to use any of the software that has been mentioned. You can, as others have mentioned, very easily use almost any existing software that you are comfortable with - providing it can perform the necessary tasks. |
![]() |
![]() |
![]() |
#18 | |
Opinionated [but right]
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 281
Karma: 1412
Join Date: Apr 2008
Location: UK
Device: Cybook Gen3, PRS 505, Kindle Int, Oasis, Paperwhite, Scribe
|
Quote:
I have some useful Word macros but it's often quicker to this kind of stuff manually. There are two kinds of people in life - those who spend hours fiddling around with quicker ways to do stuff and those who just do stuff. ![]() |
|
![]() |
![]() |
![]() |
#19 | |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Quote:
Textify works beautifully on the overwhelming majority of files. I must have used it on literally hundreds of files, with great success. It also has the useful feature of being able to create HTML output, with _ _ replaced with italics. |
|
![]() |
![]() |
![]() |
#20 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
If he knows where they are, the same way he can ascertain it, so ought a computer program be able to... I think? - Ahi |
|
![]() |
![]() |
![]() |
#21 | |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
Quote:
You could say "If a line starts in a capital letter then it's probably a new paragraph", I suppose. It wouldn't be 100% reliable, but it would be a good start. |
|
![]() |
![]() |
![]() |
#22 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
![]() |
![]() |
![]() |
#23 | |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Are you guessing, by the way, or have you seen the file? I myself would not claim to be able to tell where the paragraph breaks if my only indicators were lines starting with capitals... an indicator that is, by the way, trivial to identify and process via a script. Same question to DaleDe: are you guessing, or is what you are saying specifically the issue with Gideon's file? - Ahi |
|
![]() |
![]() |
![]() |
#24 |
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Assuming that the paragraph break information in Gideon's file is not impossible to accurately retrieve, throwing together a python script that works thusly might help:
In sequentially parsing the text file, build a list that consists of (1) non-whitespace character sequence strings, and (2) numbers indicating consecutive whitespace weights. As explained in my "text processing ideas" post, the weighting ought to assign a value of 1 for each space character, and 1000 for each linebreak (which can either be chr(10), chr(13) or the two together [which still should be counted as a single linebreak]). (Tabs, I suppose could be counted as having a weight of 4 or 8 or even larger.) 1 linebreak + 5 spaces = 1005 2 linebreaks + 0 spaces = 2000 1 space + 2 linebreaks + 5 spaces = 2006 Once this is done, you basically have a list that you could process sequentially to recreate the input file (save for the whitespaces). At that time, take all the whitespace weights and put them in a separate list, and get the mode of that list. It will be the weight of whitespace used to separate words. Remove from the list of whitespace weights all instances of the mode weight, and take the mode of what remains. It will be the weight of whitespace used to separate lines. Then remove from the list of whitespace weights all instances of the mode weight, and take the mode of what again remains. It will be the weight of whitespace used to separate paragraphs. Output the list, replacing the whitespace weights with the appropriate characters (space for word spacing, space for line spacing, linebreak for paragraph spacing). Of course, such an approach won't work if the text file is as my fellow forum members suggest it to be. But if they are guessing or mistaken, I might throw this script together this evening and post it... - Ahi |
![]() |
![]() |
![]() |
#25 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
![]() |
![]() |
![]() |
#26 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Argel,
In the vast majority of cases files of any kind will conform to their type. Text, python, HTML, C++, whatever, will use certain patterns. Once you start to understand the pattern, you can make use of it in any regex process. You know this - it is exactly what you suggested in your search & replace example. However their will always be cases where a particular user, file or piece of software does not follow the standard pattern for some reason. In those cases you can still run regex, but on a more limited, and with greater oversight, basis. So, in my example, I would run the "humongous regex" on it first to see what happened. If the result was a really garbled mess, I would revert to the original file and apply each individual expression and look the file over. Chances are it will become more uniform in general although several errors will creep in because of the regex. (No automated process is infallible.) Since this is something Gideon would read anyway, he can note any remaining errors as he goes. Either he can correct these one at a time or create a new regex to handle it (correcting similar errors that exist in later portions of the text). This is what Ahi, Dale and Harry are talking about right now. The file in question sounds as though it's one that does not conform to a known and recognized pattern. So you either have to customize any regex or perform a substantial amount of the corrections manually. Without seeing the file none of us are capable of providing assistance on this matter. Guessing only gets us so far. |
![]() |
![]() |
![]() |
#27 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
If the OP is still following this, the app I linked to in my previous post is basically a tool to apply various preconfigured regex in series. It then displays the results. If you like it, continue to apply various regex, etc, until satisfied, then export the result to any file you like, including the original.
It's got some patterns that search for the things being discussed in the thread. One of the things that you can search for, and may not be obvious to someone just looking at the text, is a space preceding a hard-return as a marker for end-of-paragraph. I'm surprised how frequently that has turned up. Luck, m a r |
![]() |
![]() |
![]() |
#28 |
Wearer of Pants
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,050
Karma: 7634
Join Date: Jan 2008
Location: Norman, OK
Device: Amazon Kindle DX / iPhone
|
I actually use a mac, but can dip into windows as needed (reluctantly.)
Me and regular expressions have had some run-ins before, and I really don't care enough to spend the time to learn how to do them. And this file, as Harry said.. it's a mess. Outside of working out a system where a shorter line preceedes a new paragraph, you don't have a lot to work with. In cases of double carriage returns it is certainly a doable thing to fix the file (I usually use TextMate) but this one is not so fortunate. |
![]() |
![]() |
![]() |
#29 |
eBook Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 85,544
Karma: 93383043
Join Date: Nov 2006
Location: UK
Device: Kindle Oasis 2, iPad Pro 10.5", iPhone 6
|
|
![]() |
![]() |
![]() |
#30 |
Banned
![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 475
Karma: 796
Join Date: Sep 2008
Location: Honolulu
Device: Nokia 770 (fbreader)
|
Well, you can always try to find another source... Mr. Google and the Keyword "torrent" are a good match.
m a r |
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to reduce indents without "removing space between paragraphs" | Skydog | Calibre | 5 | 06-05-2010 12:58 AM |
removing spacing between paragraphs WITHOUT touching indent? | ferossan | Calibre | 2 | 12-24-2009 02:12 PM |
Removing blank lines between paragraphs? | corroonb | Workshop | 3 | 08-13-2009 04:23 PM |
Removing Line-breaks / Preserving Paragraphs | ahi | Workshop | 5 | 06-08-2009 02:22 AM |
Removing excess carriage returns | Halk | Calibre | 5 | 05-17-2009 02:35 PM |