Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Reply
 
Thread Tools Search this Thread
Old 04-24-2012, 08:01 AM   #1
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 506
Karma: 570148
Join Date: Sep 2010
Location: Norway
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
diff & automerge multiple text files

To speed up proofreading texts from the internet archive, I often do a diff between two different scans/uploads of the same book; vimdiff is my tool of preference. However, some books have more than two uploads, and then it should be possible to do an auto-merge, based on how many files agree on the contents of a particular line.
Does anybody have an idea how to do this?
SBT is offline   Reply With Quote
Old 04-24-2012, 05:12 PM   #2
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 9,653
Karma: 5072002
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2
You might try diff3. It has some options that will help in the task.

Dale
DaleDe is offline   Reply With Quote
 
Enthusiast
Old 04-25-2012, 04:06 AM   #3
SBT
Fanatic
SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.SBT ought to be getting tired of karma fortunes by now.
 
SBT's Avatar
 
Posts: 506
Karma: 570148
Join Date: Sep 2010
Location: Norway
Device: prs-t1, phone/Cool Reader, tablet/BlueFire, Nook Simple
I had a look at diff3, but it doesn't seem to have the ability to select 'best out of three' automagically. I ended up making the following bash script. What it does is basically, given versions a.txt, b.txt, c.txt, ...:
  1. find lines that differ between a and b
  2. do a poll of the two line versions among all file versions
  3. select the one with most hits.
This can be iterated, repeating the process with c and d, and so on, then diffing the refined versions and so on.

Code:
cp a.txt ab.txt
diff -y a.txt b.txt|grep '|' |\
while read l
do
a=${l%%	*}
b=${l##*	}
na=$(cat ?.txt|grep -c "^$a\$" )
nb=$(cat ?.txt|grep -c "^$b\$" )
sd=$(( ${#b} - ${#a} ))
sd=${sd#-} # assume lines are not similar if the lengths differ by more than 3
if (( "$na" < "$nb"  &&  $sd < 3 )); then 
sed -i "/^$a\$/s/.*/# $b/" ab.txt
fi
done
SBT is offline   Reply With Quote
Old 04-25-2012, 10:47 AM   #4
DaleDe
Grand Sorcerer
DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.DaleDe ought to be getting tired of karma fortunes by now.
 
DaleDe's Avatar
 
Posts: 9,653
Karma: 5072002
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2
Interesting. Diff3 does try and pick the best but only based on the order the files are specified. It trys to prefer the newest change rather than voting. I guess it depends on what you think is most important.

Glad you got something you like and works for you. Thanks for posting. Others may find it useful.

Dale
DaleDe is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can't seem to automerge .docx files mshnryman Library Management 19 12-28-2011 07:06 AM
Converting multiple text files to xhtml? Spotnik Sigil 19 04-12-2011 10:37 PM
Auto send diff feeds to multiple devices tutorial mean_gene Calibre 0 12-27-2010 02:26 PM
Safari downloads MOBI & EPUB as text files webfolk Workshop 3 11-14-2010 03:27 AM
Convert zip with multiple text files to MOBI mindfire Calibre 1 03-27-2010 10:19 AM


All times are GMT -4. The time now is 01:20 PM.


MobileRead.com is a privately owned, operated and funded community.