Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 11-20-2022, 09:09 AM   #1
feuille
Enthusiast
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 34
Karma: 666
Join Date: May 2020
Device: android smartphone
[GUI Plugin] TextDiff

[GUI Plugin] TextDiff - Version 1.1.0 - 11-26-2022

A Calibre GUI plugin for finding text differences in two book formats.

Main features:
--------------
This plugin shows the differences between two selected book formatss.
The formats are first converted to text format (even if the source format is already text) with Calibre's convert utility (https://manual.calibre-ebook.com/gen...k-convert.html).
Then the text files obtained this way are read into memory and possibly edited (removing blank lines and other changes as described under "Planned Features".
Then the compare is done with Python's DiffLib (https://docs.python.org/3/library/difflib.html).
The ratio gives a measure for the similarity of the two texts. 1.0 means the texts are identical, A value near 0.0 means, that the texts are complete different.
The last thing may also occur, when the source format has no text content (as scanned PDF files). Then one should create a new book format (text) with an OCR process.

The detailed workflow is as follows:
1. Select a book with at least two formats or two books with at least one format each to compare.
2. Chose two formats.
2. Chose the output format and other comparison options.
3. Hit "Compare".
4. The formats are converted and compared and the output is displayed in the output window. A ratio is also computed and displayed.
5. If wished, copy the comparison output to the clipboard and/or save it to a file and/or save it as book with an suitable format (HTML or text).

If you want to compare other formats, repeat step 1 and hit the "Refresh formats" button. Then repeat steps 2 - 5.
The "Compare"-Dialog is modeless, what permits to move it around and touch the Calibre screen.


Planned Features:
-----------------
- Remove soft hyphens before conversion.
- Custom characters to ignore ("char junk", e. g. "" vs. »«).
- Optimierung des Füllens des Textbrowser-Widgets (HTML-Rendering verursacht 2/3 der Laufzeit!).
- Fortschrittsanzeige.

Limitations:
------------
- The converted formats are stored as strings in memory, so large formats may run out of memory.

Version History:
----------------
Spoiler:
Version 1.1.0 - 11-26-2022
- Changed tool button behavior: show compare dialog when icon clicked, show menu when arrow clicked (thanks to Comfy.n)
- Inverting HTML/CSS back colors (highlighting diffs) in dark mode (thanks to Comfy.n and Kovidgoyal)
Version 1.0.0 11-20-2022
- Initial release.


Installation:
-------------
Download the attached zip file and install the plugin as described in the plugins thread on mobileread.

To report Bugs and suggestions:
-------------------------------
If you find any issues or have suggestions, please report them in this thread.

---

[GUI-Plugin] TextDiff - Version 1.1.0 - 26.11.2022

Ein Calibre GUI-Plugin zum Finden von Textunterschieden in zwei Buchformaten.

Haupteigenschaften:
-------------------
Dieses Plugin zeigt die Unterschiede zwischen zwei ausgewählten Buchformaten.
Die Formate wurden zunächst in Textformat konvertiert (auch wenn das Ausgangsformat bereits Text ist).
Dann werden die Textdateien in den Speicher eingelesen und eventuell manipuliert (Leerzeilen und ähnliches entfernen, wie unter "Geplante Features" beschrieben).
mit dem Konvertierungsprogramm von Calibre (https://manual.calibre-ebook.com/gen...k-convert.html).
Dann wird der Vergleich mit Pythons DiffLib (https://docs.python.org/3/library/difflib.html) durchgeführt.
Das Verhältnis gibt ein Maß für die Ähnlichkeit der beiden Texte an. 1,0 bedeutet, dass die Texte identisch sind, ein Wert nahe 0,0 bedeutet, dass die Texte völlig unterschiedlich sind.
Letzteres kann auch passieren, wenn das Quellformat keinen Textinhalt hat (wie gescannte PDF-Dateien). Dann sollte man ein neues Buch-Format (Text) mit einem OCR-Prozess erzeugen.

Der detaillierte Arbeitsablauf ist wie folgt:
1. Wählen Sie ein Buch mit mindestens zwei Formaten zum Vergleichen oder zwei Bücher mit jeweils mindestens einem Format aus.
2. Wählen Sie zwei Formate aus.
2. Wählen Sie das Ausgabeformat und andere Vergleichsoptionen.
3. Klicken Sie auf „Vergleichen“.
4. Die Formate werden konvertiert und verglichen und die Ausgabe wird im Ausgabefenster angezeigt. Ein Verhältnis wird ebenfalls berechnet und angezeigt.
5. Falls gewünscht, kopieren Sie die Vergleichsausgabe in die Zwischenablage und/oder speichern Sie sie in einer Datei und/oder speichern Sie sie als Buch in einem geeigneten Format (HTML oder Text).

Wenn Sie andere Formate vergleichen möchten, wiederholen Sie Schritt 1 und klicken Sie auf die Schaltfläche "Formate aktualisieren".
Der "Vergleichen"-Dialog ist moduslos, was es erlaubt, ihn zu bewegen und den Calibre-Bildschirm zu berühren.

Geplante Funktionen:
--------------------
- Weiche Bindestriche vor der Konvertierung entfernen.
- Benutzerdefinierte Zeichen, die ignoriert werden sollen ("Zeichenmüll", z. B. "" vs. »«).

Einschränkungen:
----------------
- Die konvertierten Formate werden als Strings im Speicher gehalten, daher kann es bei großen Formaten zu Speichermangel kommen.

Versionsgeschichte:
-------------------
Spoiler:
Version 1.1.0 - 26.11.2022
- Geändertes Verhalten der Werkzeugschaltfläche: Vergleichsdialog anzeigen, wenn auf das Symbol geklickt wird, Menü anzeigen, wenn auf den Pfeil geklickt wird (Dank an Comfy.n)
- Invertieren von HTML/CSS-Hintergrundfarben (Hervorheben von Unterschieden) im Dunkelmodus (Dank an Comfy.n und Kovidgoyal)
Version 1.0.0 20.11.2022
Erstveröffentlichung.


Installation:
-------------
Laden Sie die angehängte ZIP-Datei herunter und installieren Sie das Plugin wie im Thread "Einführung in Plugins" auf mobileread beschrieben.

So melden Sie Fehler und Vorschläge:
------------------------------------
Wenn Sie Probleme finden oder Vorschläge haben, melden Sie diese bitte in diesem Thread.
Attached Files
File Type: zip TextDiff.zip (1.14 MB, 201 views)

Last edited by feuille; 11-26-2022 at 06:22 AM. Reason: Version 1.1.0
feuille is offline   Reply With Quote
Old 11-20-2022, 09:48 AM   #2
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 68,648
Karma: 113245921
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by feuille View Post
Planned Features:
-----------------
- Custom characters to ignore ("char junk", e. g. "" vs. »«)
I hope when this is implemented that there will be an option to turn it off. I do want to know if the quotes are different.

Quote:
Limitations:
------------
- The converted formats are stored as strings in memory, so large formats may run out of memory.
This does need to be addressed as some computers are only 32-bit or may not have enough memory. IMHO, store both text files on disk and compare that way instead of fully loading both text files into memory and then comparing.

Overall, I do like the idea of this plugin. Thanks.
JSWolf is offline   Reply With Quote
Advert
Old 11-20-2022, 09:51 AM   #3
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 4,999
Karma: 24003172
Join Date: Oct 2018
Location: Canada
Device: Kobo Aura HD (retired), Kobo Libra H2O
Quote:
Originally Posted by JSWolf View Post
I hope when this is implemented that there will be an option to turn it off. I do want to know if the quotes are different.
"Custom" seems to imply it'll be user-defined?

But yes, the memory limitation seems... not great. I'd use temp files.

This'll be a useful plugin though
ownedbycats is offline   Reply With Quote
Old 11-20-2022, 11:33 AM   #4
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 68,648
Karma: 113245921
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by ownedbycats View Post
"Custom" seems to imply it'll be user-defined?

But yes, the memory limitation seems... not great. I'd use temp files.

This'll be a useful plugin though
My laptop does have 16GB, but most don't. My Surface Pro 2 is only 8GB.

I do think the memory usage should be fixed right away.
JSWolf is offline   Reply With Quote
Old 11-20-2022, 11:49 AM   #5
chaley
Grumpy old git
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
chaley's Avatar
 
Posts: 10,807
Karma: 4599395
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by ownedbycats View Post
But yes, the memory limitation seems... not great. I'd use temp files.
Quote:
Originally Posted by JSWolf View Post
My laptop does have 16GB, but most don't. My Surface Pro 2 is only 8GB.

I do think the memory usage should be fixed right away.
The plugin is keeping TXT files in memory. According to this thread on Quora, 1 GB is approximately 900,000 pages, or 179 million "standard" words (5 single-byte characters plus a space). If the text is 100% UTF-8 extended characters then using the same assumptions as in the thread there are 11 bytes per word. A GB would be 98 million words or 488,000 pages.

I don't know how many people will be comparing 900,000 page books, or even 488,000 page books.
chaley is offline   Reply With Quote
Advert
Old 11-20-2022, 03:47 PM   #6
feuille
Enthusiast
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 34
Karma: 666
Join Date: May 2020
Device: android smartphone
Quote:
Originally Posted by JSWolf View Post
I hope when this is implemented that there will be an option to turn it off. I do want to know if the quotes are different.
There will be a option to turn it on
feuille is offline   Reply With Quote
Old 11-21-2022, 07:44 AM   #7
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 68,648
Karma: 113245921
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by chaley View Post
The plugin is keeping TXT files in memory. According to this thread on Quora, 1 GB is approximately 900,000 pages, or 179 million "standard" words (5 single-byte characters plus a space). If the text is 100% UTF-8 extended characters then using the same assumptions as in the thread there are 11 bytes per word. A GB would be 98 million words or 488,000 pages.

I don't know how many people will be comparing 900,000 page books, or even 488,000 page books.
Do you think this could be a problem for a computer running a 32-bit version of Calibre where you only get A 3GB chunk to work with?
JSWolf is offline   Reply With Quote
Old 11-21-2022, 07:55 AM   #8
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 4,999
Karma: 24003172
Join Date: Oct 2018
Location: Canada
Device: Kobo Aura HD (retired), Kobo Libra H2O
Quote:
Originally Posted by chaley View Post
The plugin is keeping TXT files in memory. According to this thread on Quora, 1 GB is approximately 900,000 pages, or 179 million "standard" words (5 single-byte characters plus a space). If the text is 100% UTF-8 extended characters then using the same assumptions as in the thread there are 11 bytes per word. A GB would be 98 million words or 488,000 pages.

I don't know how many people will be comparing 900,000 page books, or even 488,000 page books.
Just curious: how would this work for PDFs?

While I probably wouldn't be running it specifically on those books, PDFs downloaded from the Internet Archive use some sort of layering compression that means that pdftotext can extract gigabytes of image layers into the temp folder alongside the text layer. (This happens when indexing for FTS or running the word count plugin, which should only require the text layer.) Would it try to keep all that in memory?

Last edited by ownedbycats; 11-21-2022 at 07:59 AM.
ownedbycats is offline   Reply With Quote
Old 11-21-2022, 08:12 AM   #9
chaley
Grumpy old git
chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.chaley ought to be getting tired of karma fortunes by now.
 
chaley's Avatar
 
Posts: 10,807
Karma: 4599395
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
Quote:
Originally Posted by JSWolf View Post
Do you think this could be a problem for a computer running a 32-bit version of Calibre where you only get A 3GB chunk to work with?
Possibly, but a vanishingly small one. A user would need to be running a 32 bit version (not possible in calibre 6) and be comparing 2 books in excess of a few hundred thousand pages.

If I were the developer then doing the diffs using files would be low priority, especially if the underlying tool didn't directly support it.

FWIW: calibre keeps the entire db in memory.

@owndbycats: it converts to txt, which by definition contains only text. Where it gets the text from a layered pdf is another issue.
chaley is offline   Reply With Quote
Old 11-21-2022, 08:15 AM   #10
ownedbycats
Wizard
ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.ownedbycats ought to be getting tired of karma fortunes by now.
 
ownedbycats's Avatar
 
Posts: 4,999
Karma: 24003172
Join Date: Oct 2018
Location: Canada
Device: Kobo Aura HD (retired), Kobo Libra H2O
Yeah, just mentioned it because other things that I thought should be text-only (full-text indexing and the Count Pages plugin) I've seen it extracting the images into temp.
ownedbycats is offline   Reply With Quote
Old 11-22-2022, 08:43 AM   #11
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 68,648
Karma: 113245921
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
I've given TextDiff a try and it's pretty good. Thanks for creating this plugin.
JSWolf is offline   Reply With Quote
Old 11-24-2022, 03:30 AM   #12
feuille
Enthusiast
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 34
Karma: 666
Join Date: May 2020
Device: android smartphone
Quote:
Originally Posted by JSWolf View Post
Thanks for creating this plugin.
You're welcome
feuille is offline   Reply With Quote
Old 11-24-2022, 04:48 AM   #13
Comfy.n
Addict
Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.Comfy.n ought to be getting tired of karma fortunes by now.
 
Comfy.n's Avatar
 
Posts: 311
Karma: 2500000
Join Date: Sep 2020
Device: Calibre E-book viewer/ PW3
Thank you for this plugin! I plan to use it more, in the future, to compare Orwell's 1984 translations, but I just gave it a try for testing purposes and noticed the HTML output to file will be great for this use case, as it displays the text differences side-by-side in a very synchronized fashion.

some quick notes:
  • The toolbar button click could default directly to the "Compare" action, instead of triggering the menu opening, IMO...
  • Calibre UI will hang for a while during TextDiff, so at first I wasn't sure that this was just a temporary halt.
  • The "Save output as ebook" action sometimes returns an error, I forgot to take note of what it was about
  • Under Dark theme, it's best to use the "Save output to file", as the on-screen output isn't tuned to that color palette


Last edited by Comfy.n; 11-24-2022 at 08:09 AM.
Comfy.n is offline   Reply With Quote
Old 11-24-2022, 05:52 PM   #14
feuille
Enthusiast
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 34
Karma: 666
Join Date: May 2020
Device: android smartphone
Thank you Comfy.n for the hints! I'll try to fix that. For a better feedback of the program status I had already considered a progress bar.
feuille is offline   Reply With Quote
Old 11-26-2022, 06:29 AM   #15
feuille
Enthusiast
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 34
Karma: 666
Join Date: May 2020
Device: android smartphone
Version 1.1.0 is out

Hello Comfy.n,

I implemented two of your hints (see history).

In order to calculate the display time of a progress bar, I stopped intermediate times in the program flow. To my surprise, the HTML rendering of the text browser widget consumes 2/3 of the total runtime! I may implement in a future release a workaround like stepwise asynchronous loading with a timer or so, but I need to do some research on that.
feuille is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[GUI Plugin] KindleUnpack - The Plugin DiapDealer Plugins 492 10-25-2022 09:13 AM
[GUI Plugin] Noosfere_util, a companion plugin to noosfere DB lrpirlet Plugins 2 08-18-2022 04:15 PM
[GUI Plugin] Save Virtual Libraries To Column (GUI) chaley Plugins 14 04-04-2021 06:25 AM
[GUI Plugin] Plugin Updater **Deprecated** kiwidude Plugins 159 06-19-2011 01:27 PM


All times are GMT -4. The time now is 05:54 AM.


MobileRead.com is a privately owned, operated and funded community.