![]() |
#1 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
|
[GUI Plugin] TextDiff
[GUI Plugin] TextDiff - Version 1.2.4 - 01-07-2024
A Calibre GUI plugin for finding text differences in two book formats. Main features: -------------- This plugin shows the differences between two selected book formatss. The formats are first converted to text format (even if the source format is already text) with Calibre's convert utility (https://manual.calibre-ebook.com/gen...k-convert.html). If the conversion fails, the format has no text content (as scanned PDF files) or Calibre cannot find an appropriate conversion tool (as Microsoft wordconv). Then the text files obtained this way are read into memory and possibly edited (removing blank lines and other changes as described under "Planned Features". Then the compare is done with Python's DiffLib (https://docs.python.org/3/library/difflib.html). The ratio gives a measure for the similarity of the two texts. 1.0 means the texts are identical, A value near 0.0 means, that the texts are complete different. The last thing may also occur, when the source format has no text content (as scanned PDF files). Then one should create a new book format (text) with an OCR process. The detailed workflow is as follows: 1. Select a book with at least two formats or two books with at least one format each to compare. 2. Chose two formats. 3. Chose the output format and other comparison options. 4. Hit "Compare". 5. The formats are converted and compared and the output is displayed in the output window. A ratio is also computed and displayed. 6. If wished, copy the comparison output to the clipboard and/or save it to a file and/or save it as book with an suitable format (HTML or text). If you want to compare other formats, repeat step 1 and hit the "Refresh formats" button. Then repeat steps 2 - 5. The "Compare"-Dialog is modeless, what permits to move it around and touch the Calibre screen. Planned Features: ----------------- - Remove soft hyphens before conversion. Limitations: ------------ - The converted formats are stored as strings in memory, so large formats may run out of memory. Version History: ---------------- Spoiler:
Installation: ------------- Download the attached zip file and install the plugin as described in the plugins thread on mobileread. You need to add the calibre path to your $PATH variable. To report Bugs and suggestions: ------------------------------- If you find any issues or have suggestions, please report them on GitHub or in the MobileRead Forum. --- [GUI-Plugin] TextDiff - Version 1.2.4 - 01-07-2024 Ein Calibre GUI-Plugin zum Finden von Textunterschieden in zwei Buchformaten. Haupteigenschaften: ------------------- Dieses Plugin zeigt die Unterschiede zwischen zwei ausgewählten Buchformaten. Die Formate wurden zunächst in Textformat konvertiert (auch wenn das Ausgangsformat bereits Text ist). Wenn die Konvertierung fehlschlägt, kann das daran liegen, dass das Format keinen text enthält (wie z. B. bei gescannten PDF-Dateien) oder Calibre ein Konvertierungstool nicht finden kann (wie z. B. Microsoft wordconv). Dann werden die Textdateien in den Speicher eingelesen und eventuell manipuliert (Leerzeilen und Ähnliches entfernen, wie unter "Geplante Features" beschrieben). mit dem Konvertierungsprogramm von Calibre (https://manual.calibre-ebook.com/gen...k-convert.html). Dann wird der Vergleich mit Pythons DiffLib (https://docs.python.org/3/library/difflib.html) durchgeführt. Das Verhältnis gibt ein Maß für die Ähnlichkeit der beiden Texte an. 1,0 bedeutet, dass die Texte identisch sind, ein Wert nahe 0,0 bedeutet, dass die Texte völlig unterschiedlich sind. Letzteres kann auch passieren, wenn das Quellformat keinen Textinhalt hat (wie gescannte PDF-Dateien). Dann sollte man ein neues Buch-Format (Text) mit einem OCR-Prozess erzeugen. Der detaillierte Arbeitsablauf ist wie folgt: 1. Wählen Sie ein Buch mit mindestens zwei Formaten zum Vergleichen oder zwei Bücher mit jeweils mindestens einem Format aus. 2. Wählen Sie zwei Formate aus. 3. Wählen Sie das Ausgabeformat und andere Vergleichsoptionen. 4. Klicken Sie auf "Vergleichen". 5. Die Formate werden konvertiert und verglichen und die Ausgabe wird im Ausgabefenster angezeigt. Ein Verhältnis wird ebenfalls berechnet und angezeigt. 6. Falls gewünscht, kopieren Sie die Vergleichsausgabe in die Zwischenablage und/oder speichern Sie sie in einer Datei und/oder speichern Sie sie als Buch in einem geeigneten Format (HTML oder Text). Wenn Sie andere Formate vergleichen möchten, wiederholen Sie Schritt 1 und klicken Sie auf die Schaltfläche "Formate aktualisieren". Der "Vergleichen"-Dialog ist moduslos, was es erlaubt, ihn zu verschieben, und den darunterliegenden Calibre-Bildschirm zu steuern. Geplante Funktionen: -------------------- - Weiche Bindestriche vor der Konvertierung entfernen. Einschränkungen: ---------------- - Die konvertierten Formate werden als Strings im Speicher gehalten, daher kann es bei großen Formaten zu Speichermangel kommen. Versionsgeschichte: ------------------- Spoiler:
Installation: ------------- Laden Sie die angehängte ZIP-Datei herunter und installieren Sie das Plugin wie im Thread "Einführung in Plugins" auf mobileread beschrieben. Vergessen Sie nicht, Calibre in Ihre PATH-Variable aufzunehmen. So melden Sie Fehler und Vorschläge: ------------------------------------ Wenn Sie Probleme finden oder Vorschläge haben, melden Sie diese bitte auf GitHub oder im MobileRead-Forum. Last edited by feuille; 01-08-2024 at 10:18 AM. Reason: Version 1.2.4 |
![]() |
![]() |
![]() |
#2 | ||
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,392
Karma: 142887248
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
Quote:
Overall, I do like the idea of this plugin. Thanks. |
||
![]() |
![]() |
Advert | |
|
![]() |
#3 | |
Custom User Title
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,288
Karma: 72663495
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
|
Quote:
But yes, the memory limitation seems... not great. I'd use temp files. This'll be a useful plugin though ![]() |
|
![]() |
![]() |
![]() |
#4 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,392
Karma: 142887248
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
I do think the memory usage should be fixed right away. |
|
![]() |
![]() |
![]() |
#5 | ||
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,268
Karma: 7955525
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
Quote:
I don't know how many people will be comparing 900,000 page books, or even 488,000 page books. |
||
![]() |
![]() |
Advert | |
|
![]() |
#6 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
|
|
![]() |
![]() |
![]() |
#7 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,392
Karma: 142887248
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#8 | |
Custom User Title
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,288
Karma: 72663495
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
|
Quote:
While I probably wouldn't be running it specifically on those books, PDFs downloaded from the Internet Archive use some sort of layering compression that means that pdftotext can extract gigabytes of image layers into the temp folder alongside the text layer. (This happens when indexing for FTS or running the word count plugin, which should only require the text layer.) Would it try to keep all that in memory? Last edited by ownedbycats; 11-21-2022 at 06:59 AM. |
|
![]() |
![]() |
![]() |
#9 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 12,268
Karma: 7955525
Join Date: Jan 2010
Location: Notts, England
Device: Kobo Libra 2
|
Quote:
If I were the developer then doing the diffs using files would be low priority, especially if the underlying tool didn't directly support it. FWIW: calibre keeps the entire db in memory. @owndbycats: it converts to txt, which by definition contains only text. Where it gets the text from a layered pdf is another issue. |
|
![]() |
![]() |
![]() |
#10 |
Custom User Title
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 10,288
Karma: 72663495
Join Date: Oct 2018
Location: Canada
Device: Kobo Libra H2O, formerly Aura HD
|
Yeah, just mentioned it because other things that I thought should be text-only (full-text indexing and the Count Pages plugin) I've seen it extracting the images into temp.
|
![]() |
![]() |
![]() |
#11 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 78,392
Karma: 142887248
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I've given TextDiff a try and it's pretty good. Thanks for creating this plugin.
|
![]() |
![]() |
![]() |
#12 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
|
|
![]() |
![]() |
![]() |
#13 |
want to learn what I want
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 1,534
Karma: 7095191
Join Date: Sep 2020
Device: none
|
Thank you for this plugin! I plan to use it more, in the future, to compare Orwell's 1984 translations, but I just gave it a try for testing purposes and noticed the HTML output to file will be great for this use case, as it displays the text differences side-by-side in a very synchronized fashion.
some quick notes:
![]() Last edited by Comfy.n; 11-24-2022 at 07:09 AM. |
![]() |
![]() |
![]() |
#14 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
|
Thank you Comfy.n for the hints! I'll try to fix that. For a better feedback of the program status I had already considered a progress bar.
|
![]() |
![]() |
![]() |
#15 |
Connoisseur
![]() ![]() ![]() ![]() ![]() ![]() Posts: 61
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
|
Version 1.1.0 is out
Hello Comfy.n,
I implemented two of your hints (see history). In order to calculate the display time of a progress bar, I stopped intermediate times in the program flow. To my surprise, the HTML rendering of the text browser widget consumes 2/3 of the total runtime! I may implement in a future release a workaround like stepwise asynchronous loading with a timer or so, but I need to do some research on that. |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
[GUI Plugin] KindleUnpack - The Plugin | DiapDealer | Plugins | 508 | 02-27-2025 12:33 PM |
[GUI Plugin] Noosfere_util, a companion plugin to noosfere DB | lrpirlet | Plugins | 2 | 08-18-2022 03:15 PM |
[GUI Plugin] Save Virtual Libraries To Column (GUI) | chaley | Plugins | 14 | 04-04-2021 05:25 AM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 12:27 PM |