Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Plugins

Notices

Reply
 
Thread Tools Search this Thread
Old 11-01-2022, 09:27 PM   #1681
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,069
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by compurandom View Post
Last I checked it was both. Subsequent editions I think sometimes increment or suffix the ISBN.

Even if it gets a new ISBN rather than incrementing the old one, it's still a version number of sorts.
I have never seen a novel get a new ISBN. There either is a version number or not.
JSWolf is offline   Reply With Quote
Old 11-01-2022, 09:35 PM   #1682
JSWolf
Resident Curmudgeon
JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.JSWolf ought to be getting tired of karma fortunes by now.
 
JSWolf's Avatar
 
Posts: 80,069
Karma: 147983159
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
Quote:
Originally Posted by dunhill View Post
The revision or version number I say is
version="1.1"
<p class="trevision">
<span class="tdate">
What is inside the file
This is the code in the HTML file for the DAW eBook.

Code:
<p class="x01-FM-Copyright-Text-Space" id="release_identifier_line">btb_ppg_141035760_c0_r7</p>
JSWolf is offline   Reply With Quote
Advert
Old 11-01-2022, 10:39 PM   #1683
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,141
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by compurandom View Post
Last I checked it was both. Subsequent editions I think sometimes increment or suffix the ISBN.

Even if it gets a new ISBN rather than incrementing the old one, it's still a version number of sorts.
There is no 'suffix' in ISBN
The last character is a Check Digit (Mod 11)
theducks is offline   Reply With Quote
Old 11-16-2022, 09:41 PM   #1684
tn4w
Enthusiast
tn4w began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Aug 2022
Device: Windows 10
I'm wondering if there is someone who can develop a plugin that convert books to mp3 or m4a.

I've been looking for a way to listen to books using read-aloud. Unfortunately, Windows 10's native system TTS voices are too artificial and uncomfortable to listen to. I tried Kindle and Google Play Books read-aloud features but I find Microsoft Edge's native read -aloud voices are the most natural and acceptable.

So currently, I convert books to htmlz and unzip them and copy those files to Android devices and make Edge on Android read the content of index.html. Edge often fails to read sentences in PDF files possibly due to internal mark-ups and the Android version of Edge cannot read PDFs. Also, the Android version of Edge gets stalled while reading long TXT files so it seems converting books to HTML is the way to go at the moment.

But this is cumbersome. I've researched a little and Microsoft Edge seems to use the Microsoft Azure Text-to-speech technology, and it is available via API for free.

If somebody can develop such a plugin that convert books to audio files using Microsoft Azure TTS API, it would be greatly appreciated.
tn4w is offline   Reply With Quote
Old 11-19-2022, 07:33 PM   #1685
Fiat_Lux
Addict
Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.
 
Fiat_Lux's Avatar
 
Posts: 394
Karma: 6700000
Join Date: Jan 2012
Location: Gimel
Device: tablets
Quote:
Originally Posted by tn4w View Post
I'm wondering if there is someone who can develop a plugin that convert books to mp3 or m4a.
...

But this is cumbersome. I've researched a little and Microsoft Edge seems to use the Microsoft Azure Text-to-speech technology, and it is available via API for free.

If somebody can develop such a plugin that convert books to audio files using Microsoft Azure TTS API, it would be greatly appreciated.
TTS to MP3: Create MP3 audiobook using Windows TTS
https://www.mobileread.com/forums/sh...d.php?t=299727

Unfortunately, the developer is not longer able to provide ongoing support. The hope is that it will work for Calibre 5.x.

I don't do Windows, so I can't tell you how well it works.
Fiat_Lux is offline   Reply With Quote
Advert
Old 11-20-2022, 12:35 AM   #1686
tn4w
Enthusiast
tn4w began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Aug 2022
Device: Windows 10
Quote:
Originally Posted by Fiat_Lux View Post
TTS to MP3: Create MP3 audiobook using Windows TTS
https://www.mobileread.com/forums/sh...d.php?t=299727

Unfortunately, the developer is not longer able to provide ongoing support. The hope is that it will work for Calibre 5.x.

I don't do Windows, so I can't tell you how well it works.
I'm aware of it and wish it continues developing and supports Azure TTS options as Windows 10's default system voices are not comfortable.
tn4w is offline   Reply With Quote
Old 11-20-2022, 01:18 PM   #1687
feuille
Connoisseur
feuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enoughfeuille will become famous soon enough
 
Posts: 62
Karma: 666
Join Date: May 2020
Location: Germany
Device: android smartphone + tablet
Quote:
Originally Posted by ldolse View Post
Here's an idea for a plugin:

When comparing editions of ebooks, I frequently want to do a diff between the two books to see what the differences in the actual text are. Sometimes you'll come across one edition with nice formatting, but another edition with better proofreading. Other times I'm trying to look to see if there are differences between ebooks claiming to be from significantly different print editions.

An example is the Arthur Conan Doyle books discussed in this thread:
https://www.mobileread.com/forums/sh...52#post1447252

The problem here is you can't just use a diff tool on the raw html, you'll never get anywhere that way.

The solution that seems to work the best is to convert both books to text (enable 'smarten punctuation' to normalize punctuation differences). Then you can load the two editions of the book in a visual diff tool and get a very decent idea of the differences between the two editions.

There's a lot of manual steps to be done here, and it seems like an excellent workflow for a plugin. Basically highlight the two books you want to compare, and the plugin runs those two files through the conversion pipeline, outputs to temp files, and calls up the diff tool of your choice.
Implemented now by TextDiff Plugin.
feuille is offline   Reply With Quote
Old 11-24-2022, 12:18 AM   #1688
tn4w
Enthusiast
tn4w began at the beginning.
 
Posts: 34
Karma: 10
Join Date: Aug 2022
Device: Windows 10
Idea: Cascaded series book thumbnails in the grid view

Series of books occupy the physical area in the grid view and can slow down showing books as Calibre takes a bit of time to load thumbnails. So, it might be a good idea to group those series of books in one thumbnail with a cascaded image (like overlapped Solitaire cards). Then when the thumbnail is clicked, it shows the search result of the series.
tn4w is offline   Reply With Quote
Old 12-14-2022, 10:58 AM   #1689
The Holy
Enthusiast
The Holy began at the beginning.
 
The Holy's Avatar
 
Posts: 25
Karma: 10
Join Date: Aug 2021
Device: none
I have two ideas for plugins to identify books with incorrect metadata.

Misidentified check:
A plugin that runs full text search on all books in text based formats
It matches the title and last name of the author and makes sure an exact match exists inside the book.
If multiple authors exist one last name match from any of them is enough but the title must always match exactly. Case agnostic.
This will find many if not all misidentified books. Some false positives can be expected.

Language check:
Compare the language that is set for each book to its actual contents => only for text based formats
and
Compare the language that is set for each book to what languages are used in the title and comments.
For example by looking for non-english characters and words in title or comments when a book is set to language: English
E.g. The, Der, Die, Das, La, Le, Il, Å, Ä, Ö, Æ, 诶, ēi, も, अ, ب. Perhaps only do the most common languages if it gets to be too complicated.
Perhaps include a setting for minimum matches per page/number of words and/or matches total per book to avoid false positives.
And perhaps only check first 10, 10 in the middle and last 5 pages.
Dictionaries may be a frequent false positive.

Maybe these would be best combined into one plugin so that it checks the language is the same in metadata and the book as well as matching the author and title.
"Misidentified check" or "Fix match" for example.
Or perhaps be added to a plugin like quality check?

Last edited by The Holy; 12-14-2022 at 11:05 AM.
The Holy is offline   Reply With Quote
Old 12-14-2022, 12:21 PM   #1690
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 31,141
Karma: 60406498
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
Quote:
Originally Posted by The Holy View Post
I have two ideas for plugins to identify books with incorrect metadata.

Misidentified check:
A plugin that runs full text search on all books in text based formats
It matches the title and last name of the author and makes sure an exact match exists inside the book.
If multiple authors exist one last name match from any of them is enough but the title must always match exactly. Case agnostic.
This will find many if not all misidentified books. Some false positives can be expected.

Language check:
Compare the language that is set for each book to its actual contents => only for text based formats
and
Compare the language that is set for each book to what languages are used in the title and comments.
For example by looking for non-english characters and words in title or comments when a book is set to language: English
E.g. The, Der, Die, Das, La, Le, Il, Å, Ä, Ö, Æ, 诶, ēi, も, अ, ب. Perhaps only do the most common languages if it gets to be too complicated.
Perhaps include a setting for minimum matches per page/number of words and/or matches total per book to avoid false positives.
And perhaps only check first 10, 10 in the middle and last 5 pages.
Dictionaries may be a frequent false positive.

Maybe these would be best combined into one plugin so that it checks the language is the same in metadata and the book as well as matching the author and title.
"Misidentified check" or "Fix match" for example.
Or perhaps be added to a plugin like quality check?
took me less than 30 seconds to come up with an example that would fail:
Quote:
Das Boot

Das Boot is a 1981 West German war film written and directed by Wolfgang Petersen, produced by Günter Rohrbach
It clearly meets your tests, but the book /movie is in English (could be either)
theducks is offline   Reply With Quote
Old 12-14-2022, 08:18 PM   #1691
The Holy
Enthusiast
The Holy began at the beginning.
 
The Holy's Avatar
 
Posts: 25
Karma: 10
Join Date: Aug 2021
Device: none
Quote:
Originally Posted by theducks View Post
took me less than 30 seconds to come up with an example that would fail:


It clearly meets your tests, but the book /movie is in English (could be either)
There will be false positives like I said, to be clear what I'm suggesting is basically an advanced search which displays the books matching the search criteria. It would never change the file or metadata on its own. Thus, the false negatives are acceptable so long as the plugin returns enough accurate results per false positive.

I would guess most people don't have more than five different languages in their library, if not only one or two, so the user could select the languages in the plugin which in turn are tied to words that make sense/ are less universal and commonly used in the language.

If a library only should consist of English and German (because it is all the person thinks exists and has been getting), the user selects English and German. That way it wouldn't match with Italian, for example, due to the words Italian may share with English or German and makes the search simpler and faster. But if none of the English or German words were found/ were found enough times, the book could be in Italian or any other language, while set as English and thus shows up in the results.

Better yet, it could check how common both languages are in Das Boot.

Basically, the user tells the plugin which languages are to be expected by selecting language presets in the plugin containing some of the most common words (or common and unique) from each language expected (The for English and Das for German for example). If a lot more of the English words are found and the language is set to English it will be assumed to be correct and not show up in the search.

There would need to be a min/max required/allowed value for the number of occurrences of words from each language preset to make it show up as a result or not. Let's say the book is set to English in Calibre. If the English words don't occur enough or the German words occur too often, it will show up in the results as a possible German book/ translation. This would be decided by the min/ max value. If it's a 50/50 split, it's an English-German Dictionary



The title/author match would work for Das Boot since the title and author should be the same in the book.
I just added both the English and German version to Calibre and ran a metadata search on both. The German one was changed to English, even though it started out correctly. Looking at the images below, it's clear the function I'm suggesting would work. It would only show the German version, which was mismatched by the metadata search as English. The images also make it clear the title and author match would have to run only on the first and last few pages, and the language match in the middle.

English version would correctly match title, author, and language:
Click image for larger version

Name:	1.png
Views:	392
Size:	15.7 KB
ID:	198398
Click image for larger version

Name:	2.png
Views:	388
Size:	23.1 KB
ID:	198399

German version would correctly match title and author, but not the language, since metadata search set it to English:
Click image for larger version

Name:	3.png
Views:	381
Size:	152.1 KB
ID:	198402
Click image for larger version

Name:	4.png
Views:	399
Size:	20.2 KB
ID:	198403
Click image for larger version

Name:	5.png
Views:	376
Size:	6.2 KB
ID:	198404

Imagine bulk adding 100 books, running metadata search and applying it. Wouldn't this be the fastest way to accurately identify most that were incorrectly identified? And 100 may be low for a lot of people, imagine doing 100s if not 1000s at a time. I have a lot of books, many of which have the wrong title, author, comment and language. Aside from covers, for which we already have tools for identifying bad ones, these four metadata values are the most important pieces of information in a book, to me anyway, which is why I think this plugin would be a great addition.

Let's say we combine it all into one plugin, here are a few advantages I can come up with:

It will show books which likely have the wrong basic (read:important) metadata!
This would in turn make using the metadata download on all books feel like less of a Hail Mary, since it will be much easier to find misidentified books.

It will show books which may not be the best copy of a book (metadata in Calibre is correct, but title and author isn't written anywhere in the book, which normally e-books should have and may indicate that it is not a good version/ copy)

It will show books which are in an unwanted language (you only select English and German because that is all you think you have, but not enough English or German words were found in a book because it's written in some other language that is different enough)
The Holy is offline   Reply With Quote
Old 12-15-2022, 06:17 PM   #1692
Fiat_Lux
Addict
Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.
 
Fiat_Lux's Avatar
 
Posts: 394
Karma: 6700000
Join Date: Jan 2012
Location: Gimel
Device: tablets
Quote:
Originally Posted by The Holy View Post
I have two ideas for plugins to identify books with incorrect metadata.

Misidentified check:
A plugin that runs full text search on all books in text based formats
It matches the title and last name of the author and makes sure an exact match exists inside the book.
If multiple authors exist one last name match from any of them is enough but the title must always match exactly. Case agnostic.
This will find many if not all misidentified books. Some false positives can be expected.
Sheila Lowe _Dead Letters_ is a potential example of a false positive. Nine months it was released the cover designer, author, and publishing team discovered that the name on the cover and the name of the book were different.
CF https://www.goodreads.com/book/show/...6-dead-letters

Quote:
Language check:
Compare the language that is set for each book to its actual contents => only for text based formats
and
Compare the language that is set for each book to what languages are used in the title and comments.
For example by looking for non-English characters and words in title or comments when a book is set to language: English
English has never met a word that it has not claimed to be a part of the language.

Quote:
E.g. The, Der, Die, Das, La, Le, Il, Å, Ä, Ö, Æ, 诶, ēi, も, अ, ب. Perhaps only do the most common languages if it gets to be too complicated.
Perhaps include a setting for minimum matches per page/number of words and/or matches total per book to avoid false positives.
And perhaps only check first 10, 10 in the middle and last 5 pages.
Dictionaries may be a frequent false positive.
There are a couple of fairly standard algorithms to determine the language of a text.
The most common utilize letter frequency tables.
The least common utilize both word frequency and letter frequency tables.

Depending upon how the program and database is structured, adding a new language can be as easy as dropping a new, language-specific database in a specific folder, and telling the program what the language is, or as complicated as adding new fields to the database, replacing the old database with the new, updated database.

Quote:
Maybe these would be best combined into one plugin so that it checks the language is the same in metadata and the book as well as matching the author and title.
"Misidentified check" or "Fix match" for example.
Or perhaps be added to a plugin like quality check?
Make it two, or maybe three different plugins.
a) It isn't uncommon for official documents from either governments, or NGOs to be in two or more languages.
b) Databases of word frequency tables can become very large, very quickly.

###

_Ethnologue_ claims that there are 7151 spoken languages today, with 4169 having a developed writing system, and a further 151 languages that are exclusively signed.

_Wycliffe Bible Translators_ claims that there are 7388 spoken languages, or which the Bible has been fully translated into 724 languages, and 3266 languages have an ongoing translation project.

For various reasons, I put slightly greater credence on _Wycliffe Bible Translators_ data, than on Ethnologue data.

For a first cut plug-in, I'd mandate UTF-8 glyphs, and use them to break the book into writing system, and from that, use letter frequencies for the specific language. The virtue of this approach is that it can guess the language of any document thrown at it, with an acceptable degree of inaccuracy.

Either a second plug-in, or a more advanced version, would use word frequencies, with an initial draft of English/not-English, then expand to the ten most common languages, and when that is bug free, go to the 20 most common languages, and then jump to 50, 100, and, maybe 200 most common spoken languages.

Third parties willing to provide letter and/or word frequency tables would enable faster expansion and inclusion of minority and/or endangered and/or extinct and/or conlangs than would otherwise be the case.
Fiat_Lux is offline   Reply With Quote
Old 12-15-2022, 07:54 PM   #1693
The Holy
Enthusiast
The Holy began at the beginning.
 
The Holy's Avatar
 
Posts: 25
Karma: 10
Join Date: Aug 2021
Device: none
Quote:
Originally Posted by Fiat_Lux View Post
There are a couple of fairly standard algorithms to determine the language of a text.
The most common utilize letter frequency tables.
The least common utilize both word frequency and letter frequency tables.
Getting the top 100 (or x amount) most common words from the languages and deleting all duplicates would make a list of the most common and unique words. Perhaps that would be a good start.

Quote:
Depending upon how the program and database is structured, adding a new language can be as easy as dropping a new, language-specific database in a specific folder, and telling the program what the language is, or as complicated as adding new fields to the database, replacing the old database with the new, updated database.

Make it two, or maybe three different plugins.
a) It isn't uncommon for official documents from either governments, or NGOs to be in two or more languages.
b) Databases of word frequency tables can become very large, very quickly.

For a first cut plug-in, I'd mandate UTF-8 glyphs, and use them to break the book into writing system, and from that, use letter frequencies for the specific language. The virtue of this approach is that it can guess the language of any document thrown at it, with an acceptable degree of inaccuracy.

Either a second plug-in, or a more advanced version, would use word frequencies, with an initial draft of English/not-English, then expand to the ten most common languages, and when that is bug free, go to the 20 most common languages, and then jump to 50, 100, and, maybe 200 most common spoken languages.

Third parties willing to provide letter and/or word frequency tables would enable faster expansion and inclusion of minority and/or endangered and/or extinct and/or conlangs than would otherwise be the case.
I did think a little more about this earlier today and had the text in the image below written up. I post it as such to avoid the screen getting too crowded. In short, I was able to quickly find a few words each for English, French, German, Spanish, Swedish, and Italian which were only found in one of their books. Meaning, a Ctrl + f search for whole words in the e-book viewer, which only returned results for one of the books.

Click image for larger version

Name:	1.png
Views:	384
Size:	105.7 KB
ID:	198431
Click image for larger version

Name:	English.png
Views:	379
Size:	218.8 KB
ID:	198432 Click image for larger version

Name:	French.png
Views:	376
Size:	790.8 KB
ID:	198433
Click image for larger version

Name:	German.png
Views:	390
Size:	808.8 KB
ID:	198434 Click image for larger version

Name:	Swedish.png
Views:	385
Size:	823.9 KB
ID:	198435

Algorithms or a system that could identify any language out of the box would be interesting to test if it already exists. I do wonder, however, what the feasibility of that approach would be in terms of complexity and compute intensity.

I agree we should start small before expanding to multiple languages, perhaps just English and one other. A basic plugin would be great to start testing.
The Holy is offline   Reply With Quote
Old 12-15-2022, 08:01 PM   #1694
compurandom
Wizard
compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.compurandom ought to be getting tired of karma fortunes by now.
 
Posts: 1,017
Karma: 500000
Join Date: Jun 2015
Device: Rocketbook, kobo aura h2o, kobo forma, kobo libra color
> b) Databases of word frequency tables can become very large, very quickly.

I wouldn't think you need a complete dictionary to do this.

I would expect that having a dictionary of, say, the top 400 words in a language would be plenty to characterize it.

If you were selective, you could probably even pick less than 50 "keystone" words that are not shared with other languages, or at least very frequent in one language and very infrequent in other languages and come up with a correct weighted answer.

I'd even guess (i.e., without research or evidence) that given two languages, you could pick 10 words in each that would distinguish a text between the two using a weighted frequency sample of a few pages randomly selected in the book (i.e., page 10, not page 1, and a page full of words, not pictures).

I'm sure in the hundreds to thousands of potential languages, you could probably come up with a small number of words that would assign a book to a language family, and then go down a decision tree to narrow down which one from the family.

Even without having a database, it should be possible to analyze a book, generate a frequency table of the top ~1000 words, have the user supply the language, and build a database. After adding a handful of languages like this, you could start characterizing books and for ones that are wrong, it could generate a differential between the two languages. A user guided selection of words might be useful and improve accuracy, but likely not totally necessary.

Last edited by compurandom; 12-15-2022 at 08:06 PM.
compurandom is offline   Reply With Quote
Old 12-16-2022, 03:45 AM   #1695
Fiat_Lux
Addict
Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.Fiat_Lux ought to be getting tired of karma fortunes by now.
 
Fiat_Lux's Avatar
 
Posts: 394
Karma: 6700000
Join Date: Jan 2012
Location: Gimel
Device: tablets
Quote:
Originally Posted by The Holy View Post
Getting the top 100 (or x amount) most common words from the languages and deleting all duplicates would make a list of the most common and unique words. Perhaps that would be a good start.
Offhand, I don't remember how useful deleting duplicate words from top X word lists is.

What you don't want to happen, is what happened with the Afrikaans dictionary for OpenOffice.org. The final, automated proofreading, was running it against the South African English dictionary, and deleting words found in that dictionary. There was a list of words to be added back in --- "boer", "bakkie", other obvious Afrikaans words that English captured --- but the word "die" took almost a decade to migrate into that "add word back in list". "Die" is Afrikaans for "The".

Quote:
I was able to quickly find a few words each for English, French, German, Spanish, Swedish, and Italian which were only found in one of their books. Meaning, a Ctrl + f search for whole words in the e-book viewer, which only returned results for one of the books.
When languages are very closely related --- Catalan, Valencian, and Spanish, for example --- the unique word list gets very big, if reliability and accuracy is to be maintained.

Quote:
Algorithms or a system that could identify any language out of the box would be interesting to test if it already exists. I do wonder, however, what the feasibility of that approach would be in terms of complexity and compute intensity.
/opt/libreoffice7.4/share/fingerprint/ contains the data that LibreOffice uses to differentiate between languages.

I've forgotten where in the LibreOffice codebase their implementation resides.

The algorithm LibreOffice uses is neither complex, nor computer intense.

I learned to program using "If Then" & GoTo statements. (Standard Library? What is that? ) If the wanted algorithm wasn't in either Knuth's _The Art of Computer Programming_ or Sedgewick, brute force a working solution. An approach that is guaranteed to produce umpteen bugs per line of code. Once a working version exists, throw it all away, and write the program using procedures and functions.

Quote:
I agree we should start small before expanding to multiple languages, perhaps just English and one other. A basic plugin would be great to start testing.
Start with English/Not English, and then expand languages.

###

After thinking some more about it, I'd push for two plugins. One glyph/letter based, and one word based. The former for rough identification and the latter for precise identification.
Fiat_Lux is offline   Reply With Quote
Reply

Tags
calibre, chatbot, cover, epub fix, epub-fix, google books, kindle, metadata calibre title, missing, pdf, pdf and calibre, plugin development, scribe


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PRS-505 Any ideas what this might be? Neupy Sony Reader 4 07-03-2012 07:19 AM
New Plugin Type Idea: Library Plugin cgranade Plugins 3 09-15-2010 12:11 PM
Ideas? mike_bike_kite Which one should I buy? 10 06-13-2010 03:37 PM
Ideas F1Wild Amazon Kindle 4 07-10-2009 06:01 AM


All times are GMT -4. The time now is 05:16 PM.


MobileRead.com is a privately owned, operated and funded community.