Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > Workshop

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-06-2022, 04:11 PM   #1
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
conversion pyglossary pdf

Good afternoon,

I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.

I have tried converting a full text file of this pdf (I did not create it) but pyglossary is giving me a boat load of no tab errors and the stardict files that it creates from this txt file are empty. I also have an xml file but I cannot get pyglossary to convert it to stardict even though pyglossary is supposed to support .xml

Can anyone suggest ways to convert this pdf to a stardict dictionary? I know of no program that would convert pdf to stardict in pyglossary. Perhaps, there is another conversion tool-I am fishing for a way to do this.
The dictionary would be much more useful to me under stardict.

Cordially,
pz
pzack is offline  
Old 09-07-2022, 05:16 PM   #2
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by pzack View Post
I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.
You can't directly convert a PDF file to another dictionary file format, because the converter wouldn't be able to reliably identify headwords and definitions.

You might want to ask about converting the xml version in the Index of Custom Dictionaries for Kobo eReader thread.

MR member Markismus might be able to help you, because he often converts non-standard dictionary files to Pocketbook dictionaries.

Last edited by Doitsu; 09-07-2022 at 05:25 PM.
Doitsu is offline  
Advert
Old 09-08-2022, 01:24 PM   #3
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
XML is a language for data storage, not a dictionary format. So you can't expect pyglossary to support any XML whatsoever. However, you can put the XML-file online and post a link to it. Maybe you're lucky.

Suggestions
The Pdf-(2-epub-)2-html-2-stardict tool isn't there, yet. Probably never. The problem is that the nice styling of a PDF puts a lot of extra code in there, that has to be differentiated from the words&definitions. Optically easy, but not code-wise. You could try to get ABBYY Finereader to recognize it and specify the output format as a spreadsheet or CSV-format. However, even ABBYY's output will still have a lot of noise, that you'll have to deal with.

What is the name of the dictionary? Maybe it's already present in a nicer format than PDF.
Markismus is offline  
Old 09-09-2022, 12:46 PM   #4
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good morning M. Markismus,

Thank you for taking the time to respond to my query. I mentioned .xml because it is a supported format in pyglossary(according to github)for conversion, however, the xml file that I have is not converting. I am not sure if this file actually contains the whole dictonary anyway.

The only thing that I can think of is convert the full text file that I have but it is not tab delimited. When I look at this file in notepad I see that the headword is not separated out-it is the leading word-but it is part of the definition which is a paragraph.

Pyglossary asks for a tab delimited file citing no-tab errors as it was converting ; it produced the three stardict files from my text file but they were empty. I did not create the text and xml files.

If there is a way to do a mass conversion of the text file, that is, get the leading head word separated out, and I think that this is what is meant by tab-delimiting a file-then pyglossary may correctly convert the text file. It is almost there but needs the head word separated from the definition. However, I admit that I don't fully understand the structure of a tab delimited file.

I have seen something about dumping the text into excel or another spreadsheet to build a tab-delimited file but,unfortunately, I have zero experience and knowledge of spreadsheets.


The dictionary has over 100,000 words and I certainly cannot do it manually.

And then there is the file converter "penelope" but I don't know if there is any help in that direction.

Cordially,
pz

Last edited by pzack; 09-09-2022 at 12:49 PM.
pzack is offline  
Old 09-09-2022, 01:04 PM   #5
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
@pzack Why don't you post a link to the non-tab-delimited file?

If what you're saying turns out correct, than all you would need is to prefix each line with a repetition of the 1st word and a delimiter.
Sed could do that on Linux, any pattern-substitution in Perl, Python, Awk or Lua could do that.

You could even do it in Excel. First column your line, second column the LEFT-function, third column a concatenation of both column-values with a delimiter in between.

Last edited by Markismus; 09-09-2022 at 01:07 PM.
Markismus is offline  
Advert
Old 09-09-2022, 03:58 PM   #6
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good afternoon M. Markismus,

Thank you for your quick response.

As I indicated, I don't know how to work with excel and spreadsheets. However, you have suggested some other possibilities of tab-delimiting the text file.

May I impose upon you to give me an example of how I may do this with the apps that you listed. If you would choose one that may be the simplist to work with. Please understand that I am not a programmer and I am shakey with working with scripts. But I can work in linux terminal. Your example could be short and sweet.

I figured that there may be a way to do this and I did see a script for converting this file to tab-delimited but I can't find it; it was a short script for use in linux.

Please let me see what you come up with before I try a new thread on a tab delimited conversion.

I think, thanks to you, that we may be headed in the right direction. And here's hoping that once converted-if it can be done-that pyglossary will cooperate and give me a stardict dictionary!

cordially,
pz
pzack is offline  
Old 09-09-2022, 04:00 PM   #7
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
I already wrote it out with the Excel example. What prevents you from posting a link to the text file? If it's small, you could even zip it and upload it here.
Markismus is offline  
Old 09-09-2022, 04:16 PM   #8
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
M. Markismus,

I want to add to my just-sent reply to you that, though I don't understand fully the structure of tab-delimited text files, I assume that pyglossary needs the head word as a hook on which to hang the definition.

My sense is that the tab delimiting isolates or sets apart the headword so that pyglossary sees it as the headword and can build its index or pointers to the headword.

This is how I understand it but this is purely conjecture on my part. If I am correct, then I need an app,maybe among the apps that you have provided for me, to isolate or tab? the headword which is the first word of each of the paragraphs that include the headword and definition. There are spaces between each paragraph of text. There are no illustrations in the text file.

I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.

pz
pzack is offline  
Old 09-09-2022, 07:00 PM   #9
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Quote:
Originally Posted by pzack View Post
I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.
You also might want to look into using StarDict Editor, which can you use to compile and decompile StarDict dictionaries.
It also supports compiling and decompiling Babylon BGL dictionaries.

The Babylon glossary source file syntax, which supports inflections, is very simple:

Code:
#stripmethod=keep
#sametypesequence=h
#bookname=Spanish-English Dictionary

libro|libros
<p>single line definition of 'libro' (may contain html 3.2 tags, e.g <br>)</p>

rana|ranas
<p>single line definition of 'rana' (may contain html 3.2 tags, e.g <br>)</p>
Doitsu is offline  
Old 09-09-2022, 07:42 PM   #10
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Good evening, Doitsu

Thank you for responding. My dictionary is not in a bgl format thus, I don't think that the stardict editor is useful here. Actually, I tried this editor and like pyglossary, it threw up countless no-tab errors in the full text file and gave me empty stardict files.

Thank you the excel example but I don't understand excel.

In looking again at the text file it is like this:

headword space [prononciation of headword]space definition. In other words, the bracketed prononciation-in the international alphabet-is what separates what follows from the next headword and bracket. So that, what follows the bracketed prononciation of the headword will pertain to the headword until the next bracketed prononciation with the headword just before it. Now, where would one set the tab that would separate headword and bracket from the next headword and bracket?

Again, I am trying to understand the workings of a tab delimited file and what pyglossary is looking for.

cordially,
pz
pzack is offline  
Old 09-09-2022, 10:59 PM   #11
DNSB
Bibliophagist
DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.DNSB ought to be getting tired of karma fortunes by now.
 
DNSB's Avatar
 
Posts: 35,513
Karma: 145557716
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos
If you can't post the whole text file here as an attachment to your message, then snip a chunk of text and post that. It'll make looking at your issues a lot simpler.

To attach the file, either use the paperclip next to the smiley icon at the top of the message entry box or the Manage Attachments in the Attach files box below the message entry box. A .txt file is limited to 1MB but you can attach a .zip file of up to 20MB.
DNSB is offline  
Old 09-10-2022, 01:57 AM   #12
Sarmat89
Evangelist
Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.Sarmat89 ought to be getting tired of karma fortunes by now.
 
Posts: 482
Karma: 2267928
Join Date: Nov 2015
Device: none
It should be simple.

Get yourself an editor with regex support, like Notepad++ or VSCode.

Replace
Code:
^([^[]+?) *(?=\[)
with
Code:
\1\t
.
Sarmat89 is offline  
Old 09-10-2022, 02:03 AM   #13
Doitsu
Grand Sorcerer
Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.Doitsu ought to be getting tired of karma fortunes by now.
 
Doitsu's Avatar
 
Posts: 5,584
Karma: 22735033
Join Date: Dec 2010
Device: Kindle PW2
Sarmat beat me to the answer.
Doitsu is offline  
Old 09-10-2022, 11:41 AM   #14
pzack
Connoisseur
pzack began at the beginning.
 
Posts: 79
Karma: 10
Join Date: Aug 2022
Device: kobo sage,elipsa
Dear Sarmat89,

Thank you for the information and for responding. The code is greek to me.

Here is how my text file looks as an example(I did not build this file);

cours [kur] n.m. definition........................................ .................................
.................................................. .................................................. .........
.................................................. .................................................. ...........

.................................................. .................................................. ..........
.................................................. .................................................. .............
coursier [kursje] n.m. definition........................................ .............................
.................................................. .................................................. .............
.................................................. .................................................. ..............

Thus, you have headword space [prononciation] gender definition.
The definitions can be in separate paragraphs and sometimes a number of paragraphs in a long definition and it is, I think, the bracketed prononciation with its headword before it that delimits the definitions.

If I understand tab-delimiting correctly, then the headword and brackets would have a tab but I don't know where to place the tab and how to actually tab the text.

There are over 100,000 words with definitions(6,000 pages plus)so the program has to run through the file placing somewhere the tab. Or tabs?

If your example of code applies here, how would you plug in the actual format in this code, that is what represents what in your code looking at my example?

I don't know what regex is and how it works. I have notepad++ under win 11 and I have never formatted a text file least of all built a tab-delimited file.

I assume that the problem in pryglossary is getting the headword with the brackets tabbed so that stardict can find the word.

Very cordially,
pz
pzack is offline  
Old 09-10-2022, 11:54 AM   #15
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 897
Karma: 149877
Join Date: Jul 2013
Location: Netherlands
Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
Dear pzack,

This is not working. The example given reiterates the problem as you've described it. But it is not a sample. We already given you multiple solutions to that problem, but it doesn't seem to help you.

Zip the text-file and post in on a file-hoster such as Dropbox or pCloud and share the link. Maybe we can help you, if you stop repeating the same information.
Markismus is offline  
Closed Thread

Tags
pyglossary


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to PDF conversion causes all the text to be aligned to the left Swifty4635 Conversion 1 01-16-2022 10:17 PM
Desktop App How do I run PyGlossary on Windows ? Bilingual Kobo Reader 2 07-12-2020 01:54 PM
epub 2 PDF conversion with OCR in PDF possible? hobi2000 Conversion 2 03-25-2019 03:20 AM
PDF conversion keeping pdf page highstream Conversion 3 05-31-2016 11:46 AM
PDF to PDF conversion creates much larger file? rocketcat Conversion 11 09-30-2011 07:37 PM


All times are GMT -4. The time now is 11:13 PM.


MobileRead.com is a privately owned, operated and funded community.