conversion pyglossary pdf

pzack · 09-06-2022, 04:11 PM

Good afternoon,

I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.

I have tried converting a full text file of this pdf (I did not create it) but pyglossary is giving me a boat load of no tab errors and the stardict files that it creates from this txt file are empty. I also have an xml file but I cannot get pyglossary to convert it to stardict even though pyglossary is supposed to support .xml

Can anyone suggest ways to convert this pdf to a stardict dictionary? I know of no program that would convert pdf to stardict in pyglossary. Perhaps, there is another conversion tool-I am fishing for a way to do this.
The dictionary would be much more useful to me under stardict.

Cordially,
pz

Doitsu · 09-07-2022, 05:16 PM

Quote:

Originally Posted by pzack

I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader.

You can't directly convert a PDF file to another dictionary file format, because the converter wouldn't be able to reliably identify headwords and definitions.

You might want to ask about converting the xml version in the Index of Custom Dictionaries for Kobo eReader thread.

MR member Markismus might be able to help you, because he often converts non-standard dictionary files to Pocketbook dictionaries.

Markismus · 09-08-2022, 01:24 PM

XML is a language for data storage, not a dictionary format. So you can't expect pyglossary to support any XML whatsoever. However, you can put the XML-file online and post a link to it. Maybe you're lucky.

Suggestions
The Pdf-(2-epub-)2-html-2-stardict tool isn't there, yet. Probably never. The problem is that the nice styling of a PDF puts a lot of extra code in there, that has to be differentiated from the words&definitions. Optically easy, but not code-wise. You could try to get ABBYY Finereader to recognize it and specify the output format as a spreadsheet or CSV-format. However, even ABBYY's output will still have a lot of noise, that you'll have to deal with.

What is the name of the dictionary? Maybe it's already present in a nicer format than PDF.

pzack · 09-09-2022, 12:46 PM

Good morning M. Markismus,

Thank you for taking the time to respond to my query. I mentioned .xml because it is a supported format in pyglossary(according to github)for conversion, however, the xml file that I have is not converting. I am not sure if this file actually contains the whole dictonary anyway.

The only thing that I can think of is convert the full text file that I have but it is not tab delimited. When I look at this file in notepad I see that the headword is not separated out-it is the leading word-but it is part of the definition which is a paragraph.

Pyglossary asks for a tab delimited file citing no-tab errors as it was converting ; it produced the three stardict files from my text file but they were empty. I did not create the text and xml files.

If there is a way to do a mass conversion of the text file, that is, get the leading head word separated out, and I think that this is what is meant by tab-delimiting a file-then pyglossary may correctly convert the text file. It is almost there but needs the head word separated from the definition. However, I admit that I don't fully understand the structure of a tab delimited file.

I have seen something about dumping the text into excel or another spreadsheet to build a tab-delimited file but,unfortunately, I have zero experience and knowledge of spreadsheets.

The dictionary has over 100,000 words and I certainly cannot do it manually.

And then there is the file converter "penelope" but I don't know if there is any help in that direction.

Cordially,
pz

Markismus · 09-09-2022, 01:04 PM

@pzack Why don't you post a link to the non-tab-delimited file?

If what you're saying turns out correct, than all you would need is to prefix each line with a repetition of the 1st word and a delimiter.
Sed could do that on Linux, any pattern-substitution in Perl, Python, Awk or Lua could do that.

You could even do it in Excel. First column your line, second column the LEFT-function, third column a concatenation of both column-values with a delimiter in between.

pzack · 09-09-2022, 03:58 PM

Good afternoon M. Markismus,

Thank you for your quick response.

As I indicated, I don't know how to work with excel and spreadsheets. However, you have suggested some other possibilities of tab-delimiting the text file.

May I impose upon you to give me an example of how I may do this with the apps that you listed. If you would choose one that may be the simplist to work with. Please understand that I am not a programmer and I am shakey with working with scripts. But I can work in linux terminal. Your example could be short and sweet.

I figured that there may be a way to do this and I did see a script for converting this file to tab-delimited but I can't find it; it was a short script for use in linux.

Please let me see what you come up with before I try a new thread on a tab delimited conversion.

I think, thanks to you, that we may be headed in the right direction. And here's hoping that once converted-if it can be done-that pyglossary will cooperate and give me a stardict dictionary!

cordially,
pz

Markismus · 09-09-2022, 04:00 PM

I already wrote it out with the Excel example. What prevents you from posting a link to the text file? If it's small, you could even zip it and upload it here.

pzack · 09-09-2022, 04:16 PM

M. Markismus,

I want to add to my just-sent reply to you that, though I don't understand fully the structure of tab-delimited text files, I assume that pyglossary needs the head word as a hook on which to hang the definition.

My sense is that the tab delimiting isolates or sets apart the headword so that pyglossary sees it as the headword and can build its index or pointers to the headword.

This is how I understand it but this is purely conjecture on my part. If I am correct, then I need an app,maybe among the apps that you have provided for me, to isolate or tab? the headword which is the first word of each of the paragraphs that include the headword and definition. There are spaces between each paragraph of text. There are no illustrations in the text file.

I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.

pz

Doitsu · 09-09-2022, 07:00 PM

Quote:

Originally Posted by pzack

I would need the syntax to instruct the app to tab-delimit the first word which is the head word.
Maybe this helps to clarify things.

You also might want to look into using StarDict Editor, which can you use to compile and decompile StarDict dictionaries.
It also supports compiling and decompiling Babylon BGL dictionaries.

The Babylon glossary source file syntax, which supports inflections, is very simple:

Code:

#stripmethod=keep
#sametypesequence=h
#bookname=Spanish-English Dictionary

libro|libros
<p>single line definition of 'libro' (may contain html 3.2 tags, e.g <br>)</p>

rana|ranas
<p>single line definition of 'rana' (may contain html 3.2 tags, e.g <br>)</p>

pzack · 09-09-2022, 07:42 PM

Good evening, Doitsu

Thank you for responding. My dictionary is not in a bgl format thus, I don't think that the stardict editor is useful here. Actually, I tried this editor and like pyglossary, it threw up countless no-tab errors in the full text file and gave me empty stardict files.

Thank you the excel example but I don't understand excel.

In looking again at the text file it is like this:

headword space [prononciation of headword]space definition. In other words, the bracketed prononciation-in the international alphabet-is what separates what follows from the next headword and bracket. So that, what follows the bracketed prononciation of the headword will pertain to the headword until the next bracketed prononciation with the headword just before it. Now, where would one set the tab that would separate headword and bracket from the next headword and bracket?

Again, I am trying to understand the workings of a tab delimited file and what pyglossary is looking for.

cordially,
pz

DNSB · 09-09-2022, 10:59 PM

If you can't post the whole text file here as an attachment to your message, then snip a chunk of text and post that. It'll make looking at your issues a lot simpler.

To attach the file, either use the paperclip next to the smiley icon at the top of the message entry box or the Manage Attachments in the Attach files box below the message entry box. A .txt file is limited to 1MB but you can attach a .zip file of up to 20MB.

Sarmat89 · 09-10-2022, 01:57 AM

It should be simple.

Get yourself an editor with regex support, like Notepad++ or VSCode.

Replace

Code:

^([^[]+?) *(?=\[)

with

Code:

\1\t

.

Doitsu · 09-10-2022, 02:03 AM

Sarmat beat me to the answer.

pzack · 09-10-2022, 11:41 AM

Dear Sarmat89,

Thank you for the information and for responding. The code is greek to me.

Here is how my text file looks as an example(I did not build this file);

cours [kur] n.m. definition........................................ .................................
.................................................. .................................................. .........
.................................................. .................................................. ...........

.................................................. .................................................. ..........
.................................................. .................................................. .............
coursier [kursje] n.m. definition........................................ .............................
.................................................. .................................................. .............
.................................................. .................................................. ..............

Thus, you have headword space [prononciation] gender definition.
The definitions can be in separate paragraphs and sometimes a number of paragraphs in a long definition and it is, I think, the bracketed prononciation with its headword before it that delimits the definitions.

If I understand tab-delimiting correctly, then the headword and brackets would have a tab but I don't know where to place the tab and how to actually tab the text.

There are over 100,000 words with definitions(6,000 pages plus)so the program has to run through the file placing somewhere the tab. Or tabs?

If your example of code applies here, how would you plug in the actual format in this code, that is what represents what in your code looking at my example?

I don't know what regex is and how it works. I have notepad++ under win 11 and I have never formatted a text file least of all built a tab-delimited file.

I assume that the problem in pryglossary is getting the headword with the brackets tabbed so that stardict can find the word.

Very cordially,
pz

Markismus · 09-10-2022, 11:54 AM

Dear pzack,

This is not working. The example given reiterates the problem as you've described it. But it is not a sample. We already given you multiple solutions to that problem, but it doesn't seem to help you.

Zip the text-file and post in on a file-hoster such as Dropbox or pCloud and share the link. Maybe we can help you, if you stop repeating the same information.

09-06-2022, 04:11 PM	#1
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	conversion pyglossary pdf Good afternoon, I need help converting a pdf stand alone dictionary that I use on my e-reader to a stardict dictionary for use under koreader. I have tried converting a full text file of this pdf (I did not create it) but pyglossary is giving me a boat load of no tab errors and the stardict files that it creates from this txt file are empty. I also have an xml file but I cannot get pyglossary to convert it to stardict even though pyglossary is supposed to support .xml Can anyone suggest ways to convert this pdf to a stardict dictionary? I know of no program that would convert pdf to stardict in pyglossary. Perhaps, there is another conversion tool-I am fishing for a way to do this. The dictionary would be much more useful to me under stardict. Cordially, pz

09-09-2022, 12:46 PM	#4
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Good morning M. Markismus, Thank you for taking the time to respond to my query. I mentioned .xml because it is a supported format in pyglossary(according to github)for conversion, however, the xml file that I have is not converting. I am not sure if this file actually contains the whole dictonary anyway. The only thing that I can think of is convert the full text file that I have but it is not tab delimited. When I look at this file in notepad I see that the headword is not separated out-it is the leading word-but it is part of the definition which is a paragraph. Pyglossary asks for a tab delimited file citing no-tab errors as it was converting ; it produced the three stardict files from my text file but they were empty. I did not create the text and xml files. If there is a way to do a mass conversion of the text file, that is, get the leading head word separated out, and I think that this is what is meant by tab-delimiting a file-then pyglossary may correctly convert the text file. It is almost there but needs the head word separated from the definition. However, I admit that I don't fully understand the structure of a tab delimited file. I have seen something about dumping the text into excel or another spreadsheet to build a tab-delimited file but,unfortunately, I have zero experience and knowledge of spreadsheets. The dictionary has over 100,000 words and I certainly cannot do it manually. And then there is the file converter "penelope" but I don't know if there is any help in that direction. Cordially, pz Last edited by pzack; 09-09-2022 at 12:49 PM.

09-09-2022, 01:04 PM	#5
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	@pzack Why don't you post a link to the non-tab-delimited file? If what you're saying turns out correct, than all you would need is to prefix each line with a repetition of the 1st word and a delimiter. Sed could do that on Linux, any pattern-substitution in Perl, Python, Awk or Lua could do that. You could even do it in Excel. First column your line, second column the LEFT-function, third column a concatenation of both column-values with a delimiter in between. Last edited by Markismus; 09-09-2022 at 01:07 PM.

09-10-2022, 01:57 AM	#12
Sarmat89 Evangelist Posts: 482 Karma: 2267928 Join Date: Nov 2015 Device: none	It should be simple. Get yourself an editor with regex support, like Notepad++ or VSCode. Replace Code: ^([^[]+?) *(?=\[) with Code: \1\t .

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to PDF conversion causes all the text to be aligned to the left	Swifty4635	Conversion	1	01-16-2022 10:17 PM
Desktop App How do I run PyGlossary on Windows ?	Bilingual	Kobo Reader	2	07-12-2020 01:54 PM
epub 2 PDF conversion with OCR in PDF possible?	hobi2000	Conversion	2	03-25-2019 03:20 AM
PDF conversion keeping pdf page	highstream	Conversion	3	05-31-2016 11:46 AM
PDF to PDF conversion creates much larger file?	rocketcat	Conversion	11	09-30-2011 07:37 PM

09-08-2022, 01:24 PM	#3
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	XML is a language for data storage, not a dictionary format. So you can't expect pyglossary to support any XML whatsoever. However, you can put the XML-file online and post a link to it. Maybe you're lucky. Suggestions The Pdf-(2-epub-)2-html-2-stardict tool isn't there, yet. Probably never. The problem is that the nice styling of a PDF puts a lot of extra code in there, that has to be differentiated from the words&definitions. Optically easy, but not code-wise. You could try to get ABBYY Finereader to recognize it and specify the output format as a spreadsheet or CSV-format. However, even ABBYY's output will still have a lot of noise, that you'll have to deal with. What is the name of the dictionary? Maybe it's already present in a nicer format than PDF.

09-09-2022, 03:58 PM	#6
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Good afternoon M. Markismus, Thank you for your quick response. As I indicated, I don't know how to work with excel and spreadsheets. However, you have suggested some other possibilities of tab-delimiting the text file. May I impose upon you to give me an example of how I may do this with the apps that you listed. If you would choose one that may be the simplist to work with. Please understand that I am not a programmer and I am shakey with working with scripts. But I can work in linux terminal. Your example could be short and sweet. I figured that there may be a way to do this and I did see a script for converting this file to tab-delimited but I can't find it; it was a short script for use in linux. Please let me see what you come up with before I try a new thread on a tab delimited conversion. I think, thanks to you, that we may be headed in the right direction. And here's hoping that once converted-if it can be done-that pyglossary will cooperate and give me a stardict dictionary! cordially, pz

09-09-2022, 04:00 PM	#7
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	I already wrote it out with the Excel example. What prevents you from posting a link to the text file? If it's small, you could even zip it and upload it here.

09-09-2022, 04:16 PM	#8
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	M. Markismus, I want to add to my just-sent reply to you that, though I don't understand fully the structure of tab-delimited text files, I assume that pyglossary needs the head word as a hook on which to hang the definition. My sense is that the tab delimiting isolates or sets apart the headword so that pyglossary sees it as the headword and can build its index or pointers to the headword. This is how I understand it but this is purely conjecture on my part. If I am correct, then I need an app,maybe among the apps that you have provided for me, to isolate or tab? the headword which is the first word of each of the paragraphs that include the headword and definition. There are spaces between each paragraph of text. There are no illustrations in the text file. I would need the syntax to instruct the app to tab-delimit the first word which is the head word. Maybe this helps to clarify things. pz

09-09-2022, 07:42 PM	#10
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Good evening, Doitsu Thank you for responding. My dictionary is not in a bgl format thus, I don't think that the stardict editor is useful here. Actually, I tried this editor and like pyglossary, it threw up countless no-tab errors in the full text file and gave me empty stardict files. Thank you the excel example but I don't understand excel. In looking again at the text file it is like this: headword space [prononciation of headword]space definition. In other words, the bracketed prononciation-in the international alphabet-is what separates what follows from the next headword and bracket. So that, what follows the bracketed prononciation of the headword will pertain to the headword until the next bracketed prononciation with the headword just before it. Now, where would one set the tab that would separate headword and bracket from the next headword and bracket? Again, I am trying to understand the workings of a tab delimited file and what pyglossary is looking for. cordially, pz

09-09-2022, 10:59 PM	#11
DNSB Bibliophagist Posts: 35,513 Karma: 145557716 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Forma, Clara HD, Lenovo M8 FHD, Paperwhite 4, Tolino epos	If you can't post the whole text file here as an attachment to your message, then snip a chunk of text and post that. It'll make looking at your issues a lot simpler. To attach the file, either use the paperclip next to the smiley icon at the top of the message entry box or the Manage Attachments in the Attach files box below the message entry box. A .txt file is limited to 1MB but you can attach a .zip file of up to 20MB.

09-10-2022, 02:03 AM	#13
Doitsu Grand Sorcerer Posts: 5,584 Karma: 22735033 Join Date: Dec 2010 Device: Kindle PW2	Sarmat beat me to the answer.

09-10-2022, 11:41 AM	#14
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Dear Sarmat89, Thank you for the information and for responding. The code is greek to me. Here is how my text file looks as an example(I did not build this file); cours [kur] n.m. definition........................................ ................................. .................................................. .................................................. ......... .................................................. .................................................. ........... .................................................. .................................................. .......... .................................................. .................................................. ............. coursier [kursje] n.m. definition........................................ ............................. .................................................. .................................................. ............. .................................................. .................................................. .............. Thus, you have headword space [prononciation] gender definition. The definitions can be in separate paragraphs and sometimes a number of paragraphs in a long definition and it is, I think, the bracketed prononciation with its headword before it that delimits the definitions. If I understand tab-delimiting correctly, then the headword and brackets would have a tab but I don't know where to place the tab and how to actually tab the text. There are over 100,000 words with definitions(6,000 pages plus)so the program has to run through the file placing somewhere the tab. Or tabs? If your example of code applies here, how would you plug in the actual format in this code, that is what represents what in your code looking at my example? I don't know what regex is and how it works. I have notepad++ under win 11 and I have never formatted a text file least of all built a tab-delimited file. I assume that the problem in pryglossary is getting the headword with the brackets tabbed so that stardict can find the word. Very cordially, pz

09-10-2022, 11:54 AM	#15
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	Dear pzack, This is not working. The example given reiterates the problem as you've described it. But it is not a sample. We already given you multiple solutions to that problem, but it doesn't seem to help you. Zip the text-file and post in on a file-hoster such as Dropbox or pCloud and share the link. Maybe we can help you, if you stop repeating the same information.

Advert

Advert