conversion pyglossary pdf - Page 3

Sarmat89 · 09-11-2022, 05:19 PM

Quote:

Originally Posted by pzack

'\' n’est pas reconnu en tant que commande interne

Use /\^|\^|/ instead of /\|\|/

pzack · 09-11-2022, 06:35 PM

Goodevening M. Markismus,

I went into Linux using perl and was able to execute the four lines of code that you provided. However, the files are empty. Terminal showed the text scrolling when I was creating output1.txt but this file is empty. There was no text scrolling in the terminal window when the other files were created. And the other files are empty.

I think that I'd like to try this myself with your help ,given the time that I have spent on italready; could you advise me what went wrong? With the code provided it seems straight-forward; I am following your instructions.

Perl seemed to be working as no errors were being reported. I think I will stay under linux with this.

ps

pzack · 09-12-2022, 10:32 AM

Hello, M. Markismus,

May I trouble you to clarify some things for me regarding your message containing the perl codes for creating the csv file of my text file.

Are the four lines of perl code that you placed in the highlighted box, even though I don't see in the boxed codes some of the substitutions that are listed above the box,to be used to create the csv file of my text file? The csv file that I created using these codes was empty as were the output1, output2 and output3 txt files.

I am not sure what I am to do with the xml and binary zip files created by another line of code that you have highlighted in a box and which is supposed to correct a certain character problem in the text.

What file am I to use in the pyglossary conversion?

cordially,
pz

Markismus · 09-12-2022, 10:55 AM

The zipped-files contain the Stardict-dictionary of you snippet of dictionary in binary form. The xml-file contains the same Stardict-dictionary in xml-format.

If you rather use pyGlossary for the conversion of csv-format to whatever-format, just do that.

If you don't specify the input correctly, you won't get an output. That is probably what is happening.
Why don't you try it with the files that I provided. Keep trying, you'll get there.

If all else fails, you could just send the whole file, couldn't you? As suggested in my first post?

pzack · 09-12-2022, 01:06 PM

Goodevening M. Markismus,

Thank-you for your kind response and for taking the time to make things a little more clear for me.

I made three output txt files; output1,output2 and output3 and finally output4.csv.

I entered one line at a time in linux terminal;

perl -pe 's/\n\n+/\|\|/sg' grandl.txt output1.txt
perl -pe 's/\n/ /sg' output1.txt output2.txt
perl -pe 's/\|\|/\n/sg' output2.txt output3.txt
perl -pe 's/^(\S+)/$1 /sg' output3.txt output4.csv

Where is my error( or errors)?

Cordially,
pz

Sarmat89 · 09-12-2022, 01:48 PM

The correct syntax is <grandl.txt >output1.txt

pzack · 09-12-2022, 02:03 PM

Dear Sarmat89,

Thank you much for responding and for pointing this out to me; I completely missed this in the original perl code lines.

I will try this shortly!

Very cordially,
pz

pzack · 09-12-2022, 02:23 PM

Dear Sarmat89,

I just did what you suggested and I got this in linux terminal:

k**~**Documents**perl -pe 's/\n\n+/\|\|/sg' <grand.txt> output1.txt
bash: output1.txt: Is a directory

Now output1.txt is seen as a directory.

The txt files are in my Documents directory and I did a cd Documents so that I am in this directory when I entered the code.

Can you pinpoint the problem?

cordially,
pz

Sarmat89 · 09-12-2022, 04:00 PM

Remove output1?

pzack · 09-12-2022, 06:59 PM

Hello Sarmat89,

Unfortunately, that didn't work.

I am hoping that M. Markismus will respond and that I will be able to build the csv file to put in pyglossary.

It seems that you are not sure where the problem is with the perl codes that I entered in the terminal? But I appreciate your efforts to help me.

Cordially,
pz

pzack · 09-12-2022, 07:06 PM

Hello M. Markismus,

M. Sarmat89 pointed out to me that I had left off the "<>" signs in the lines of perl code.

However, this still produced a problem as I received a bash message saying that "output1.txt is a directory".

When it is convenient for you, would you kindly give me some ideas where the problem might be. You had created a csv file with your 4 lines of perl code and I don't understand why I can't duplicate this.

cordially,
pz

Markismus · 09-13-2022, 02:32 AM

@pzack Why don't you start over in a new directory with just the snippet provided by my post and see whether you can generate each step.

You don't have a problem with the conversion, you are trying to get your feet under you in the Linux shell. It's quite powerful and most probably you've created a directory called output1.txt in one of your earlier tries. Try Google for "Linux, shell, command line, remove directory, tutorial"?

You could also try getting an account at pCloud or whatever and upload the pdf-file and post it in a personal message to me.

I am still guessing at why you're still trying to do this for 4? days? So I've decided that you like puzzling and the power of possible conversion. I can appreciate that. Succes!

pzack · 09-13-2022, 10:25 AM

Goodmorning M. Markismus,

Thank you for your response and please alert me when your patience wears thin with all this.

You hit the nail on the head when you said that I like puzzling this thing out. My facination with computers and program languages started with Dbase2 in 1985 when I taught myself to write a few business programs in Dbase code. Since then, from time to time-when I have the time!-I latch onto a problem. Oh the frustating hours trying to install Arch Linux before the arrival on the scene of the calamares installer! All that work...and I don't use Arch at all. It's an intellectual diversion for me and I find it a little stimulating for the gray matter. Never heard of perl until I met you or pyglossary until I started to mess with this stuff.

I take it, since you didn't mention it, that the perl code that I entered in terminal is correct? Sarmat89, who appears to be following this thread pointed out to me the missing "<>" signs.

My file manager shows no directory of output1 so how could I have a directory in this name? Thus, if it does not show up in file manager, how could it be there?

One thing though,and perhaps this is off the wall,but the snippet that you have of the full terxt does not contain all the very lengthy introductory material that is in the beginning of the dictionary. I don't think that this would be an issue for the conversion and that this material is not important for the word look-up. I imagine the conversion would just ignore this material. Could this be posing a problem?

When I put in the "<>" signs for the first text file in each line of your code is when I got the "this is a directory" error. Without the <> the code executed,but, as you already know, the csv file created was empty.

I will try converting the snippet in a new directory.

I will be 73 years old near end of september and I'd like to get this dictionary functioning under stardict in koreader as a wee little gift to myself.

Need your help, though! You're the only one who seems to know this stuff.

very cordially,
pz

pzack · 09-13-2022, 11:00 AM

Hello, M. Markismus,

I put the full txt file in another directory and was able to build the csv file. At last. Thanks to your help,your expertise.

I will put the csv file in pyglossary later in the day when I'll have the time. I'll be sick if it doesn't work.

I take it that this csv file solves the tab-delimiting issue? Which started all this.

You mentioned the stardict app to convert the file to stardict. Is there a difference from pyglossary and do you prefer one over the other. Or, is there something else to convert to stardict?

Hope the csv file converts correctly.

Cordially,
pz

Markismus · 09-13-2022, 03:37 PM

Cheers from Holland!

So what you have is probably correctly formatted comma-separated-values, which are in part nonsense. Maybe as long as you don't search for those keywords, as long as those keywords don't appear multiple times, as long as no other restriction is hit upon with....everything will be fine...

If pyGlossary fails, try the tool from Stardict itself: It will have the least restrictions.

Like I said in the long post, there is a problem. Not only is the separator <EOL><EOL><EOL>(Two empty Lines) not only used between articles, but also between subparagraphs in articles. Also there now seems to be a problem with the start of the file which contains text.

If you want to correct for the start of the text:
That can be done manually quite simply.

If you want to correct for the subparagraphs:
Code a loop in which consecutive keywords are compared:
- They should be alphabetically in the correct order.

This is a little complex, because you'll have to compare multiple values to assess that not only a keyword follows on the previous one, but also allows the next ones to follow upon him.
So if a paragraph in an article starting with "ab" is followed by one starting with "d", that is only correct if the next on follows upon it, too. So e.g. "ab", "d", "ae" is wrong and the article starting with "d" should be reclassified as a paragraph to the previous article.
So how do you manage an article that has multiple paragraphs? How many keywords of articles do you have to compare to filter out all paragraphs? Well usually, you do something smart, whilst running the code. You find an extra criterion or you criterion turns out to be more robust that you thought.

09-12-2022, 10:32 AM	#33
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Hello, M. Markismus, May I trouble you to clarify some things for me regarding your message containing the perl codes for creating the csv file of my text file. Are the four lines of perl code that you placed in the highlighted box, even though I don't see in the boxed codes some of the substitutions that are listed above the box,to be used to create the csv file of my text file? The csv file that I created using these codes was empty as were the output1, output2 and output3 txt files. I am not sure what I am to do with the xml and binary zip files created by another line of code that you have highlighted in a box and which is supposed to correct a certain character problem in the text. What file am I to use in the pyglossary conversion? cordially, pz Last edited by pzack; 09-12-2022 at 10:38 AM.

09-12-2022, 02:23 PM	#38
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Dear Sarmat89, I just did what you suggested and I got this in linux terminal: k~Documentsperl -pe 's/\n\n+/\\|\\|/sg' <grand.txt> output1.txt bash: output1.txt: Is a directory Now output1.txt is seen as a directory. The txt files are in my Documents directory and I did a cd Documents so that I am in this directory when I entered the code. Can you pinpoint the problem? cordially, pz Last edited by pzack; 09-12-2022 at 02:31 PM.

09-13-2022, 02:32 AM	#42
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	@pzack Why don't you start over in a new directory with just the snippet provided by my post and see whether you can generate each step. You don't have a problem with the conversion, you are trying to get your feet under you in the Linux shell. It's quite powerful and most probably you've created a directory called output1.txt in one of your earlier tries. Try Google for "Linux, shell, command line, remove directory, tutorial"? You could also try getting an account at pCloud or whatever and upload the pdf-file and post it in a personal message to me. I am still guessing at why you're still trying to do this for 4? days? So I've decided that you like puzzling and the power of possible conversion. I can appreciate that. Succes! Last edited by Markismus; 09-13-2022 at 02:35 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF to PDF conversion causes all the text to be aligned to the left	Swifty4635	Conversion	1	01-16-2022 10:17 PM
Desktop App How do I run PyGlossary on Windows ?	Bilingual	Kobo Reader	2	07-12-2020 01:54 PM
epub 2 PDF conversion with OCR in PDF possible?	hobi2000	Conversion	2	03-25-2019 03:20 AM
PDF conversion keeping pdf page	highstream	Conversion	3	05-31-2016 11:46 AM
PDF to PDF conversion creates much larger file?	rocketcat	Conversion	11	09-30-2011 07:37 PM

09-11-2022, 06:35 PM	#32
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Goodevening M. Markismus, I went into Linux using perl and was able to execute the four lines of code that you provided. However, the files are empty. Terminal showed the text scrolling when I was creating output1.txt but this file is empty. There was no text scrolling in the terminal window when the other files were created. And the other files are empty. I think that I'd like to try this myself with your help ,given the time that I have spent on italready; could you advise me what went wrong? With the code provided it seems straight-forward; I am following your instructions. Perl seemed to be working as no errors were being reported. I think I will stay under linux with this. ps

09-12-2022, 10:55 AM	#34
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	The zipped-files contain the Stardict-dictionary of you snippet of dictionary in binary form. The xml-file contains the same Stardict-dictionary in xml-format. If you rather use pyGlossary for the conversion of csv-format to whatever-format, just do that. If you don't specify the input correctly, you won't get an output. That is probably what is happening. Why don't you try it with the files that I provided. Keep trying, you'll get there. If all else fails, you could just send the whole file, couldn't you? As suggested in my first post?

09-12-2022, 01:06 PM	#35
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Goodevening M. Markismus, Thank-you for your kind response and for taking the time to make things a little more clear for me. I made three output txt files; output1,output2 and output3 and finally output4.csv. I entered one line at a time in linux terminal; perl -pe 's/\n\n+/\\|\\|/sg' grandl.txt output1.txt perl -pe 's/\n/ /sg' output1.txt output2.txt perl -pe 's/\\|\\|/\n/sg' output2.txt output3.txt perl -pe 's/^(\S+)/$1 /sg' output3.txt output4.csv Where is my error( or errors)? Cordially, pz

09-12-2022, 01:48 PM	#36
Sarmat89 Evangelist Posts: 482 Karma: 2267928 Join Date: Nov 2015 Device: none	The correct syntax is <grandl.txt >output1.txt

09-12-2022, 02:03 PM	#37
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Dear Sarmat89, Thank you much for responding and for pointing this out to me; I completely missed this in the original perl code lines. I will try this shortly! Very cordially, pz

09-12-2022, 04:00 PM	#39
Sarmat89 Evangelist Posts: 482 Karma: 2267928 Join Date: Nov 2015 Device: none	Remove output1?

09-12-2022, 06:59 PM	#40
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Hello Sarmat89, Unfortunately, that didn't work. I am hoping that M. Markismus will respond and that I will be able to build the csv file to put in pyglossary. It seems that you are not sure where the problem is with the perl codes that I entered in the terminal? But I appreciate your efforts to help me. Cordially, pz

09-12-2022, 07:06 PM	#41
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Hello M. Markismus, M. Sarmat89 pointed out to me that I had left off the "<>" signs in the lines of perl code. However, this still produced a problem as I received a bash message saying that "output1.txt is a directory". When it is convenient for you, would you kindly give me some ideas where the problem might be. You had created a csv file with your 4 lines of perl code and I don't understand why I can't duplicate this. cordially, pz

09-13-2022, 10:25 AM	#43
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Goodmorning M. Markismus, Thank you for your response and please alert me when your patience wears thin with all this. You hit the nail on the head when you said that I like puzzling this thing out. My facination with computers and program languages started with Dbase2 in 1985 when I taught myself to write a few business programs in Dbase code. Since then, from time to time-when I have the time!-I latch onto a problem. Oh the frustating hours trying to install Arch Linux before the arrival on the scene of the calamares installer! All that work...and I don't use Arch at all. It's an intellectual diversion for me and I find it a little stimulating for the gray matter. Never heard of perl until I met you or pyglossary until I started to mess with this stuff. I take it, since you didn't mention it, that the perl code that I entered in terminal is correct? Sarmat89, who appears to be following this thread pointed out to me the missing "<>" signs. My file manager shows no directory of output1 so how could I have a directory in this name? Thus, if it does not show up in file manager, how could it be there? One thing though,and perhaps this is off the wall,but the snippet that you have of the full terxt does not contain all the very lengthy introductory material that is in the beginning of the dictionary. I don't think that this would be an issue for the conversion and that this material is not important for the word look-up. I imagine the conversion would just ignore this material. Could this be posing a problem? When I put in the "<>" signs for the first text file in each line of your code is when I got the "this is a directory" error. Without the <> the code executed,but, as you already know, the csv file created was empty. I will try converting the snippet in a new directory. I will be 73 years old near end of september and I'd like to get this dictionary functioning under stardict in koreader as a wee little gift to myself. Need your help, though! You're the only one who seems to know this stuff. very cordially, pz

09-13-2022, 11:00 AM	#44
pzack Connoisseur Posts: 79 Karma: 10 Join Date: Aug 2022 Device: kobo sage,elipsa	Hello, M. Markismus, I put the full txt file in another directory and was able to build the csv file. At last. Thanks to your help,your expertise. I will put the csv file in pyglossary later in the day when I'll have the time. I'll be sick if it doesn't work. I take it that this csv file solves the tab-delimiting issue? Which started all this. You mentioned the stardict app to convert the file to stardict. Is there a difference from pyglossary and do you prefer one over the other. Or, is there something else to convert to stardict? Hope the csv file converts correctly. Cordially, pz

Advert

Advert

09-13-2022, 03:37 PM	#45
Markismus Guru Posts: 897 Karma: 149877 Join Date: Jul 2013 Location: Netherlands Device: Cracked HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura	Cheers from Holland! So what you have is probably correctly formatted comma-separated-values, which are in part nonsense. Maybe as long as you don't search for those keywords, as long as those keywords don't appear multiple times, as long as no other restriction is hit upon with....everything will be fine... If pyGlossary fails, try the tool from Stardict itself: It will have the least restrictions. Like I said in the long post, there is a problem. Not only is the separator <EOL><EOL><EOL>(Two empty Lines) not only used between articles, but also between subparagraphs in articles. Also there now seems to be a problem with the start of the file which contains text. If you want to correct for the start of the text: That can be done manually quite simply. If you want to correct for the subparagraphs: Code a loop in which consecutive keywords are compared: - They should be alphabetically in the correct order. This is a little complex, because you'll have to compare multiple values to assess that not only a keyword follows on the previous one, but also allows the next ones to follow upon him. So if a paragraph in an article starting with "ab" is followed by one starting with "d", that is only correct if the next on follows upon it, too. So e.g. "ab", "d", "ae" is wrong and the article starting with "d" should be reclassified as a paragraph to the previous article. So how do you manage an article that has multiple paragraphs? How many keywords of articles do you have to compare to filter out all paragraphs? Well usually, you do something smart, whilst running the code. You find an extra criterion or you criterion turns out to be more robust that you thought.