Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Readers > Onyx Boox

Notices

Reply
 
Thread Tools Search this Thread
Old 10-21-2021, 05:36 AM   #1
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 955
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
Character set used for exported notes

Does anyone know or know how to determine the character set used by Onyx Boox for the exported notes?

I've wrote a perl script to analyse the exported notes and convert it to my layout in LaTeX. However, I have a spot of trouble with identifying the original character set.
If I open it in an editor such as sublime, it is seen as a hexadecimal file.
Code:
424f 4f58 2052 6561 6469 6e67 204e 6f74
6573 c2a0 7cc2 a03c 3c47 4120 3236 2e20
2d20 4d65 7461 7068 7973 6973 6368 6520
416e 6661 6e67 7367 72c3 bc6e 6465 2064
6572 204c 6f67 696b 2069 6d20 4175 7367
616e 6720 766f 6e20 4c65 6962 6e69 7a20
2853 756d 6d65 7220 7365 6d65 7374 6572
2031 3932 3829 2c20 6564 2e20 4b2e 4865
6c64 2c20 3139 3738 2c20 326e 6420 6564
6e20 3139 3930 2c20 5649 2c20 3239 3270
3e3e 0a4e 6f74 6550 726f 0a0a 5469 6d65
efbc 9a32 3032 302d 3038 2d32 3720 3233
3a32 340a e380 904f 7269 6769 6e61 6c20
5465 7874 e380 9167 6c69 6564 6572 6e64
6520 4175 66ef bfbe 6465 636b 756e 6700
0ae3 8090 416e 6e6f 7461 7469 6f6e 73e3
If I reopen it with encoding UTF-8, it is almost correct, but not entirely. Some encoding troubles remain, such as hexadecimal 0x00, whitespace characters that are not spaces and odd choices for characters for brackets:
Code:
BOOX Reading Notes*|*<<GA 26. - Metaphysische Anfangsgründe der Logik im Ausgang von Leibniz (Summer semester 1928), ed. K.Held, 1978, 2nd edn 1990, VI, 292p>>
NotePro

Time:2020-08-27 23:24
【Original Text】gliedernde Auf￾deckung
for example the spaces between 'BOOX Reading Notes' are normal spaces, around 'Notes | <<' are different. In Perl you can simply write '/s' for all whitespace and get on with it, but other characters pose more trouble down the line.

Last edited by Markismus; 10-21-2021 at 11:38 AM.
Markismus is offline   Reply With Quote
Old 10-21-2021, 07:06 AM   #2
Renate
Onyx-maniac
Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.
 
Posts: 3,918
Karma: 17236157
Join Date: Feb 2012
Device: Nook NST, Glow2, 3, 4, '21, Kobo Aura2, Poke3, Poke5, Go6
Quote:
Originally Posted by Markismus View Post
Whitespace characters that are not spaces and odd choices for characters for brackets...
It is normal UTF-8. The choice of characters is a bit idiosyncratic.

U+00A0 No-Break Space
U+3010 Left Black Lenticular Bracket
U+3011 Right Black Lenticular Bracket

Edit: Don't forget

U+FF1A Fullwidth Colon

Last edited by Renate; 10-21-2021 at 03:48 PM.
Renate is online now   Reply With Quote
Advert
Old 10-21-2021, 08:49 AM   #3
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 955
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
At least some of the characters aren't:
Code:
UTF-8 "\xEF\xBF\xBE" does not map to Unicode at ....
Apparently, it's a codepoint for U+BFBE. But that's not helping at all
It appears out of nowhere in the middle of a few words. Rather odd. E.g. in the piece of code in the first post it's between Auf and deckung.

EDIT: Just checked and it appears in two different books within the cititations. One book is scanned and OCR'ed, the other is a PDF with only text within. So it seems that the source is Onyx Boox's export code.

Last edited by Markismus; 10-21-2021 at 11:42 AM.
Markismus is offline   Reply With Quote
Old 10-21-2021, 10:21 AM   #4
Renate
Onyx-maniac
Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.
 
Posts: 3,918
Karma: 17236157
Join Date: Feb 2012
Device: Nook NST, Glow2, 3, 4, '21, Kobo Aura2, Poke3, Poke5, Go6
Quote:
Originally Posted by Markismus View Post
Apparently, it's a codepoint for U+BFBE.
No, that's a UTF-8-BOM, it marks a file as UTF-8. It should only be at the head of a file. If at all.
Renate is online now   Reply With Quote
Old 10-21-2021, 11:33 AM   #5
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 955
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
It's indeed real close to the UTF-8 byte order mark. Wikipedia BOM says that one is EF BB BF.

Still, all other errors are gone now that I opened and closed all files with specified encodings both in Perl and LaTeX. So unless I run into other problems, I'll just have to filter the codepoints out.

Last edited by Markismus; 10-21-2021 at 11:43 AM.
Markismus is offline   Reply With Quote
Advert
Old 10-21-2021, 12:48 PM   #6
Renate
Onyx-maniac
Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.Renate ought to be getting tired of karma fortunes by now.
 
Posts: 3,918
Karma: 17236157
Join Date: Feb 2012
Device: Nook NST, Glow2, 3, 4, '21, Kobo Aura2, Poke3, Poke5, Go6
Now you're getting me confused.

Normal UTF-8-BOM
0xEF, 0xBB, 0xBF -> U+FEFF (a valid Unicode)

0xEF, 0xBF, 0xBE -> 0xFFFE (not a Unicode anything, a byte reversed UTF-16-BOM).
Renate is online now   Reply With Quote
Old 10-21-2021, 04:23 PM   #7
Markismus
Guru
Markismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicingMarkismus causes much rejoicing
 
Markismus's Avatar
 
Posts: 955
Karma: 149907
Join Date: Jul 2013
Location: Rotterdam
Device: HiSenseA5ProCC, Cracked OnyxNotePro, Note5, Kobo Glo, Aura
I rather have clear insight into things. However, given the current topic, it also nice to be confused together.
Anyways, this is Perl generated while writing to an UTF-8 encoded textfile:
Code:
...
Wir versuchen eine philosophische Logik und damit eine Ein\xEF\xBF\xBEführung in das Philosophieren
...
Markismus is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Exported ePUBs stay the same, exported PDFs always change. halloleo Library Management 4 01-10-2021 08:02 AM
Onyx Boox Max 2 – Character Encoding of exported Annotations Sklanfurt Onyx Boox 0 01-12-2019 06:44 AM
traditional and simplified chinese character set? mzmm ePub 3 05-10-2013 07:41 AM
character set troubles wijnands Calibre 5 05-15-2010 11:12 AM
Customized character set problem - with solution BlackVoid Sony Reader Dev Corner 2 09-13-2008 12:54 AM


All times are GMT -4. The time now is 06:42 AM.


MobileRead.com is a privately owned, operated and funded community.