View Single Post
Old 05-26-2022, 03:31 PM   #525
jhowell
Grand Sorcerer
jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.jhowell ought to be getting tired of karma fortunes by now.
 
jhowell's Avatar
 
Posts: 7,121
Karma: 92500001
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
Quote:
Originally Posted by xxyzz View Post
Hi, jhowell:

The `content` string from `convert_to_json_content()` contains footnote number, which causes spaCy to mark words around this number as a single named entity. For example: "Viktor Chebrikov.69 Gorbachev", the number 69 is the footnote number and "Gorbachev" is the first word of the next sentence.

Could you please split the paragraph contains footnote number(also remove this number) to separate paragraphs if this feature doesn't require too much time to implement?
Luckily KFX format supports a semantic indicator that shows whether text is a footnote reference. That is set as long as the conversion process recognizes a properly formatted footnote.

Using that I can add an option to replace characters that make up a footnote link with spaces. So that in your example "Viktor_Chebrikov.69_Gorbachev" would become "Viktor_Chebrikov.___Gorbachev".

Hopefully that will be enough to solve the problem for you. If there are other cases of numbers appearing in text that cause problems for spaCy you may need to come up with other ways to recognize them and filter them out.
jhowell is online now   Reply With Quote