Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOW not showing attributes in sparse #1063

Open
ajdapretnar opened this issue Jun 17, 2024 · 1 comment
Open

BOW not showing attributes in sparse #1063

ajdapretnar opened this issue Jun 17, 2024 · 1 comment
Labels
meal This will take a day or two need-discussion text expert Requires knowledge of Text add-on.

Comments

@ajdapretnar
Copy link
Collaborator

ajdapretnar commented Jun 17, 2024

Describe the bug
A bit tricky bug to describe. There are two underlying issues:

  • Bag of Words can return all 0 features.
  • Data Table in sparse is not showing all nan features (probably it can't or maybe word=nan).

To Reproduce
Say we have the following documents:

cat dog sleep
Cat sleeps. 1 0 1
Dog sleeps 0 1 1
Cat sleeps, dog sleeps. 1 2 1

When computing TF-IDF for "sleep", the IDF is 0. There are three ways of computing IDF.

  1. math.log10(number_of_docs / number_of_docs_with_word) (how we do it)
  2. math.log10(1 + number_of_docs / number_of_docs_with_word) (how we do it with Smooth IDF)
  3. math.log10(number_of_docs / (number_of_docs_with_word + 1)) (how it is recommended)

How scikit does it: idf(t) = log [ n / df(t) ] + 1 or idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1 if smooth = True.

To reproduce: Create Corpus (with above docs). Bow (TF-IDF). Data Table.

Expected behavior

  1. Resolve the nan result in bow. Why do we not use scikit-learn? Should we reconsider how we compute IDF?
  2. Resolve the display of nan in sparse.

Orange version:
3.37.0

Text add-on version:
1.7.0

@ajdapretnar ajdapretnar added meal This will take a day or two text expert Requires knowledge of Text add-on. need-discussion labels Mar 26, 2025
@ajdapretnar
Copy link
Collaborator Author

Related to #1069.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meal This will take a day or two need-discussion text expert Requires knowledge of Text add-on.
Projects
None yet
Development

No branches or pull requests

1 participant