Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2992: Replace normalize-strings #3108

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

hannaseithe
Copy link
Contributor

@hannaseithe hannaseithe commented Feb 19, 2025

Short Description

The dependency on the library 'normalize-string' has been replaced by the ES6 String.normalize() method. This has been implemented following the approach of digitalfabrik/entitlementcard/#1782

Proposed Changes

  • the dependency on the 'normalize-string' library was deleted
  • implementation of normalizeString follows the implementation of digitalfabrik/entitlementcard/#1782
  • a new file normalizeString.ts was created from where const normalizeString (and default) is exported (also exported from shared) + export from search.ts was deleted
  • tests were moved to seperate file

Side Effects

  • any functionality that relies on string normalization, especially search

Testing

  • check if search functionality still works:
    • "munchen" finds "München"
    • " test " finds "Test"
    • "straße" finds "Straße"
    • "strasse" finds "Straße"

Resolved Issues

Fixes: #2992


@steffenkleinle
Copy link
Member

I think ß should definitely continue to work 👍 Could you try to fix that such that ß just gets normalized as well?

@hannaseithe
Copy link
Contributor Author

hannaseithe commented Feb 19, 2025

#2992 (comment)

[Continuing the discussion here]
@hauf-toni Technically this would not be a problem, I can just make an exception for ß. Its not due to the actual normalization but because we filter out certain characters now beforehand

@hannaseithe hannaseithe self-assigned this Feb 19, 2025
@hannaseithe hannaseithe marked this pull request as ready for review February 19, 2025 14:15
@hannaseithe hannaseithe marked this pull request as draft February 19, 2025 14:49
@hannaseithe
Copy link
Contributor Author

hannaseithe commented Feb 24, 2025

#2992 (comment)

[Continuing the discussion here] @hauf-toni Technically this would not be a problem, I can just make an exception for ß. Its not due to the actual normalization but because we filter out certain characters now beforehand

So I have done some more research into the Eszett (ß) situation. This is what I have come up with.

  1. Before the PR
  • ß was not filtered out
  • what finds "Straße"
    • straße (all 6 letters marked bold)
  • what doesn't find "Straße"
    • strasse/strae/strase
  1. After the change as also implimented in EC
  • ß filtered out
  • what finds "Straße"
    • straße/strae (only the five letters Straß are bold, e is normal)
  • what doesn't find "Straße"
    • strasse/strase
  1. After adapting the EC changes to not filter out ß (current commits), but still use string.normalize()
  • as in 1.)
  1. My suggestion: Option to replace ß bei ss (internally only)
  • ß not filtered but replaced by 'ss'
  • what would find "Straße" (all six letters get bolded)
    • straße/strasse
  • what would not find "Straße"
    • strae/strase

PS.: string.normalize() does not replace ß with ss in any version of normalization (NFC, NFD, NFKC, NFKD), but this can be done by hand

@steffenkleinle
Copy link
Member

#2992 (comment)
[Continuing the discussion here] @hauf-toni Technically this would not be a problem, I can just make an exception for ß. Its not due to the actual normalization but because we filter out certain characters now beforehand

So I have done some more research into the Eszett (ß) situation. This is what I have come up with.

1. Before the PR


* ß was not filtered out

* what finds "Straße"
  
  * straße  (all 6 letters marked bold)

* what doesn't find "Straße"
  
  * strasse/strae/strase


2. After the change as also implimented in EC


* ß filtered out

* what finds "Straße"
  
  * straße/strae (only the five letters Straß are bold, e is normal)

* what doesn't find "Straße"
  
  * strasse/strase


3. After adapting the EC changes to not filter out ß (current commits), but still use string.normalize()


* as in 1.)


4. **My suggestion**: Option to replace ß bei ss (internally only)


* ß not filtered but replaced by 'ss'

* what would find "Straße" (all six letters get bolded)
  
  * straße/strasse

* what would not find "Straße"
  
  * strae/strase

PS.: string.normalize() does not replace ß with ss in any version of normalization (NFC, NFD, NFKC, NFKD), but this can be done by hand

Sounds great, I like your suggestion. Lets go with that 🔍

@steffenkleinle
Copy link
Member

@hannaseithe do you want to create an issue in the entitlementcard repository for this problem and with your findings? Thanks :)

@steffenkleinle steffenkleinle changed the title 2992: Replaced dependency 'normalize-string' with string.normalize() 2992: Replac normalize-strings Mar 6, 2025
@steffenkleinle steffenkleinle changed the title 2992: Replac normalize-strings 2992: Replace normalize-strings Mar 6, 2025
@steffenkleinle
Copy link
Member

steffenkleinle commented Mar 10, 2025

I think a solution making use of the default findChunk implementation would be more elegant/less error prone:

import { findChunks } from 'highlight-words-core'

const findNormalizedChunks = (props: FindChunks): Chunk[] => {
  const chunks: Chunk[] = findChunks(props)
  return chunks.map(chunk => {
    const match = props.textToHighlight.slice(chunk.start, chunk.end)
    const sanitationAdditionalChars = match.split('ß').length - 1
    return { start: chunk.start, end: chunk.end - sanitationAdditionalChars }
  })
}

Also, the current solution crashes for me when entering ).

@steffenkleinle steffenkleinle force-pushed the 2992--Replace-normalize-strings-dependency branch from bb034a6 to fa6db2d Compare March 10, 2025 12:32
@hannaseithe hannaseithe removed their assignment Mar 10, 2025
@@ -79,7 +79,6 @@
"react-native-calendars": "^1.1306.0",
"react-native-gesture-handler": "^2.19.0",
"react-native-get-random-values": "^1.11.0",
"react-native-highlight-words": "^1.0.1",
Copy link
Member

@steffenkleinle steffenkleinle Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unmaintained and does not support custom findChunks method, so I wrote a custom Highlighter component

@steffenkleinle steffenkleinle force-pushed the 2992--Replace-normalize-strings-dependency branch from 535f3a4 to cb51398 Compare March 11, 2025 09:51
@steffenkleinle steffenkleinle force-pushed the 2992--Replace-normalize-strings-dependency branch from cb51398 to 49a5832 Compare March 11, 2025 09:57
@steffenkleinle steffenkleinle marked this pull request as ready for review March 19, 2025 07:47
Copy link
Contributor

@bahaaTuffaha bahaaTuffaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work ✨
Unfortunately there is one issue left.. if you searched in Arabic (after you switch the language) it will return the whole list due to replacing non-Ascii letters and Arabic uses non ascii as far as I know.

You can test it by typing اللغة

@steffenkleinle
Copy link
Member

Great work ✨ Unfortunately there is one issue left.. if you searched in Arabic (after you switch the language) it will return the whole list due to replacing non-Ascii letters and Arabic uses non ascii as far as I know.

You can test it by typing اللغة

Very good catch, thanks for testing thoroughly. Should be fixed now, could you please check again?

@bahaaTuffaha
Copy link
Contributor

bahaaTuffaha commented Mar 28, 2025

Very good catch, thanks for testing thoroughly. Should be fixed now, could you please check again?

"strasse" now doesn't show same results as "straße" maybe writing a condition to check the language whether it's Deutsch?

@steffenkleinle
Copy link
Member

steffenkleinle commented Mar 28, 2025

Very good catch, thanks for testing thoroughly. Should be fixed now, could you please check again?

"strasse" now doesn't show same results as "straße" maybe writing a condition to check the language whether it's Deutsch?

Yes. I think that check is not necessary, this normalization should always be done imo, no matter the user language.

@bahaaTuffaha
Copy link
Contributor

Very good catch, thanks for testing thoroughly. Should be fixed now, could you please check again?

"strasse" now doesn't show same results as "straße" maybe writing a condition to check the language whether it's Deutsch?

Yes. I think that check is not necessary, this normalization should always be done imo, no matter the user language.

Keep normalization but change the regex for Deutsch to nonAsciiRegex = /[^\x00-\x7F\xDF]/g and the rest nonAsciiRegex = /[^\p{Letter}|\p{Number}]/gu .

@steffenkleinle
Copy link
Member

Keep normalization but change the regex for Deutsch to nonAsciiRegex = /[^\x00-\x7F\xDF]/g and the rest nonAsciiRegex = /[^\p{Letter}|\p{Number}]/gu .

Why? What benefit does that serve? In what usecase would we not want to use the same regex?

@bahaaTuffaha
Copy link
Contributor

Keep normalization but change the regex for Deutsch to nonAsciiRegex = /[^\x00-\x7F\xDF]/g and the rest nonAsciiRegex = /[^\p{Letter}|\p{Number}]/gu .

Why? What benefit does that serve? In what usecase would we not want to use the same regex?

My bad I didn't pull your latest changes. ("strasse" works as expected)
image

It's not highlighted completely also noticed some words are highlighted but not relevant to the search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace normalize-strings dependency
3 participants