-
Notifications
You must be signed in to change notification settings - Fork 119
Hindi TN 2.0 - Accuracy Enhancements & New Telephone Class Integration #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging_hi_tn
Are you sure you want to change the base?
Conversation
* Future Implementations for classes - Measure, Money, and Date (NVIDIA#258) * Future Implementations for classes - Measure, Money, and Date Signed-off-by: Namrata Gachchi <[email protected]> * Resolved the conflicts with mm_yyyy and date ranges and added the previously removed failing test cases. Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed the unused empty string implementation Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for the tagger files Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reformatted decimal final graph Signed-off-by: Namrata Gachchi <[email protected]> * incorporated the suggestion for decimal graph Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Century implementations Signed-off-by: Namrata Gachchi <[email protected]> * Working on the yyyy format for the date class Signed-off-by: Namrata Gachchi <[email protected]> * reverted yyyy code Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on future implementations Signed-off-by: Namrata Gachchi <[email protected]> * working on improving the date class accuracy Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added year prefix for the date class Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on the commma cases for date class Signed-off-by: Namrata Gachchi <[email protected]> * minor fixes Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implemented mixed fractions Signed-off-by: Namrata Gachchi <[email protected]> * rectified the test case Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * working on quarterly measurements Signed-off-by: Namrata Gachchi <[email protected]> * reformatted the prefixes and suffixes for date tagger class Signed-off-by: Namrata Gachchi <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * replaced text tag with era tag for the date class Signed-off-by: Namrata Gachchi <[email protected]> * Removed the text tag reference from date class verbalizer Signed-off-by: Namrata Gachchi <[email protected]> --------- Signed-off-by: Namrata Gachchi <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update jenkins cache Signed-off-by: Mariana Graterol Fuenmayor <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Potential fix for code scanning alert no. 821: Unused local variable Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Mariana <[email protected]> --------- Signed-off-by: Namrata Gachchi <[email protected]> Signed-off-by: Mariana Graterol Fuenmayor <[email protected]> Signed-off-by: Mariana <[email protected]> Co-authored-by: Namrata Gachchi <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <[email protected]>
nemo_text_processing/text_normalization/hi/taggers/telephone.py
Outdated
Show resolved
Hide resolved
nemo_text_processing/text_normalization/hi/verbalizers/fraction.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgrafu, could you please review this PR?
…e telephone class Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
nemo_text_processing/text_normalization/hi/data/telephone/STD_codes.tsv
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
@@ -0,0 +1,8 @@ | |||
२ दो |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mapping any different than cardinals (lines 1-4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These refer to the validation of landline numbers starting with specific digits within India.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.
@@ -0,0 +1,8 @@ | |||
६ छह |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mapping any different than cardinals (lines 1-4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These refer to the validation of mobile numbers starting with specific digits within India.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is the mapping any different from cardinals? if not, please import from cardinals and restrict the accepted inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the additional mappings from the TSV file and integrated them into the existing TSV files, as you suggested.
@@ -0,0 +1,20 @@ | |||
० शून्य |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mapping any different than cardinals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Hindi digits, no, it's actually the same as cardinal single digits. But for English digits, yes, it's just a common resource for telephone class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please use cardinal for Hindi digits and filter the inputs you need, and only add a file for English digits in that case? let's avoid repetition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sure, I've updated the same
@@ -0,0 +1,100 @@ | |||
० एक |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mapping any different than cardinals (lines 1-4)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, actually 0.75 is converted to a quarter, so zero is mapped to one in paune_mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want a data file that is 100 lines -- please reuse cardinal when applicable or reapply with rules elsewhere
nemo_text_processing/text_normalization/hi/taggers/telephone.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
def __init__(self): | ||
super().__init__(name="telephone", kind="classify") | ||
|
||
mobile_number = generate_mobile(["नंबर", "मोबाइल", "फोन", "कॉल"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these inputs be part of a tsv file instead of hardcoding them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes sure, I've removed these inputs and converted them to respective tsv files
tests/nemo_text_processing/hi/data_text_normalization/test_cases_fraction.txt
Show resolved
Hide resolved
Signed-off-by: Namrata Gachchi <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Namrata Gachchi <[email protected]>
What does this PR do ?
This PR introduces Hindi Text Normalization 2.0, which features substantial accuracy improvements across multiple classes and the addition of a new Telephone class. It also integrates culturally relevant linguistic constructs to enhance natural language understanding.
Accuracy Improvements by Class:
Key Enhancements:
New Class: Telephone
Linguistic Enrichment:
Before your PR is "Ready for review"
Pre checks:
git commit -s
to sign.pytest
or (if your machine does not have GPU)pytest --cpu
from the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')
).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
pytest
and Sparrowhawk here.__init__.py
for every folder and subfolder, includingdata
folder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
to all newly added Python files?Copyright 2015 and onwards Google, Inc.
. See an example here.try import: ... except: ...
) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.