Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata info apart from maxElements in dataframe schema #550

Closed
saikumare-a opened this issue Dec 19, 2022 · 31 comments
Closed

metadata info apart from maxElements in dataframe schema #550

saikumare-a opened this issue Dec 19, 2022 · 31 comments
Labels
enhancement New feature or request

Comments

@saikumare-a
Copy link

saikumare-a commented Dec 19, 2022

Background

Hi, Thank you for the making this wonderful package.

Currently cobrix provides the maxLength(String), MinElements(Array) and MaxElementsArray) metadata info in dataframe schema

Feature

for ASCII Files, adding the below info will help in debugging purpose for pyspark users. i see that we could use scala/java to get the corresponding info from cobol converter/parser. but we only know python/pyspark.

  1. adding the length,decimal precision and decimal scale for all primitive column types (decimal, integer,long,float etc..)
  2. start position, end position
  3. redefines, assumed scale, occurs_depends_on etc..
  4. other metadata that is already available in cobol converter

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

  1. adding the existing(converter/parser) metadata information to dataframe schema metadata
@saikumare-a saikumare-a added the enhancement New feature or request label Dec 19, 2022
@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

We use Spark schema metadata only for information that can affect further processing of the data frame. For instance, maximum elements in arrays helps flattening the schema.
What you ask is add basically the full copybook ast to Spark schema. This can be achieved by using

val parsedSchema = CopybookParser.parseSimple(copyBookContents)

from this section of README: https://github.com/AbsaOSS/cobrix#spark-sql-schema-extraction

parsedSchema will contain the AST that has all the info you need.

You can use Java to Python gateway to access the parser.

@saikumare-a
Copy link
Author

Hi @yruslan,

  1. python to java gateway is blocked for us and we dont have option to use any other funtion/code except the spark api
  2. also i see above mentioned parsedSchema provides only start,end and length information only.

adding full copybook ast as metadata to spark schema would be helpful for us and any other users.

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

  • Sorry that java gateway is blocked for you, but that is not a sufficient reason for duplicating functionality inside Cobrix.
  • CopybookParser.parseSimple() and parseTree() provide full AST with all information extracted from the copybook, not just start and length.
  • Adding full copybook as a Spark schema metadata might help some users, but it can complicate things for other users since the metadata could become too heavy and redundant.

We were thinking of extending metadata with several other fields (like PIC), let's keep this feature request open. Will consider the list you provided as potential new fields. But definitely not everything that you've listed. For instance, decimal precision and scale is already present as the column type in Spark. There is no need to duplicate it.

Out of curiosity, what do you mean by adding "redefines" as metadata info?

@saikumare-a
Copy link
Author

Hi @yruslan,

Thank you for the reply and quick reponse time.

  1. for decimal fields, precision and scale are present so no need of this info, but assumed decimal ( in case of PIC S9V9) and sign character would be needed
  2. for integer & long type, precision would be needed based on PIC info
  3. redefines --> redefines which column
  4. occurs --> depends on which column
  5. level
    6)parent
    In summary:
    adding overall info which can help to reproduce the copybook from spark dataframe!!

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

So it seems you need a COBOL parser. :)
Literally all the info you mentioned is available from that object returned by CopybookParser.parseSimple().

@saikumare-a
Copy link
Author

agree, can you help in providing a sample or existing function which can provide this info in a json format similar to spark schema metadata

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

val parsedCopybook = CopybookParser.parseSimple(copybookContents)
val ast = parsedCopybook.getCobolSchema

The ast object will contain the tree of Statement which can be either Primitive or Group. You can traverse this tree to get a JSON etc.

Here are class definitions for these objects:

@saikumare-a
Copy link
Author

Thank you @yruslan. having a function as part of this package to create json info might help others.

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

It certainly might. If I create the traversing logic for you that extracts a couple of parameters, could you help me to take it from there and create PR that will generate all the info you want?

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

We have this idea that any person creating a PR for Cobrix is entitled to a pint of beer 🍺 from us if they happen to come to Prague 😄

@saikumare-a
Copy link
Author

very much interested to support on PR. but i dont have any knowledge about Scala. i know pyspark/python and would be ready to help with python/pyspark.

i am from India and cannot join you :P in person.

would like a say thanks a lot for making this awesome package :)

@yruslan
Copy link
Collaborator

yruslan commented Dec 21, 2022

Okay, I see. Still if you happen to go to Prague, let me or Felipe know.

Anyway, the COBOL parser that we have in Cobrix is written in ANTLR by the amazing contributor Tiago.
https://github.com/AbsaOSS/cobrix/tree/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr

You can use this parser from Python by either:

  1. Python to Java gateway that PySpark uses. Can you confirm that you can invoke CopybookParser.parseSimple(copyBookContents) from PySpark?
  2. You can generate Python parser from.g4 files. But this requires deep technical knowledge of ANTLR.

Python to Java gateway is much easier since Cobrix parsed copybook is very easy to use. But if it is blocked for you, the only option to reuse the parser is using the option 2.

@saikumare-a
Copy link
Author

tried this gateway option and saw this gateway is blocked on our clusters due to some security issue.

only option left is to use pyspark dataframes :(

@saikumare-a
Copy link
Author

is there a security scanner report for this package like sonarqube or similar?

@yruslan
Copy link
Collaborator

yruslan commented Dec 22, 2022

Nothing like that. Just standard GitHub dependency scan.

@saikumare-a
Copy link
Author

@yruslan ,

is it possible to include the below metadata info in 2.6.2 version

  1. explicit/assumed decimal indicator for decimal columns
  2. copybook defined length for Long/Integer columns

we would like to use "_debug" fields obtained by using option("debug","string") and above metadata as cobrix output, we would like to do Data quality check's and datatype conversion after cobrix conversion

@saikumare-a
Copy link
Author

Hi @yruslan ,

any thoughts on this request?

is it possible to include the below metadata info in 2.6.2 version

  1. explicit decimal indicator for decimal columns
  2. copybook defined length for Long/Integer columns
    we would like to use "_debug" fields obtained by using option("debug","string") and above metadata as cobrix output, we would like to do Data quality check's and datatype conversion after cobrix conversion

@yruslan
Copy link
Collaborator

yruslan commented Dec 29, 2022

Yes, this seems okay. It is going to be in 2.7.0 since adding metadata columns breaks many integration tests, and can break library users tests as well.

I have a question though. What do you mean by 'explicit decimal indicator'? Can you give an example?

@saikumare-a
Copy link
Author

Hi @yruslan,

i mean "isExplicitDecimalPt" mentioned in below code.

which indicates whether the decimal point is explicitly included ( PIC 9.99 etc) or not (PIC 9V99).

https://github.com/AbsaOSS/cobrix/blob/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala

we would like to use option("debug","string") in our process. using this option, we are unable to infer decimal columns (PIC 9V99) correctly . adding "isExplicitDecimalPt" to metadata would solve this issue.

as you mentioned above, i understand that adding all metadata info would need lot of integration tests.

is it possible to include only this "isExplicitDecimalPt"="N" in metadata (only in case of (PIC 9V99 etc.,)) in 2.6.2 release and fix corresponding integration tests( this number might be less), so that we can use option("debug","string")

other metadata addition apart from this, we can wait till 2.7.* release.

Thank you for your support and let us know your thoughts.

@yruslan
Copy link
Collaborator

yruslan commented Jan 2, 2023

Thanks for the detailed description. We can certainly add the metadata field.

However, could you please clarify how this will help you find the exact position of the implicit decimal?
For example, if your debug data is '12345', and isExplicitDecimalPt='N', the actual number can be 1.2345 or 12.345 or 123.45 etc? Is Y/N sufficient for isExplicitDecimalPt in order to reconstruct your number?

I'm thinking adding PIC of the field from the copybook to the metadata. In your case you will get pic='9V99' or pic='9.99', and you will be able to infer the presence and position of the explicit decimal for each field. Using PIC you can define integer and long sizes as well. What do you think?

@saikumare-a
Copy link
Author

saikumare-a commented Jan 2, 2023

for a decimal column "abc" with PIC 9V99, when option("debug","string") is used, cobrix will provide 2 fields

  1. "abc" of decimal(3,2) type
  2. "abc_debug" of string type

if isExplicitDecimalPt='N' is available, we can use the "scale" provided by "abc" (above Field 1) which tells the exact position of decimal point, we can add decimal point in "abc_debug" (above Field 2)

Adding metadata and PIC info will help (but this needs more time to fix and test). i can think of above option to make available sooner

@yruslan
Copy link
Collaborator

yruslan commented Jan 2, 2023

These 2 options are equally breaking and it takes about the same time to do. I order to make it non-breaking I plan to add new metadata fields only in case this option is set (.option("detailed_metadata", "true"))

I prefer adding PIC (and some other metadata, like comp, redefines, etc) since it is more general and can potentially include more use cases.

@saikumare-a
Copy link
Author

Hi @yruslan,

awesome thought, this would be more helpful and non-breaking.

waiting to use it as soon as possible. :)

Thank you once again!!

@saikumare-a
Copy link
Author

These 2 options are equally breaking and it takes about the same time to do. I order to make it non-breaking I plan to add new metadata fields only in case this option is set (.option("detailed_metadata", "true"))

I prefer adding PIC (and some other metadata, like comp, redefines, etc) since it is more general and can potentially include more use cases.

could you provide more detail in which case this option will break

@yruslan
Copy link
Collaborator

yruslan commented Jan 2, 2023

It would break our use cases. We have a very strict unit test suites that ensure that schemas coming from Cobrix is exactly as expected, including metadata fields. Adding new metadata fields breaks this check.

@saikumare-a
Copy link
Author

Hi @yruslan,

  1. the thought is to add "isExplicitDecimalPt"='N' only in case of "N" case and no change in metadata in case of "Y" case. this would result in less number of breaks

i agree with .option("detailed_metadata", "true") case of no breakage and i am good with going with this

@saikumare-a
Copy link
Author

saikumare-a commented Jan 3, 2023

Hi @yruslan,

i see that 2.6.2 is just released, is this .option("detailed_metadata", "true") included in 2.6.2 ?

we cannot use option("debug","string") until this .option("detailed_metadata", "true") is included :(

@yruslan
Copy link
Collaborator

yruslan commented Jan 3, 2023

We plan it to be a part of the next release. 2.6.3.

yruslan added a commit that referenced this issue Jan 26, 2023

Verified

This commit was signed with the committer’s verified signature.
yruslan Ruslan Yushchenko
… metadata to Spark schemas.
yruslan added a commit that referenced this issue Jan 27, 2023

Verified

This commit was signed with the committer’s verified signature.
yruslan Ruslan Yushchenko
yruslan added a commit that referenced this issue Jan 27, 2023

Verified

This commit was signed with the committer’s verified signature.
yruslan Ruslan Yushchenko
This fields specified the name of the field in the copybook.
(Before the removal of special characters)
yruslan added a commit that referenced this issue Jan 30, 2023
yruslan added a commit that referenced this issue Jan 30, 2023
This fields specified the name of the field in the copybook.
(Before the removal of special characters)
@yruslan
Copy link
Collaborator

yruslan commented Feb 2, 2023

.option("detailed_metadata", "true") is included in 2.6.3

@yruslan yruslan closed this as completed Feb 2, 2023
@saikumare-a
Copy link
Author

Hi @yruslan ,

Thank you for the support on this.

tested on 2.6.3 and able to see PIC info as below

copybook = """
01 RECORD.
05 ID PIC S9(1).
05 ID_NUM PIC S9V9 OCCURS 1 TIMES.
05 ID_TXT PIC X(2).
"""

metadata
'{"fields":[{"metadata":{"level":5,"pic":"S9(1)"},"name":"ID","nullable":true,"type":"integer"},{"metadata":{"level":5,"maxElements":1,"minElements":0,"pic":"S9(1)V9(1)"},"name":"ID_NUM","nullable":true,"type":{"containsNull":true,"elementType":"decimal(2,1)","type":"array"}},{"metadata":{"level":5,"maxLength":2,"pic":"X(2)"},"name":"ID_TXT","nullable":true,"type":"string"}],"type":"struct"}'

could you also help in adding below metadata. this would help in avoiding parsing PIC info

  1. decimal precision
  2. decimal scale
  3. assumed decimal indicator (Y/N) ( in case of S9V9)
  4. startPosition and EndPosition (in case of ascii)

Thank you in advance!!
Regards,
Saikumar

@saikumare-a
Copy link
Author

created new feature request #580 for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants