metadata info apart from maxElements in dataframe schema #550

saikumare-a · 2022-12-19T16:07:15Z

Background

Hi, Thank you for the making this wonderful package.

Currently cobrix provides the maxLength(String), MinElements(Array) and MaxElementsArray) metadata info in dataframe schema

Feature

for ASCII Files, adding the below info will help in debugging purpose for pyspark users. i see that we could use scala/java to get the corresponding info from cobol converter/parser. but we only know python/pyspark.

adding the length,decimal precision and decimal scale for all primitive column types (decimal, integer,long,float etc..)
start position, end position
redefines, assumed scale, occurs_depends_on etc..
other metadata that is already available in cobol converter

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas

adding the existing(converter/parser) metadata information to dataframe schema metadata

yruslan · 2022-12-21T07:54:17Z

We use Spark schema metadata only for information that can affect further processing of the data frame. For instance, maximum elements in arrays helps flattening the schema.
What you ask is add basically the full copybook ast to Spark schema. This can be achieved by using

val parsedSchema = CopybookParser.parseSimple(copyBookContents)

from this section of README: https://github.com/AbsaOSS/cobrix#spark-sql-schema-extraction

parsedSchema will contain the AST that has all the info you need.

You can use Java to Python gateway to access the parser.

saikumare-a · 2022-12-21T08:17:34Z

Hi @yruslan,

python to java gateway is blocked for us and we dont have option to use any other funtion/code except the spark api
also i see above mentioned parsedSchema provides only start,end and length information only.

adding full copybook ast as metadata to spark schema would be helpful for us and any other users.

yruslan · 2022-12-21T14:17:02Z

Sorry that java gateway is blocked for you, but that is not a sufficient reason for duplicating functionality inside Cobrix.
CopybookParser.parseSimple() and parseTree() provide full AST with all information extracted from the copybook, not just start and length.
Adding full copybook as a Spark schema metadata might help some users, but it can complicate things for other users since the metadata could become too heavy and redundant.

We were thinking of extending metadata with several other fields (like PIC), let's keep this feature request open. Will consider the list you provided as potential new fields. But definitely not everything that you've listed. For instance, decimal precision and scale is already present as the column type in Spark. There is no need to duplicate it.

Out of curiosity, what do you mean by adding "redefines" as metadata info?

saikumare-a · 2022-12-21T14:25:30Z

Hi @yruslan,

Thank you for the reply and quick reponse time.

for decimal fields, precision and scale are present so no need of this info, but assumed decimal ( in case of PIC S9V9) and sign character would be needed
for integer & long type, precision would be needed based on PIC info
redefines --> redefines which column
occurs --> depends on which column
level
6)parent
In summary:
adding overall info which can help to reproduce the copybook from spark dataframe!!

yruslan · 2022-12-21T14:36:24Z

So it seems you need a COBOL parser. :)
Literally all the info you mentioned is available from that object returned by CopybookParser.parseSimple().

saikumare-a · 2022-12-21T14:46:23Z

agree, can you help in providing a sample or existing function which can provide this info in a json format similar to spark schema metadata

yruslan · 2022-12-21T14:53:24Z

val parsedCopybook = CopybookParser.parseSimple(copybookContents)
val ast = parsedCopybook.getCobolSchema

The ast object will contain the tree of Statement which can be either Primitive or Group. You can traverse this tree to get a JSON etc.

Here are class definitions for these objects:

saikumare-a · 2022-12-21T15:01:35Z

Thank you @yruslan. having a function as part of this package to create json info might help others.

yruslan · 2022-12-21T15:07:35Z

It certainly might. If I create the traversing logic for you that extracts a couple of parameters, could you help me to take it from there and create PR that will generate all the info you want?

yruslan · 2022-12-21T15:08:58Z

We have this idea that any person creating a PR for Cobrix is entitled to a pint of beer 🍺 from us if they happen to come to Prague 😄

saikumare-a · 2022-12-21T15:14:17Z

very much interested to support on PR. but i dont have any knowledge about Scala. i know pyspark/python and would be ready to help with python/pyspark.

i am from India and cannot join you :P in person.

would like a say thanks a lot for making this awesome package :)

yruslan · 2022-12-21T15:25:00Z

Okay, I see. Still if you happen to go to Prague, let me or Felipe know.

Anyway, the COBOL parser that we have in Cobrix is written in ANTLR by the amazing contributor Tiago.
https://github.com/AbsaOSS/cobrix/tree/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/antlr

You can use this parser from Python by either:

Python to Java gateway that PySpark uses. Can you confirm that you can invoke CopybookParser.parseSimple(copyBookContents) from PySpark?
You can generate Python parser from.g4 files. But this requires deep technical knowledge of ANTLR.

Python to Java gateway is much easier since Cobrix parsed copybook is very easy to use. But if it is blocked for you, the only option to reuse the parser is using the option 2.

saikumare-a · 2022-12-21T15:34:21Z

tried this gateway option and saw this gateway is blocked on our clusters due to some security issue.

only option left is to use pyspark dataframes :(

saikumare-a · 2022-12-21T15:40:15Z

is there a security scanner report for this package like sonarqube or similar?

yruslan · 2022-12-22T11:45:04Z

Nothing like that. Just standard GitHub dependency scan.

saikumare-a · 2022-12-26T07:31:44Z

@yruslan ,

is it possible to include the below metadata info in 2.6.2 version

explicit/assumed decimal indicator for decimal columns
copybook defined length for Long/Integer columns

we would like to use "_debug" fields obtained by using option("debug","string") and above metadata as cobrix output, we would like to do Data quality check's and datatype conversion after cobrix conversion

saikumare-a · 2022-12-29T09:22:34Z

Hi @yruslan ,

any thoughts on this request?

is it possible to include the below metadata info in 2.6.2 version

explicit decimal indicator for decimal columns
copybook defined length for Long/Integer columns
we would like to use "_debug" fields obtained by using option("debug","string") and above metadata as cobrix output, we would like to do Data quality check's and datatype conversion after cobrix conversion

yruslan · 2022-12-29T18:03:54Z

Yes, this seems okay. It is going to be in 2.7.0 since adding metadata columns breaks many integration tests, and can break library users tests as well.

I have a question though. What do you mean by 'explicit decimal indicator'? Can you give an example?

saikumare-a · 2023-01-02T06:23:11Z

Hi @yruslan,

i mean "isExplicitDecimalPt" mentioned in below code.

which indicates whether the decimal point is explicitly included ( PIC 9.99 etc) or not (PIC 9V99).

https://github.com/AbsaOSS/cobrix/blob/master/cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala

we would like to use option("debug","string") in our process. using this option, we are unable to infer decimal columns (PIC 9V99) correctly . adding "isExplicitDecimalPt" to metadata would solve this issue.

as you mentioned above, i understand that adding all metadata info would need lot of integration tests.

is it possible to include only this "isExplicitDecimalPt"="N" in metadata (only in case of (PIC 9V99 etc.,)) in 2.6.2 release and fix corresponding integration tests( this number might be less), so that we can use option("debug","string")

other metadata addition apart from this, we can wait till 2.7.* release.

Thank you for your support and let us know your thoughts.

yruslan · 2023-01-02T08:44:50Z

Thanks for the detailed description. We can certainly add the metadata field.

However, could you please clarify how this will help you find the exact position of the implicit decimal?
For example, if your debug data is '12345', and isExplicitDecimalPt='N', the actual number can be 1.2345 or 12.345 or 123.45 etc? Is Y/N sufficient for isExplicitDecimalPt in order to reconstruct your number?

I'm thinking adding PIC of the field from the copybook to the metadata. In your case you will get pic='9V99' or pic='9.99', and you will be able to infer the presence and position of the explicit decimal for each field. Using PIC you can define integer and long sizes as well. What do you think?

saikumare-a · 2023-01-02T08:55:35Z

for a decimal column "abc" with PIC 9V99, when option("debug","string") is used, cobrix will provide 2 fields

"abc" of decimal(3,2) type
"abc_debug" of string type

if isExplicitDecimalPt='N' is available, we can use the "scale" provided by "abc" (above Field 1) which tells the exact position of decimal point, we can add decimal point in "abc_debug" (above Field 2)

Adding metadata and PIC info will help (but this needs more time to fix and test). i can think of above option to make available sooner

yruslan · 2023-01-02T09:06:00Z

These 2 options are equally breaking and it takes about the same time to do. I order to make it non-breaking I plan to add new metadata fields only in case this option is set (.option("detailed_metadata", "true"))

I prefer adding PIC (and some other metadata, like comp, redefines, etc) since it is more general and can potentially include more use cases.

saikumare-a · 2023-01-02T09:08:42Z

Hi @yruslan,

awesome thought, this would be more helpful and non-breaking.

waiting to use it as soon as possible. :)

Thank you once again!!

saikumare-a · 2023-01-02T09:26:30Z

These 2 options are equally breaking and it takes about the same time to do. I order to make it non-breaking I plan to add new metadata fields only in case this option is set (.option("detailed_metadata", "true"))

I prefer adding PIC (and some other metadata, like comp, redefines, etc) since it is more general and can potentially include more use cases.

could you provide more detail in which case this option will break

yruslan · 2023-01-02T09:37:09Z

It would break our use cases. We have a very strict unit test suites that ensure that schemas coming from Cobrix is exactly as expected, including metadata fields. Adding new metadata fields breaks this check.

saikumare-a · 2023-01-02T10:03:33Z

Hi @yruslan,

the thought is to add "isExplicitDecimalPt"='N' only in case of "N" case and no change in metadata in case of "Y" case. this would result in less number of breaks

i agree with .option("detailed_metadata", "true") case of no breakage and i am good with going with this

saikumare-a · 2023-01-03T14:24:06Z

Hi @yruslan,

i see that 2.6.2 is just released, is this .option("detailed_metadata", "true") included in 2.6.2 ?

we cannot use option("debug","string") until this .option("detailed_metadata", "true") is included :(

yruslan · 2023-01-03T18:52:06Z

We plan it to be a part of the next release. 2.6.3.

… metadata to Spark schemas.

This fields specified the name of the field in the copybook. (Before the removal of special characters)

… metadata to Spark schemas.

This fields specified the name of the field in the copybook. (Before the removal of special characters)

yruslan · 2023-02-02T08:30:08Z

.option("detailed_metadata", "true") is included in 2.6.3

saikumare-a · 2023-02-11T05:17:48Z

Hi @yruslan ,

Thank you for the support on this.

tested on 2.6.3 and able to see PIC info as below

copybook = """
01 RECORD.
05 ID PIC S9(1).
05 ID_NUM PIC S9V9 OCCURS 1 TIMES.
05 ID_TXT PIC X(2).
"""

metadata
'{"fields":[{"metadata":{"level":5,"pic":"S9(1)"},"name":"ID","nullable":true,"type":"integer"},{"metadata":{"level":5,"maxElements":1,"minElements":0,"pic":"S9(1)V9(1)"},"name":"ID_NUM","nullable":true,"type":{"containsNull":true,"elementType":"decimal(2,1)","type":"array"}},{"metadata":{"level":5,"maxLength":2,"pic":"X(2)"},"name":"ID_TXT","nullable":true,"type":"string"}],"type":"struct"}'

could you also help in adding below metadata. this would help in avoiding parsing PIC info

decimal precision
decimal scale
assumed decimal indicator (Y/N) ( in case of S9V9)
startPosition and EndPosition (in case of ascii)

Thank you in advance!!
Regards,
Saikumar

saikumare-a · 2023-02-13T06:45:37Z

created new feature request #580 for this

saikumare-a added the enhancement label Dec 19, 2022

yruslan added a commit that referenced this issue Jan 30, 2023

#550 Add new option 'detailed_metadata' that allows adding additional…

aa68ee3

… metadata to Spark schemas.

yruslan added a commit that referenced this issue Jan 30, 2023

#550 Add a test suite for the detailed metadata generation.

9d2846b

yruslan added a commit that referenced this issue Jan 30, 2023

#550 Add 'original_name' to the detailed metadata generation.

b8282ef

This fields specified the name of the field in the copybook. (Before the removal of special characters)

yruslan closed this as completed Feb 2, 2023

saikumare-a mentioned this issue Feb 13, 2023

additional metadata in detailed_metadata option #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata info apart from maxElements in dataframe schema #550

metadata info apart from maxElements in dataframe schema #550

saikumare-a commented Dec 19, 2022 •

edited

Loading

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

yruslan commented Dec 21, 2022 •

edited

Loading

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 22, 2022

saikumare-a commented Dec 26, 2022

saikumare-a commented Dec 29, 2022

yruslan commented Dec 29, 2022

saikumare-a commented Jan 2, 2023

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023 •

edited

Loading

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

saikumare-a commented Jan 3, 2023 •

edited

Loading

yruslan commented Jan 3, 2023

yruslan commented Feb 2, 2023

saikumare-a commented Feb 11, 2023

saikumare-a commented Feb 13, 2023

metadata info apart from maxElements in dataframe schema #550

metadata info apart from maxElements in dataframe schema #550

Comments

saikumare-a commented Dec 19, 2022 • edited Loading

Background

Feature

Example [Optional]

Proposed Solution [Optional]

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

yruslan commented Dec 21, 2022 • edited Loading

saikumare-a commented Dec 21, 2022

yruslan commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

saikumare-a commented Dec 21, 2022

yruslan commented Dec 22, 2022

saikumare-a commented Dec 26, 2022

saikumare-a commented Dec 29, 2022

yruslan commented Dec 29, 2022

saikumare-a commented Jan 2, 2023

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023 • edited Loading

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

yruslan commented Jan 2, 2023

saikumare-a commented Jan 2, 2023

saikumare-a commented Jan 3, 2023 • edited Loading

yruslan commented Jan 3, 2023

yruslan commented Feb 2, 2023

saikumare-a commented Feb 11, 2023

saikumare-a commented Feb 13, 2023

saikumare-a commented Dec 19, 2022 •

edited

Loading

yruslan commented Dec 21, 2022 •

edited

Loading

saikumare-a commented Jan 2, 2023 •

edited

Loading

saikumare-a commented Jan 3, 2023 •

edited

Loading