-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metadata info apart from maxElements in dataframe schema #550
Comments
We use Spark schema metadata only for information that can affect further processing of the data frame. For instance, maximum elements in arrays helps flattening the schema.
from this section of README: https://github.com/AbsaOSS/cobrix#spark-sql-schema-extraction
You can use Java to Python gateway to access the parser. |
Hi @yruslan,
adding full copybook ast as metadata to spark schema would be helpful for us and any other users. |
We were thinking of extending metadata with several other fields (like PIC), let's keep this feature request open. Will consider the list you provided as potential new fields. But definitely not everything that you've listed. For instance, decimal precision and scale is already present as the column type in Spark. There is no need to duplicate it. Out of curiosity, what do you mean by adding "redefines" as metadata info? |
Hi @yruslan, Thank you for the reply and quick reponse time.
|
So it seems you need a COBOL parser. :) |
agree, can you help in providing a sample or existing function which can provide this info in a json format similar to spark schema metadata |
val parsedCopybook = CopybookParser.parseSimple(copybookContents)
val ast = parsedCopybook.getCobolSchema The Here are class definitions for these objects: |
Thank you @yruslan. having a function as part of this package to create json info might help others. |
It certainly might. If I create the traversing logic for you that extracts a couple of parameters, could you help me to take it from there and create PR that will generate all the info you want? |
We have this idea that any person creating a PR for Cobrix is entitled to a pint of beer 🍺 from us if they happen to come to Prague 😄 |
very much interested to support on PR. but i dont have any knowledge about Scala. i know pyspark/python and would be ready to help with python/pyspark. i am from India and cannot join you :P in person. would like a say thanks a lot for making this awesome package :) |
Okay, I see. Still if you happen to go to Prague, let me or Felipe know. Anyway, the COBOL parser that we have in Cobrix is written in ANTLR by the amazing contributor Tiago. You can use this parser from Python by either:
Python to Java gateway is much easier since Cobrix parsed copybook is very easy to use. But if it is blocked for you, the only option to reuse the parser is using the option 2. |
tried this gateway option and saw this gateway is blocked on our clusters due to some security issue. only option left is to use pyspark dataframes :( |
is there a security scanner report for this package like sonarqube or similar? |
Nothing like that. Just standard GitHub dependency scan. |
@yruslan , is it possible to include the below metadata info in 2.6.2 version
we would like to use "_debug" fields obtained by using option("debug","string") and above metadata as cobrix output, we would like to do Data quality check's and datatype conversion after cobrix conversion |
Hi @yruslan , any thoughts on this request? is it possible to include the below metadata info in 2.6.2 version
|
Yes, this seems okay. It is going to be in I have a question though. What do you mean by 'explicit decimal indicator'? Can you give an example? |
Hi @yruslan, i mean "isExplicitDecimalPt" mentioned in below code. which indicates whether the decimal point is explicitly included ( PIC 9.99 etc) or not (PIC 9V99). we would like to use option("debug","string") in our process. using this option, we are unable to infer decimal columns (PIC 9V99) correctly . adding "isExplicitDecimalPt" to metadata would solve this issue. as you mentioned above, i understand that adding all metadata info would need lot of integration tests. is it possible to include only this "isExplicitDecimalPt"="N" in metadata (only in case of (PIC 9V99 etc.,)) in 2.6.2 release and fix corresponding integration tests( this number might be less), so that we can use option("debug","string") other metadata addition apart from this, we can wait till 2.7.* release. Thank you for your support and let us know your thoughts. |
Thanks for the detailed description. We can certainly add the metadata field. However, could you please clarify how this will help you find the exact position of the implicit decimal? I'm thinking adding PIC of the field from the copybook to the metadata. In your case you will get |
for a decimal column "abc" with PIC 9V99, when option("debug","string") is used, cobrix will provide 2 fields
if isExplicitDecimalPt='N' is available, we can use the "scale" provided by "abc" (above Field 1) which tells the exact position of decimal point, we can add decimal point in "abc_debug" (above Field 2) Adding metadata and PIC info will help (but this needs more time to fix and test). i can think of above option to make available sooner |
These 2 options are equally breaking and it takes about the same time to do. I order to make it non-breaking I plan to add new metadata fields only in case this option is set ( I prefer adding PIC (and some other metadata, like comp, redefines, etc) since it is more general and can potentially include more use cases. |
Hi @yruslan, awesome thought, this would be more helpful and non-breaking. waiting to use it as soon as possible. :) Thank you once again!! |
could you provide more detail in which case this option will break |
It would break our use cases. We have a very strict unit test suites that ensure that schemas coming from Cobrix is exactly as expected, including metadata fields. Adding new metadata fields breaks this check. |
Hi @yruslan,
i agree with .option("detailed_metadata", "true") case of no breakage and i am good with going with this |
Hi @yruslan, i see that 2.6.2 is just released, is this .option("detailed_metadata", "true") included in 2.6.2 ? we cannot use option("debug","string") until this .option("detailed_metadata", "true") is included :( |
We plan it to be a part of the next release. 2.6.3. |
This fields specified the name of the field in the copybook. (Before the removal of special characters)
This fields specified the name of the field in the copybook. (Before the removal of special characters)
.option("detailed_metadata", "true") is included in 2.6.3 |
Hi @yruslan , Thank you for the support on this. tested on 2.6.3 and able to see PIC info as below copybook = """ metadata could you also help in adding below metadata. this would help in avoiding parsing PIC info
Thank you in advance!! |
created new feature request #580 for this |
Background
Hi, Thank you for the making this wonderful package.
Currently cobrix provides the maxLength(String), MinElements(Array) and MaxElementsArray) metadata info in dataframe schema
Feature
for ASCII Files, adding the below info will help in debugging purpose for pyspark users. i see that we could use scala/java to get the corresponding info from cobol converter/parser. but we only know python/pyspark.
Example [Optional]
A simple example if applicable.
Proposed Solution [Optional]
Solution Ideas
The text was updated successfully, but these errors were encountered: