-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412
Comments
Thanks for the contribution! I'll take a look and will come back to you |
I've checked the implementation and it looks good. It would be really nice to have the VBVR feature as part of Cobrix. I have a couple of ideas on how it can be improved a little, but I'd try them out once a test suite is added for the implementation. I like to ask you to add the file to the project retaining you as the original author, and then I can start working and improving it. Specifically, the steps are:
After that, I can merge the pull request into a separate branch, add tests, possibly improve the code, and then create a pull request with you as a reviewer for you to check if you agree with the changes. And then, the feature will be available in the next release of Cobrix. Let me know if it works for you. |
Thank you this is wonderful news indeed, it may take me a few days since I have some other higher priority work I need to get to first, but I've added a story to my board to prepare this PR for you. |
The PR is done, here are a few things that came to mind and I'm sure you are already thinking about:
|
Yes, all above considerations are something that will be addressed.
|
I have similar set of files, converted variable block to fixed block and created new file in hdfs,after used cobrix framework to parse the fixedBlck file. My logic to convert variable block to fixed blockdef varBlckToFixedBlck(spark:SparkSession,inputFile:String,hdfsPath:String):String={ |
Yes, the feature to read VB (VBVR) records directly will be supported in the next release (2.4.0) planned for the next week.
This is not available yet, but will be in |
Cobrix .option("record_format", "VB")
.option("is_bdw_big_endian", "true")
.option("is_rdw_big_endian", "true")
.option("bdw_adjustment", "-4") |
Please, let me. know if it worked for you |
Thanks for information @yruslan These options worked for me. Can you provide me how to to prase VBF(variable block with fixed width file) ? |
VBF files is something new. Cobrix doesn't have a direct support for it, but a custom raw record extractor can be used to parse such files. Could I ask you to create a separate feature request for FB record types? I've found documentation on this: https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats We can implement the direct support for such files. If it is possible from your side to provide a small example FB file (synthetic file is ok), it can speed up the process |
Background [Optional]
In #338 @yruslan added support for a custom record extractor interface so that users could write their own custom record extractors. I have implemented a custom record extractor to read variable block, variable record datasets as defined by the IBM documentation I referenced in #338
In implementing this record extractor my first attempt resulted in a bug where the last block would only return a single record.
I believe that this is because when I call RawRecordContext.next(n) to read the last block of records it advances the "pointer" to EOF, which makes hasNext() return false even if we haven't processed all of the records from that block.
My second attempt which I've attached, attempts to "decouple" hasNext() from the RawRecordContext by implementing a buffering queue and the logic to read blocks inside of hasNext() which I don't regard as an ideal solution.
The thing I don't like about my current implementation is that it has logic inside of hasNext() and generally seems in-elegant and complicated.
Question
Should support for VBVR record types be implemented as a separate block-aware reader type in Cobrix? If not, do you see a better and more elegant way than what I have implemented?
VariableBlockRecordExtractor.scala.zip
The text was updated successfully, but these errors were encountered: