Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

Closed
mark-weghorst opened this issue Aug 17, 2021 · 12 comments · Fixed by #418 or #419
Assignees
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@mark-weghorst
Copy link

Background [Optional]

In #338 @yruslan added support for a custom record extractor interface so that users could write their own custom record extractors. I have implemented a custom record extractor to read variable block, variable record datasets as defined by the IBM documentation I referenced in #338

In implementing this record extractor my first attempt resulted in a bug where the last block would only return a single record.

I believe that this is because when I call RawRecordContext.next(n) to read the last block of records it advances the "pointer" to EOF, which makes hasNext() return false even if we haven't processed all of the records from that block.

My second attempt which I've attached, attempts to "decouple" hasNext() from the RawRecordContext by implementing a buffering queue and the logic to read blocks inside of hasNext() which I don't regard as an ideal solution.

The thing I don't like about my current implementation is that it has logic inside of hasNext() and generally seems in-elegant and complicated.

Question

Should support for VBVR record types be implemented as a separate block-aware reader type in Cobrix? If not, do you see a better and more elegant way than what I have implemented?

VariableBlockRecordExtractor.scala.zip

@mark-weghorst mark-weghorst added the question Further information is requested label Aug 17, 2021
@yruslan
Copy link
Collaborator

yruslan commented Aug 17, 2021

Thanks for the contribution! I'll take a look and will come back to you

@yruslan yruslan added accepted Accepted for implementation enhancement New feature or request and removed question Further information is requested labels Aug 18, 2021
@yruslan yruslan self-assigned this Aug 18, 2021
@yruslan
Copy link
Collaborator

yruslan commented Aug 18, 2021

I've checked the implementation and it looks good. It would be really nice to have the VBVR feature as part of Cobrix. I have a couple of ideas on how it can be improved a little, but I'd try them out once a test suite is added for the implementation.
If you are okay with using the implementation as a custom record extractor for now, I can integrate this as a native Cobrix feature. I can add tests as well. But I can do it only in September.

I like to ask you to add the file to the project retaining you as the original author, and then I can start working and improving it.

Specifically, the steps are:

  • Fork Cobrix
  • Create a branch
  • Add the file to za.co.absa.cobrix.cobol.reader.extractors.raw package (cobol-parser module)
  • Add the copyright header tp the top of the file by copying one from any other Scala files in Cobrix
  • Create a pull request to Cobrix

After that, I can merge the pull request into a separate branch, add tests, possibly improve the code, and then create a pull request with you as a reviewer for you to check if you agree with the changes. And then, the feature will be available in the next release of Cobrix.

Let me know if it works for you.

@mark-weghorst
Copy link
Author

Thank you this is wonderful news indeed, it may take me a few days since I have some other higher priority work I need to get to first, but I've added a story to my board to prepare this PR for you.

@mark-weghorst
Copy link
Author

The PR is done, here are a few things that came to mind and I'm sure you are already thinking about:

  • The current implementation only supports big endian block and record descriptor words
  • I'm not sure what your plans are for configuration switches, but as written this will only work if specified as a custom record extractor

@yruslan
Copy link
Collaborator

yruslan commented Aug 25, 2021

Yes, all above considerations are something that will be addressed.
I'll merge the PR and going to start working on the feature next week.

  • Both big endian and little endian will be supported
  • The feature will be built-in

@sree018
Copy link

sree018 commented Sep 3, 2021

I have similar set of files, converted variable block to fixed block and created new file in hdfs,after used cobrix framework to parse the fixedBlck file.
is there any way to parse file instead of creating new file ?

My logic to convert variable block to fixed block

def varBlckToFixedBlck(spark:SparkSession,inputFile:String,hdfsPath:String):String={
val fileName:String=FilenameUtils.getName(inputFile)
val sc =spark.sparkContext.hadoopConfiguration
val fs=FileSystem.get(sc)
val outputFile=s"$hdfsPath/fixBlck_$ fileName"
val stream=fs.open(new Path(inputFile))
val inputStream=new FSDataInputStream(stream)
val outputStream=new BufferedOutputStream(fs.create(new Path(outputFile)))
val recCount:bBigInt=0
val byteArray:Array[Byte]=Array(0)
val finalRec=scala.collection.mutable.MapInteger,Integer.withDefault(_=>0)
while(inputStream.available()>0){
val blckLength=IOUtils.toByteArray(inputStream,2)
val blck_length=BigInt(byteArray++blckLength)
val x= Array.ofDimByte
val blck_rec=IOUtils.read(inputStream,x)
val y=x
val record=x.drop(2)
var rec_length=0
while(rec_length<blck_length-4){
val rec_length_b=BigInt(record.slice(rec_length,rec_length+2))
recCount=recCount+1
rec_length=rrec_length+rec_length_b.toInt
finalRec(rec_length_b.toInt)=finalRec(rec_length_b.toInt)+1
}
outputStream.write(record)
}
inputStream.close()
outputStream.close()
}

@yruslan
Copy link
Collaborator

yruslan commented Sep 3, 2021

Yes, the feature to read VB (VBVR) records directly will be supported in the next release (2.4.0) planned for the next week.
The new option will look like this:

.option("record_format", "VB")
.option("bdw_adjustment", 0)
.option("is_bdw_big_endian", "true")

This is not available yet, but will be in 2.4.0.

@yruslan
Copy link
Collaborator

yruslan commented Sep 7, 2021

Cobrix 2.4.0 is released. You can try using .option("record_format", "VB") to get the direct support for VBVR files.
For your example, you might need to add the options that specify that record headers are big endian, and use BDW adjustment:

.option("record_format", "VB")
.option("is_bdw_big_endian", "true")
.option("is_rdw_big_endian", "true")
.option("bdw_adjustment", "-4")

@yruslan
Copy link
Collaborator

yruslan commented Sep 7, 2021

Please, let me. know if it worked for you

@sree018
Copy link

sree018 commented Sep 10, 2021

Thanks for information @yruslan

These options worked for me.
.option("record_format", "VB")
.option("is_bdw_big_endian", "true")
.option("is_rdw_big_endian", "true")
.option("bdw_adjustment", "-4")
.option("rdw_adjustment", "-4")

Can you provide me how to to prase VBF(variable block with fixed width file) ?

@yruslan
Copy link
Collaborator

yruslan commented Sep 10, 2021

VBF files is something new. Cobrix doesn't have a direct support for it, but a custom raw record extractor can be used to parse such files.

Could I ask you to create a separate feature request for FB record types?

I've found documentation on this: https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats
Such files are referred as FB

We can implement the direct support for such files. If it is possible from your side to provide a small example FB file (synthetic file is ok), it can speed up the process

@yruslan
Copy link
Collaborator

yruslan commented Sep 17, 2021

@sree018, a PR for FB format is created: #421

You can test it before merging by cloning the repo, checking out 'feature/420-fb-record-format' branch and building from sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants