Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

mark-weghorst · 2021-08-17T18:02:25Z

Background [Optional]

In #338 @yruslan added support for a custom record extractor interface so that users could write their own custom record extractors. I have implemented a custom record extractor to read variable block, variable record datasets as defined by the IBM documentation I referenced in #338

In implementing this record extractor my first attempt resulted in a bug where the last block would only return a single record.

I believe that this is because when I call RawRecordContext.next(n) to read the last block of records it advances the "pointer" to EOF, which makes hasNext() return false even if we haven't processed all of the records from that block.

My second attempt which I've attached, attempts to "decouple" hasNext() from the RawRecordContext by implementing a buffering queue and the logic to read blocks inside of hasNext() which I don't regard as an ideal solution.

The thing I don't like about my current implementation is that it has logic inside of hasNext() and generally seems in-elegant and complicated.

Question

Should support for VBVR record types be implemented as a separate block-aware reader type in Cobrix? If not, do you see a better and more elegant way than what I have implemented?

VariableBlockRecordExtractor.scala.zip

yruslan · 2021-08-17T18:18:33Z

Thanks for the contribution! I'll take a look and will come back to you

yruslan · 2021-08-18T10:43:11Z

I've checked the implementation and it looks good. It would be really nice to have the VBVR feature as part of Cobrix. I have a couple of ideas on how it can be improved a little, but I'd try them out once a test suite is added for the implementation.
If you are okay with using the implementation as a custom record extractor for now, I can integrate this as a native Cobrix feature. I can add tests as well. But I can do it only in September.

I like to ask you to add the file to the project retaining you as the original author, and then I can start working and improving it.

Specifically, the steps are:

Fork Cobrix
Create a branch
Add the file to za.co.absa.cobrix.cobol.reader.extractors.raw package (cobol-parser module)
Add the copyright header tp the top of the file by copying one from any other Scala files in Cobrix
Create a pull request to Cobrix

After that, I can merge the pull request into a separate branch, add tests, possibly improve the code, and then create a pull request with you as a reviewer for you to check if you agree with the changes. And then, the feature will be available in the next release of Cobrix.

Let me know if it works for you.

mark-weghorst · 2021-08-19T20:26:41Z

Thank you this is wonderful news indeed, it may take me a few days since I have some other higher priority work I need to get to first, but I've added a story to my board to prepare this PR for you.

mark-weghorst · 2021-08-23T15:58:40Z

The PR is done, here are a few things that came to mind and I'm sure you are already thinking about:

The current implementation only supports big endian block and record descriptor words
I'm not sure what your plans are for configuration switches, but as written this will only work if specified as a custom record extractor

yruslan · 2021-08-25T19:43:51Z

Yes, all above considerations are something that will be addressed.
I'll merge the PR and going to start working on the feature next week.

Both big endian and little endian will be supported
The feature will be built-in

…oduced.

sree018 · 2021-09-03T18:24:22Z

I have similar set of files, converted variable block to fixed block and created new file in hdfs,after used cobrix framework to parse the fixedBlck file.
is there any way to parse file instead of creating new file ?

My logic to convert variable block to fixed block

def varBlckToFixedBlck(spark:SparkSession,inputFile:String,hdfsPath:String):String={
val fileName:String=FilenameUtils.getName(inputFile)
val sc =spark.sparkContext.hadoopConfiguration
val fs=FileSystem.get(sc)
val outputFile=s"$hdfsPath/fixBlck_$ fileName"
val stream=fs.open(new Path(inputFile))
val inputStream=new FSDataInputStream(stream)
val outputStream=new BufferedOutputStream(fs.create(new Path(outputFile)))
val recCount:bBigInt=0
val byteArray:Array[Byte]=Array(0)
val finalRec=scala.collection.mutable.MapInteger,Integer.withDefault(_=>0)
while(inputStream.available()>0){
val blckLength=IOUtils.toByteArray(inputStream,2)
val blck_length=BigInt(byteArray++blckLength)
val x= Array.ofDimByte
val blck_rec=IOUtils.read(inputStream,x)
val y=x
val record=x.drop(2)
var rec_length=0
while(rec_length<blck_length-4){
val rec_length_b=BigInt(record.slice(rec_length,rec_length+2))
recCount=recCount+1
rec_length=rrec_length+rec_length_b.toInt
finalRec(rec_length_b.toInt)=finalRec(rec_length_b.toInt)+1
}
outputStream.write(record)
}
inputStream.close()
outputStream.close()
}

yruslan · 2021-09-03T19:09:19Z

Yes, the feature to read VB (VBVR) records directly will be supported in the next release (2.4.0) planned for the next week.
The new option will look like this:

.option("record_format", "VB")
.option("bdw_adjustment", 0)
.option("is_bdw_big_endian", "true")

This is not available yet, but will be in 2.4.0.

…oduced.

yruslan · 2021-09-07T12:55:32Z

Cobrix 2.4.0 is released. You can try using .option("record_format", "VB") to get the direct support for VBVR files.
For your example, you might need to add the options that specify that record headers are big endian, and use BDW adjustment:

.option("record_format", "VB")
.option("is_bdw_big_endian", "true")
.option("is_rdw_big_endian", "true")
.option("bdw_adjustment", "-4")

yruslan · 2021-09-07T12:55:47Z

Please, let me. know if it worked for you

sree018 · 2021-09-10T01:42:09Z

Thanks for information @yruslan

These options worked for me.
.option("record_format", "VB")
.option("is_bdw_big_endian", "true")
.option("is_rdw_big_endian", "true")
.option("bdw_adjustment", "-4")
.option("rdw_adjustment", "-4")

Can you provide me how to to prase VBF(variable block with fixed width file) ?

yruslan · 2021-09-10T06:19:59Z

VBF files is something new. Cobrix doesn't have a direct support for it, but a custom raw record extractor can be used to parse such files.

Could I ask you to create a separate feature request for FB record types?

I've found documentation on this: https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats
Such files are referred as FB

We can implement the direct support for such files. If it is possible from your side to provide a small example FB file (synthetic file is ok), it can speed up the process

yruslan · 2021-09-17T06:56:59Z

@sree018, a PR for FB format is created: #421

You can test it before merging by cloning the repo, checking out 'feature/420-fb-record-format' branch and building from sources.

mark-weghorst added the question Further information is requested label Aug 17, 2021

yruslan added accepted Accepted for implementation enhancement New feature or request and removed question Further information is requested labels Aug 18, 2021

yruslan self-assigned this Aug 18, 2021

mark-weghorst mentioned this issue Aug 23, 2021

Feature/412 vbvr record extractor #414

Merged

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Rename the class name to match the file name

98e2456

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Add unit tests for the existing implementation

9377e8b

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Add generic RDW/BDW decoder supporting adjustments and endiness

b737b3f

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Make VBVR record extractor to use generic RDW/BDW decoders

8a04358

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Add support for extended BDWs

97588ec

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Improve the implementation of VBVR raw record extractor

30ea187

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Add new options to specify VB record format.

b325e6d

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Update documentation.

b9396ed

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Use non-deprecated options everywhere.

0df5ab9

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Bump up the minor version number since breaking changes are intr…

f211443

…oduced.

yruslan added a commit that referenced this issue Sep 2, 2021

#412 Add missing license.

761679d

yruslan added a commit that referenced this issue Sep 6, 2021

#412 Add a test suite for data that has VB record format.

3261ba6

yruslan linked a pull request Sep 6, 2021 that will close this issue

New features: #412 and #417 #418

Merged

yruslan closed this as completed in #418 Sep 7, 2021

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Rename the class name to match the file name

d25f780

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add unit tests for the existing implementation

349e071

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add generic RDW/BDW decoder supporting adjustments and endiness

43e621e

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Make VBVR record extractor to use generic RDW/BDW decoders

93a0445

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add support for extended BDWs

54dd74f

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Improve the implementation of VBVR raw record extractor

52be3ac

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add new options to specify VB record format.

5646dd6

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Update documentation.

55e9887

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Use non-deprecated options everywhere.

3e3c56e

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Bump up the minor version number since breaking changes are intr…

f178a5a

…oduced.

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add missing license.

c267da4

yruslan added a commit that referenced this issue Sep 7, 2021

#412 Add a test suite for data that has VB record format.

654aae5

yruslan mentioned this issue Sep 7, 2021

Release Cobrix 2.4.0 #419

Merged

sree018 mentioned this issue Sep 17, 2021

#420 Add experimental support for 'FB' format #421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

mark-weghorst commented Aug 17, 2021

yruslan commented Aug 17, 2021

yruslan commented Aug 18, 2021

mark-weghorst commented Aug 19, 2021

mark-weghorst commented Aug 23, 2021

yruslan commented Aug 25, 2021

sree018 commented Sep 3, 2021

yruslan commented Sep 3, 2021

yruslan commented Sep 7, 2021

yruslan commented Sep 7, 2021

sree018 commented Sep 10, 2021

yruslan commented Sep 10, 2021

yruslan commented Sep 17, 2021

Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

Custom record extractor implementation for VBVR data files, should it not use rawRecordExtractor? #412

Comments

mark-weghorst commented Aug 17, 2021

Background [Optional]

Question

yruslan commented Aug 17, 2021

yruslan commented Aug 18, 2021

mark-weghorst commented Aug 19, 2021

mark-weghorst commented Aug 23, 2021

yruslan commented Aug 25, 2021

sree018 commented Sep 3, 2021

My logic to convert variable block to fixed block

yruslan commented Sep 3, 2021

yruslan commented Sep 7, 2021

yruslan commented Sep 7, 2021

sree018 commented Sep 10, 2021

yruslan commented Sep 10, 2021

yruslan commented Sep 17, 2021