Data Validation Using tSchemaComplianceCheck

Data Validation Using tSchemaComplianceCheck
5 (100%) 4 votes

In my Last post, I have demonstrated how to Validate CSV headers  using Talend. In this post we`ll see how to validate source data with output schema? for that we will use tSchemaComplianceCheck component.

tSchemaComplianceCheck

Validates all input rows against a reference schema or check types, nullability, length of rows against reference values. The validation can be carried out in full or partly.

Different use cases:

  • This component is an intermediary step in the flow allowing to exclude from the main flow the non-compliant data.
  • This component cannot be a start component as it requires an input flow.
  • It also requires at least one output component to gather the validated flow.
  • Possibly a second output component for rejected data using Rejects link.

For demonstration we will use below input data.

ID;Name;BirthDate;State;City
1;Dwight;06-04-2008;Delaware;Concord
2;Warren;25-10-2008;Montana
3;Benjamin;17-08-2008;Washington;Austin
4;Harry;14-04-2008;Kansas;Annapolis
5;Ulysses;2007-04-12;Michigan;Raleigh
6;James;19-08-2007;Delaware;Charleston
.7;Bill;20-04-2007;Illinois;Bismarck
8;Ulysses;04-12-2008;;Saint Paul
9;Thomas;09-05-2008;Maryland;Albany
10;Ronald;11-02-2008;Florida;Hartford
Step 1: Create Job and add following components.
  • Add tFileInputDelimited and configure as follows.
    • File Name=”c:\SampleInput.txt”
    • Row Separator & Filed Separator as default.
    • Header=1, footer=0,limit=””
    • Skip Empty Rows=selected
    • Advance Setting–>Trim All Columns=Selected
    • Create Schema using Metatdata. and define all the columns data type =string.
  • Add tSchemaComplianceCheck component and configure it as follows.
    • Connection tFileInputDelimited with tSchemaComplianceCheck component using row->Main
    • Synch Columns.
  • Add 2, tLogRow components and
    • Connect first tLogRow with tSchemaComplianceCheck using Row->Main
    • Connect Second tLogRow with tSchemaComplianceCheck using Row->Rejects

Note: We will configure this component for various mods and see how does it behaves.

Mode 1: Check All Columns From Schema.

Now if you execute your job you will get output like below.

tSchemaComplianceCheck use case in Talend-output 1
tSchemaComplianceCheck use case in Talend-output 1

If you notice our schema is not having all the columns as string data type so and destination also not having any difference hence all the rows went through Main row connection. Not Rejected any Row

Now we will change the tSchemaComplianceCheck Mode-to “Costume Define” with following changes.

  • id {Datatype=Ineteger}
  • BirthDate{DataType=Date,format=”dd-MM-yyyy”}
  • Name{Length=7, maxlength=selected}
  • State{length=10, maxlength=selected}
  • City{Length=10, maxlength=selected}

Mode 2: Costume Define

See the below image for more details.

tSchemaComplianceCheck1 use case custome defined settings
tSchemaComplianceCheck1 use case custome defined settings

Note: change only those properties which are mentioned above setting, keep other setting as it is.

Save the and execute the job you will see below output. if observe you will see all the non compliance rows rejected and shown in reject table.

Main flow has only those rows which comply with our custom schema defined.

tSchemaComplianceCheck2 use case mode output -2
tSchemaComplianceCheck2 use case mode output -2

 Mode 3: Use another schema for compliance check.

change the tSchemaComplianceCheck setting to add another schema for check.

  • Select “Use another schema for compliance check.” mode from basic setting tSchemaComplianceCheck component.
  • Select “schema to test” as built in and click on “synch column” button.
  • Click on “Edit Schema” button to edit as follows.
    • id {Datatype=Ineteger}
    • BirthDate{DataType=Date,format=”dd-MM-yyyy”}
    • Name{Length=7, maxlength=selected}
    • State{length=10, maxlength=selected}
    • City{Length=10, maxlength=selected}
  • See final configuration look like below image.

tSchemaComplianceCheck use case mode-3 setting

tSchemaComplianceCheck use case mode-3 setting

Save and execute the job, you will gate same output as above. by this way you can use schema from repository to validate the data.

tSchemaComplianceCheck use case mode-3  output
tSchemaComplianceCheck use case mode-3 output

This is a very good component but it has several limitations, it will not check for column headers.

 

About Umesh

I am Software consultant with approx 7 years of experience mainly in Business Intelligence and data warehousing assignments using Talend. Writing is not my passion but i am doing it to help others. if you have any special case where you want me to demonstrate then please post me.

5 comments on “Data Validation Using tSchemaComplianceCheck

  1. I have list of XML files in folder, i have tfilelist component as first componet in job, i want to compare schema for every file in this folder against a xml schema that i have created and any file that doesnt match with this schema should be moved to different folder. I am trying to use tSchemaCompliancecheck component here to move rejected files to different folder and accepted files to sent to tFileInputXML component to process further.
    tFileList provides iterate connection and tSchemaComponentCheck accepts only main connection as input connection.
    How can I compare file schema and move rejected files?
    Thanks

    1. Hi Jenny,

      If I understood correct, you want to validate XML files not XML schema? if so then you need to validate XML using tXSDValidator component not schema compliance check. schema compliance check will validate your source structure/schema with target structure/schema.

      Job Design should be like this.

      tFileList—-iterator—-tXSDValidator–if—tFileInputXML—furtherjob
      |____tFileCopy(DiffrentFolder)

      let me know if you have any further questions.

      Thanks
      Umesh

Leave a Reply to Jenny Cancel reply

Your email address will not be published. Required fields are marked *