Validate CSV headers in Talend.

Validate CSV headers in Talend.
5 (100%) 7 votes

File processing is a day to day task in ETL world, and there is huge need of validation regarding source file format, headers, footers, column name, data type and so on, thanks to tSchemaComplianceCheck component which can do most of the validation like.

  • Length checking.
  • Date pattern/format.
  • Data Types

But does not support number of columns and column sequence validation, that which we have to manage using java code, in this post I will describe you how to validate column names and their sequence.

This is our final job design.

Complete CSV Validation Job
Complete CSV Validation Job

Let’s start with adding first component to job designer. Add tFileList and configure to get expected files.

Add tFileInputFullRow component and configure as shown in below screen.

tFileInputFullRow Configuration
tFileInputFullRow Configuration

 

tFileInputFullRow Configuration

  • Add tMap and connect with main link from tFileinputFullRow component.
  • Add tFixedFlowInput and connect with lookup link to tMap then configure as follows.

Note: if you have your refrence header row stored in file or database you can use it instead of tFixedFlowInput.

tFixedFlowInput Configuration
tFixedFlowInput Configuration
  • Configure tMap as follows.
    • Make inner join with your reference line and main line of input.
    • Add two outputs and named it as “matching” and “reject” respectively.
    • In the reject output click on setting “catch lookup inner join reject”=true
    • Add source line to both the flows.

See image for more details.

tMap Setting
tMap Setting
  • Add tJava next to tMap and connect with “matching” flow.
  • Add another tJava next to tMap and connect with “reject” flow.
  • Add tFileInputDelimited and connect with first tjava using “iterate” flow.
  • Configure tFileInputDelimited as shown in below image.

Add tLogRow component to see the output from file.

tFileInputDelimited Configuration
tFileInputDelimited Configuration

You can see that for each file whole sub job will be executed if it is matching with header row then it will be used for reading.

You can connect reject row to make a note of rejected file based on your requirement.

About Umesh

I am Software consultant with approx 7 years of experience mainly in Business Intelligence and data warehousing assignments using Talend. Writing is not my passion but i am doing it to help others. if you have any special case where you want me to demonstrate then please post me.

7 comments on “Validate CSV headers in Talend.

  1. hello umesh ,

    nice post…

    can you pls help me out on following scenario.
    we have various xls files present on server.
    format : sales price MM yyyy.xls.
    example : 1) sales price 01 2014.xls
    2) sales price 03 2014.xls
    3) sales price 09 2014. xls
    4) sales price 01 2015.xls
    … 5) sales price 07 2014.xls

    requirement : I need to fetch latest file I.e. sales price 01 2015.xls

    note: files are not present in exact date order ..means files of 2015 can come after 2013 or prior to 2010 .

    1. Design your job like below.

      tFileList—Iterator—tFileproperty—-tMap—-thashOutput
      onsubjobok
      tHashInput–tAggrigateRow—yourFile.

      Now add fowllowing code in in tmap variable.

      basename.substring(basename.indexOf(“.”)-7).replace(“.xls”, “”)

      use this varibale to parse in output column and named as fileDate.
      TalendDate.parseDate(“MM yyyy”, Var.Name)

      Configure tAggrigaterow component
      click on add button below the oration and add fileDate column.
      Select Max Operation from list, it will give you latest file

  2. Hi Umesh,
    the above post is very useful.
    Could you please help me out on this ?
    My requirement is :-
    The file which gets generated after running etl is “XYZ_20150603_1.csv”

    where XYZ is hardcoded value,
    20150603 is the system date
    and ‘1’ indicates number of times the etl has run in a single day.
    Suppose if etl has ran for 2 times then the value needs to be ‘2’
    My question is how to generate a counter like 1, 2, 3, …… based on the number of times an etl has run in a single day.

    1. Hi Mahesh,

      Thank you for posting such good scenario, you can see my post here wherein I added all the required details for you.

      Regards
      dwetl

Leave a Reply to Rsh Cancel reply

Your email address will not be published. Required fields are marked *