tHTMLInput- Parse HTML in Talend

tHTMLInput- Parse HTML in Talend
4.75 (95%) 12 votes

tHTMLInput From Exchange
tHTMLInput From Exchange

tHTMLInput

is a excellent component on HTML parsing, this component accepts custom selector using defined schema and provide output in tabular format.

Note: tHTMLInput component published for community, this post is official documentation for tHTMLInput.

Getting started with tHTMLInput.

Download it from Talend Exchange and install tHTMLInput component, after installation you can do following setting in component.

Note: If component not displayed on default list you can use search option to keyword html or tHTMLInput it bring component to your search then you can download it.

  • Set the time-out value in integer type
  • Set URL( Site page address to parse) we are using "https://www.talendforge.org/forum/viewtopic.php?id=43567"
  • User Agent keep default.
  • You can specify Follow redirect based on your URL behaviour.
  • Max Body Size.
  • Parent Element. this is important value, because tHTMLInput will used it as parent element to parse all the child element listed in schema. in our example we are using "div.postleft" which is class name of DIV tag. get more information from Jsoup.org

Now configure schema, this is the list of attributes or element which we want get from parent element. e.g. links, text, attributes..

  • Open schema editor and create following elements with respective data type.
    • Author =String
    • MemberType=String
    • NumberOfPost=String
    • ProfileLink=Object

Note we are using Object data type for "ProfileLink" column because we want the links not the link text. e. g.

<a href="profile.php?id=20081">umeshrakhe</a>

If we set the string type for this "ProfileLink" then it will give string which is umeshrakhe but we want the entire link then use Object as data type.

Close the Schema editor, because we want to add selectors for each column.

  • Go to the mapping table and do the following setting.
    • Author => Selector Code = "a[href]"
    • MemberType=>Selector Code = "dd[class=usertitle]" 
    • NumberOfPost=>Selector Code = "dd:contains(posts)" 
    • ProfileLink=>Selector Code= "a[abs:href]" 

tHTMLInput Settings looks like below image.

tHTMLInput Setting
tHTMLInput Setting

 

  • Add tLogRow and synch the columns then execute the job you will see following output on console.
tHTMLInput Output
tHTMLInput Output

We have provided basic information to the tHTMLInput component and it gave us result in Tabular format. This way you can parse multiple pages using dynamic URL and other Talend component. if you face any problem please do reach to our expert team using comment or contact us .

About dwetl

20 comments on “tHTMLInput- Parse HTML in Talend

  1. Hello there.

    Thanks a lot for this tutorial!!

    But… I don’t know hot to extract data from a website table.

    I don’t know how to search into source code, what can I use in Mapping area (Selector Code)?

    1. Hi Juan,
      Without selector it is very difficult to fetch the required details. if you dot`t know particular selector then you can directly use html tags like “p”, “span”, “div” or any other tag which you think it contains required information.

      Other wise you can use contact us page to send me secret message with website url and details which you want to extract.

      Thank You
      Dwetl team

      1. Thx you dwelt for this great component.
        It works for me. I have forgot to set the type to Object so I dont get the whole tag.

        An other question.
        How I can get the redirected links?
        Can you make a tutorial?

  2. Thank for your work ! Just a come back, I have tried some use cases and meet issue in those cases :
    – working behind proxy. I have used TSetProxy but connection doesn’t success
    – working with local file. I have used “file:///…” but connection doesn’t success. Reading Soup API, they talk about Parse method : “There is a sister method parse(File in, String charsetName) which uses the file’s location as the baseUri. This is useful if you are working on a filesystem-local site and the relative links it points to are also on the filesystem.”
    Implement the local file case will be really interesting ;))
    Regards

  3. how to call class value..?
    example:

    Hello World

    Output:democlass

    In the above code i need value of class..How can i write code for calling the value of class in tHTMLINPUT..?

    1. you can write a class in routines and then call the method in tHTMLINPUT component, if you put some example then we can suggest you better.

  4. Hi,
    I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.

    tHTMLInput > tMap > tFileOutputDelimited

    Exact Question : How can I avoid empty tag attribute values after crawled data?

  5. Hi,
    I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.
    tHTMLInput > tMap > tFileOutputDelimited
    Exact Question : How can I avoid empty tag attribute values after crawled data?

    1. Can you put more details like your source data and then output you get, i want to reproduce it, i can assist better.

Leave a Reply to BELHASSEN Cancel reply

Your email address will not be published. Required fields are marked *