Hi Juan,
Without selector it is very difficult to fetch the required details. if you dot`t know particular selector then you can directly use html tags like “p”, “span”, “div” or any other tag which you think it contains required information.
Other wise you can use contact us page to send me secret message with website url and details which you want to extract.
Thank for your work ! Just a come back, I have tried some use cases and meet issue in those cases :
– working behind proxy. I have used TSetProxy but connection doesn’t success
– working with local file. I have used “file:///…” but connection doesn’t success. Reading Soup API, they talk about Parse method : “There is a sister method parse(File in, String charsetName) which uses the file’s location as the baseUri. This is useful if you are working on a filesystem-local site and the relative links it points to are also on the filesystem.”
Implement the local file case will be really interesting ;))
Regards
So, can we parse a local file ?
I get a “java.lang.NullPointerException” when using “file///c:/file.html” or “c:/file.html” for URL, is there another method to do it ?
I am using the version 2 from Talend Exchange.
Thanks for your work.
Hi,
I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.
tHTMLInput > tMap > tFileOutputDelimited
Exact Question : How can I avoid empty tag attribute values after crawled data?
Hi,
I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.
tHTMLInput > tMap > tFileOutputDelimited
Exact Question : How can I avoid empty tag attribute values after crawled data?
first of all: thanks for a great component. I managed to scrape a table that import.io had no way of scraping. However, as some people before me, I struggle with extracting ID’s, links or other ‘embedded’ information.
The most straightforward example is regarding the ID. I want to extract the ID-value of every row. At the moment I use “[id]” as selector code with the column type set to “Object”. This returns the whole line as html. Afterwards I clean it up in tMap, but it’s not very easy this way.
I also recognise the issues someone had with an ‘empty row’. This occurs when in some records there is a field (in my case a with “[class*=lineuprow]”) and in others there is not. If this is not available, the tHTMLInput just returns all values it can find in the ‘Parent Element’. My workaroud for this is to extract the first field of every record and use tMap to suppress all records of “[class*=lineuprow]” that start with that same value. But I would prefer tHTMLInput to return NULL in case when it cannot find the Selector Code.
If you could help me with the first issue, this would be greatly appreciated. The second is a nice to have.
Hello there.
Thanks a lot for this tutorial!!
But… I don’t know hot to extract data from a website table.
I don’t know how to search into source code, what can I use in Mapping area (Selector Code)?
Hi Juan,
Without selector it is very difficult to fetch the required details. if you dot`t know particular selector then you can directly use html tags like “p”, “span”, “div” or any other tag which you think it contains required information.
Other wise you can use contact us page to send me secret message with website url and details which you want to extract.
Thank You
Dwetl team
Hello how I can get the attribute value src of an image like this?
Hi Aser,
You can use
img[src]
as a selector and add require attributes.Give us an example of this please, and do u mean by add require attributes.Thanks a lot.
thank you for this great tool, but the extraction of a link in a Href of an tag is quite difficult.
I tried the example showed here and it didn’t work and even your reply to Aser didn’t clarify the issue.
So how to extract link from an anchor tag ?
Best regards.
I will submit newer version soon, but some of the feature may not seen in that. you can provide test case to test it.
Thx you dwelt for this great component.
It works for me. I have forgot to set the type to Object so I dont get the whole tag.
An other question.
How I can get the redirected links?
Can you make a tutorial?
Thank You Aser,
I will test for Redirected Links and give some tutorials.
Thank for your work ! Just a come back, I have tried some use cases and meet issue in those cases :
– working behind proxy. I have used TSetProxy but connection doesn’t success
– working with local file. I have used “file:///…” but connection doesn’t success. Reading Soup API, they talk about Parse method : “There is a sister method parse(File in, String charsetName) which uses the file’s location as the baseUri. This is useful if you are working on a filesystem-local site and the relative links it points to are also on the filesystem.”
Implement the local file case will be really interesting ;))
Regards
Thanks Piaf, its rely interesting, i will give a try and then upload newer version of component.
So, can we parse a local file ?
I get a “java.lang.NullPointerException” when using “file///c:/file.html” or “c:/file.html” for URL, is there another method to do it ?
I am using the version 2 from Talend Exchange.
Thanks for your work.
how to call class value..?
example:
Hello World
Output:democlass
In the above code i need value of class..How can i write code for calling the value of class in tHTMLINPUT..?
you can write a class in routines and then call the method in tHTMLINPUT component, if you put some example then we can suggest you better.
Hi,
I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.
tHTMLInput > tMap > tFileOutputDelimited
Exact Question : How can I avoid empty tag attribute values after crawled data?
Hi,
I am trying to crawl HTML webpage table data’s using tHTMLInput component with particular tag attributes. Here If empty row data availed into table, Then Its repeating all column data again into that empty row. Below Job sample I am trying.
tHTMLInput > tMap > tFileOutputDelimited
Exact Question : How can I avoid empty tag attribute values after crawled data?
Can you put more details like your source data and then output you get, i want to reproduce it, i can assist better.
I just have a self-made html file. What should I provide in the URL field of tHTMLInput Component ?
file name with path
I just have an html file. What should I provide in the URL field of tHTMLInput Component ?
Hi there,
first of all: thanks for a great component. I managed to scrape a table that import.io had no way of scraping. However, as some people before me, I struggle with extracting ID’s, links or other ‘embedded’ information.
The most straightforward example is regarding the ID. I want to extract the ID-value of every row. At the moment I use “[id]” as selector code with the column type set to “Object”. This returns the whole line as html. Afterwards I clean it up in tMap, but it’s not very easy this way.
I also recognise the issues someone had with an ‘empty row’. This occurs when in some records there is a field (in my case a with “[class*=lineuprow]”) and in others there is not. If this is not available, the tHTMLInput just returns all values it can find in the ‘Parent Element’. My workaroud for this is to extract the first field of every record and use tMap to suppress all records of “[class*=lineuprow]” that start with that same value. But I would prefer tHTMLInput to return NULL in case when it cannot find the Selector Code.
If you could help me with the first issue, this would be greatly appreciated. The second is a nice to have.
Kind regards,
Pepe
how to write for inner div data
div–>ul—>li–>div—>
Hi sir.
Im new in talend and development.
Im trying to get data from a html table ( https://s27.postimg.org/56yzbhxvn/print.png ).
I need to get all three columns and rows.
But I’m getting an error (java.lang.NullPointerException), could you help-me ?
(https://s28.postimg.org/c1mw2oefx/Sem_t_tulo.png)