How to read HTML text?
The problem I'm running in to with reading HTML (both the page itself, as well as the DOM inside the page) is that you have to take into account the fact that HTML tends to use a mixture of HTML and non-HTML tags in its text. For example, a typical paragraph (as in this page): is quite common. Now, if you want to parse and read this whole string as HTML code, you have to check whether every single one of those characters should be considered part of an HTML element and thus needs to be tagged as such. For instance, the first ">" is the beginning of the start of a tag and thus you have to parse it and decide whether you want to add a < p > tag to the beginning of the string or not.
There are of course other ways of doing it, for instance using regular expressions to see what's the next thing to try to act on, but I am wondering if there's any better way to do it? I'd appreciate any advice. P.: The HTML in the example above is very small and for demonstration purposes only, so it doesn't represent any real-world case I am facing now. I'm basically facing situations where I have to parse HTML text like you show in the example (parsing whole sentences instead of single words) and then having to decide what to do with every single text node and its parent element. If this was not the case, we could just read the text string and then go through every single node and compare the nodes type and their ID/class to determine how to handle the text.
I had to reread your question several times to catch what you actually wrote. Now I'm guessing that you're working with a string that has HTML in it. In that case you'll have to write some code that takes any given HTML character as input (a string, like '
' for example), then check the character in your string and compare it with what is in the HTML string. If it's an opener, add the opened tag to the end of the string. If it's an ender, remove the closed tag from the string.
What is Extract Text from HTML work?
I have a table that is being built dynamically.
So the cells can vary in length and therefore length of content. I'd like to wrap some of the long text as a link, however the links are appearing below the text.
For more information, see the post linked by @Mason.
How to extract text from HTML?
I need a function to extract the text of html element in my program, I know c# has a library for this purpose but it's a slow program and time consuming for heavy text content in each page (it may be used in our mobile application), and what I want is find the way to do this without the use of c#. Thanks. I think there are a few ways to accomplish this. You can use something like HtmlAgilityPack to search an XML document for certain tags and just pull out the text from those tags.
For example, say I have an XML document like this:
Foreach (HtmlAgilityPack.HtmlNode element in elements) But the problem with that is I'm not getting the namespace of foo, the namespace of bar, the inner text content of foo, the namespace of root, etc. So if I did: foreach (HtmlAgilityPack.HtmlNode element in doc. I'll get output that looks like this:
Related Answers
What type of data can be scraped?
The following types of data can be scraped by a bot: Data for news sites:...
What is web scraping?
Web scraping is a technique to extract data from a website. It is a process to extrac...
What is web scraping tasks?
I'm aware of some of the basics: Basic HTML parsing and DOM mani...