My Windows 8 App whose code can be found here, uses a 3rd party API (Readability) to scrape the news sites such as Yahoo or CNN, so that the user sees only the text of the article and nothing more. Ads and images are removed.
I am using the HTML Agility Pack to parse HTML and convert it to Xaml or a WinRT Visual Tree for display. Right now, it’s all text. The content of each paragraph, the <p> tag is pulled out and placed into a TextBlock as a Run. A style is applied to the TextBlock, so that the font is bigger than default, but nothing much else.
Now, I am working to add Images to the article view. Images were easy to add, but captions are proving a more formidable obstacle. Some sites use the alt attribute on the img tag to represent a caption. Some also add a paragraph after the image to act as a caption. Some do both.
But some do neither. So I cannot assume that the next paragraph after the image is the caption. If anyone can figure out the solution to this problem, comment on this or email me feinbergaa at yahoo dot com