Skip to content
Bruno Sonnino
Menu
  • Home
  • About
Menu

Parsing HTML data with C#

Posted on 16 June 2019

Sometimes, you need to parse some html data to do some processing and present it to the user. That may be a daunting task, as some pages can become very complex and it may be difficult to do it.

For that, you can use an excellent tool, named HTML Agility Pack. With it, you can parse HTML from a string, a file, a web site or even from a WebBrowser: you can add a WebBrowser to your app, navigate to an URL and parse the data from there.

In this article, I'll show how to make a query in Bing, retrieve and parse the response. For that, we need to create the query url and pass it to Bing. You may ask why I'm querying Bing and not Google - I'm doing that because Google makes it difficult to get its data, and I want to show you how to use HTML Agility Pack, and not how to retrieve data from Google 😃. The query should be something like this:

https://www.bing.com/search?q=html+agility+pack&count=100

We will use the Query (q) and the number of results (count) parameters. With them, we can create our program. We will create a WPF program that gets the query text, parses it and presents the results in a Listbox.

Create a new WPF program and name it BingSearch.

The next step is to add the HTML Agility Pack to the project. Right-click the References node in the Solution Explorer and select Manage NuGet Packages. Then add the Html Agility Pack to the project.

Then, in the main window, add this XAML code:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <StackPanel Grid.Row="0" Orientation ="Horizontal" 
                Margin="5,0" VerticalAlignment="Center">
        <TextBlock Text="Search" VerticalAlignment="Center"/>
        <TextBox x:Name="TxtSearch" Width="300" Height="30" 
                 Margin="5,0" VerticalContentAlignment="Center"/>
    </StackPanel>
    <Button Grid.Row="0" HorizontalAlignment="Right" 
            Content="Search" Margin="5,0" VerticalAlignment="Center"
            Width="65" Height="30" Click="SearchClick"/>
    <ListBox Grid.Row="1" x:Name="LbxResults" />
</Grid>

Right click in the button's click event handler in the XAML and press F12 to add the handler in code and go to it. Then, add this code to the handler:

private async void SearchClick(object sender, RoutedEventArgs e)
{
    if (string.IsNullOrWhiteSpace(TxtSearch.Text))
        return;
    var queryString = WebUtility.UrlEncode(TxtSearch.Text);
    var htmlWeb = new HtmlWeb();
    var query = $"https://bing.com/search?q={queryString}&count=100";
    var doc = await htmlWeb.LoadFromWebAsync(query);
    var response = doc.DocumentNode.SelectSingleNode("//ol[@id='b_results']");
    var results = response.SelectNodes("//li[@class='b_algo']");
    if (results == null)
    {
        LbxResults.ItemsSource = null;
        return;
    }
    var searchResults = new List<SearchResult>();
    foreach (var result in results)
    {
        var refNode = result.Element("h2").Element("a");
        var url = refNode.Attributes["href"].Value;
        var text = refNode.InnerText;
        var description = result.Element("div").Element("p").InnerText;
        searchResults.Add(new SearchResult(text, url, description));
    }
    LbxResults.ItemsSource = searchResults;
}

Initially we encode the text to search to add it to the query and create the query string. Then we call the LoadFromWebAsync method to load the HTML data from the query response. When the response comes, we get the response node, from the ordered list with id b_results and extract from it the individual results. Finally, we parse each result and add it to a list of SearchResult, and assign the list to the items in the ListBox. You can note that we can find the nodes using XPath, like in

var results = response.SelectNodes("//li[@class='b_algo']");

Or we can traverse the elements and get the text of the resulting node with something like:

var refNode = result.Element("h2").Element("a");
var url = refNode.Attributes["href"].Value;
var text = refNode.InnerText;
var description = WebUtility.HtmlDecode(
    result.Element("div").Element("p").InnerText);

SearchResult is declared as:

internal class SearchResult
{
    public string Text { get; }
    public string Url { get; }
    public string Description { get; }

    public SearchResult(string text, string url, string description)
    {
        Text = text;
        Url = url;
        Description = description;
    }
}

if you run the program, you will see something like this:

The data isn't displayed because we haven't defined any data template for the list items. You can define an item template like that in the XAML:

<ListBox.ItemTemplate>
    <DataTemplate>
        <StackPanel Margin="0,3">
            <TextBlock Text="{Binding Text}" FontWeight="Bold"/>
            <TextBlock >
              <Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
                 <TextBlock Text="{Binding Url}"/>
              </Hyperlink>
            </TextBlock>
            <TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
        </StackPanel>
    </DataTemplate>
</ListBox.ItemTemplate>

The LinkNavigate event handler is:

private void LinkNavigate(object sender, RequestNavigateEventArgs e)
{
    System.Diagnostics.Process.Start(e.Uri.AbsoluteUri);
}

Now, when you run the program, you will get something like this:

You can click on the hyperlink and it will open a browser window with the selected page. We can even go further and add a WebBrowser to our app that will show the selected page when you click on an item. For that, you have to modify the XAML code with something like this:

<Grid>
    <Grid.RowDefinitions>
        <RowDefinition Height="40"/>
        <RowDefinition Height="*"/>
    </Grid.RowDefinitions>
    <Grid.ColumnDefinitions>
        <ColumnDefinition Width="*"/>
        <ColumnDefinition Width="*"/>
    </Grid.ColumnDefinitions>
    <StackPanel Grid.Row="0" Orientation ="Horizontal" 
                Margin="5,0" VerticalAlignment="Center">
        <TextBlock Text="Search" VerticalAlignment="Center"/>
        <TextBox x:Name="TxtSearch" Width="300" Height="30" 
                 Margin="5,0" VerticalContentAlignment="Center"/>
    </StackPanel>
    <Button Grid.Row="0" HorizontalAlignment="Right" 
            Content="Search" Margin="5,0" VerticalAlignment="Center"
            Width="65" Height="30" Click="SearchClick"/>
    <ListBox Grid.Row="1" x:Name="LbxResults" 
             ScrollViewer.HorizontalScrollBarVisibility="Disabled"
             SelectionChanged="LinkChanged">
        <ListBox.ItemTemplate>
            <DataTemplate>
                <StackPanel Margin="0,3">
                    <TextBlock Text="{Binding Text}" FontWeight="Bold"/>
                    <TextBlock >
                      <Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
                         <TextBlock Text="{Binding Url}"/>
                      </Hyperlink>
                    </TextBlock>
                    <TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
                </StackPanel>
            </DataTemplate>
        </ListBox.ItemTemplate>
    </ListBox>
    <WebBrowser Grid.Column="1" Grid.RowSpan="2" x:Name="WebPage"  />
</Grid>

We've added a second column to the window and added a WebBrwser to it, then added a SelectionChanged event to the listbox, so we can navigate to the selected page.

The SelectionChanged event handler is:

private void LinkChanged(object sender, SelectionChangedEventArgs e)
{
    if (e.AddedItems?.Count > 0)
    {
        WebPage.Navigate(((SearchResult)e.AddedItems[0]).Url);
    }
}

Now, when you run the app and click on a result, it will show the page in the WebBrowser. One thing that happened is that, sometimes a Javascript error pops up. To remove these errors, I used the solution obtained from here:

public MainWindow()
{
    InitializeComponent();
    WebPage.Navigated += (s, e) => SetSilent(WebPage, true);
}

public static void SetSilent(WebBrowser browser, bool silent)
{
    if (browser == null)
        throw new ArgumentNullException("browser");

    // get an IWebBrowser2 from the document
    IOleServiceProvider sp = browser.Document as IOleServiceProvider;
    if (sp != null)
    {
        Guid IID_IWebBrowserApp = new Guid("0002DF05-0000-0000-C000-000000000046");
        Guid IID_IWebBrowser2 = new Guid("D30C1661-CDAF-11d0-8A3E-00C04FC9E26E");

        object webBrowser;
        sp.QueryService(ref IID_IWebBrowserApp, ref IID_IWebBrowser2, out webBrowser
        if (webBrowser != null)
        {
            webBrowser.GetType().InvokeMember("Silent", 
                BindingFlags.Instance | BindingFlags.Public | 
                BindingFlags.PutDispProperty, null, webBrowser, 
                new object[] { silent });
        }
    }
}

[ComImport, Guid("6D5140C1-7436-11CE-8034-00AA006009FA"), 
    InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
private interface IOleServiceProvider
{
    [PreserveSig]
    int QueryService([In] ref Guid guidService, [In] ref Guid riid, 
        [MarshalAs(UnmanagedType.IDispatch)] out object ppvObject);
}

With this code, the Javascript errors disappear and when you run the app, you will see something like this:

As you can see, the HTML Agility Pack makes it easy to process and parse HTML Pages, allowing you to manipulate them the way you want.

The full source code for this article is in https://github.com/bsonnino/BingSearch

3 thoughts on “Parsing HTML data with C#”

  1. Darren says:
    19 June 2019 at 15:00

    Any ideas why htmlWeb.LoadFromWebAsync(query); doesn’t always work. I have to hit the search button 5 time sometimes before any results are returned. When I step through the debugger “doc” and “response” will be populated but “results” are null because response says, “There are no results for html agility packCheck your spelling or try different keywords”. But if i try and try and try again, it works eventually.

    Reply
    1. bsonnino says:
      20 June 2019 at 17:05

      This seems be something due to the query to Bing and not the Html Agility Pack.
      As you can see, there is a response, it just isn’t what we expect. Maybe there is something to do with the User Agent that is used to query Bing.
      I’ve made some changes, changing the URL and adding the Edge User Agent and it seems to be better:

      var htmlWeb = new HtmlWeb();
      htmlWeb.UserAgent = “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134″;
      var query = $”https://www.bing.com/search?q={queryString}&count=100&toWww=1”;

      Reply
  2. Darren says:
    21 June 2019 at 14:50

    Thanks for the tip. This is better. Now I usually get 2 retries versus 6. Still not very reliable. I’m guessing the problem is somewhere between my connection and bing. The strange thing is that when I search bing from the browser it comes back with the proper results instantly. Thanks again for your help and the demo.

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • May 2025
  • December 2024
  • October 2024
  • August 2024
  • July 2024
  • June 2024
  • November 2023
  • October 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • June 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • October 2020
  • September 2020
  • April 2020
  • March 2020
  • January 2020
  • November 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • June 2017
  • May 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • October 2015
  • August 2013
  • May 2013
  • February 2012
  • January 2012
  • April 2011
  • March 2011
  • December 2010
  • November 2009
  • June 2009
  • April 2009
  • March 2009
  • February 2009
  • January 2009
  • December 2008
  • November 2008
  • October 2008
  • July 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • Development
  • English
  • Português
  • Uncategorized
  • Windows

.NET AI Algorithms asp.NET Backup C# Debugging Delphi Dependency Injection Desktop Bridge Desktop icons Entity Framework JSON Linq Mef Minimal API MVVM NTFS Open Source OpenXML OzCode PowerShell Sensors Silverlight Source Code Generators sql server Surface Dial Testing Tools TypeScript UI Unit Testing UWP Visual Studio VS Code WCF WebView2 WinAppSDK Windows Windows 10 Windows Forms Windows Phone WPF XAML Zip

  • Entries RSS
  • Comments RSS
©2025 Bruno Sonnino | Design: Newspaperly WordPress Theme