Sometimes, you need to parse some html data to do some processing and present it to the user. That may be a daunting task, as some pages can become very complex and it may be difficult to do it.
For that, you can use an excellent tool, named HTML Agility Pack. With it, you can parse HTML from a string, a file, a web site or even from a WebBrowser: you can add a WebBrowser to your app, navigate to an URL and parse the data from there.
In this article, I'll show how to make a query in Bing, retrieve and parse the response. For that, we need to create the query url and pass it to Bing. You may ask why I'm querying Bing and not Google - I'm doing that because Google makes it difficult to get its data, and I want to show you how to use HTML Agility Pack, and not how to retrieve data from Google 😃. The query should be something like this:
https://www.bing.com/search?q=html+agility+pack&count=100
We will use the Query (q) and the number of results (count) parameters. With them, we can create our program. We will create a WPF program that gets the query text, parses it and presents the results in a Listbox.
Create a new WPF program and name it BingSearch.
The next step is to add the HTML Agility Pack to the project. Right-click the References node in the Solution Explorer and select Manage NuGet Packages. Then add the Html Agility Pack to the project.
Then, in the main window, add this XAML code:
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="40"/>
<RowDefinition Height="*"/>
</Grid.RowDefinitions>
<StackPanel Grid.Row="0" Orientation ="Horizontal"
Margin="5,0" VerticalAlignment="Center">
<TextBlock Text="Search" VerticalAlignment="Center"/>
<TextBox x:Name="TxtSearch" Width="300" Height="30"
Margin="5,0" VerticalContentAlignment="Center"/>
</StackPanel>
<Button Grid.Row="0" HorizontalAlignment="Right"
Content="Search" Margin="5,0" VerticalAlignment="Center"
Width="65" Height="30" Click="SearchClick"/>
<ListBox Grid.Row="1" x:Name="LbxResults" />
</Grid>
Right click in the button's click event handler in the XAML and press F12 to add the handler in code and go to it. Then, add this code to the handler:
private async void SearchClick(object sender, RoutedEventArgs e)
{
if (string.IsNullOrWhiteSpace(TxtSearch.Text))
return;
var queryString = WebUtility.UrlEncode(TxtSearch.Text);
var htmlWeb = new HtmlWeb();
var query = $"https://bing.com/search?q={queryString}&count=100";
var doc = await htmlWeb.LoadFromWebAsync(query);
var response = doc.DocumentNode.SelectSingleNode("//ol[@id='b_results']");
var results = response.SelectNodes("//li[@class='b_algo']");
if (results == null)
{
LbxResults.ItemsSource = null;
return;
}
var searchResults = new List<SearchResult>();
foreach (var result in results)
{
var refNode = result.Element("h2").Element("a");
var url = refNode.Attributes["href"].Value;
var text = refNode.InnerText;
var description = result.Element("div").Element("p").InnerText;
searchResults.Add(new SearchResult(text, url, description));
}
LbxResults.ItemsSource = searchResults;
}
Initially we encode the text to search to add it to the query and create the query string. Then we call the LoadFromWebAsync method to load the HTML data from the query response. When the response comes, we get the response node, from the ordered list with id b_results and extract from it the individual results. Finally, we parse each result and add it to a list of SearchResult, and assign the list to the items in the ListBox. You can note that we can find the nodes using XPath, like in
var results = response.SelectNodes("//li[@class='b_algo']");
Or we can traverse the elements and get the text of the resulting node with something like:
var refNode = result.Element("h2").Element("a");
var url = refNode.Attributes["href"].Value;
var text = refNode.InnerText;
var description = WebUtility.HtmlDecode(
result.Element("div").Element("p").InnerText);
SearchResult is declared as:
internal class SearchResult
{
public string Text { get; }
public string Url { get; }
public string Description { get; }
public SearchResult(string text, string url, string description)
{
Text = text;
Url = url;
Description = description;
}
}
if you run the program, you will see something like this:
The data isn't displayed because we haven't defined any data template for the list items. You can define an item template like that in the XAML:
<ListBox.ItemTemplate>
<DataTemplate>
<StackPanel Margin="0,3">
<TextBlock Text="{Binding Text}" FontWeight="Bold"/>
<TextBlock >
<Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
<TextBlock Text="{Binding Url}"/>
</Hyperlink>
</TextBlock>
<TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
</StackPanel>
</DataTemplate>
</ListBox.ItemTemplate>
The LinkNavigate event handler is:
private void LinkNavigate(object sender, RequestNavigateEventArgs e)
{
System.Diagnostics.Process.Start(e.Uri.AbsoluteUri);
}
Now, when you run the program, you will get something like this:
You can click on the hyperlink and it will open a browser window with the selected page. We can even go further and add a WebBrowser to our app that will show the selected page when you click on an item. For that, you have to modify the XAML code with something like this:
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="40"/>
<RowDefinition Height="*"/>
</Grid.RowDefinitions>
<Grid.ColumnDefinitions>
<ColumnDefinition Width="*"/>
<ColumnDefinition Width="*"/>
</Grid.ColumnDefinitions>
<StackPanel Grid.Row="0" Orientation ="Horizontal"
Margin="5,0" VerticalAlignment="Center">
<TextBlock Text="Search" VerticalAlignment="Center"/>
<TextBox x:Name="TxtSearch" Width="300" Height="30"
Margin="5,0" VerticalContentAlignment="Center"/>
</StackPanel>
<Button Grid.Row="0" HorizontalAlignment="Right"
Content="Search" Margin="5,0" VerticalAlignment="Center"
Width="65" Height="30" Click="SearchClick"/>
<ListBox Grid.Row="1" x:Name="LbxResults"
ScrollViewer.HorizontalScrollBarVisibility="Disabled"
SelectionChanged="LinkChanged">
<ListBox.ItemTemplate>
<DataTemplate>
<StackPanel Margin="0,3">
<TextBlock Text="{Binding Text}" FontWeight="Bold"/>
<TextBlock >
<Hyperlink NavigateUri="{Binding Url}" RequestNavigate="LinkNavigate">
<TextBlock Text="{Binding Url}"/>
</Hyperlink>
</TextBlock>
<TextBlock Text="{Binding Description}" TextWrapping="Wrap"/>
</StackPanel>
</DataTemplate>
</ListBox.ItemTemplate>
</ListBox>
<WebBrowser Grid.Column="1" Grid.RowSpan="2" x:Name="WebPage" />
</Grid>
We've added a second column to the window and added a WebBrwser to it, then added a SelectionChanged event to the listbox, so we can navigate to the selected page.
The SelectionChanged event handler is:
private void LinkChanged(object sender, SelectionChangedEventArgs e)
{
if (e.AddedItems?.Count > 0)
{
WebPage.Navigate(((SearchResult)e.AddedItems[0]).Url);
}
}
Now, when you run the app and click on a result, it will show the page in the WebBrowser. One thing that happened is that, sometimes a Javascript error pops up. To remove these errors, I used the solution obtained from here:
public MainWindow()
{
InitializeComponent();
WebPage.Navigated += (s, e) => SetSilent(WebPage, true);
}
public static void SetSilent(WebBrowser browser, bool silent)
{
if (browser == null)
throw new ArgumentNullException("browser");
// get an IWebBrowser2 from the document
IOleServiceProvider sp = browser.Document as IOleServiceProvider;
if (sp != null)
{
Guid IID_IWebBrowserApp = new Guid("0002DF05-0000-0000-C000-000000000046");
Guid IID_IWebBrowser2 = new Guid("D30C1661-CDAF-11d0-8A3E-00C04FC9E26E");
object webBrowser;
sp.QueryService(ref IID_IWebBrowserApp, ref IID_IWebBrowser2, out webBrowser
if (webBrowser != null)
{
webBrowser.GetType().InvokeMember("Silent",
BindingFlags.Instance | BindingFlags.Public |
BindingFlags.PutDispProperty, null, webBrowser,
new object[] { silent });
}
}
}
[ComImport, Guid("6D5140C1-7436-11CE-8034-00AA006009FA"),
InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
private interface IOleServiceProvider
{
[PreserveSig]
int QueryService([In] ref Guid guidService, [In] ref Guid riid,
[MarshalAs(UnmanagedType.IDispatch)] out object ppvObject);
}
With this code, the Javascript errors disappear and when you run the app, you will see something like this:
As you can see, the HTML Agility Pack makes it easy to process and parse HTML Pages, allowing you to manipulate them the way you want.
The full source code for this article is in https://github.com/bsonnino/BingSearch
Any ideas why htmlWeb.LoadFromWebAsync(query); doesn’t always work. I have to hit the search button 5 time sometimes before any results are returned. When I step through the debugger “doc” and “response” will be populated but “results” are null because response says, “There are no results for html agility packCheck your spelling or try different keywords”. But if i try and try and try again, it works eventually.
This seems be something due to the query to Bing and not the Html Agility Pack.
As you can see, there is a response, it just isn’t what we expect. Maybe there is something to do with the User Agent that is used to query Bing.
I’ve made some changes, changing the URL and adding the Edge User Agent and it seems to be better:
var htmlWeb = new HtmlWeb();
htmlWeb.UserAgent = “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134″;
var query = $”https://www.bing.com/search?q={queryString}&count=100&toWww=1”;
Thanks for the tip. This is better. Now I usually get 2 retries versus 6. Still not very reliable. I’m guessing the problem is somewhere between my connection and bing. The strange thing is that when I search bing from the browser it comes back with the proper results instantly. Thanks again for your help and the demo.